Johnatan Bonilla

Humboldt-Universität zu Berlin

Institut für Romanistik

I am a research associate in project A09, which focuses on register and dialect variation in Canarian Spanish. My research interests include corpus linguistics, computational linguistics, and the intersection of sociolinguistics, dialectology, and natural language processing (NLP).

I hold a PhD in Linguistics from Ghent University and a Master’s in Linguistics from the Instituto Caro y Cuervo (Colombia). Previously, I led projects on dialectology, lexicography, and NLP, including the digitization of the Linguistic and Ethnographic Atlas of Colombia (ALEC), the development of the New Linguistic Atlas of Colombia (NALAC), and the creation of the first morphosyntactic treebank for spoken Spanish (COSER-UD).

Repositories

GitHub | HuggingFace | UD Spanish-COSER

Previous Projects

Recent Publications

  • Bonilla J.E., Merino Hernández, L. M., Marttinnen Larsson., M. (2025). BERT’s Interpretation of Literalmente ‘Literally’: What Deep Learning Models Can Tell Us about Synchronic Layering and Diachronic Shifts. Cognitive Semantics. https://doi.org/10.1163/23526416-bja10079
  • Monsalve Muñoz, U. de C., Bonilla, J.E., Rubio López, R.Y., Luna Cortés, A. S. (2025). LEXICC: The Design and Development of an Online Dictionary Writing System. Lexikos. https://doi.org/10.5788/35-1-1989
  • Bonilla, J. E. (2024). Spoken Spanish PoS tagging: Gold standard dataset. Language Resources and Evaluationhttps://doi.org/10.1007/s10579-024-09751-x
  • Fernández, J. O., Bonilla, J. E., & Rocha, L. Á. (2024). The influence of geographic variables in linguistic variation. Dialectologia. https://doi.org/10.1344/DIALECTOLOGIA2023.32.7
  • Bouzouita, M., Bonilla, J. E., & Segundo Díaz, R. L. (2024). Gaming for dialects: Creating an annotated and parsed corpus of European Spanish dialects through GWAPs. In Linguistic Corpora and Big Data in Spanish and Portuguese. https://doi.org/10.1515/9783110781465-005
  • Bonilla, J. E., Segundo Díaz, R. L., & Bouzouita, M. (2023). Using GWAPs for verifying PoS tagging of spoken dialectal Spanish. In Conference paperhttps://doi.org/10.1109/besc59560.2023.10386542
  • Bonilla, J. E. (2023). Superdialectos, dialectos y subdialectos del español de Colombia. Lexishttps://doi.org/10.18800/lexis.202302.002
  • Segundo Díaz, R. L., Bonilla, J. E., Bouzouita, M., & Rovelo Ruiz, G. (2023). Juegos con propósito para la anotación del Corpus Oral Sonoro del Español rural. Dialectologia et Geolinguistica. https://doi.org/10.1515/dialect-2023-0007

Projects

A09 On the interplay between register and socio-geographic variation in Canarian Spanish
MGK Integrated Graduate School

Contact

Hausvogteiplatz 5-7, R. 22 10117 Berlin

j.bonilla@hu-berlin.de https://orcid.org/0000-0002-8166-3548

Publications & Presentations

    Publications

  • Bonilla, Johnatan  (2025) LEXICC: The Design and Development of an Online Dictionary Writing System  In: Lexikos [DOI] [ViVo]
    The Instituto Caro y Cuervo (Caro and Cuervo Institute, ICC) was initially founded to complete Rufino José Cuervo's Diccionario de Construcción y Régimen (Dictionary of Construction and Usage) (Cuervo and ICC 1998) and has since expanded its mission to include the research and promotion of Colombia's linguistic heritage. Following this lexicographic tradition, the Institute developed the Diccionario de Colombianismos (Dictionary of Colombianisms, DiCol) (ICC 2018) using the proprietary software TshwaneLex, which facilitated the production of its print version but created a dependency on third-party resources, the need for a more flexible and independent solution became apparent. In response, this report introduces LEXICC — Diccionarios y Lenguajes (Dictionaries and Languages, LEXICC), a new, tailored online Dictionary Writing System (DWS) developed from scratch as an open-source solution. LEXICC empowers researchers, linguists, lexicographers, and anyone interested in dictionaries to create and manage their lexicographic resources separately. This paper details the design and development process of LEXICC, highlights its main functionalities, and discusses the electronic adaptation of the DiCol, now accessible online through LEXICC. Keywords: electronic dictionaries, dictionary writing system, Colombian Spanish, Caro and Cuervo Institute, Dictionary of Colombianisms, non-functional requirements, functional requirements, demo dictionary, lexicographer director
  • Bonilla, Johnatan  (2025) bert’s Interpretation of Literalmente ‘Literally’: What Deep Learning Models Can Tell Us about Synchronic Layering and Diachronic Shifts  In: Cognitive Semantics [DOI] [ViVo]
    Abstract How do language models disambiguate semantically and pragmatically complex and polysemic meanings? In this study, we present a computational approach to the analysis of Spanish’s polysemic literalmente ‘literally,’ an adverb whose meaning and pragmatic functions range from strict word-by-word denotation to (inter)subjective intensification and emphasis. Focusing on the Spanish pre-trained bert model –beto–, two objectives are pursued: i) to shed light onto how artificial language processors interpret pragmatically polyfunctional and semantically polysemic words, and ii) to showcase how the contextual cues drawn on by an artificial language processor can help elucidate semantic polysemy and change in natural language. Using Local Interpretable Model-Agnostic Explanations (lime), our results show that more innovative and grammaticalized uses exhibit a higher degree of syntactic polyfunctionality. We discuss parallelisms and cross-pollination potential between the uncovered computational dynamics of polysemic literalmente and theories of grammaticalization and semantic change.
  • Bonilla, Johnatan  (2025) Spoken Spanish PoS tagging: gold standard dataset  In: Language Resources and Evaluation [DOI] [ViVo]
    Abstract The development of a benchmark for part-of-speech (PoS) tagging of spoken dialectal European Spanish is presented, which will serve as the foundation for a future treebank. The benchmark is constructed using transcriptions of the Corpus Oral y Sonoro del Español Rural (COSER;“Audible corpus of spoken rural Spanish”) and follows the Universal Dependencies project guidelines. We describe the methodology used to create a gold standard, which serves to evaluate different state-of-the-art PoS taggers (spaCy, Stanza NLP, and UDPipe), originally trained on written data and to fine-tune and evaluate a model for spoken Spanish. It is shown that the accuracy of these taggers drops from 0.98 $$-$$ - 0.99 to 0.94 $$-$$ - 0.95 when tested on spoken data. Of these three taggers, the spaCy’s trf (transformers) and Stanza NLP models performed the best. Finally, the spaCy trf model is fine-tuned using our gold standard, which resulted in an accuracy of 0.98 for coarse-grained tags (UPOS) and 0.97 for fine-grained tags (FEATS). Our benchmark will enable the development of more accurate PoS taggers for spoken Spanish and facilitate the construction of a treebank for European Spanish varieties.
  • Presentations

  • Bouzouita, Miriam; Bonilla, Johnatan  () Variación morfosintáctica en el español de Canarias: Un marco experimental a partir de corpus multimodales  In: XXXI CILFR, Università del Salento  [ViVo]
  • Bonilla, Johnatan; Bouzouita, Miriam  () The pluralization of the existential verb haber ‘there is/are’ in written and recorded parliamentary speeches in Canarian Spanish  In: 11th Inter-Varietal Applied Corpus Studies- University of Cambridge [ViVo]
  • Bonilla, Johnatan; Bouzouita, Miriam  () Corpus de YouTube para el análisis morfosintáctico del español canario: la pluralización del verbo existencial haber  In: Universidad Nacional Autónoma de México [ViVo]