Johnatan Bonilla

I am a research associate in project A09, which focuses on register and dialect variation in Canarian Spanish. My research interests include corpus linguistics, computational linguistics, and the intersection of sociolinguistics, dialectology, and natural language processing (NLP).

I hold a PhD in Linguistics from Ghent University and a Master’s in Linguistics from the Instituto Caro y Cuervo (Colombia). Previously, I led projects on dialectology, lexicography, and NLP, including the digitization of the Linguistic and Ethnographic Atlas of Colombia (ALEC), the development of the New Linguistic Atlas of Colombia (NALAC), and the creation of the first morphosyntactic treebank for spoken Spanish (COSER-UD).

Repositories

GitHub | HuggingFace | UD Spanish-COSER

Previous Projects

Projects

A09 On the interplay between register and socio-geographic variation in Canarian Spanish
Integrated Research Training Group (IRTG/MGK) Integrated Graduate School

Contact

Hausvogteiplatz 5-7, R. 22 10117 Berlin

030 2093 85092

j.bonilla@hu-berlin.de https://orcid.org/0000-0002-8166-3548

Publications & Presentations

Publications

Bonilla, Johnatan (2025) LEXICC: The Design and Development of an Online Dictionary Writing System In: Lexikos [DOI] [ViVo]
The Instituto Caro y Cuervo (Caro and Cuervo Institute, ICC) was initially founded to complete Rufino José Cuervo's Diccionario de Construcción y Régimen (Dictionary of Construction and Usage) (Cuervo and ICC 1998) and has since expanded its mission to include the research and promotion of Colombia's linguistic heritage. Following this lexicographic tradition, the Institute developed the Diccionario de Colombianismos (Dictionary of Colombianisms, DiCol) (ICC 2018) using the proprietary software TshwaneLex, which facilitated the production of its print version but created a dependency on third-party resources, the need for a more flexible and independent solution became apparent. In response, this report introduces LEXICC — Diccionarios y Lenguajes (Dictionaries and Languages, LEXICC), a new, tailored online Dictionary Writing System (DWS) developed from scratch as an open-source solution. LEXICC empowers researchers, linguists, lexicographers, and anyone interested in dictionaries to create and manage their lexicographic resources separately. This paper details the design and development process of LEXICC, highlights its main functionalities, and discusses the electronic adaptation of the DiCol, now accessible online through LEXICC. Keywords: electronic dictionaries, dictionary writing system, Colombian Spanish, Caro and Cuervo Institute, Dictionary of Colombianisms, non-functional requirements, functional requirements, demo dictionary, lexicographer director
Bonilla, Johnatan (2025) bert’s Interpretation of Literalmente ‘Literally’: What Deep Learning Models Can Tell Us about Synchronic Layering and Diachronic Shifts In: Cognitive Semantics [DOI] [ViVo]
Abstract How do language models disambiguate semantically and pragmatically complex and polysemic meanings? In this study, we present a computational approach to the analysis of Spanish’s polysemic literalmente ‘literally,’ an adverb whose meaning and pragmatic functions range from strict word-by-word denotation to (inter)subjective intensification and emphasis. Focusing on the Spanish pre-trained bert model –beto–, two objectives are pursued: i) to shed light onto how artificial language processors interpret pragmatically polyfunctional and semantically polysemic words, and ii) to showcase how the contextual cues drawn on by an artificial language processor can help elucidate semantic polysemy and change in natural language. Using Local Interpretable Model-Agnostic Explanations (lime), our results show that more innovative and grammaticalized uses exhibit a higher degree of syntactic polyfunctionality. We discuss parallelisms and cross-pollination potential between the uncovered computational dynamics of polysemic literalmente and theories of grammaticalization and semantic change.
Bonilla, Johnatan (2025) Spoken Spanish PoS tagging: gold standard dataset In: Language Resources and Evaluation [DOI] [ViVo]
Abstract The development of a benchmark for part-of-speech (PoS) tagging of spoken dialectal European Spanish is presented, which will serve as the foundation for a future treebank. The benchmark is constructed using transcriptions of the Corpus Oral y Sonoro del Español Rural (COSER;“Audible corpus of spoken rural Spanish”) and follows the Universal Dependencies project guidelines. We describe the methodology used to create a gold standard, which serves to evaluate different state-of-the-art PoS taggers (spaCy, Stanza NLP, and UDPipe), originally trained on written data and to fine-tune and evaluate a model for spoken Spanish. It is shown that the accuracy of these taggers drops from 0.98 $$-$$ - 0.99 to 0.94 $$-$$ - 0.95 when tested on spoken data. Of these three taggers, the spaCy’s trf (transformers) and Stanza NLP models performed the best. Finally, the spaCy trf model is fine-tuned using our gold standard, which resulted in an accuracy of 0.98 for coarse-grained tags (UPOS) and 0.97 for fine-grained tags (FEATS). Our benchmark will enable the development of more accurate PoS taggers for spoken Spanish and facilitate the construction of a treebank for European Spanish varieties.
Bonilla, Johnatan (2023) THE INFLUENCE OF GEOGRAPHIC VARIABLES IN LINGUISTIC VARIATION In: Dialectologia [DOI] [ViVo]

Presentations

Bouzouita, Miriam; Bonilla, Johnatan () The pluralization of the existential verb haber ‘there is/are’ in written and recorded parliamentary speeches in Canarian Spanish In: 11th Inter-Varietal Applied Corpus Studies- University of Cambridge [ViVo]
Bouzouita, Miriam; Bonilla, Johnatan () A09 On the interplay between register and socio-geographic variation in Canarian Spanish In: CRC 1412 - Area A Retreat 2024 [ViVo]
Bouzouita, Miriam; Bonilla, Johnatan () Mapping Register Variation in Canarian Spanish Using NLP and Emerging Language Technologies In: 39. Romanistiktag, Universtiät Konstanz [ViVo]
Bouzouita, Miriam; Bonilla, Johnatan () Corpus de YouTube para el análisis morfosintáctico del español canario: la pluralización del verbo existencial haber In: Universidad Nacional Autónoma de México [ViVo]
Bonilla, Johnatan; Bouzouita, Miriam () Beyond Traditional Corpus Creation: Integrating NLP, AI, and Social Mapping for the Study of the Interaction between Register and Socio-Geographic Variation In: 7th Wedisyn, Humboldt Universitäat zu Berlin [ViVo]
Bonilla, Johnatan; Bouzouita, Miriam () Variación morfosintáctica en el español de Canarias: Un marco experimental a partir de corpus multimodales In: XXXI CILFR, Università del Salento [ViVo]