Felix Bildhauer
Leibniz-Institut für Deutsche Sprache Mannheim
Projekte
A04
Building register into the architecture of language – an HPSG account
Kontakt
bildhauer@ids-mannheim.deWebsite https://orcid.org/0000-0002-6567-5987Veröffentlichungen und Präsentationen
Pescuma, Valentina Nicole; Serova, Dina; Lukassek, Julia; Sauermann, Antje; Schäfer, Roland; Adli, Aria; Bildhauer, Felix; Egg, Markus; Hülk, Kristina; Ito, Aine; Jannedy, Stefanie; Kordoni, Valia; Kühnast, Milena; Kutscher, Silvia; Lange, Robert; Lehmann, Nico; Liu, Mingya; Lütke, Beate; Maquate, Katja; Mooshammer, Christine; Mortezapour, Vahid; Müller, Stefan; Norde, Muriel; Pankratz, Elizabeth; Patarroyo, Angela Giovanna; Plesca, Ana-Maria; Ronderos, Camilo R.; Rotter, Stephanie; Sauerland, Uli; Schulte, Britta; Schüppenhauer, Gediminas; Sell, Bianca Maria; Solt, Stephanie; Terada, Megumi; Tsiapou, Dimitra; Verhoeven, Elisabeth; Weirich, Melanie; Wiese, Heike; Zaruba, Kathy; Zeige, Lars Erik; Lüdeling, Anke; Knoeferle, Pia; Schnelle, Gohar (2023) Situating language register across the ages, languages, modalities, and cultural aspects: Evidence from complementary methods In: Frontiers in Psychology [DOI] [PDF] [ViVo] In the present review paper by members of the collaborative research center ‘Register: Language Users’ Knowledge of SituationalFunctional Variation’ (CRC 1412), we assess the pervasiveness of register phenomena across different time periods, languages, modalities, and cultures. We define ‘register’ as recurring variation in language use depending on the function of language and on the social situation. Informed by rich data, we aim to better understand and model the knowledge involved in situation- and function-based use of language register. In order to achieve this goal, we are using complementary methods and measures. In the review, we start by clarifying the concept of ‘register’, by reviewing the state of the art, and by setting out our methods and modeling goals. Against this background, we discuss three key challenges, two at the methodological level and one at the theoretical level: 1. To better uncover registers in text and spoken corpora, we propose changes to established analytical approaches. 2. To tease apart between-subject variability from the linguistic variability at issue (intra-individual situation based register variability), we use within-subject designs and the modeling of individuals’ social, language, and educational background. 3. We highlight a gap in cognitive modeling, viz. modeling the mental representations of register (processing), and present our first attempts at filling this gap. We argue that the targeted use of multiple complementary methods and measures supports investigating the pervasiveness of register phenomena and yields comprehensive insights into the cross-methodological robustness of register-related language variability. These comprehensive insights in turn provide a solid foundation for associated cognitive modeling.Weber, Thilo; Bildhauer, Felix; Münzberg, Franziska (2023) Finite vs. infinite Attributsätze: zu/dass-Alternation bei Substantiven In: Fugenelemente, Präfix-und Partikelverben, Attributsätze [ViVo] Machicao y Priemer, Antonio; Müller, Stefan; Schäfer, Roland; Bildhauer, Felix (2022) Towards a treatment of register phenomena in HPSG In: Proceedings of the 29th International Conference on Head-Driven Phrase Structure Grammar, Nagoya University & Institute for Japanese Language and Linguistics [ViVo] Machicao y Priemer, Antonio; Schäfer, Roland; Bildhauer, Felix; Müller, Stefan (2022) Towards a treatment of register phenomena in HPSG In: The 29th International Conference on Head-Driven Phrase Structure Grammar, Nagoya University & the National Institute for Japanese Language and Linguistics [ViVo] Machicao y Priemer, Antonio; Schäfer, Roland; Bildhauer, Felix; Müller, Stefan (2022) Towards a treatment of register phenomena in HPSG In: The 29th International Conference on Head-Driven Phrase Structure Grammar, Nagoya University & the National Institute for Japanese Language and Linguistics [ViVo] Schäfer, Roland; Bildhauer, Felix (2020) Beyond Multidimensional Analysis: Probabilistic Register Induction for Large Corpora In: Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2020) [ViVo] The analysis of the register in which a corpus document is written is prominently associated with Biber’s (1988; 1995) Multidimensional Analysis (MDA). We present an approach superficially similar to MDA but which solves three major conceptual problems of MDA by using Bayesian inference to uncover registers or – rather potential registers. First, in Biber’s MDA, registers are associated discretely with documents, and each document can only instantiate one specific register, whereas we allow registers to be associated probabilistically with documents, and we allow mixtures of registers in single documents. Given that many linguistic phenomena are now understood as being probabilistic in nature (cf. Schäfer 2018), we suggest that this is a much more realistic assumption. Second, we assume the surface features to be associated with registers in a probabilistic manner for similar reasons. Third, we do not use a catalogue of registers assumed to exist a priori, but instead we merely infer potential registers (pregisters) via clusters of surface features. The question of which pregisters actually correspond to registers with an identifiable situational communicative setting will be dealt with in a future stage of the project using theory-driven evaluation and experimental validation. Given our assumptions about the nature of the mapping between features and pregisters and pregisters and documents, an obvious algorithm to use is Bayesian inference in the form of Latent Dirichlet Allocation (LDA; Blei et al. 2003; Blei 2012) as used in Topic Modelling. In our approach, we deal with pregisters instead of topics and with distributions of lexico-grammatical surface features instead of lexical words. The LDA algorithm otherwise performs an exactly parallel inference task. We first show how we extended the COReX feature extraction framework (Bildhauer & Schäfer in prep.) developed at FU Berlin and the IDS Mannheim in order to provide a large enough number of features for the LDA algorithm to work. We then present first results and discuss how we tuned the LDA algorithm and the feature set to lead to interpretable results. In order to be able to interpret the pregisters found by LDA, we extract the documents which most strongly instantiate the inferred pregisters. We introduce the PreCOX20 sub-corpus of the DECOW German web corpus, in which those prototypical documents are collected for further analysis w.r.t. their situational communicative setting. References: Biber, D. (1988). Variation across Speech and Writing. CUP. Biber, D. (1995). Dimensions of Register Variation: A Cross-Linguistic Comparison. CUP. Bildhauer, F. & R. Schäfer (in prep.) Automatic register annotation and alternation modelling. Blei, D. M (2012). Probabilistic topic models. Communications of the ACM 55(4), 77-84. Blei, D. M., A. Y. Ng & M. I. Jordan (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022. Schäfer, R. (2018). Probabilistic German Morphosyntax. Habilitation thesis. HU Berlin.Schäfer, Roland; Bildhauer, Felix (2020) Beyond Multidimensional Analysis: Probabilistic Register Induction for Large Corpora In: Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2020) [ViVo] The analysis of the register in which a corpus document is written is prominently associated with Biber’s (1988; 1995) Multidimensional Analysis (MDA). We present an approach superficially similar to MDA but which solves three major conceptual problems of MDA by using Bayesian inference to uncover registers or – rather potential registers. First, in Biber’s MDA, registers are associated discretely with documents, and each document can only instantiate one specific register, whereas we allow registers to be associated probabilistically with documents, and we allow mixtures of registers in single documents. Given that many linguistic phenomena are now understood as being probabilistic in nature (cf. Schäfer 2018), we suggest that this is a much more realistic assumption. Second, we assume the surface features to be associated with registers in a probabilistic manner for similar reasons. Third, we do not use a catalogue of registers assumed to exist a priori, but instead we merely infer potential registers (pregisters) via clusters of surface features. The question of which pregisters actually correspond to registers with an identifiable situational communicative setting will be dealt with in a future stage of the project using theory-driven evaluation and experimental validation. Given our assumptions about the nature of the mapping between features and pregisters and pregisters and documents, an obvious algorithm to use is Bayesian inference in the form of Latent Dirichlet Allocation (LDA; Blei et al. 2003; Blei 2012) as used in Topic Modelling. In our approach, we deal with pregisters instead of topics and with distributions of lexico-grammatical surface features instead of lexical words. The LDA algorithm otherwise performs an exactly parallel inference task. We first show how we extended the COReX feature extraction framework (Bildhauer & Schäfer in prep.) developed at FU Berlin and the IDS Mannheim in order to provide a large enough number of features for the LDA algorithm to work. We then present first results and discuss how we tuned the LDA algorithm and the feature set to lead to interpretable results. In order to be able to interpret the pregisters found by LDA, we extract the documents which most strongly instantiate the inferred pregisters. We introduce the PreCOX20 sub-corpus of the DECOW German web corpus, in which those prototypical documents are collected for further analysis w.r.t. their situational communicative setting. References: Biber, D. (1988). Variation across Speech and Writing. CUP. Biber, D. (1995). Dimensions of Register Variation: A Cross-Linguistic Comparison. CUP. Bildhauer, F. & R. Schäfer (in prep.) Automatic register annotation and alternation modelling. Blei, D. M (2012). Probabilistic topic models. Communications of the ACM 55(4), 77-84. Blei, D. M., A. Y. Ng & M. I. Jordan (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022. Schäfer, R. (2018). Probabilistic German Morphosyntax. Habilitation thesis. HU Berlin.