A04
Building register into the architecture of language – an HPSG account

A04 aims to develop a formal grammatical model of morphosyntactic register variation within Head-Driven Phrase Structure Grammar. The project investigates phenomena pertaining to four domains of variation in German and Brazilian Portuguese: pronoun choice (lexicon), agreement (morphology), valence alternations (argument structure), and the structure of the pre-field (constituent structure). The initial hypothesis is that probabilistic relations between subsets of the grammar (i.e. registers) and situational parameters are mediated by social meanings, understood broadly to cover any kind of non-truth conditional content that indexes some socially relevant property of context coordinates (e.g. the speaker, the hearer, or the relationship between them).  The social meanings of particular register variants are tested by means of perception studies. 

Members

Project leader


Members


Student assistant

Former members






Project in Phase I

Project Title for Phase I

Situated syntax: Exploring and modeling syntactic register variation in German

Project Description for Phase I

A04 aims at uncovering and modeling aspects of register variation in German morphosyntax. In the first part, the project will utilize existing formal and empirical methods like Biber’s Multidimensional Analysis and develop them further by (i) using a probabilistic approach to registers, (ii) using a more sophisticated method for automatically inferring registers from distributions of features in texts, namely Latent Dirichlet Allocation, (iii) and validating the cognitive reality of the automatic identification of registers using psycholinguistic methods. In the second part, it will extend the PI’s CoreGram implementation in HPSG to model speakers’ probabilistic knowledge of register-related effects in morphosyntax. This is innovative in asmuch as no similar formal implementation of register effects has been attempted before.

Project Leaders in Phase I

Publications & Presentations

    Publications

    2024

  • Schäfer, Roland  (2024) Between syntax and morphology: German noun+verb units In:  Glossa: a journal of general linguistics [DOI] [ViVo]
    We show that graphemic variation—at least in some writing systems—can be analysed in terms of grammatical variation given a usage- based probabilistic view of the grammar-graphemics interface. Concretely, we examine a type of noun+verb unit in German, which can be written as one word or two. We argue that the variation in writing is rooted in the units’ ambiguous status in between morphology (one word) and syntax (two words). The major influencing factors are shown to be the semantic relation between the noun and the verb (argument or oblique relation) and the morphosyntactic context. In prototypically nominal contexts, a reinterpretation of the unit as a noun+noun compound is facilitated, which favours spelling as one word, while in prototypically verbal contexts, a syntactic realisation and consequently spelling as two words is preferred. We report the results of two large-scale corpus studies and a controlled production experiment to corroborate our analysis.
  • Schäfer, Roland  (2024) Between syntax and morphology: German noun+verb units In:  Glossa: a journal of general linguistics [DOI] [ViVo]
    We show that graphemic variation—at least in some writing systems—can be analysed in terms of grammatical variation given a usage- based probabilistic view of the grammar-graphemics interface. Concretely, we examine a type of noun+verb unit in German, which can be written as one word or two. We argue that the variation in writing is rooted in the units’ ambiguous status in between morphology (one word) and syntax (two words). The major influencing factors are shown to be the semantic relation between the noun and the verb (argument or oblique relation) and the morphosyntactic context. In prototypically nominal contexts, a reinterpretation of the unit as a noun+noun compound is facilitated, which favours spelling as one word, while in prototypically verbal contexts, a syntactic realisation and consequently spelling as two words is preferred. We report the results of two large-scale corpus studies and a controlled production experiment to corroborate our analysis.
  • 2023

  • Müller, Stefan  (2023) Germanic Syntax: A Constraint-Based View[DOI] [ViVo]
    This book is an introduction to the syntactic structures that can be found in the Germanic languages. The analyses are couched in the framework of HPSG light, which is a simplified version of HPSG that uses trees to depict analyses rather than complicated attribute value matrices. The book is written for students with basic knowledge about case, constituent tests, and simple phrase structure grammars (advanced BA or MA level) and for researchers with an interest in the Germanic languages and/or an interest in Head-Driven Phrase Structure Grammar/Sign-Based Construction Grammar without having the time to deal with all the details of these theories.
  • Müller, Stefan  (2023) Germanic Syntax: A Constraint-Based View[DOI] [ViVo]
    This book is an introduction to the syntactic structures that can be found in the Germanic languages. The analyses are couched in the framework of HPSG light, which is a simplified version of HPSG that uses trees to depict analyses rather than complicated attribute value matrices. The book is written for students with basic knowledge about case, constituent tests, and simple phrase structure grammars (advanced BA or MA level) and for researchers with an interest in the Germanic languages and/or an interest in Head-Driven Phrase Structure Grammar/Sign-Based Construction Grammar without having the time to deal with all the details of these theories.
  • Müller, Stefan  (2023) Grammatical theory: From transformational grammar to constraint-based approaches. Fifth revised edition[DOI] [ViVo]
    This book introduces formal grammar theories that play a role in current linguistic theorizing (Phrase Structure Grammar, Transformational Grammar/Government & Binding, Generalized Phrase Structure Grammar, Lexical Functional Grammar, Categorial Grammar, Head-Driven Phrase Structure Grammar, Construction Grammar, Tree Adjoining Grammar). The key assumptions are explained and it is shown how the respective theory treats arguments and adjuncts, the active/passive alternation, local reorderings, verb placement, and fronting of constituents over long distances. The analyses are explained with German as the object language. The second part of the book compares these approaches with respect to their predictions regarding language acquisition and psycholinguistic plausibility. The nativism hypothesis, which assumes that humans posses genetically determined innate language-specific knowledge, is critically examined and alternative models of language acquisition are discussed. The second part then addresses controversial issues of current theory building such as the question of flat or binary branching structures being more appropriate, the question whether constructions should be treated on the phrasal or the lexical level, and the question whether abstract, non-visible entities should play a role in syntactic analyses. It is shown that the analyses suggested in the respective frameworks are often translatable into each other. The book closes with a chapter showing how properties common to all languages or to certain classes of languages can be captured. The book is a translation of the German book Grammatiktheorie, which was published by Stauffenburg in 2010. The following quotes are taken from reviews: With this critical yet fair reflection on various grammatical theories, Müller fills what was a major gap in the literature. Karen Lehmann, Zeitschrift für Rezen­sio­nen zur ger­man­is­tis­chen Sprach­wis­senschaft, 2012 Stefan Müller’s recent introductory textbook, Gram­matik­the­o­rie, is an astonishingly comprehensive and insightful survey for beginning students of the present state of syntactic theory. Wolfgang Sternefeld und Frank Richter, Zeitschrift für Sprach­wissen­schaft, 2012 This is the kind of work that has been sought after for a while [...] The impartial and objective discussion offered by the author is particularly refreshing. Werner Abraham, Germanistik, 2012
  • Müller, Stefan  (2023) Grammatical theory: From transformational grammar to constraint-based approaches. Fifth revised edition[DOI] [ViVo]
    This book introduces formal grammar theories that play a role in current linguistic theorizing (Phrase Structure Grammar, Transformational Grammar/Government & Binding, Generalized Phrase Structure Grammar, Lexical Functional Grammar, Categorial Grammar, Head-Driven Phrase Structure Grammar, Construction Grammar, Tree Adjoining Grammar). The key assumptions are explained and it is shown how the respective theory treats arguments and adjuncts, the active/passive alternation, local reorderings, verb placement, and fronting of constituents over long distances. The analyses are explained with German as the object language. The second part of the book compares these approaches with respect to their predictions regarding language acquisition and psycholinguistic plausibility. The nativism hypothesis, which assumes that humans posses genetically determined innate language-specific knowledge, is critically examined and alternative models of language acquisition are discussed. The second part then addresses controversial issues of current theory building such as the question of flat or binary branching structures being more appropriate, the question whether constructions should be treated on the phrasal or the lexical level, and the question whether abstract, non-visible entities should play a role in syntactic analyses. It is shown that the analyses suggested in the respective frameworks are often translatable into each other. The book closes with a chapter showing how properties common to all languages or to certain classes of languages can be captured. The book is a translation of the German book Grammatiktheorie, which was published by Stauffenburg in 2010. The following quotes are taken from reviews: With this critical yet fair reflection on various grammatical theories, Müller fills what was a major gap in the literature. Karen Lehmann, Zeitschrift für Rezen­sio­nen zur ger­man­is­tis­chen Sprach­wis­senschaft, 2012 Stefan Müller’s recent introductory textbook, Gram­matik­the­o­rie, is an astonishingly comprehensive and insightful survey for beginning students of the present state of syntactic theory. Wolfgang Sternefeld und Frank Richter, Zeitschrift für Sprach­wissen­schaft, 2012 This is the kind of work that has been sought after for a while [...] The impartial and objective discussion offered by the author is particularly refreshing. Werner Abraham, Germanistik, 2012
  • Pescuma, Valentina Nicole; Serova, Dina; Lukassek, Julia; Sauermann, Antje; Schäfer, Roland; Adli, Aria; Bildhauer, Felix; Egg, Markus; Hülk, Kristina; Ito, Aine; Jannedy, Stefanie; Kordoni, Valia; Kühnast, Milena; Kutscher, Silvia; Lange, Robert; Lehmann, Nico; Liu, Mingya; Lütke, Beate; Maquate, Katja; Mooshammer, Christine; Mortezapour, Vahid; Müller, Stefan; Norde, Muriel; Pankratz, Elizabeth; Patarroyo, Angela Giovanna; Plesca, Ana-Maria; Ronderos, Camilo R.; Rotter, Stephanie; Sauerland, Uli; Schulte, Britta; Schüppenhauer, Gediminas; Sell, Bianca Maria; Solt, Stephanie; Terada, Megumi; Tsiapou, Dimitra; Verhoeven, Elisabeth; Weirich, Melanie; Wiese, Heike; Zaruba, Kathy; Zeige, Lars Erik; Lüdeling, Anke; Knoeferle, Pia; Schnelle, Gohar  (2023) Situating language register across the ages, languages, modalities, and cultural aspects: Evidence from complementary methods In:  Frontiers in Psychology [DOI] [ViVo]
    In the present review paper by members of the collaborative research center ‘Register: Language Users’ Knowledge of SituationalFunctional Variation’ (CRC 1412), we assess the pervasiveness of register phenomena across different time periods, languages, modalities, and cultures. We define ‘register’ as recurring variation in language use depending on the function of language and on the social situation. Informed by rich data, we aim to better understand and model the knowledge involved in situation- and function-based use of language register. In order to achieve this goal, we are using complementary methods and measures. In the review, we start by clarifying the concept of ‘register’, by reviewing the state of the art, and by setting out our methods and modeling goals. Against this background, we discuss three key challenges, two at the methodological level and one at the theoretical level: 1. To better uncover registers in text and spoken corpora, we propose changes to established analytical approaches. 2. To tease apart between-subject variability from the linguistic variability at issue (intra-individual situation based register variability), we use within-subject designs and the modeling of individuals’ social, language, and educational background. 3. We highlight a gap in cognitive modeling, viz. modeling the mental representations of register (processing), and present our first attempts at filling this gap. We argue that the targeted use of multiple complementary methods and measures supports investigating the pervasiveness of register phenomena and yields comprehensive insights into the cross-methodological robustness of register-related language variability. These comprehensive insights in turn provide a solid foundation for associated cognitive modeling.
  • Pescuma, Valentina Nicole; Serova, Dina; Lukassek, Julia; Sauermann, Antje; Schäfer, Roland; Adli, Aria; Bildhauer, Felix; Egg, Markus; Hülk, Kristina; Ito, Aine; Jannedy, Stefanie; Kordoni, Valia; Kühnast, Milena; Kutscher, Silvia; Lange, Robert; Lehmann, Nico; Liu, Mingya; Lütke, Beate; Maquate, Katja; Mooshammer, Christine; Mortezapour, Vahid; Müller, Stefan; Norde, Muriel; Pankratz, Elizabeth; Patarroyo, Angela Giovanna; Plesca, Ana-Maria; Ronderos, Camilo R.; Rotter, Stephanie; Sauerland, Uli; Schulte, Britta; Schüppenhauer, Gediminas; Sell, Bianca Maria; Solt, Stephanie; Terada, Megumi; Tsiapou, Dimitra; Verhoeven, Elisabeth; Weirich, Melanie; Wiese, Heike; Zaruba, Kathy; Zeige, Lars Erik; Lüdeling, Anke; Knoeferle, Pia; Schnelle, Gohar  (2023) Situating language register across the ages, languages, modalities, and cultural aspects: Evidence from complementary methods In:  Frontiers in Psychology [DOI] [ViVo]
    In the present review paper by members of the collaborative research center ‘Register: Language Users’ Knowledge of SituationalFunctional Variation’ (CRC 1412), we assess the pervasiveness of register phenomena across different time periods, languages, modalities, and cultures. We define ‘register’ as recurring variation in language use depending on the function of language and on the social situation. Informed by rich data, we aim to better understand and model the knowledge involved in situation- and function-based use of language register. In order to achieve this goal, we are using complementary methods and measures. In the review, we start by clarifying the concept of ‘register’, by reviewing the state of the art, and by setting out our methods and modeling goals. Against this background, we discuss three key challenges, two at the methodological level and one at the theoretical level: 1. To better uncover registers in text and spoken corpora, we propose changes to established analytical approaches. 2. To tease apart between-subject variability from the linguistic variability at issue (intra-individual situation based register variability), we use within-subject designs and the modeling of individuals’ social, language, and educational background. 3. We highlight a gap in cognitive modeling, viz. modeling the mental representations of register (processing), and present our first attempts at filling this gap. We argue that the targeted use of multiple complementary methods and measures supports investigating the pervasiveness of register phenomena and yields comprehensive insights into the cross-methodological robustness of register-related language variability. These comprehensive insights in turn provide a solid foundation for associated cognitive modeling.
  • Weber, Thilo; Bildhauer, Felix; Münzberg, Franziska  (2023) Finite vs. infinite Attributsätze: zu/dass-Alternation bei Substantiven In:  Fugenelemente, Präfix-und Partikelverben, Attributsätze [ViVo]
  • Weber, Thilo; Bildhauer, Felix; Münzberg, Franziska  (2023) Finite vs. infinite Attributsätze: zu/dass-Alternation bei Substantiven In:  Fugenelemente, Präfix-und Partikelverben, Attributsätze [ViVo]
  • Varaschin, Giuseppe; Culicover, Peter W.; Winkler, Susanne  (2023) In pursuit of Condition C: (Non-)coreference in grammar, discourse and processing In:  Information Structure and Discourse in Generative Grammar [ViVo]
  • Varaschin, Giuseppe; Culicover, Peter W.; Winkler, Susanne  (2023) In pursuit of Condition C: (Non-)coreference in grammar, discourse and processing In:  Information Structure and Discourse in Generative Grammar [ViVo]
  • 2022

  • 2021

  • Machicao y Priemer, Antonio; Müller, Stefan  (2021) NPs in German: Locality, theta roles, possessives, and genitive arguments In:  Glossa: a journal of general linguistics [DOI] [ViVo]
    Since Abney (1987), the DP-analysis has been the standard analysis for nominal complexes, but in the last decade, the NP analysis has experienced a revival. In this spirit, we provide an NP analysis for German nominal complexes in HPSG. Our analysis deals with the fact that relational nouns assign case and theta role to their arguments. We develop an analysis in line with selectional localism (Sag 2012: 149), accounting for the asymmetry between prenominal and postnominal genitives, as well as for the complementarity between higher arguments and possessives, providing a syntactic and semantic analysis.
  • Machicao y Priemer, Antonio; Müller, Stefan  (2021) NPs in German: Locality, theta roles, possessives, and genitive arguments In:  Glossa: a journal of general linguistics [DOI] [ViVo]
    Since Abney (1987), the DP-analysis has been the standard analysis for nominal complexes, but in the last decade, the NP analysis has experienced a revival. In this spirit, we provide an NP analysis for German nominal complexes in HPSG. Our analysis deals with the fact that relational nouns assign case and theta role to their arguments. We develop an analysis in line with selectional localism (Sag 2012: 149), accounting for the asymmetry between prenominal and postnominal genitives, as well as for the complementarity between higher arguments and possessives, providing a syntactic and semantic analysis.
  • 2020

  • Kutscher, Silvia; Alexiadou, Artemis; Adli, Aria; Donhauser, Karin; Dreyer, Malte; Egg, Markus; Feulner, Anna Helene; Gagarina, Natalia; Hock, Wolfgang; Jannedy, Stefanie; Kammerzell, Frank; Knoeferle, Pia; Krause, Thomas; Krause, Thomas; Krifka, Manfred; Lüdeling, Anke; Maquate, Katja; McFadden, Thomas; Meyer, Roland; Mooshammer, Christine; Lütke, Beate; Müller, Stefan; Norde, Muriel; Sauerland, Uli; Szucsich, Luka; Verhoeven, Elisabeth; Waltereit, Richard; Wolfsgruber, Anne; Adli, Aria  (2020) Register: Language Users’ Knowledge of Situational-Functional Variation In:  REALIS: Register Aspects of Language in Situation [DOI] [ViVo]
    The Collaborative Research Center 1412 “Register: Language Users’ Knowledge of Situational-Functional Variation” (CRC 1412) investigates the role of register in language, focusing in particular on what constitutes a language user’s register knowledge and which situational-functional factors determine a user’s choices. The following paper is an extract from the frame text of the proposal for the CRC 1412, which was submitted to the Deutsche Forschungsgemeinschaft in 2019, followed by a successful onsite evaluation that took place in 2019. The CRC 1412 then started its work on January 1, 2020. The theoretical part of the frame text gives an extensive overview of the theoretical and empirical perspectives on register knowledge from the viewpoint of 2019. Due to the high collaborative effort of all PIs involved, the frame text is unique in its scope on register research, encompassing register-relevant aspects from variationist approaches, psycholinguistics, grammatical theory, acquisition theory, historical linguistics, phonology, phonetics, typology, corpus linguistics, and computational linguistics, as well as qualitative and quantitative modeling. Although our positions and hypotheses since its submission have developed further, the frame text is still a vital resource as a compilation of state-of-the-art register research and a documentation of the start of the CRC 1412. The theoretical part without administrative components therefore presents an ideal starter publication to kick off the CRC’s publication series REALIS. For an overview of the projects and more information on the CRC, see https://sfb1412.hu-berlin.de/.
  • Kutscher, Silvia; Alexiadou, Artemis; Adli, Aria; Donhauser, Karin; Dreyer, Malte; Egg, Markus; Feulner, Anna Helene; Gagarina, Natalia; Hock, Wolfgang; Jannedy, Stefanie; Kammerzell, Frank; Knoeferle, Pia; Krause, Thomas; Krause, Thomas; Krifka, Manfred; Lüdeling, Anke; Maquate, Katja; McFadden, Thomas; Meyer, Roland; Mooshammer, Christine; Lütke, Beate; Müller, Stefan; Norde, Muriel; Sauerland, Uli; Szucsich, Luka; Verhoeven, Elisabeth; Waltereit, Richard; Wolfsgruber, Anne; Adli, Aria  (2020) Register: Language Users’ Knowledge of Situational-Functional Variation In:  REALIS: Register Aspects of Language in Situation [DOI] [ViVo]
    The Collaborative Research Center 1412 “Register: Language Users’ Knowledge of Situational-Functional Variation” (CRC 1412) investigates the role of register in language, focusing in particular on what constitutes a language user’s register knowledge and which situational-functional factors determine a user’s choices. The following paper is an extract from the frame text of the proposal for the CRC 1412, which was submitted to the Deutsche Forschungsgemeinschaft in 2019, followed by a successful onsite evaluation that took place in 2019. The CRC 1412 then started its work on January 1, 2020. The theoretical part of the frame text gives an extensive overview of the theoretical and empirical perspectives on register knowledge from the viewpoint of 2019. Due to the high collaborative effort of all PIs involved, the frame text is unique in its scope on register research, encompassing register-relevant aspects from variationist approaches, psycholinguistics, grammatical theory, acquisition theory, historical linguistics, phonology, phonetics, typology, corpus linguistics, and computational linguistics, as well as qualitative and quantitative modeling. Although our positions and hypotheses since its submission have developed further, the frame text is still a vital resource as a compilation of state-of-the-art register research and a documentation of the start of the CRC 1412. The theoretical part without administrative components therefore presents an ideal starter publication to kick off the CRC’s publication series REALIS. For an overview of the projects and more information on the CRC, see https://sfb1412.hu-berlin.de/.
  • Machicao y Priemer, Antonio; Fritz-Huechante, Paola  (2020) Boundaries at play In:  Interfaces in Romance [DOI] [ViVo]
    Summary In this paper, we model the left-bounded state reading and the true reflexive reading of the se clitic in the Spanish psychological domain. We argue that a lexical analysis of se provides us with a more accurate description of the different classes of psychological verbs that occur with the clitic. We provide a unified analysis where the use of the two readings of se are modeled by means of lexical rules. We take the morphologically simple but semantically more complex basic items (e.g. asustar ‘frighten’) as input of the lexical rules, getting as the output a morphologically more complex but semantically simpler verb (e.g asustarse ‘get frightened’). The analysis for psych verbs correctly allows only those verbs assigning accusative to the experiencer or the stimulus to combine with se, hence preventing dative verbs from entering the lexical rules. The analysis also demonstrates how to account for punctual and non-punctual readings of psych verbs with se incorporating ‘boundaries’ into the type hierarchy of eventualities.
  • Machicao y Priemer, Antonio; Fritz-Huechante, Paola  (2020) Boundaries at play In:  Interfaces in Romance [DOI] [ViVo]
    Summary In this paper, we model the left-bounded state reading and the true reflexive reading of the se clitic in the Spanish psychological domain. We argue that a lexical analysis of se provides us with a more accurate description of the different classes of psychological verbs that occur with the clitic. We provide a unified analysis where the use of the two readings of se are modeled by means of lexical rules. We take the morphologically simple but semantically more complex basic items (e.g. asustar ‘frighten’) as input of the lexical rules, getting as the output a morphologically more complex but semantically simpler verb (e.g asustarse ‘get frightened’). The analysis for psych verbs correctly allows only those verbs assigning accusative to the experiencer or the stimulus to combine with se, hence preventing dative verbs from entering the lexical rules. The analysis also demonstrates how to account for punctual and non-punctual readings of psych verbs with se incorporating ‘boundaries’ into the type hierarchy of eventualities.
  • Presentations

    2023

  • Sailer, Manfred  (2023) Explicit or redundant: The social meaning of multiple exponence In:  Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2023) [ViVo]
  • Sailer, Manfred  (2023) Explicit or redundant: The social meaning of multiple exponence In:  Kolloquium SFB1412 (2023) [ViVo]
  • Sailer, Manfred  (2023) Explicit or redundant: The social meaning of multiple exponence In:  Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2023) [ViVo]
  • Sailer, Manfred  (2023) Explicit or redundant: The social meaning of multiple exponence In:  Kolloquium SFB1412 (2023) [ViVo]
  • Sailer, Manfred  (2023) Explicit or redundant: The social meaning of multiple exponence In:  Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2023) [ViVo]
  • Sailer, Manfred  (2023) Explicit or redundant: The social meaning of multiple exponence In:  Kolloquium SFB1412 (2023) [ViVo]
  • 2022

  • Varaschin, Giuseppe; Machicao y Priemer, Antonio  (2022) Agreement mismatches and register-driven variation in Brazilian Portuguese In:  Oberseminar Syntax and Semantics, Institut für England- und Amerikastudien, Goethe-Universität Frankfurt am Main [ViVo]
  • Varaschin, Giuseppe; Machicao y Priemer, Antonio  (2022) Agreement mismatches and register-driven variation in Brazilian Portuguese In:  Oberseminar Syntax and Semantics, Institut für England- und Amerikastudien, Goethe-Universität Frankfurt am Main [ViVo]
  • Varaschin, Giuseppe; Machicao y Priemer, Antonio  (2022) Agreement mismatches and register-driven variation in Brazilian Portuguese In:  Oberseminar Syntax and Semantics, Institut für England- und Amerikastudien, Goethe-Universität Frankfurt am Main [ViVo]
  • 2020

  • Schäfer, Roland  (2020) Grammatische Variation zwischen Individuen und Situationen: Perspektiven für Linguistik und Bildungsspracherwerb In:  Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2020) [ViVo]
  • Schäfer, Roland  (2020) Grammatische Variation zwischen Individuen und Situationen: Perspektiven für Linguistik und Bildungsspracherwerb In:  Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2020) [ViVo]
  • Schäfer, Roland  (2020) Grammatische Variation zwischen Individuen und Situationen: Perspektiven für Linguistik und Bildungsspracherwerb In:  Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2020) [ViVo]
  • Schäfer, Roland; Bildhauer, Felix  (2020) Beyond Multidimensional Analysis: Probabilistic Register Induction for Large Corpora In:  Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2020) [ViVo]
    The analysis of the register in which a corpus document is written is prominently associated with Biber’s (1988; 1995) Multidimensional Analysis (MDA). We present an approach superficially similar to MDA but which solves three major conceptual problems of MDA by using Bayesian inference to uncover registers or – rather potential registers. First, in Biber’s MDA, registers are associated discretely with documents, and each document can only instantiate one specific register, whereas we allow registers to be associated probabilistically with documents, and we allow mixtures of registers in single documents. Given that many linguistic phenomena are now understood as being probabilistic in nature (cf. Schäfer 2018), we suggest that this is a much more realistic assumption. Second, we assume the surface features to be associated with registers in a probabilistic manner for similar reasons. Third, we do not use a catalogue of registers assumed to exist a priori, but instead we merely infer potential registers (pregisters) via clusters of surface features. The question of which pregisters actually correspond to registers with an identifiable situational communicative setting will be dealt with in a future stage of the project using theory-driven evaluation and experimental validation. Given our assumptions about the nature of the mapping between features and pregisters and pregisters and documents, an obvious algorithm to use is Bayesian inference in the form of Latent Dirichlet Allocation (LDA; Blei et al. 2003; Blei 2012) as used in Topic Modelling. In our approach, we deal with pregisters instead of topics and with distributions of lexico-grammatical surface features instead of lexical words. The LDA algorithm otherwise performs an exactly parallel inference task. We first show how we extended the COReX feature extraction framework (Bildhauer & Schäfer in prep.) developed at FU Berlin and the IDS Mannheim in order to provide a large enough number of features for the LDA algorithm to work. We then present first results and discuss how we tuned the LDA algorithm and the feature set to lead to interpretable results. In order to be able to interpret the pregisters found by LDA, we extract the documents which most strongly instantiate the inferred pregisters. We introduce the PreCOX20 sub-corpus of the DECOW German web corpus, in which those prototypical documents are collected for further analysis w.r.t. their situational communicative setting. References: Biber, D. (1988). Variation across Speech and Writing. CUP. Biber, D. (1995). Dimensions of Register Variation: A Cross-Linguistic Comparison. CUP. Bildhauer, F. & R. Schäfer (in prep.) Automatic register annotation and alternation modelling. Blei, D. M (2012). Probabilistic topic models. Communications of the ACM 55(4), 77-84. Blei, D. M., A. Y. Ng & M. I. Jordan (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022. Schäfer, R. (2018). Probabilistic German Morphosyntax. Habilitation thesis. HU Berlin.
  • Schäfer, Roland; Bildhauer, Felix  (2020) Beyond Multidimensional Analysis: Probabilistic Register Induction for Large Corpora In:  Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2020) [ViVo]
    The analysis of the register in which a corpus document is written is prominently associated with Biber’s (1988; 1995) Multidimensional Analysis (MDA). We present an approach superficially similar to MDA but which solves three major conceptual problems of MDA by using Bayesian inference to uncover registers or – rather potential registers. First, in Biber’s MDA, registers are associated discretely with documents, and each document can only instantiate one specific register, whereas we allow registers to be associated probabilistically with documents, and we allow mixtures of registers in single documents. Given that many linguistic phenomena are now understood as being probabilistic in nature (cf. Schäfer 2018), we suggest that this is a much more realistic assumption. Second, we assume the surface features to be associated with registers in a probabilistic manner for similar reasons. Third, we do not use a catalogue of registers assumed to exist a priori, but instead we merely infer potential registers (pregisters) via clusters of surface features. The question of which pregisters actually correspond to registers with an identifiable situational communicative setting will be dealt with in a future stage of the project using theory-driven evaluation and experimental validation. Given our assumptions about the nature of the mapping between features and pregisters and pregisters and documents, an obvious algorithm to use is Bayesian inference in the form of Latent Dirichlet Allocation (LDA; Blei et al. 2003; Blei 2012) as used in Topic Modelling. In our approach, we deal with pregisters instead of topics and with distributions of lexico-grammatical surface features instead of lexical words. The LDA algorithm otherwise performs an exactly parallel inference task. We first show how we extended the COReX feature extraction framework (Bildhauer & Schäfer in prep.) developed at FU Berlin and the IDS Mannheim in order to provide a large enough number of features for the LDA algorithm to work. We then present first results and discuss how we tuned the LDA algorithm and the feature set to lead to interpretable results. In order to be able to interpret the pregisters found by LDA, we extract the documents which most strongly instantiate the inferred pregisters. We introduce the PreCOX20 sub-corpus of the DECOW German web corpus, in which those prototypical documents are collected for further analysis w.r.t. their situational communicative setting. References: Biber, D. (1988). Variation across Speech and Writing. CUP. Biber, D. (1995). Dimensions of Register Variation: A Cross-Linguistic Comparison. CUP. Bildhauer, F. & R. Schäfer (in prep.) Automatic register annotation and alternation modelling. Blei, D. M (2012). Probabilistic topic models. Communications of the ACM 55(4), 77-84. Blei, D. M., A. Y. Ng & M. I. Jordan (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022. Schäfer, R. (2018). Probabilistic German Morphosyntax. Habilitation thesis. HU Berlin.
  • Schäfer, Roland; Bildhauer, Felix  (2020) Beyond Multidimensional Analysis: Probabilistic Register Induction for Large Corpora In:  Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2020) [ViVo]
    The analysis of the register in which a corpus document is written is prominently associated with Biber’s (1988; 1995) Multidimensional Analysis (MDA). We present an approach superficially similar to MDA but which solves three major conceptual problems of MDA by using Bayesian inference to uncover registers or – rather potential registers. First, in Biber’s MDA, registers are associated discretely with documents, and each document can only instantiate one specific register, whereas we allow registers to be associated probabilistically with documents, and we allow mixtures of registers in single documents. Given that many linguistic phenomena are now understood as being probabilistic in nature (cf. Schäfer 2018), we suggest that this is a much more realistic assumption. Second, we assume the surface features to be associated with registers in a probabilistic manner for similar reasons. Third, we do not use a catalogue of registers assumed to exist a priori, but instead we merely infer potential registers (pregisters) via clusters of surface features. The question of which pregisters actually correspond to registers with an identifiable situational communicative setting will be dealt with in a future stage of the project using theory-driven evaluation and experimental validation. Given our assumptions about the nature of the mapping between features and pregisters and pregisters and documents, an obvious algorithm to use is Bayesian inference in the form of Latent Dirichlet Allocation (LDA; Blei et al. 2003; Blei 2012) as used in Topic Modelling. In our approach, we deal with pregisters instead of topics and with distributions of lexico-grammatical surface features instead of lexical words. The LDA algorithm otherwise performs an exactly parallel inference task. We first show how we extended the COReX feature extraction framework (Bildhauer & Schäfer in prep.) developed at FU Berlin and the IDS Mannheim in order to provide a large enough number of features for the LDA algorithm to work. We then present first results and discuss how we tuned the LDA algorithm and the feature set to lead to interpretable results. In order to be able to interpret the pregisters found by LDA, we extract the documents which most strongly instantiate the inferred pregisters. We introduce the PreCOX20 sub-corpus of the DECOW German web corpus, in which those prototypical documents are collected for further analysis w.r.t. their situational communicative setting. References: Biber, D. (1988). Variation across Speech and Writing. CUP. Biber, D. (1995). Dimensions of Register Variation: A Cross-Linguistic Comparison. CUP. Bildhauer, F. & R. Schäfer (in prep.) Automatic register annotation and alternation modelling. Blei, D. M (2012). Probabilistic topic models. Communications of the ACM 55(4), 77-84. Blei, D. M., A. Y. Ng & M. I. Jordan (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022. Schäfer, R. (2018). Probabilistic German Morphosyntax. Habilitation thesis. HU Berlin.

Featured Master's Theses

Boeke, S. (2021). Funktionen des Vorgangspassivs im Deutschen. B.A. Germanistische Linguistik. IdSL, HU Berlin

Reiß, Pauline (2024). Untersuchung von Genitiv- und Präpositionalattributen als ein Registerphänomen des Deutschen: Eine Korpusstudie und Implementierung in HPSG. M.A. Linguistik. IdSL, HU Berlin

Contact