Prof. Dr. phil. Roland Schäfer

Friedrich-Schiller-Universität Jena

Institut für Germanistische Sprachwissenschaft

I’m a linguist focussing on German morphosyntax in written language (including non-standard written language) as well as the grammar-graphemics interface. I hold a ‘venia legendi’ for German Linguistics and Theoretical Linguistics from the Faculty of Language Sciences at Humboldt-Universität zu Berlin. As a ‘Privatdozent’, I’m a member of the faculty.

My linguistic approach is cognitively oriented, theory-driven, and strongly empirical. I use corpus-linguistic and experimental methods. I also have a strong  interest in statistical methods, epistemology, and methods of large-scale data analysis. Furthermore, I’m the principal creator of a suite of very large web corpora (COW), which researchers can access at webcorpora.org. From 2015 to 2018, I worked on my own DFG-funded project about the grammar on the German web at Freie Universität Berlin. I was visiting professor for German Grammar at Freie Universität Berlin in 2016 and from 2018 to 2019.

Finally, I have a strong interest in teaching methodology and the education of future schoolteachers of German, focussing on the role of linguistic knowledge in the acquisition of educated language and register awareness. I have a broad teaching experience in German Linguistics and English Linguistics as well as General/Theoretical Linguistics and Applied Computational Linguistics.

Projects

A04 Building register into the architecture of language – an HPSG account

Contact

https://orcid.org/0000-0003-3233-7874

Publications & Presentations

    Publications

  • Schäfer, Roland  (2024) Between syntax and morphology: German noun+verb units  In: Glossa: a journal of general linguistics [DOI] [ViVo]
    We show that graphemic variation—at least in some writing systems—can be analysed in terms of grammatical variation given a usage- based probabilistic view of the grammar-graphemics interface. Concretely, we examine a type of noun+verb unit in German, which can be written as one word or two. We argue that the variation in writing is rooted in the units’ ambiguous status in between morphology (one word) and syntax (two words). The major influencing factors are shown to be the semantic relation between the noun and the verb (argument or oblique relation) and the morphosyntactic context. In prototypically nominal contexts, a reinterpretation of the unit as a noun+noun compound is facilitated, which favours spelling as one word, while in prototypically verbal contexts, a syntactic realisation and consequently spelling as two words is preferred. We report the results of two large-scale corpus studies and a controlled production experiment to corroborate our analysis.
  • Pescuma, Valentina Nicole; Serova, Dina; Lukassek, Julia; Sauermann, Antje; Schäfer, Roland; Adli, Aria; Bildhauer, Felix; Egg, Markus; Hülk, Kristina; Ito, Aine; Jannedy, Stefanie; Kordoni, Valia; Kühnast, Milena; Kutscher, Silvia; Lange, Robert; Lehmann, Nico; Liu, Mingya; Lütke, Beate; Maquate, Katja; Mooshammer, Christine; Mortezapour, Vahid; Müller, Stefan; Norde, Muriel; Pankratz, Elizabeth; Patarroyo, Angela Giovanna; Plesca, Ana-Maria; Ronderos, Camilo R.; Rotter, Stephanie; Sauerland, Uli; Schulte, Britta; Schüppenhauer, Gediminas; Sell, Bianca Maria; Solt, Stephanie; Terada, Megumi; Tsiapou, Dimitra; Verhoeven, Elisabeth; Weirich, Melanie; Wiese, Heike; Zaruba, Kathy; Zeige, Lars Erik; Lüdeling, Anke; Knoeferle, Pia; Schnelle, Gohar  (2023) Situating language register across the ages, languages, modalities, and cultural aspects: Evidence from complementary methods  In: Frontiers in Psychology [DOI] [PDF] [ViVo]
    In the present review paper by members of the collaborative research center ‘Register: Language Users’ Knowledge of SituationalFunctional Variation’ (CRC 1412), we assess the pervasiveness of register phenomena across different time periods, languages, modalities, and cultures. We define ‘register’ as recurring variation in language use depending on the function of language and on the social situation. Informed by rich data, we aim to better understand and model the knowledge involved in situation- and function-based use of language register. In order to achieve this goal, we are using complementary methods and measures. In the review, we start by clarifying the concept of ‘register’, by reviewing the state of the art, and by setting out our methods and modeling goals. Against this background, we discuss three key challenges, two at the methodological level and one at the theoretical level: 1. To better uncover registers in text and spoken corpora, we propose changes to established analytical approaches. 2. To tease apart between-subject variability from the linguistic variability at issue (intra-individual situation based register variability), we use within-subject designs and the modeling of individuals’ social, language, and educational background. 3. We highlight a gap in cognitive modeling, viz. modeling the mental representations of register (processing), and present our first attempts at filling this gap. We argue that the targeted use of multiple complementary methods and measures supports investigating the pervasiveness of register phenomena and yields comprehensive insights into the cross-methodological robustness of register-related language variability. These comprehensive insights in turn provide a solid foundation for associated cognitive modeling.
  • Presentations

  • Schäfer, Roland  (2020) Grammatische Variation zwischen Individuen und Situationen: Perspektiven für Linguistik und Bildungsspracherwerb  In: Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2020) [ViVo]
  • Schäfer, Roland  (2020) Grammatische Variation zwischen Individuen und Situationen: Perspektiven für Linguistik und Bildungsspracherwerb  In: Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2020) [ViVo]
  • Schäfer, Roland; Bildhauer, Felix  (2020) Beyond Multidimensional Analysis: Probabilistic Register Induction for Large Corpora  In: Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2020) [ViVo]
    The analysis of the register in which a corpus document is written is prominently associated with Biber’s (1988; 1995) Multidimensional Analysis (MDA). We present an approach superficially similar to MDA but which solves three major conceptual problems of MDA by using Bayesian inference to uncover registers or – rather potential registers. First, in Biber’s MDA, registers are associated discretely with documents, and each document can only instantiate one specific register, whereas we allow registers to be associated probabilistically with documents, and we allow mixtures of registers in single documents. Given that many linguistic phenomena are now understood as being probabilistic in nature (cf. Schäfer 2018), we suggest that this is a much more realistic assumption. Second, we assume the surface features to be associated with registers in a probabilistic manner for similar reasons. Third, we do not use a catalogue of registers assumed to exist a priori, but instead we merely infer potential registers (pregisters) via clusters of surface features. The question of which pregisters actually correspond to registers with an identifiable situational communicative setting will be dealt with in a future stage of the project using theory-driven evaluation and experimental validation. Given our assumptions about the nature of the mapping between features and pregisters and pregisters and documents, an obvious algorithm to use is Bayesian inference in the form of Latent Dirichlet Allocation (LDA; Blei et al. 2003; Blei 2012) as used in Topic Modelling. In our approach, we deal with pregisters instead of topics and with distributions of lexico-grammatical surface features instead of lexical words. The LDA algorithm otherwise performs an exactly parallel inference task. We first show how we extended the COReX feature extraction framework (Bildhauer & Schäfer in prep.) developed at FU Berlin and the IDS Mannheim in order to provide a large enough number of features for the LDA algorithm to work. We then present first results and discuss how we tuned the LDA algorithm and the feature set to lead to interpretable results. In order to be able to interpret the pregisters found by LDA, we extract the documents which most strongly instantiate the inferred pregisters. We introduce the PreCOX20 sub-corpus of the DECOW German web corpus, in which those prototypical documents are collected for further analysis w.r.t. their situational communicative setting. References: Biber, D. (1988). Variation across Speech and Writing. CUP. Biber, D. (1995). Dimensions of Register Variation: A Cross-Linguistic Comparison. CUP. Bildhauer, F. & R. Schäfer (in prep.) Automatic register annotation and alternation modelling. Blei, D. M (2012). Probabilistic topic models. Communications of the ACM 55(4), 77-84. Blei, D. M., A. Y. Ng & M. I. Jordan (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022. Schäfer, R. (2018). Probabilistic German Morphosyntax. Habilitation thesis. HU Berlin.
  • Schäfer, Roland; Bildhauer, Felix  (2020) Beyond Multidimensional Analysis: Probabilistic Register Induction for Large Corpora  In: Humboldt-Universität zu Berlin: Kolloquium Syntax und Semantik (2020) [ViVo]
    The analysis of the register in which a corpus document is written is prominently associated with Biber’s (1988; 1995) Multidimensional Analysis (MDA). We present an approach superficially similar to MDA but which solves three major conceptual problems of MDA by using Bayesian inference to uncover registers or – rather potential registers. First, in Biber’s MDA, registers are associated discretely with documents, and each document can only instantiate one specific register, whereas we allow registers to be associated probabilistically with documents, and we allow mixtures of registers in single documents. Given that many linguistic phenomena are now understood as being probabilistic in nature (cf. Schäfer 2018), we suggest that this is a much more realistic assumption. Second, we assume the surface features to be associated with registers in a probabilistic manner for similar reasons. Third, we do not use a catalogue of registers assumed to exist a priori, but instead we merely infer potential registers (pregisters) via clusters of surface features. The question of which pregisters actually correspond to registers with an identifiable situational communicative setting will be dealt with in a future stage of the project using theory-driven evaluation and experimental validation. Given our assumptions about the nature of the mapping between features and pregisters and pregisters and documents, an obvious algorithm to use is Bayesian inference in the form of Latent Dirichlet Allocation (LDA; Blei et al. 2003; Blei 2012) as used in Topic Modelling. In our approach, we deal with pregisters instead of topics and with distributions of lexico-grammatical surface features instead of lexical words. The LDA algorithm otherwise performs an exactly parallel inference task. We first show how we extended the COReX feature extraction framework (Bildhauer & Schäfer in prep.) developed at FU Berlin and the IDS Mannheim in order to provide a large enough number of features for the LDA algorithm to work. We then present first results and discuss how we tuned the LDA algorithm and the feature set to lead to interpretable results. In order to be able to interpret the pregisters found by LDA, we extract the documents which most strongly instantiate the inferred pregisters. We introduce the PreCOX20 sub-corpus of the DECOW German web corpus, in which those prototypical documents are collected for further analysis w.r.t. their situational communicative setting. References: Biber, D. (1988). Variation across Speech and Writing. CUP. Biber, D. (1995). Dimensions of Register Variation: A Cross-Linguistic Comparison. CUP. Bildhauer, F. & R. Schäfer (in prep.) Automatic register annotation and alternation modelling. Blei, D. M (2012). Probabilistic topic models. Communications of the ACM 55(4), 77-84. Blei, D. M., A. Y. Ng & M. I. Jordan (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022. Schäfer, R. (2018). Probabilistic German Morphosyntax. Habilitation thesis. HU Berlin.