Several types of data beyond traditional publications are used and created throughout the
CRC. We are committed to publish our research data and software wherever possible under an
Open Access license.
We are also listing here research data and research software produced by others to
appropriately acknowledge their work. By sharing our resources, we want to enable reproducibility
and re-use by the research community.
» Corpora (102)
» Experimental research data (3)
» Documents/Other (5)
» Software (36)
↑ ()
Kommunikanten-Pronomen als Mittel der situierten Variation in Erklärungen
Type: Contribution in Edited volumeStatus:
Details: [ViVo]
Kommunikanten-Pronomen als Mittel der situierten Variation in Erklärungen
Type: Contribution in Edited volumeStatus:
Details: [ViVo]
↑ Corpora (102)
A Grammatically Annotated Corpus of the Old Latvian Postil of Georg Mancelius
Type: CorpusStatus: created
Details: [ViVo] [DOI] [URL]
This grammatically annoted corpus aims at facilitating linguistic research on Old Latvian based on the Postil of Georg Mancelius from the year 1654. The corpus is divided into two subcorpora, "pericopes" and "homilies" to make register related research easier.
The pericopes were annotated using SIL Toolbox and converted to be used in the search-tool ANNIS using the conversion tool PEPPER.
Three formats are provided in this release: 1. the Toolbox files, 2. the transitional Excel files and 3. a zipped folder to be imported into ANNIS.
Created in the project B02, Emergence and change of registers: The case of Lithuanian and Latvian of the CRC 1412 "Register" (funded by the Deutsche Forschungsgemeinschaft: DFG, German Research Foundation: 416591334).
BeDiaCo - Berlin Dialogue Corpus
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
The corpus consists of acoustic recordings of spontaneous dialogues of German native speakers with both task-free and task-based parts and additional read word lists.
Malte Belz, Christine Mooshammer, Alina Zöllner, and Lea‑Sophie Adam. Berlin Dialogue Corpus
(BeDiaCo): Version 2, 2021.
BeMeCo v1
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
1, 2021. URL https://rs.cms.hu‑berlin.de/phon
BiNoKo V. 1.0 Birgitta-Notker-Korpus
Type: CorpusStatus: created
Details: [ViVo] [DOI] [URL]
The Birgitta-Notker-Korpus (BiNoKo) is a resource dedicated to comparative research on historical registers. The corpus comprises two sources: The Old High German Book of Psalms by Notker III of Saint Gall and the Old Swedish Revelations of Birgitta of Sweden. The subcorpus of Birgitta's Revelations and the subcorpus of Notker's Psalms are available as separate zip files. The corpus format is ANNIS. For local installation, use ANNIS Desktop. The documentation for ANNIS can be found here:
https://corpus-tools.org/annis/
https://corpus-tools.org/annis/download.html
The guidelines (see 'related identifiers') are published in REALIS 2/3 and include information about the corpus design, annotation layers, meta data, and annotation principles.
Bonner Totenbuchprojekt
Type: CorpusStatus: used
Details: [ViVo] [URL]
British National Corpus (BNC)
Type: CorpusStatus: used
Details: [ViVo] [URL]
CaeMmCom
Type: CorpusStatus: used
Details: [ViVo] [URL]
Corpus of Ancient Egyptian Multimodal Communication.
CAEMmCom – Corpus of Ancient Egyptian Multimodal Communication: Getting Started [pdf]
2020 Multimodale graphische Kommunikation im pharaonischen Ägypten: Entwurf einer Analysemethode, Lingua Aegyptia 28: 81-116.
2020 (mit Rebecca Döhl und Jens-Martin Loebel) CaeMmCom – Corpus altaegyptischer multimodaler Communication. Der Aufbau einer multimodalen Datensammlung altägyptischer Kommunikate, Zeitschrift für digitale Geisteswissenschaft, [open access].
CoACan: Corpus del español académico de Canarias (1.0)
Type: CorpusStatus:
Details: [ViVo] [URL]
COACAN (Corpus of Academic Canarian Spanish) is a linguistic corpus that documents and analyzes varieties of Spanish spoken in the Canary Islands, specifically within the university academic context. This dataset was developed as part of project A09 "On the interplay between register and socio-geographic variation in Canarian Spanish" of the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of Situational-Functional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin.
This corpus was developed at the University of La Laguna with students from three specific degree programs in October-November 2024:
• Physical Activity and Sports Sciences (CAFYD)
• Hispanic Linguistics and Literature
• Education (Pedagogy)
The corpus focuses on spontaneous speech from Canarian university students, capturing both dialectal features of Canarian Spanish and academic/colloquial registers used in higher education settings, as well as their relationship with insular territory.
CoCoYum
Type: CorpusStatus: created
Details: [ViVo] [URL]
Size: 159.00 tokens
Description: natural language production (spoken) and elicited data
Features: morpheme, glosses, translations, comments
Access: CC-BY-NC-ND
The Collective Corpus of Yucatec Maya (CoCoYum) is a collection of data from various researchers about the Yucatec Mayan language. It contains transcriptions of recordings (e.g. story telling, dialogue, public events), written data as well as elicited data. The corpus will be enlarged in time with fresh data collections and when further researchers add their data to the corpus.
CoMeCaYo: Corpus Mediático de Canarias en YouTube (1.0)
Type: CorpusStatus:
Details: [ViVo] [URL]
CoPaCaYo: Corpus del Parlamento de Canarias en YouTube (1.0)
Type: CorpusStatus:
Details: [ViVo] [URL]
The corpus spans over 14 years of video content from major Canarian governmental institutions, parliamentary sessions, and official channels, providing a valuable resource for linguistic research, natural language processing, and computational linguistics studies focusing on Spanish language evolution and regional institutional discourse patterns.
CoParCan: Corpus del Parlamento de Canarias (1.0)
Type: CorpusStatus:
Details: [ViVo] [URL]
Corpus of Non-Native Addressee Register (CoNNAR). Version 1
Type: CorpusStatus: created
Details: [ViVo] [URL]
Czech corpus Koditex
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech.
Zasina, A. J. – Lukeš, D. – Komrsková, Z. – Poukarová, P. – Řehořková, A.: Koditex: A corpus of diversified texts. Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2018. Available at WWW: www.korpus.cz
DNam corpus + DNam Wenker corpus
Type: CorpusStatus: used
Details: [ViVo] [DOI] [URL]
Article PDF
ENCOW16
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
Engines: NoSkE and RStudio Server
FOLK excerpt
Type: CorpusStatus: used
Details: [ViVo]
Size: 194,716 tokens
Description: conversations in various situations
Features: rich metadata lemma, POS, speech unit segmentation, some dependencies
Access: internal
Falko Corpus
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
Further information about the Falko-project: https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falko
GeWISS excerpt
Type: CorpusStatus: used
Details: [ViVo] [URL]
GeWiss is a research project in spoken academic language. It provides a multilingual (German/English/Polish/Italian) corpus of audio recordings and transcriptions of academic communications, as an empirical foundation for comparative research.
To this end, the GeWiss corpus focusses on two main genres of spoken adademic language:
- talks including discussions, and
- oral exams,
and it explicitly distinguishes between L1 and L2 subcorpora. The corpus is enlarged and developed continuously.
GermaNet
Type: CorpusStatus: used
Details: [ViVo] [URL]
GermaParl Corpus of Plenary Protocols
Type: CorpusStatus: used
Details: [ViVo] [DOI] [URL]
Icelandic Parsed Historical Corpus (IcePaHC)
Type: CorpusStatus: used
Details: [ViVo] [URL]
Kobalt_RST: Die Annotation von rhetorischen Strukturen im Kobalt-DaF-Korpus
Type: CorpusStatus: created
Details: [ViVo] [DOI]
Das Kobalt-DaF-Korpus ist ein systematisch erhobenes und tief annotiertes Deutschlernerkorpus, welches 80 deutschsprachige argumentative Texte von deutschen L1-Sprecher:innen und Deutschlerner:innen unterschiedlicher L1 enthält. Dieses Repositorium stellt eine zusätzliche Annotation des Kobalt-DaF-Korpus bzgl. rhetorischer Strukturen frei zur Verfügung. Folgende Informationen sind hier zu finden: (1) Die Darstellung des Annotationsprozesses (Annotationsframework, -richtlinie, und -verfahren). (2) Die annotierten rs3-Dateien.
*Versionshinweise: Bislang sind ausschließlich die Texte der chinesischen Deutschlerner:innen und der deutschen L1-Sprecher:innen (insgesamt 40 Texte) verfügbar. Die Annotation der übrigen Texte folgt demnächst.
*Die Annotationsarbeit wurde gefördert durch das Chinese Scholarship Council und die Deutsche Forschungsgemeinschaft (DFG) – SFB 1412, 416591334.
Lang*Reg: A multi-lingual corpus of intra-individual variation across situations
Type: CorpusStatus: created
Details: [ViVo] [DOI]
Size: 36 hours
Description: same speakers varied by mode, acquaintance, professionalism, and expertise
Features: transcription, syntactic segmentation, normalization, token, glossing or POS-tags, some syntax
Access: transcription or annotation in progress; CC-BY-NC-ND
Lithuanian Corpus
Type: CorpusStatus: re-used
Details: [ViVo]
The text files used in the research were generated from the facsimiles.
Potsdam Commentary Corpus
Type: CorpusStatus: used
Details: [ViVo] [URL]
[Bourgonje & Stede 2020] Bourgonje, Peter and Stede, Manfred (2020). The Potsdam Commentary Corpus 2.2: Extending Annotations for Shallow Discourse Parsing Proc. of the Language Resources and Evaluation Conference (LREC), Marseille.
PreCOXX25: Register-annotated German webcorpus
Type: CorpusStatus: re-used
Details: [ViVo]
Engines: NoSkE and RStudio Server
Prestudy ”situational context” in Czech
Type: CorpusStatus: re-used
Details: [ViVo]
Ramsès Project
Type: CorpusStatus: used
Details: [ViVo] [URL]
ReFlexAE
Type: CorpusStatus: created
Details: [ViVo]
Russian National Corpus
Type: CorpusStatus: used
Details: [ViVo] [URL]
The Russian National Corpus is a representative collection of texts in Russian, counting about 1,5 bln tokens and completed with linguistic annotation and search tools
SENIE Corpus
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
of Latvia. http://senie.korpuss.lv/toc.jsp
Andronova, Everita (2007). The Corpus of Early Written Latvian: current state and future tasks. Proceedings of the Corpus Linguistics Conference. CL2007. University of Birmingham, UK. 27-30 July 2007. Edited by Matthew Davies, Paul Rayson, Susan Hunston, Pernilla Danielsson. ISSN 1747-9398. (http://ucrel.lancs.ac.uk/publications/CL2007/paper/245_Paper.pdf)
Simulated Zoom-Corpus
Type: CorpusStatus: created
Details: [ViVo]
The grammatically annotated corpus of the pericopes of the Old Lithuanian Postil of Jonas Bretkūnas
Type: CorpusStatus: created
Details: [ViVo] [DOI] [URL]
This grammatically annoted corpus aims at facilitating linguistic research on Old Lithuanian based on the Postil of Jonas Bretkūnas from the year 1591. The corpus is divided into two subcorpora, "pericopes" and "homilies" to make register related research easier.
The pericopes were annotated using SIL Toolbox and converted to be used in the search-tool ANNIS using the conversion tool PEPPER.
Three formats are provided in this release: 1. the Toolbox files, 2. the transitional Excel files and 3. a zipped folder to be imported into ANNIS.
Created in the project B02, Emergence and change of registers: The case of Lithuanian and Latvian of the CRC 1412 "Register" (funded by the Deutsche Forschungsgemeinschaft: DFG, German Research Foundation: 416591334).
The written ReFlexAE corpus (Register Flexibility in Academic Education)
Type: CorpusStatus:
Details: [ViVo]
Thesaurus Linguae Aegyptiae (TLA)
Type: CorpusStatus: used
Details: [ViVo] [URL]
WroDiaCo v2
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
Version 2, 2020. URL https://rs.cms.hu‑berlin.de/phon.
sgs corpus
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
Size: 26 h
Description: free spoken dialogues with interviewer on fictive crime scenario
Features: social metadata, syntax
Access: internal
A Grammatically Annotated Corpus of the Old Latvian Postil of Georg Mancelius
Type: CorpusStatus: created
Details: [ViVo] [DOI] [URL]
This grammatically annoted corpus aims at facilitating linguistic research on Old Latvian based on the Postil of Georg Mancelius from the year 1654. The corpus is divided into two subcorpora, "pericopes" and "homilies" to make register related research easier.
The pericopes were annotated using SIL Toolbox and converted to be used in the search-tool ANNIS using the conversion tool PEPPER.
Three formats are provided in this release: 1. the Toolbox files, 2. the transitional Excel files and 3. a zipped folder to be imported into ANNIS.
Created in the project B02, Emergence and change of registers: The case of Lithuanian and Latvian of the CRC 1412 "Register" (funded by the Deutsche Forschungsgemeinschaft: DFG, German Research Foundation: 416591334).
BeDiaCo - Berlin Dialogue Corpus
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
The corpus consists of acoustic recordings of spontaneous dialogues of German native speakers with both task-free and task-based parts and additional read word lists.
Malte Belz, Christine Mooshammer, Alina Zöllner, and Lea‑Sophie Adam. Berlin Dialogue Corpus
(BeDiaCo): Version 2, 2021.
BeMeCo v1
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
1, 2021. URL https://rs.cms.hu‑berlin.de/phon
BiNoKo V. 1.0 Birgitta-Notker-Korpus
Type: CorpusStatus: created
Details: [ViVo] [DOI] [URL]
The Birgitta-Notker-Korpus (BiNoKo) is a resource dedicated to comparative research on historical registers. The corpus comprises two sources: The Old High German Book of Psalms by Notker III of Saint Gall and the Old Swedish Revelations of Birgitta of Sweden. The subcorpus of Birgitta's Revelations and the subcorpus of Notker's Psalms are available as separate zip files. The corpus format is ANNIS. For local installation, use ANNIS Desktop. The documentation for ANNIS can be found here:
https://corpus-tools.org/annis/
https://corpus-tools.org/annis/download.html
The guidelines (see 'related identifiers') are published in REALIS 2/3 and include information about the corpus design, annotation layers, meta data, and annotation principles.
Bonner Totenbuchprojekt
Type: CorpusStatus: used
Details: [ViVo] [URL]
British National Corpus (BNC)
Type: CorpusStatus: used
Details: [ViVo] [URL]
CaeMmCom
Type: CorpusStatus: used
Details: [ViVo] [URL]
Corpus of Ancient Egyptian Multimodal Communication.
CAEMmCom – Corpus of Ancient Egyptian Multimodal Communication: Getting Started [pdf]
2020 Multimodale graphische Kommunikation im pharaonischen Ägypten: Entwurf einer Analysemethode, Lingua Aegyptia 28: 81-116.
2020 (mit Rebecca Döhl und Jens-Martin Loebel) CaeMmCom – Corpus altaegyptischer multimodaler Communication. Der Aufbau einer multimodalen Datensammlung altägyptischer Kommunikate, Zeitschrift für digitale Geisteswissenschaft, [open access].
CoACan: Corpus del español académico de Canarias (1.0)
Type: CorpusStatus:
Details: [ViVo] [URL]
COACAN (Corpus of Academic Canarian Spanish) is a linguistic corpus that documents and analyzes varieties of Spanish spoken in the Canary Islands, specifically within the university academic context. This dataset was developed as part of project A09 "On the interplay between register and socio-geographic variation in Canarian Spanish" of the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of Situational-Functional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin.
This corpus was developed at the University of La Laguna with students from three specific degree programs in October-November 2024:
• Physical Activity and Sports Sciences (CAFYD)
• Hispanic Linguistics and Literature
• Education (Pedagogy)
The corpus focuses on spontaneous speech from Canarian university students, capturing both dialectal features of Canarian Spanish and academic/colloquial registers used in higher education settings, as well as their relationship with insular territory.
CoCoYum
Type: CorpusStatus: created
Details: [ViVo] [URL]
Size: 159.00 tokens
Description: natural language production (spoken) and elicited data
Features: morpheme, glosses, translations, comments
Access: CC-BY-NC-ND
The Collective Corpus of Yucatec Maya (CoCoYum) is a collection of data from various researchers about the Yucatec Mayan language. It contains transcriptions of recordings (e.g. story telling, dialogue, public events), written data as well as elicited data. The corpus will be enlarged in time with fresh data collections and when further researchers add their data to the corpus.
CoMeCaYo: Corpus Mediático de Canarias en YouTube (1.0)
Type: CorpusStatus:
Details: [ViVo] [URL]
CoPaCaYo: Corpus del Parlamento de Canarias en YouTube (1.0)
Type: CorpusStatus:
Details: [ViVo] [URL]
The corpus spans over 14 years of video content from major Canarian governmental institutions, parliamentary sessions, and official channels, providing a valuable resource for linguistic research, natural language processing, and computational linguistics studies focusing on Spanish language evolution and regional institutional discourse patterns.
CoParCan: Corpus del Parlamento de Canarias (1.0)
Type: CorpusStatus:
Details: [ViVo] [URL]
Corpus of Non-Native Addressee Register (CoNNAR). Version 1
Type: CorpusStatus: created
Details: [ViVo] [URL]
Czech corpus Koditex
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech.
Zasina, A. J. – Lukeš, D. – Komrsková, Z. – Poukarová, P. – Řehořková, A.: Koditex: A corpus of diversified texts. Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2018. Available at WWW: www.korpus.cz
DNam corpus + DNam Wenker corpus
Type: CorpusStatus: used
Details: [ViVo] [DOI] [URL]
Article PDF
ENCOW16
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
Engines: NoSkE and RStudio Server
FOLK excerpt
Type: CorpusStatus: used
Details: [ViVo]
Size: 194,716 tokens
Description: conversations in various situations
Features: rich metadata lemma, POS, speech unit segmentation, some dependencies
Access: internal
Falko Corpus
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
Further information about the Falko-project: https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falko
GeWISS excerpt
Type: CorpusStatus: used
Details: [ViVo] [URL]
GeWiss is a research project in spoken academic language. It provides a multilingual (German/English/Polish/Italian) corpus of audio recordings and transcriptions of academic communications, as an empirical foundation for comparative research.
To this end, the GeWiss corpus focusses on two main genres of spoken adademic language:
- talks including discussions, and
- oral exams,
and it explicitly distinguishes between L1 and L2 subcorpora. The corpus is enlarged and developed continuously.
GermaNet
Type: CorpusStatus: used
Details: [ViVo] [URL]
GermaParl Corpus of Plenary Protocols
Type: CorpusStatus: used
Details: [ViVo] [DOI] [URL]
Icelandic Parsed Historical Corpus (IcePaHC)
Type: CorpusStatus: used
Details: [ViVo] [URL]
Kobalt_RST: Die Annotation von rhetorischen Strukturen im Kobalt-DaF-Korpus
Type: CorpusStatus: created
Details: [ViVo] [DOI]
Das Kobalt-DaF-Korpus ist ein systematisch erhobenes und tief annotiertes Deutschlernerkorpus, welches 80 deutschsprachige argumentative Texte von deutschen L1-Sprecher:innen und Deutschlerner:innen unterschiedlicher L1 enthält. Dieses Repositorium stellt eine zusätzliche Annotation des Kobalt-DaF-Korpus bzgl. rhetorischer Strukturen frei zur Verfügung. Folgende Informationen sind hier zu finden: (1) Die Darstellung des Annotationsprozesses (Annotationsframework, -richtlinie, und -verfahren). (2) Die annotierten rs3-Dateien.
*Versionshinweise: Bislang sind ausschließlich die Texte der chinesischen Deutschlerner:innen und der deutschen L1-Sprecher:innen (insgesamt 40 Texte) verfügbar. Die Annotation der übrigen Texte folgt demnächst.
*Die Annotationsarbeit wurde gefördert durch das Chinese Scholarship Council und die Deutsche Forschungsgemeinschaft (DFG) – SFB 1412, 416591334.
Lang*Reg: A multi-lingual corpus of intra-individual variation across situations
Type: CorpusStatus: created
Details: [ViVo] [DOI]
Size: 36 hours
Description: same speakers varied by mode, acquaintance, professionalism, and expertise
Features: transcription, syntactic segmentation, normalization, token, glossing or POS-tags, some syntax
Access: transcription or annotation in progress; CC-BY-NC-ND
Lithuanian Corpus
Type: CorpusStatus: re-used
Details: [ViVo]
The text files used in the research were generated from the facsimiles.
Potsdam Commentary Corpus
Type: CorpusStatus: used
Details: [ViVo] [URL]
[Bourgonje & Stede 2020] Bourgonje, Peter and Stede, Manfred (2020). The Potsdam Commentary Corpus 2.2: Extending Annotations for Shallow Discourse Parsing Proc. of the Language Resources and Evaluation Conference (LREC), Marseille.
PreCOXX25: Register-annotated German webcorpus
Type: CorpusStatus: re-used
Details: [ViVo]
Engines: NoSkE and RStudio Server
Prestudy ”situational context” in Czech
Type: CorpusStatus: re-used
Details: [ViVo]
Ramsès Project
Type: CorpusStatus: used
Details: [ViVo] [URL]
ReFlexAE
Type: CorpusStatus: created
Details: [ViVo]
Russian National Corpus
Type: CorpusStatus: used
Details: [ViVo] [URL]
The Russian National Corpus is a representative collection of texts in Russian, counting about 1,5 bln tokens and completed with linguistic annotation and search tools
SENIE Corpus
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
of Latvia. http://senie.korpuss.lv/toc.jsp
Andronova, Everita (2007). The Corpus of Early Written Latvian: current state and future tasks. Proceedings of the Corpus Linguistics Conference. CL2007. University of Birmingham, UK. 27-30 July 2007. Edited by Matthew Davies, Paul Rayson, Susan Hunston, Pernilla Danielsson. ISSN 1747-9398. (http://ucrel.lancs.ac.uk/publications/CL2007/paper/245_Paper.pdf)
Simulated Zoom-Corpus
Type: CorpusStatus: created
Details: [ViVo]
The grammatically annotated corpus of the pericopes of the Old Lithuanian Postil of Jonas Bretkūnas
Type: CorpusStatus: created
Details: [ViVo] [DOI] [URL]
This grammatically annoted corpus aims at facilitating linguistic research on Old Lithuanian based on the Postil of Jonas Bretkūnas from the year 1591. The corpus is divided into two subcorpora, "pericopes" and "homilies" to make register related research easier.
The pericopes were annotated using SIL Toolbox and converted to be used in the search-tool ANNIS using the conversion tool PEPPER.
Three formats are provided in this release: 1. the Toolbox files, 2. the transitional Excel files and 3. a zipped folder to be imported into ANNIS.
Created in the project B02, Emergence and change of registers: The case of Lithuanian and Latvian of the CRC 1412 "Register" (funded by the Deutsche Forschungsgemeinschaft: DFG, German Research Foundation: 416591334).
The written ReFlexAE corpus (Register Flexibility in Academic Education)
Type: CorpusStatus:
Details: [ViVo]
Thesaurus Linguae Aegyptiae (TLA)
Type: CorpusStatus: used
Details: [ViVo] [URL]
WroDiaCo v2
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
Version 2, 2020. URL https://rs.cms.hu‑berlin.de/phon.
sgs corpus
Type: CorpusStatus: re-used
Details: [ViVo] [URL]
Size: 26 h
Description: free spoken dialogues with interviewer on fictive crime scenario
Features: social metadata, syntax
Access: internal
↑ ()
Experimental Data: A register approach to negative concord vs. negative polarity items in English
Type: DatasetStatus:
Details: [ViVo] [DOI]
The repository contains all related files used for the experimental work reported in the paper "A register approach to negative concord vs. negative polarity items in English". It contains files with the experimental stimuli, rating data, completion tasks, demographics information of participants, R scripts, and plots.
Experimental Data: Bias and modality in conditionals: experimental evidence and theoretical implications
Type: DatasetStatus:
Details: [ViVo] [DOI]
Experimental Data: Comparing comparatives: Appropriateness ratings of synthetic, analytic and double comparatives in American and British English
Type: DatasetStatus:
Details: [ViVo] [DOI]
Experimental Data: Interlocutor relation predicts the formality of the conversation: an experiment in American and British English
Type: DatasetStatus:
Details: [ViVo] [DOI]
The repository contains all related files used for the experimental work reported in the paper "Interlocutor relation predicts the formality of the conversation: an experiment in American and British English". It contains files with the experimental stimuli, rating data, demographics information of participants, R scripts, and plots.
Experimental Data: Modal concord in American and British English: A register-based experimental study
Type: DatasetStatus:
Details: [ViVo] [DOI]
The repository contains all related files used for the experimental work reported in the paper "Modal concord in American and British English: A register-based experimental study". It contains files with the experimental stimuli, rating data, demographics information of participants, R scripts, and plots.
Experimental Data: Counterfactual language, emotion, and perspective: a sentence completion study during the COVID-19 pandemic
Type: DatasetStatus:
Details: [ViVo] [DOI]
Experimental Data: Less formal and more rebellious --- An experiment on the social meaning of negative concord in American English
Type: DatasetStatus:
Details: [ViVo] [DOI]
Experimental data: A register approach to negative concord vs. negative polarity items in English
Type: DatasetStatus:
Details: [ViVo] [DOI]
The repository contains all related files used for the experimental work reported in the paper "A register approach to negative concord vs. negative polarity items in English". It contains files with the experimental stimuli, rating data, completion tasks, demographics information of participants, R scripts, and plots.
Lang*Reg: A multi-lingual corpus of intra-speaker variation across situations
Type: DatasetStatus:
Details: [ViVo] [DOI]
The Lang*Reg corpus records intra-speaker variation across languages and different situational-functional contexts, presumed to result in different registers. It has been prepared in the SFB1412 Register with data collections taking place in 2021-2022 for the following languages included in this version: German, Persian, Kurdish, Javanese. The data sets for each language comprise the speech of the same language users in a variety of spoken conversations and one written interaction. A minimum of 12 participants per language traversed a course of 6 situations in which they were asked to produce language in three types of activities: telling a story to a friend, talking freely with various interlocutors (friend, stranger, taxi driver) and engaging in an interview with a (university) professor. Moreover, our design included the storytelling in two modes, which allows for the comparison between spoken and written modes of the same language user.
Lang*Reg has a basic syntactic segmentation (one matrix clause and all its dependent clauses per segment). v0.2.0 includes the data sets with transcriptions, normalizations and tokens for each language as well as additional language-specific annotations such as glosses and syntactic annotations. We prepared each data set also for use with the browser-based search and visualization architecture ANNIS. For further language-specific morpho-syntactic and sociolinguistic annotations, refer to the respective data set description. For an overview of all data set characteristics, please see the corpus documentation in each data set.
↑ Documents/Other (5)
Big Five Inventory (BFI-10)
Type: DocumentStatus: used
Details: [ViVo] [DOI] [URL]
Rammstedt, B., Kemper, C. J., Klein, M. C., Beierlein, C., & Kovaleva, A. (2014). Big Five Inventory (BFI-10).
Zusammenstellung sozialwissenschaftlicher Items und Skalen (ZIS).
https://doi.org/10.6102/zis76
Depressions-Angst-Stress Skalen (DASS 21)
Type: DocumentStatus: used
Details: [ViVo] [DOI] [URL]
© Lovibond, P.F., Lovibond, S.H., Nilges, P. & Essau, C.
Epidemic - Pandemic Impacts Inventory (EPII)
Type: DocumentStatus: used
Details: [ViVo] [URL]
Interpersonal Reactivity Index (IRI)
Type: DocumentStatus: used
Details: [ViVo] [DOI]
Skalen zur motivationalen Regulation beim Lernen im Studium (SMR-LS)
Type: DocumentStatus: used
Details: [ViVo] [DOI] [URL]
↑ Experimental research data (3)
Experimental data: Addressee identification study
Type: Experimental research dataStatus: created
Details: [ViVo]
Rating data (on a 9-point scale) of the probable addobressee of spoken texts in three conditions: a) entirely Standard German, b) with Namibian-specific lexical, and c) with non-standard grammatical features. The open-guise method was used to collect the data.
Participants:adults and adolescents in NamibiaExperimental data: Newspaper correction study
Type: Experimental research dataStatus: created
Details: [ViVo]
Data containing corrections of Namibian-German vs. Standard German features (lexical, morpho-syntactic, and grammatical) presented in a written mock newspaper article.
Participants: Adults and adolescents in Namibia and Germany
Experimental data: speaker evaluation study
Type: Experimental research dataStatus: created
Details: [ViVo]
Ratings (on a 9-point scale) of social meaning (competence and solidarity assessments) and inferences (origin, place of residence) regarding speakers of spoken texts in three conditions: a) entirely Standard German, b) with Namibian-specific lexical, and c) non-standard grammatical features, collected using in the open-guise method.
Participants: adults and adolescents in Namibia
↑ Software (36)
Stanford Log-linear Part-Of-Speech Tagger
Type: Software publicationStatus: used
Details: [ViVo] [URL]
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. This software is a Java implementation of the log-linear part-of-speech taggers described in these papers (if citing just one paper, cite the 2003 one):
Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.
ANNIS3
Type: Software publicationStatus: used
Details: [ViVo] [DOI] [URL]
A web browser-based search and visualization architecture for complex multilayer linguistic corpora with diverse types of annotation.
Available from: https://corpus-tools.org/annis/
Documentation: https://corpus-tools.org/annis/documentation.html
Cite as: Krause, Thomas & Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. in: Digital Scholarship in the Humanities 2016 (31). http://dsh.oxfordjournals.org/content/31/1/118
AntConc
Type: Software publicationStatus: used
Details: [ViVo] [URL]
ELAN
Type: Software publicationStatus: used
Details: [ViVo] [DOI] [URL]
Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands
Lausberg, H., & Sloetjes, H. (2009). Coding gestural behavior with the NEUROGES-ELAN system. Behavior Research Methods, Instruments, & Computers, 41(3), 841-849. doi:10.3758/BRM.41.3.841.
EXMARaLDA
Type: Software publicationStatus: used
Details: [ViVo] [URL]
Schmidt T and Wörner K (2014), „EXMARaLDA“, In Handbook on Corpus Phonology, pp. 402-419. Oxford University Press.
Field Linguist's Toolbox
Type: Software publicationStatus: used
Details: [ViVo] [URL]
HAL-Inria
Type: Software publicationStatus: used
Details: [ViVo] [URL]
INCEpTION
Type: Software publicationStatus: used
Details: [ViVo] [URL]
INF is hosting INCEpTION at https://inception.sfb1412.hu-berlin.de (Intranet only)
PennController for Internet Based Experiments (IBEX)
Type: Software publicationStatus: used
Details: [ViVo] [URL]
Pepper
Type: Software publicationStatus: used
Details: [ViVo] [URL]
A highly extensible platform for conversion and manipulation of linguistic data between an unbound set of formats. Pepper can be used stand-alone as a command line interface, or be integrated as an API into other software products.
Available from: https://corpus-tools.org/pepper/
Documentation: https://corpus-tools.org/pepper/userGuide.html
Cite as: F. Zipser & L. Romary (2010). A model oriented approach to the mapping of annotation formats using standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. Malta. URL: http://hal.archives-ouvertes.fr/inria-00527799/en/
RStudio Server
Type: Software publicationStatus: used
Details: [ViVo] [URL]
RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
TextSTat
Type: Software publicationStatus: used
Details: [ViVo] [URL]
ToolboxTools
Type: Software publicationStatus: created
Details: [ViVo] [URL]
This project provides tools to read and write files in the Toolbox format.
Tools rPraat and mPraat - Interfacing Phonetic Analyses with Signal Processing
Type: Software publicationStatus: used
Details: [ViVo] [DOI]
emuR
Type: Software publicationStatus: used
Details: [ViVo] [URL]
flairNLP / flair
Type: Software publicationStatus: used
Details: [ViVo] [URL]
Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
spaCy
Type: Software publicationStatus: used
Details: [ViVo] [URL]
Stanford Log-linear Part-Of-Speech Tagger
Type: Software publicationStatus: used
Details: [ViVo] [URL]
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. This software is a Java implementation of the log-linear part-of-speech taggers described in these papers (if citing just one paper, cite the 2003 one):
Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.
ANNIS3
Type: Software publicationStatus: used
Details: [ViVo] [DOI] [URL]
A web browser-based search and visualization architecture for complex multilayer linguistic corpora with diverse types of annotation.
Available from: https://corpus-tools.org/annis/
Documentation: https://corpus-tools.org/annis/documentation.html
Cite as: Krause, Thomas & Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. in: Digital Scholarship in the Humanities 2016 (31). http://dsh.oxfordjournals.org/content/31/1/118
AntConc
Type: Software publicationStatus: used
Details: [ViVo] [URL]
ELAN
Type: Software publicationStatus: used
Details: [ViVo] [DOI] [URL]
Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands
Lausberg, H., & Sloetjes, H. (2009). Coding gestural behavior with the NEUROGES-ELAN system. Behavior Research Methods, Instruments, & Computers, 41(3), 841-849. doi:10.3758/BRM.41.3.841.
EXMARaLDA
Type: Software publicationStatus: used
Details: [ViVo] [URL]
Schmidt T and Wörner K (2014), „EXMARaLDA“, In Handbook on Corpus Phonology, pp. 402-419. Oxford University Press.
Field Linguist's Toolbox
Type: Software publicationStatus: used
Details: [ViVo] [URL]
HAL-Inria
Type: Software publicationStatus: used
Details: [ViVo] [URL]
INCEpTION
Type: Software publicationStatus: used
Details: [ViVo] [URL]
INF is hosting INCEpTION at https://inception.sfb1412.hu-berlin.de (Intranet only)
PennController for Internet Based Experiments (IBEX)
Type: Software publicationStatus: used
Details: [ViVo] [URL]
Pepper
Type: Software publicationStatus: used
Details: [ViVo] [URL]
A highly extensible platform for conversion and manipulation of linguistic data between an unbound set of formats. Pepper can be used stand-alone as a command line interface, or be integrated as an API into other software products.
Available from: https://corpus-tools.org/pepper/
Documentation: https://corpus-tools.org/pepper/userGuide.html
Cite as: F. Zipser & L. Romary (2010). A model oriented approach to the mapping of annotation formats using standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. Malta. URL: http://hal.archives-ouvertes.fr/inria-00527799/en/
RStudio Server
Type: Software publicationStatus: used
Details: [ViVo] [URL]
RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
TextSTat
Type: Software publicationStatus: used
Details: [ViVo] [URL]
ToolboxTools
Type: Software publicationStatus: created
Details: [ViVo] [URL]
This project provides tools to read and write files in the Toolbox format.
Tools rPraat and mPraat - Interfacing Phonetic Analyses with Signal Processing
Type: Software publicationStatus: used
Details: [ViVo] [DOI]
emuR
Type: Software publicationStatus: used
Details: [ViVo] [URL]
flairNLP / flair
Type: Software publicationStatus: used
Details: [ViVo] [URL]
Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, Santa Fe, New Mexico, USA. Association for Computational Linguistics.