Resources - SFB 1412

A Grammatically Annotated Corpus of the Old Latvian Postil of Georg Mancelius

Type: Corpus
Status: created
Details: [ViVo] [DOI] [URL]

This grammatically annoted corpus aims at facilitating linguistic research on Old Latvian based on the Postil of Georg Mancelius from the year 1654. The corpus is divided into two subcorpora, "pericopes" and "homilies" to make register related research easier.

The pericopes were annotated using SIL Toolbox and converted to be used in the search-tool ANNIS using the conversion tool PEPPER.

Three formats are provided in this release: 1. the Toolbox files, 2. the transitional Excel files and 3. a zipped folder to be imported into ANNIS.

Created in the project B02, Emergence and change of registers: The case of Lithuanian and Latvian of the CRC 1412 "Register" (funded by the Deutsche Forschungsgemeinschaft: DFG, German Research Foundation: 416591334).

Used by: B02

BeDiaCo - Berlin Dialogue Corpus

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

The corpus consists of acoustic recordings of spontaneous dialogues of German native speakers with both task-free and task-based parts and additional read word lists.
Malte Belz, Christine Mooshammer, Alina Zöllner, and Lea‑Sophie Adam. Berlin Dialogue Corpus
(BeDiaCo): Version 2, 2021.

Used by: C06

BeMeCo v1

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

lina Zöllner, Christine Mooshammer, and Silke Hamann. Berlin Menutask Corpus (BeMeCo): Version
1, 2021. URL https://rs.cms.hu‑berlin.de/phon

Used by: C06

BiNoKo V. 1.0 Birgitta-Notker-Korpus

Type: Corpus
Status: created
Details: [ViVo] [DOI] [URL]

The Birgitta-Notker-Korpus (BiNoKo) is a resource dedicated to comparative research on historical registers. The corpus comprises two sources: The Old High German Book of Psalms by Notker III of Saint Gall and the Old Swedish Revelations of Birgitta of Sweden. The subcorpus of Birgitta's Revelations and the subcorpus of Notker's Psalms are available as separate zip files. The corpus format is ANNIS. For local installation, use ANNIS Desktop. The documentation for ANNIS can be found here:
https://corpus-tools.org/annis/
https://corpus-tools.org/annis/download.html

The guidelines (see 'related identifiers') are published in REALIS 2/3 and include information about the corpus design, annotation layers, meta data, and annotation principles.

Used by: B04

Bislama Spoken Corpus

Type: Corpus
Status: created
Details: [ViVo]

Used by: A02

Bonner Totenbuchprojekt

Type: Corpus
Status: used
Details: [ViVo] [URL]

Das Totenbuch bildete im Alten Ägypten über 1500 Jahre hinweg einen Wissensschatz für den Verstorbenen, der ihm in Schriftform mit ins Grab gegeben wurde.

Used by: B03

British National Corpus (BNC)

Type: Corpus
Status: used
Details: [ViVo] [URL]

The British National Corpus (BNC) was originally created by Oxford University press in the 1980s - early 1990s, and it contains 100 million words of text from a wide range of genres (e.g. spoken, fiction, magazines, newspapers, and academic).

Used by: A05

C02-Corpus

Type: Corpus
Status:
Details: [ViVo]

Used by: C06

CaeMmCom

Type: Corpus
Status: used
Details: [ViVo] [URL]

Corpus of Ancient Egyptian Multimodal Communication.

CAEMmCom – Corpus of Ancient Egyptian Multimodal Communication: Getting Started [pdf]

2020 Multimodale graphische Kommunikation im pharaonischen Ägypten: Entwurf einer Analysemethode, Lingua Aegyptia 28: 81-116.
2020 (mit Rebecca Döhl und Jens-Martin Loebel) CaeMmCom – Corpus altaegyptischer multimodaler Communication. Der Aufbau einer multimodalen Datensammlung altägyptischer Kommunikate, Zeitschrift für digitale Geisteswissenschaft, [open access].

Used by: B03

CoACan: Corpus del español académico de Canarias (1.0)

Type: Corpus
Status:
Details: [ViVo] [URL]

COACAN (Corpus of Academic Canarian Spanish) is a linguistic corpus that documents and analyzes varieties of Spanish spoken in the Canary Islands, specifically within the university academic context. This dataset was developed as part of project A09 "On the interplay between register and socio-geographic variation in Canarian Spanish" of the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of Situational-Functional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin.

This corpus was developed at the University of La Laguna with students from three specific degree programs in October-November 2024:

• Physical Activity and Sports Sciences (CAFYD)
• Hispanic Linguistics and Literature
• Education (Pedagogy)

The corpus focuses on spontaneous speech from Canarian university students, capturing both dialectal features of Canarian Spanish and academic/colloquial registers used in higher education settings, as well as their relationship with insular territory.

Used by: A09

CoCoYum

Type: Corpus
Status: created
Details: [ViVo] [URL]

Language: Yucatec, Maya
Size: 159.00 tokens
Description: natural language production (spoken) and elicited data
Features: morpheme, glosses, translations, comments
Access: CC-BY-NC-ND

The Collective Corpus of Yucatec Maya (CoCoYum) is a collection of data from various researchers about the Yucatec Mayan language. It contains transcriptions of recordings (e.g. story telling, dialogue, public events), written data as well as elicited data. The corpus will be enlarged in time with fresh data collections and when further researchers add their data to the corpus.

Used by: A06

CoMeCaYo: Corpus Mediático de Canarias en YouTube (1.0)

Type: Corpus
Status:
Details: [ViVo] [URL]

Comecayo (Corpus mediático de Canarias en YouTube) is a comprehensive corpus of Spanish-language YouTube videos from Canarian media outlets containing timestamped transcriptions. This dataset was developed as part of project A09 "On the interplay between register and socio-geographic variation in Canarian Spanish" within the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of Situational-Functional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin. The corpus spans over 15 years of video content from major Canarian television channels, radio stations, and digital media platforms, providing a valuable resource for linguistic research, natural language processing, and computational linguistics studies focusing on Spanish language evolution and regional media discourse patterns.

Used by: A09

CoPaCaYo: Corpus del Parlamento de Canarias en YouTube (1.0)

Type: Corpus
Status:
Details: [ViVo] [URL]

COPACAYO (Corpus del Parlamento de Canarias en YouTube) is a comprehensive corpus of Spanish-language YouTube videos from Canarian parliamentary and governmental institutions containing timestamped transcriptions. This dataset was developed as part of project A09 "On the interplay between register and socio-geographic variation in Canarian Spanish" within the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of Situational-Functional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin.

The corpus spans over 14 years of video content from major Canarian governmental institutions, parliamentary sessions, and official channels, providing a valuable resource for linguistic research, natural language processing, and computational linguistics studies focusing on Spanish language evolution and regional institutional discourse patterns.

Used by: A09

CoParCan: Corpus del Parlamento de Canarias (1.0)

Type: Corpus
Status:
Details: [ViVo] [URL]

Corpus of institutional declarations from the Parliament of the Canary Islands that documents contemporary Canarian political-institutional discourse. This corpus is part of project A09 "On the interplay between register and sociogeographic variation in Canarian Spanish" of the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of SituationalFunctional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin.

Used by: A09

Corpus of Non-Native Addressee Register (CoNNAR). Version 1

Type: Corpus
Status: created
Details: [ViVo] [URL]

Used by: C06

Czech corpus Koditex

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

A synchronic, representative and reference 9‑million‑word corpus (excl. punctuation)
compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech.

Zasina, A. J. – Lukeš, D. – Komrsková, Z. – Poukarová, P. – Řehořková, A.: Koditex: A corpus of diversified texts. Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2018. Available at WWW: www.korpus.cz

Used by: A03

DNam

Type: Corpus
Status:
Details: [ViVo]

Used by: C06

DNam corpus + DNam Wenker corpus

Type: Corpus
Status: used
Details: [ViVo] [DOI] [URL]

The corpus "German in Namibia" („Deutsch in Namibia“ –DNam) was created in the period 2016-2021, in the DFG project „NamDeutsch: Die Dynamik des Deutschen im mehrsprachigen Kontext Namibias“ ("NamDeutsch: The Dynamics of German in Namibia's Multilingual Context" – WI 2155/9-1 and SI 750/4-1, directed by Heike Wiese and Horst Simon in cooperation with Marianne Zappen-Thomson) at the University of Potsdam (until 2019) and at HU Berlin (since 2019), at the FU Berlin and at UNAM Windhoek.

Article PDF

Used by: C07

ENCOW16

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

Access: webcorpora.org (free access for academic use)
Engines: NoSkE and RStudio Server

Used by: A04, B01

ENCOW16A-NANO

Type: Corpus
Status:
Details: [ViVo]

Used by: B01

Eye-Tracking Corpus

Type: Corpus
Status: created
Details: [ViVo]

Used by: C03

FOLK excerpt

Type: Corpus
Status: used
Details: [ViVo]

Language: German
Size: 194,716 tokens
Description: conversations in various situations
Features: rich metadata lemma, POS, speech unit segmentation, some dependencies
Access: internal

Used by: A06

Falko Corpus

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

L1 and L2- authored argumentative essays collected in a controlled setting.
Further information about the Falko-project: https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falko

Used by: C04, C05

GeWISS excerpt

Type: Corpus
Status: used
Details: [ViVo] [URL]

GeWiss is a research project in spoken academic language. It provides a multilingual (German/English/Polish/Italian) corpus of audio recordings and transcriptions of academic communications, as an empirical foundation for comparative research.

To this end, the GeWiss corpus focusses on two main genres of spoken adademic language:

talks including discussions, and
oral exams,

and it explicitly distinguishes between L1 and L2 subcorpora. The corpus is enlarged and developed continuously.

Used by: A06

GermaNet

Type: Corpus
Status: used
Details: [ViVo] [URL]

GermaNet ist ein lexikalisch-semantisches Wortnetz, das deutsche Nomina, Verben und Adjektive semantisch zueinander in Beziehung setzt, indem es lexikalische Einheiten, die dasselbe Konzept ausdrücken, in Synsets zusammenfasst und semantische Relationen zwischen diesen Synsets definiert. GermaNet hat viel mit dem Englischen WordNet® gemeinsam und kann als ein Online-Thesaurus oder als eine Lightweight-Ontologie betrachtet werden.

Used by: A01

GermaParl Corpus of Plenary Protocols

Type: Corpus
Status: used
Details: [ViVo] [DOI] [URL]

The GermaParl Corpus has been prepared in the PolMine Project (http://polmine.github.io) and comprises all protocols of plenary sessions in the German Bundestag (1996 - 2016). This version of the corpus is based on plain text documents issued by the German Bundestag. For a period between 2008 and 2010, txt files are not available. To fill the gap, pdf documents were processed. As part of the corpus preparation pipeline, the data has been linguistically annotated (using the TreeTagger) and imported into the Corpus Workbench (CWB). See the GermaParl documentation website (http://polmine.github.io/GermaParl) for further information.

Used by: A01

Icelandic Parsed Historical Corpus (IcePaHC)

Type: Corpus
Status: used
Details: [ViVo] [URL]

The Icelandic Parsed Historical Corpus (IcePaHC) is a project that has built a diachronic corpus with samples of written Icelandic from all periods from the 12th century to modern times. The corpus is mostly compatible with the corpora of historical English developed at UPenn. For historical texts spelling is modernized for phonological change.

Used by: A05

Kobalt_RST: Die Annotation von rhetorischen Strukturen im Kobalt-DaF-Korpus

Type: Corpus
Status: created
Details: [ViVo] [DOI]

Das Kobalt-DaF-Korpus ist ein systematisch erhobenes und tief annotiertes Deutschlernerkorpus, welches 80 deutschsprachige argumentative Texte von deutschen L1-Sprecher:innen und Deutschlerner:innen unterschiedlicher L1 enthält. Dieses Repositorium stellt eine zusätzliche Annotation des Kobalt-DaF-Korpus bzgl. rhetorischer Strukturen frei zur Verfügung. Folgende Informationen sind hier zu finden: (1) Die Darstellung des Annotationsprozesses (Annotationsframework, -richtlinie, und -verfahren). (2) Die annotierten rs3-Dateien.

*Versionshinweise: Bislang sind ausschließlich die Texte der chinesischen Deutschlerner:innen und der deutschen L1-Sprecher:innen (insgesamt 40 Texte) verfügbar. Die Annotation der übrigen Texte folgt demnächst.

*Die Annotationsarbeit wurde gefördert durch das Chinese Scholarship Council und die Deutsche Forschungsgemeinschaft (DFG) – SFB 1412, 416591334.

Used by: C04

Lang*Reg: A multi-lingual corpus of intra-individual variation across situations

Type: Corpus
Status: created
Details: [ViVo] [DOI]

Language: German, Persian, Yucatec Maya, Kurdish, Javanese
Size: 36 hours
Description: same speakers varied by mode, acquaintance, professionalism, and expertise
Features: transcription, syntactic segmentation, normalization, token, glossing or POS-tags, some syntax
Access: transcription or annotation in progress; CC-BY-NC-ND

Used by: A06, C06

Lithuanian Corpus

Type: Corpus
Status: re-used
Details: [ViVo]

The Old Lithuanian corpus is the postil of Jonas Bretkūnas published as a facsimile edition by Ona Aleknavičienė (Jono Bretkūno Postilė, parengė Ona Aleknavičienė. Vilnius: Lietuvių kalbos institutas, 2005. ISBN 9986-668-96-4).
The text files used in the research were generated from the facsimiles.

Used by: B02

Luther-Bretke-Korpus

Type: Corpus
Status:
Details: [ViVo]

Used by: B04

Morisien Spoken Corpus

Type: Corpus
Status: created
Details: [ViVo]

Used by: A02

Online production experiment on Imprecision

Type: Corpus
Status: created
Details: [ViVo]

Used by: A05

Penn-Helsinki Corpus of Early Modern English

Type: Corpus
Status:
Details: [ViVo]

Used by: B01

Penn-Helsinki Corpus of Middle English

Type: Corpus
Status:
Details: [ViVo]

Used by: B01

Potsdam Commentary Corpus

Type: Corpus
Status: used
Details: [ViVo] [URL]

The Potsdam Commentary Corpus (PCC) is a corpus of 220 German newspaper commentaries (2.900 sentences, 44.000 tokens) taken from the online issues of the Märkische Allgemeine Zeitung (MAZ subcorpus) and Tagesspiegel (ProCon subcorpus) and is annotated with a range of different types of linguistic information.

[Bourgonje & Stede 2020] Bourgonje, Peter and Stede, Manfred (2020). The Potsdam Commentary Corpus 2.2: Extending Annotations for Shallow Discourse Parsing Proc. of the Language Resources and Evaluation Conference (LREC), Marseille.

Used by: A01

PreCOXX25: Register-annotated German webcorpus

Type: Corpus
Status: re-used
Details: [ViVo]

Access: webcorpora.org (free access for academic use)
Engines: NoSkE and RStudio Server

Used by: A04

Prestudy ”situational context” in Czech

Type: Corpus
Status: re-used
Details: [ViVo]

Ibex farm project (Zehr and Schwarz, 2018)

Used by: A03

RUEG-GER

Type: Corpus
Status:
Details: [ViVo]

Used by: C06

Ramsès Project

Type: Corpus
Status: used
Details: [ViVo] [URL]

Morphologically annotated, lemmatized text corpus of Late Egyptian texts (c. 1550 – 1000 BCE) by the University of Liège (http://ramses.ulg.ac.be)

Used by: B03

ReFlexAE

Type: Corpus
Status: created
Details: [ViVo]

The ReFlexAE corpus (Register Flexibility in Academic Education) is a longitudinal corpus of written grammatical explanations built to investigate late register development in the context of higher education. The data are collected through a longitudinal written elicitation study with German L1 students enrolled in programs for primary school teachers. According to a repeated measures design, the longitudinal written study elicits data at three time points: before and after linguistic courses and before graduation. Each participant completes the same test battery comprising four written elicitation tasks, a grammar test, a demographic questionnaire and standard psychological questionnaires assessing personal traits and motivation for learning.

Used by: C05

Russian National Corpus

Type: Corpus
Status: used
Details: [ViVo] [URL]

The Russian National Corpus is a representative collection of texts in Russian, counting about 1,5 bln tokens and completed with linguistic annotation and search tools

Used by: A03

SENIE Corpus

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

Latvian texts provided by the SENIE project of the University
of Latvia. http://senie.korpuss.lv/toc.jsp

Andronova, Everita (2007). The Corpus of Early Written Latvian: current state and future tasks. Proceedings of the Corpus Linguistics Conference. CL2007. University of Birmingham, UK. 27-30 July 2007. Edited by Matthew Davies, Paul Rayson, Susan Hunston, Pernilla Danielsson. ISSN 1747-9398. (http://ucrel.lancs.ac.uk/publications/CL2007/paper/245_Paper.pdf)

Used by: B02

Simulated Zoom-Corpus

Type: Corpus
Status: created
Details: [ViVo]

Simulated zoom interaction with choreographed videos (variation of interlocutor persona [formality] & variation of topic / atstakeness) Simultaneous laboratory recordings of audio and video.

Used by: C02

The GeWiss corpus (Gesprochene Wissenschaftssprache)

Type: Corpus
Status:
Details: [ViVo]

Used by: C05

The grammatically annotated corpus of the pericopes of the Old Lithuanian Postil of Jonas Bretkūnas

Type: Corpus
Status: created
Details: [ViVo] [DOI] [URL]

This grammatically annoted corpus aims at facilitating linguistic research on Old Lithuanian based on the Postil of Jonas Bretkūnas from the year 1591. The corpus is divided into two subcorpora, "pericopes" and "homilies" to make register related research easier.

The pericopes were annotated using SIL Toolbox and converted to be used in the search-tool ANNIS using the conversion tool PEPPER.

Three formats are provided in this release: 1. the Toolbox files, 2. the transitional Excel files and 3. a zipped folder to be imported into ANNIS.

Created in the project B02, Emergence and change of registers: The case of Lithuanian and Latvian of the CRC 1412 "Register" (funded by the Deutsche Forschungsgemeinschaft: DFG, German Research Foundation: 416591334).

Used by: B02

The oral ReFlexAE corpus

Type: Corpus
Status:
Details: [ViVo]

Used by: C05, C06

The written ReFlexAE corpus (Register Flexibility in Academic Education)

Type: Corpus
Status:
Details: [ViVo]

Used by: C05

Thesaurus Linguae Aegyptiae (TLA)

Type: Corpus
Status: used
Details: [ViVo] [URL]

Digital text corpus of ancient Egyptian and Demotic language, morphosyntactic annotation & lemmatized. Largest corpus of Egyptian texts of different types and times (c. 2500 BCE – 450 AD)

Used by: B03

WroDiaCo v2

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

Sarah Wesolek, Malte Belz, and Christine Mooshammer. Wroclaw Dialogue Corpus (WroDiaCo):
Version 2, 2020. URL https://rs.cms.hu‑berlin.de/phon.

Used by: C06

sgs corpus

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

Language: Persian
Size: 26 h
Description: free spoken dialogues with interviewer on fictive crime scenario
Features: social metadata, syntax
Access: internal

Used by: A06

A Grammatically Annotated Corpus of the Old Latvian Postil of Georg Mancelius

Type: Corpus
Status: created
Details: [ViVo] [DOI] [URL]

This grammatically annoted corpus aims at facilitating linguistic research on Old Latvian based on the Postil of Georg Mancelius from the year 1654. The corpus is divided into two subcorpora, "pericopes" and "homilies" to make register related research easier.

The pericopes were annotated using SIL Toolbox and converted to be used in the search-tool ANNIS using the conversion tool PEPPER.

Three formats are provided in this release: 1. the Toolbox files, 2. the transitional Excel files and 3. a zipped folder to be imported into ANNIS.

Created in the project B02, Emergence and change of registers: The case of Lithuanian and Latvian of the CRC 1412 "Register" (funded by the Deutsche Forschungsgemeinschaft: DFG, German Research Foundation: 416591334).

Used by: B02

BeDiaCo - Berlin Dialogue Corpus

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

The corpus consists of acoustic recordings of spontaneous dialogues of German native speakers with both task-free and task-based parts and additional read word lists.
Malte Belz, Christine Mooshammer, Alina Zöllner, and Lea‑Sophie Adam. Berlin Dialogue Corpus
(BeDiaCo): Version 2, 2021.

Used by: C06

BeMeCo v1

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

lina Zöllner, Christine Mooshammer, and Silke Hamann. Berlin Menutask Corpus (BeMeCo): Version
1, 2021. URL https://rs.cms.hu‑berlin.de/phon

Used by: C06

BiNoKo V. 1.0 Birgitta-Notker-Korpus

Type: Corpus
Status: created
Details: [ViVo] [DOI] [URL]

The Birgitta-Notker-Korpus (BiNoKo) is a resource dedicated to comparative research on historical registers. The corpus comprises two sources: The Old High German Book of Psalms by Notker III of Saint Gall and the Old Swedish Revelations of Birgitta of Sweden. The subcorpus of Birgitta's Revelations and the subcorpus of Notker's Psalms are available as separate zip files. The corpus format is ANNIS. For local installation, use ANNIS Desktop. The documentation for ANNIS can be found here:
https://corpus-tools.org/annis/
https://corpus-tools.org/annis/download.html

The guidelines (see 'related identifiers') are published in REALIS 2/3 and include information about the corpus design, annotation layers, meta data, and annotation principles.

Used by: B04

Bislama Spoken Corpus

Type: Corpus
Status: created
Details: [ViVo]

Used by: A02

Bonner Totenbuchprojekt

Type: Corpus
Status: used
Details: [ViVo] [URL]

Das Totenbuch bildete im Alten Ägypten über 1500 Jahre hinweg einen Wissensschatz für den Verstorbenen, der ihm in Schriftform mit ins Grab gegeben wurde.

Used by: B03

British National Corpus (BNC)

Type: Corpus
Status: used
Details: [ViVo] [URL]

The British National Corpus (BNC) was originally created by Oxford University press in the 1980s - early 1990s, and it contains 100 million words of text from a wide range of genres (e.g. spoken, fiction, magazines, newspapers, and academic).

Used by: A05

C02-Corpus

Type: Corpus
Status:
Details: [ViVo]

Used by: C06

CaeMmCom

Type: Corpus
Status: used
Details: [ViVo] [URL]

Corpus of Ancient Egyptian Multimodal Communication.

CAEMmCom – Corpus of Ancient Egyptian Multimodal Communication: Getting Started [pdf]

2020 Multimodale graphische Kommunikation im pharaonischen Ägypten: Entwurf einer Analysemethode, Lingua Aegyptia 28: 81-116.
2020 (mit Rebecca Döhl und Jens-Martin Loebel) CaeMmCom – Corpus altaegyptischer multimodaler Communication. Der Aufbau einer multimodalen Datensammlung altägyptischer Kommunikate, Zeitschrift für digitale Geisteswissenschaft, [open access].

Used by: B03

CoACan: Corpus del español académico de Canarias (1.0)

Type: Corpus
Status:
Details: [ViVo] [URL]

COACAN (Corpus of Academic Canarian Spanish) is a linguistic corpus that documents and analyzes varieties of Spanish spoken in the Canary Islands, specifically within the university academic context. This dataset was developed as part of project A09 "On the interplay between register and socio-geographic variation in Canarian Spanish" of the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of Situational-Functional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin.

This corpus was developed at the University of La Laguna with students from three specific degree programs in October-November 2024:

• Physical Activity and Sports Sciences (CAFYD)
• Hispanic Linguistics and Literature
• Education (Pedagogy)

The corpus focuses on spontaneous speech from Canarian university students, capturing both dialectal features of Canarian Spanish and academic/colloquial registers used in higher education settings, as well as their relationship with insular territory.

Used by: A09

CoCoYum

Type: Corpus
Status: created
Details: [ViVo] [URL]

Language: Yucatec, Maya
Size: 159.00 tokens
Description: natural language production (spoken) and elicited data
Features: morpheme, glosses, translations, comments
Access: CC-BY-NC-ND

The Collective Corpus of Yucatec Maya (CoCoYum) is a collection of data from various researchers about the Yucatec Mayan language. It contains transcriptions of recordings (e.g. story telling, dialogue, public events), written data as well as elicited data. The corpus will be enlarged in time with fresh data collections and when further researchers add their data to the corpus.

Used by: A06

CoMeCaYo: Corpus Mediático de Canarias en YouTube (1.0)

Type: Corpus
Status:
Details: [ViVo] [URL]

Comecayo (Corpus mediático de Canarias en YouTube) is a comprehensive corpus of Spanish-language YouTube videos from Canarian media outlets containing timestamped transcriptions. This dataset was developed as part of project A09 "On the interplay between register and socio-geographic variation in Canarian Spanish" within the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of Situational-Functional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin. The corpus spans over 15 years of video content from major Canarian television channels, radio stations, and digital media platforms, providing a valuable resource for linguistic research, natural language processing, and computational linguistics studies focusing on Spanish language evolution and regional media discourse patterns.

Used by: A09

CoPaCaYo: Corpus del Parlamento de Canarias en YouTube (1.0)

Type: Corpus
Status:
Details: [ViVo] [URL]

COPACAYO (Corpus del Parlamento de Canarias en YouTube) is a comprehensive corpus of Spanish-language YouTube videos from Canarian parliamentary and governmental institutions containing timestamped transcriptions. This dataset was developed as part of project A09 "On the interplay between register and socio-geographic variation in Canarian Spanish" within the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of Situational-Functional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin.

The corpus spans over 14 years of video content from major Canarian governmental institutions, parliamentary sessions, and official channels, providing a valuable resource for linguistic research, natural language processing, and computational linguistics studies focusing on Spanish language evolution and regional institutional discourse patterns.

Used by: A09

CoParCan: Corpus del Parlamento de Canarias (1.0)

Type: Corpus
Status:
Details: [ViVo] [URL]

Corpus of institutional declarations from the Parliament of the Canary Islands that documents contemporary Canarian political-institutional discourse. This corpus is part of project A09 "On the interplay between register and sociogeographic variation in Canarian Spanish" of the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of SituationalFunctional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin.

Used by: A09

Corpus of Non-Native Addressee Register (CoNNAR). Version 1

Type: Corpus
Status: created
Details: [ViVo] [URL]

Used by: C06

Czech corpus Koditex

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

A synchronic, representative and reference 9‑million‑word corpus (excl. punctuation)
compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech.

Zasina, A. J. – Lukeš, D. – Komrsková, Z. – Poukarová, P. – Řehořková, A.: Koditex: A corpus of diversified texts. Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2018. Available at WWW: www.korpus.cz

Used by: A03

DNam

Type: Corpus
Status:
Details: [ViVo]

Used by: C06

DNam corpus + DNam Wenker corpus

Type: Corpus
Status: used
Details: [ViVo] [DOI] [URL]

The corpus "German in Namibia" („Deutsch in Namibia“ –DNam) was created in the period 2016-2021, in the DFG project „NamDeutsch: Die Dynamik des Deutschen im mehrsprachigen Kontext Namibias“ ("NamDeutsch: The Dynamics of German in Namibia's Multilingual Context" – WI 2155/9-1 and SI 750/4-1, directed by Heike Wiese and Horst Simon in cooperation with Marianne Zappen-Thomson) at the University of Potsdam (until 2019) and at HU Berlin (since 2019), at the FU Berlin and at UNAM Windhoek.

Article PDF

Used by: C07

ENCOW16

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

Access: webcorpora.org (free access for academic use)
Engines: NoSkE and RStudio Server

Used by: A04, B01

ENCOW16A-NANO

Type: Corpus
Status:
Details: [ViVo]

Used by: B01

Eye-Tracking Corpus

Type: Corpus
Status: created
Details: [ViVo]

Used by: C03

FOLK excerpt

Type: Corpus
Status: used
Details: [ViVo]

Language: German
Size: 194,716 tokens
Description: conversations in various situations
Features: rich metadata lemma, POS, speech unit segmentation, some dependencies
Access: internal

Used by: A06

Falko Corpus

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

L1 and L2- authored argumentative essays collected in a controlled setting.
Further information about the Falko-project: https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falko

Used by: C04, C05

GeWISS excerpt

Type: Corpus
Status: used
Details: [ViVo] [URL]

GeWiss is a research project in spoken academic language. It provides a multilingual (German/English/Polish/Italian) corpus of audio recordings and transcriptions of academic communications, as an empirical foundation for comparative research.

To this end, the GeWiss corpus focusses on two main genres of spoken adademic language:

talks including discussions, and
oral exams,

and it explicitly distinguishes between L1 and L2 subcorpora. The corpus is enlarged and developed continuously.

Used by: A06

GermaNet

Type: Corpus
Status: used
Details: [ViVo] [URL]

GermaNet ist ein lexikalisch-semantisches Wortnetz, das deutsche Nomina, Verben und Adjektive semantisch zueinander in Beziehung setzt, indem es lexikalische Einheiten, die dasselbe Konzept ausdrücken, in Synsets zusammenfasst und semantische Relationen zwischen diesen Synsets definiert. GermaNet hat viel mit dem Englischen WordNet® gemeinsam und kann als ein Online-Thesaurus oder als eine Lightweight-Ontologie betrachtet werden.

Used by: A01

GermaParl Corpus of Plenary Protocols

Type: Corpus
Status: used
Details: [ViVo] [DOI] [URL]

The GermaParl Corpus has been prepared in the PolMine Project (http://polmine.github.io) and comprises all protocols of plenary sessions in the German Bundestag (1996 - 2016). This version of the corpus is based on plain text documents issued by the German Bundestag. For a period between 2008 and 2010, txt files are not available. To fill the gap, pdf documents were processed. As part of the corpus preparation pipeline, the data has been linguistically annotated (using the TreeTagger) and imported into the Corpus Workbench (CWB). See the GermaParl documentation website (http://polmine.github.io/GermaParl) for further information.

Used by: A01

Icelandic Parsed Historical Corpus (IcePaHC)

Type: Corpus
Status: used
Details: [ViVo] [URL]

The Icelandic Parsed Historical Corpus (IcePaHC) is a project that has built a diachronic corpus with samples of written Icelandic from all periods from the 12th century to modern times. The corpus is mostly compatible with the corpora of historical English developed at UPenn. For historical texts spelling is modernized for phonological change.

Used by: A05

Kobalt_RST: Die Annotation von rhetorischen Strukturen im Kobalt-DaF-Korpus

Type: Corpus
Status: created
Details: [ViVo] [DOI]

Das Kobalt-DaF-Korpus ist ein systematisch erhobenes und tief annotiertes Deutschlernerkorpus, welches 80 deutschsprachige argumentative Texte von deutschen L1-Sprecher:innen und Deutschlerner:innen unterschiedlicher L1 enthält. Dieses Repositorium stellt eine zusätzliche Annotation des Kobalt-DaF-Korpus bzgl. rhetorischer Strukturen frei zur Verfügung. Folgende Informationen sind hier zu finden: (1) Die Darstellung des Annotationsprozesses (Annotationsframework, -richtlinie, und -verfahren). (2) Die annotierten rs3-Dateien.

*Versionshinweise: Bislang sind ausschließlich die Texte der chinesischen Deutschlerner:innen und der deutschen L1-Sprecher:innen (insgesamt 40 Texte) verfügbar. Die Annotation der übrigen Texte folgt demnächst.

*Die Annotationsarbeit wurde gefördert durch das Chinese Scholarship Council und die Deutsche Forschungsgemeinschaft (DFG) – SFB 1412, 416591334.

Used by: C04

Lang*Reg: A multi-lingual corpus of intra-individual variation across situations

Type: Corpus
Status: created
Details: [ViVo] [DOI]

Language: German, Persian, Yucatec Maya, Kurdish, Javanese
Size: 36 hours
Description: same speakers varied by mode, acquaintance, professionalism, and expertise
Features: transcription, syntactic segmentation, normalization, token, glossing or POS-tags, some syntax
Access: transcription or annotation in progress; CC-BY-NC-ND

Used by: A06, C06

Lithuanian Corpus

Type: Corpus
Status: re-used
Details: [ViVo]

The Old Lithuanian corpus is the postil of Jonas Bretkūnas published as a facsimile edition by Ona Aleknavičienė (Jono Bretkūno Postilė, parengė Ona Aleknavičienė. Vilnius: Lietuvių kalbos institutas, 2005. ISBN 9986-668-96-4).
The text files used in the research were generated from the facsimiles.

Used by: B02

Luther-Bretke-Korpus

Type: Corpus
Status:
Details: [ViVo]

Used by: B04

Morisien Spoken Corpus

Type: Corpus
Status: created
Details: [ViVo]

Used by: A02

Online production experiment on Imprecision

Type: Corpus
Status: created
Details: [ViVo]

Used by: A05

Penn-Helsinki Corpus of Early Modern English

Type: Corpus
Status:
Details: [ViVo]

Used by: B01

Penn-Helsinki Corpus of Middle English

Type: Corpus
Status:
Details: [ViVo]

Used by: B01

Potsdam Commentary Corpus

Type: Corpus
Status: used
Details: [ViVo] [URL]

The Potsdam Commentary Corpus (PCC) is a corpus of 220 German newspaper commentaries (2.900 sentences, 44.000 tokens) taken from the online issues of the Märkische Allgemeine Zeitung (MAZ subcorpus) and Tagesspiegel (ProCon subcorpus) and is annotated with a range of different types of linguistic information.

[Bourgonje & Stede 2020] Bourgonje, Peter and Stede, Manfred (2020). The Potsdam Commentary Corpus 2.2: Extending Annotations for Shallow Discourse Parsing Proc. of the Language Resources and Evaluation Conference (LREC), Marseille.

Used by: A01

PreCOXX25: Register-annotated German webcorpus

Type: Corpus
Status: re-used
Details: [ViVo]

Access: webcorpora.org (free access for academic use)
Engines: NoSkE and RStudio Server

Used by: A04

Prestudy ”situational context” in Czech

Type: Corpus
Status: re-used
Details: [ViVo]

Ibex farm project (Zehr and Schwarz, 2018)

Used by: A03

RUEG-GER

Type: Corpus
Status:
Details: [ViVo]

Used by: C06

Ramsès Project

Type: Corpus
Status: used
Details: [ViVo] [URL]

Morphologically annotated, lemmatized text corpus of Late Egyptian texts (c. 1550 – 1000 BCE) by the University of Liège (http://ramses.ulg.ac.be)

Used by: B03

ReFlexAE

Type: Corpus
Status: created
Details: [ViVo]

The ReFlexAE corpus (Register Flexibility in Academic Education) is a longitudinal corpus of written grammatical explanations built to investigate late register development in the context of higher education. The data are collected through a longitudinal written elicitation study with German L1 students enrolled in programs for primary school teachers. According to a repeated measures design, the longitudinal written study elicits data at three time points: before and after linguistic courses and before graduation. Each participant completes the same test battery comprising four written elicitation tasks, a grammar test, a demographic questionnaire and standard psychological questionnaires assessing personal traits and motivation for learning.

Used by: C05

Russian National Corpus

Type: Corpus
Status: used
Details: [ViVo] [URL]

The Russian National Corpus is a representative collection of texts in Russian, counting about 1,5 bln tokens and completed with linguistic annotation and search tools

Used by: A03

SENIE Corpus

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

Latvian texts provided by the SENIE project of the University
of Latvia. http://senie.korpuss.lv/toc.jsp

Andronova, Everita (2007). The Corpus of Early Written Latvian: current state and future tasks. Proceedings of the Corpus Linguistics Conference. CL2007. University of Birmingham, UK. 27-30 July 2007. Edited by Matthew Davies, Paul Rayson, Susan Hunston, Pernilla Danielsson. ISSN 1747-9398. (http://ucrel.lancs.ac.uk/publications/CL2007/paper/245_Paper.pdf)

Used by: B02

Simulated Zoom-Corpus

Type: Corpus
Status: created
Details: [ViVo]

Simulated zoom interaction with choreographed videos (variation of interlocutor persona [formality] & variation of topic / atstakeness) Simultaneous laboratory recordings of audio and video.

Used by: C02

The GeWiss corpus (Gesprochene Wissenschaftssprache)

Type: Corpus
Status:
Details: [ViVo]

Used by: C05

The grammatically annotated corpus of the pericopes of the Old Lithuanian Postil of Jonas Bretkūnas

Type: Corpus
Status: created
Details: [ViVo] [DOI] [URL]

This grammatically annoted corpus aims at facilitating linguistic research on Old Lithuanian based on the Postil of Jonas Bretkūnas from the year 1591. The corpus is divided into two subcorpora, "pericopes" and "homilies" to make register related research easier.

The pericopes were annotated using SIL Toolbox and converted to be used in the search-tool ANNIS using the conversion tool PEPPER.

Three formats are provided in this release: 1. the Toolbox files, 2. the transitional Excel files and 3. a zipped folder to be imported into ANNIS.

Created in the project B02, Emergence and change of registers: The case of Lithuanian and Latvian of the CRC 1412 "Register" (funded by the Deutsche Forschungsgemeinschaft: DFG, German Research Foundation: 416591334).

Used by: B02

The oral ReFlexAE corpus

Type: Corpus
Status:
Details: [ViVo]

Used by: C05, C06

The written ReFlexAE corpus (Register Flexibility in Academic Education)

Type: Corpus
Status:
Details: [ViVo]

Used by: C05

Thesaurus Linguae Aegyptiae (TLA)

Type: Corpus
Status: used
Details: [ViVo] [URL]

Digital text corpus of ancient Egyptian and Demotic language, morphosyntactic annotation & lemmatized. Largest corpus of Egyptian texts of different types and times (c. 2500 BCE – 450 AD)

Used by: B03

WroDiaCo v2

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

Sarah Wesolek, Malte Belz, and Christine Mooshammer. Wroclaw Dialogue Corpus (WroDiaCo):
Version 2, 2020. URL https://rs.cms.hu‑berlin.de/phon.

Used by: C06

sgs corpus

Type: Corpus
Status: re-used
Details: [ViVo] [URL]

Language: Persian
Size: 26 h
Description: free spoken dialogues with interviewer on fictive crime scenario
Features: social metadata, syntax
Access: internal

Used by: A06

Conversations

Type: Dataset
Status:
Details: [ViVo]

Used by: A09

Experimental Data: A register approach to negative concord vs. negative polarity items in English

Type: Dataset
Status:
Details: [ViVo] [DOI]

The repository contains all related files used for the experimental work reported in the paper "A register approach to negative concord vs. negative polarity items in English". It contains files with the experimental stimuli, rating data, completion tasks, demographics information of participants, R scripts, and plots.

Used by: A07

Experimental Data: Bias and modality in conditionals: experimental evidence and theoretical implications

Type: Dataset
Status:
Details: [ViVo] [DOI]

The repository contains all related files used for the experimental work reported in the paper " Bias and modality in conditionals: experimental evidence and theoretical implications". It contains files with the experimental stimuli, rating data, demographics information of participants, R scripts, and plots.

Used by: A07

Experimental Data: Comparing comparatives: Appropriateness ratings of synthetic, analytic and double comparatives in American and British English

Type: Dataset
Status:
Details: [ViVo] [DOI]

The repository contains all related files used for the experimental work reported in the paper "Comparing comparatives: Appropriateness ratings of synthetic, analytic and double comparatives in American and British English". It contains files with the experimental stimuli, rating data, demographics information of participants, R scripts, and plots.

Used by: A07

Experimental Data: Interlocutor relation predicts the formality of the conversation: an experiment in American and British English

Type: Dataset
Status:
Details: [ViVo] [DOI]

The repository contains all related files used for the experimental work reported in the paper "Interlocutor relation predicts the formality of the conversation: an experiment in American and British English". It contains files with the experimental stimuli, rating data, demographics information of participants, R scripts, and plots.

Used by: A07

Experimental Data: Modal concord in American and British English: A register-based experimental study

Type: Dataset
Status:
Details: [ViVo] [DOI]

The repository contains all related files used for the experimental work reported in the paper "Modal concord in American and British English: A register-based experimental study". It contains files with the experimental stimuli, rating data, demographics information of participants, R scripts, and plots.

Used by: A07

Experimental Data: Counterfactual language, emotion, and perspective: a sentence completion study during the COVID-19 pandemic

Type: Dataset
Status:
Details: [ViVo] [DOI]

The repository contains all related files used for the experimental work reported in the paper "Counterfactual language, emotion, and perspective: a sentence completion study during the COVID-19 pandemic". It contains files with the experimental stimuli, rating data, completion tasks, demographics information of participants, R scripts, and plots.

Used by: A07

Experimental Data: Language use and perception in social interaction – Another experiment on the social meaning of negative concord in American English

Type: Dataset
Status:
Details: [ViVo] [DOI]

Used by: A07

Experimental Data: Less formal and more rebellious --- An experiment on the social meaning of negative concord in American English

Type: Dataset
Status:
Details: [ViVo] [DOI]

The repository contains all related files used for the experimental work reported in the paper "Less formal and more rebellious --- An experiment on the social meaning of negative concord in American English". It contains files with the experimental stimuli, rating data, completion tasks, demographics information of participants, R scripts, and plots.

Used by: A07

Experimental data: A register approach to negative concord vs. negative polarity items in English

Type: Dataset
Status:
Details: [ViVo] [DOI]

The repository contains all related files used for the experimental work reported in the paper "A register approach to negative concord vs. negative polarity items in English". It contains files with the experimental stimuli, rating data, completion tasks, demographics information of participants, R scripts, and plots.

Used by: A07

Experimental data: demographic and language background

Type: Dataset
Status:
Details: [ViVo]

Used by: A07, C03

Interview data

Type: Dataset
Status:
Details: [ViVo]

Used by: C07

Lang*Reg Social Survey

Type: Dataset
Status:
Details: [ViVo]

Used by: A06

Lang*Reg: A multi-lingual corpus of intra-speaker variation across situations

Type: Dataset
Status:
Details: [ViVo] [DOI]

The Lang*Reg corpus records intra-speaker variation across languages and different situational-functional contexts, presumed to result in different registers. It has been prepared in the SFB1412 Register with data collections taking place in 2021-2022 for the following languages included in this version: German, Persian, Kurdish, Javanese. The data sets for each language comprise the speech of the same language users in a variety of spoken conversations and one written interaction. A minimum of 12 participants per language traversed a course of 6 situations in which they were asked to produce language in three types of activities: telling a story to a friend, talking freely with various interlocutors (friend, stranger, taxi driver) and engaging in an interview with a (university) professor. Moreover, our design included the storytelling in two modes, which allows for the comparison between spoken and written modes of the same language user.

Lang*Reg has a basic syntactic segmentation (one matrix clause and all its dependent clauses per segment). v0.2.0 includes the data sets with transcriptions, normalizations and tokens for each language as well as additional language-specific annotations such as glosses and syntactic annotations. We prepared each data set also for use with the browser-based search and visualization architecture ANNIS. For further language-specific morpho-syntactic and sociolinguistic annotations, refer to the respective data set description. For an overview of all data set characteristics, please see the corpus documentation in each data set.

Used by: A06

Big Five Inventory (BFI-10)

Type: Document
Status: used
Details: [ViVo] [DOI] [URL]

The BFI-10 is a highly economic scale that allows the personality to be recorded according to the five-factor model. The scale is easy to administer in different survey modes. The empirical evidence of the validation studies suggests that the BFI-10 allows not only an economic but also a reliable and valid recording of the Big Five. The BFI-10 allows a rough measurement of the individual personality structure of adult interviewees from the German-speaking general population.

Rammstedt, B., Kemper, C. J., Klein, M. C., Beierlein, C., & Kovaleva, A. (2014). Big Five Inventory (BFI-10).
Zusammenstellung sozialwissenschaftlicher Items und Skalen (ZIS).
https://doi.org/10.6102/zis76

Used by: A07, C05

Depressions-Angst-Stress Skalen (DASS 21)

Type: Document
Status: used
Details: [ViVo] [DOI] [URL]

Die DASS eignen sich zur Erfassung von Belastungen durch Depression, Angst und Stress ohne konfundierende somatische Faktoren, wie beispielsweise chronische Schmerzprobleme. Die Skalen sind allerdings auch für Klienten ohne somatische Beschwerden brauchbar. Die DASS sind in der Kurzversion mit 21 Items sowie der Langversion mit 42 Items jeweils auf Deutsch und Englisch verfügbar.
© Lovibond, P.F., Lovibond, S.H., Nilges, P. & Essau, C.

Used by: A07

Epidemic - Pandemic Impacts Inventory (EPII)

Type: Document
Status: used
Details: [ViVo] [URL]

The EPII is a tool designed to assess tangible impacts of epidemics and pandemics across personal and social life domains.

Used by: A07

Interpersonal Reactivity Index (IRI)

Type: Document
Status: used
Details: [ViVo] [DOI]

Davis, M. (1983). Measuring individual differences in empathy: Evidence for a multidimensional approach. Journal of Personality and Social Psychology, 44, 1114–1126.

Used by: C05

Skalen zur motivationalen Regulation beim Lernen im Studium (SMR-LS)

Type: Document
Status: used
Details: [ViVo] [DOI] [URL]

Used by: C05

Experimental data: Addressee identification study

Type: Experimental research data
Status: created
Details: [ViVo]

Rating data (on a 9-point scale) of the probable addobressee of spoken texts in three conditions: a) entirely Standard German, b) with Namibian-specific lexical, and c) with non-standard grammatical features. The open-guise method was used to collect the data.

Participants:adults and adolescents in Namibia

Used by: C07

Experimental data: Newspaper correction study

Type: Experimental research data
Status: created
Details: [ViVo]

Data containing corrections of Namibian-German vs. Standard German features (lexical, morpho-syntactic, and grammatical) presented in a written mock newspaper article.
Participants: Adults and adolescents in Namibia and Germany

Used by: C07

Experimental data: speaker evaluation study

Type: Experimental research data
Status: created
Details: [ViVo]

Ratings (on a 9-point scale) of social meaning (competence and solidarity assessments) and inferences (origin, place of residence) regarding speakers of spoken texts in three conditions: a) entirely Standard German, b) with Namibian-specific lexical, and c) non-standard grammatical features, collected using in the open-guise method.
Participants: adults and adolescents in Namibia

Used by: C07

Stanford Log-linear Part-Of-Speech Tagger

Type: Software publication
Status: used
Details: [ViVo] [URL]

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. This software is a Java implementation of the log-linear part-of-speech taggers described in these papers (if citing just one paper, cite the 2003 one):

Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.

Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.

Used by: A01, INF

ANNIS3

Type: Software publication
Status: used
Details: [ViVo] [DOI] [URL]

A web browser-based search and visualization architecture for complex multilayer linguistic corpora with diverse types of annotation.

Available from: https://corpus-tools.org/annis/
Documentation: https://corpus-tools.org/annis/documentation.html
Cite as: Krause, Thomas & Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. in: Digital Scholarship in the Humanities 2016 (31). http://dsh.oxfordjournals.org/content/31/1/118

Used by: A03, A06, B04, C04, C06, INF

AntConc

Type: Software publication
Status: used
Details: [ViVo] [URL]

A freeware corpus analysis toolkit for concordancing and text analysis.

Used by: B02

CorpusSearch

Type: Software publication
Status:
Details: [ViVo]

Used by: B01

ELAN

Type: Software publication
Status: used
Details: [ViVo] [DOI] [URL]

ELAN is computer software, a professional tool to manually and semi-automatically annotate and transcribe audio or video recordings. It has a tier-based data model that supports multi-level, multi-participant annotation of time-based media.

Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands

Lausberg, H., & Sloetjes, H. (2009). Coding gestural behavior with the NEUROGES-ELAN system. Behavior Research Methods, Instruments, & Computers, 41(3), 841-849. doi:10.3758/BRM.41.3.841.

Used by: A06

EXMARaLDA

Type: Software publication
Status: used
Details: [ViVo] [URL]

EXMARaLDA is a system for working with oral corpora on a computer. It consists of a transcription and annotation tool (Partitur-Editor), a tool for managing corpora (Corpus-Manager) and a query and analysis tool (EXAKT). Further parts of EXMARaLDA are FOLKER and OrthoNormal, which were both developed in and for the FOLK project.
Schmidt T and Wörner K (2014), „EXMARaLDA“, In Handbook on Corpus Phonology, pp. 402-419. Oxford University Press.

Used by: A06, C05

Field Linguist's Toolbox

Type: Software publication
Status: used
Details: [ViVo] [URL]

Toolbox is a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data. Toolbox is free to download and use.

Used by: A06, B02

HAL-Inria

Type: Software publication
Status: used
Details: [ViVo] [URL]

In this paper, we present, SALT, a framework for mapping heterogeneous linguistic formats from one another based on a model-based approach, i.e. independently of the actual formats in which the corresponding linguistic data is being expressed. While we describe the underlying concept of this framework, we identify how it echoes past ongoing standardisation activities within ISO committee TC 37/SC 4, and in particular, the possible conceptual equivalences with ISO CD 24612 (LAF) combined with ISO 24610-1 (FSR), as well as the possible role of the central data category registry (ISOCat), currently under deployment. We thus show the adequacy of our methodology and its capacity to integrate a wide range of possible linguistic annotation models.

Used by: A03

INCEpTION

Type: Software publication
Status: used
Details: [ViVo] [URL]

We introduce INCEpTION, a new annotation platform for tasks including interactive and semantic annotation (e.g., concept linking, fact linking, knowledge base population, semantic frame annotation). These tasks are very time consuming and demanding for annotators, especially when knowledge bases are used. We address these issues by developing an annotation platform that incorporates machine learning capabilities which actively assist and guide annotators. The platform is both generic and modular. It targets a range of research domains in need of semantic annotation, such as digital humanities, bioinformatics, or linguistics. INCEpTION is publicly available as open-source software.

INF is hosting INCEpTION at https://inception.sfb1412.hu-berlin.de (Intranet only)

Used by: A01, A06, B04, INF

PennController for Internet Based Experiments (IBEX)

Type: Software publication
Status: used
Details: [ViVo] [URL]

PennController for Internet Based Experiments (“PennController” or “PCIbex” for short) provides the tools to build and run online experiments, from familiar paradigms like self-paced reading to completely custom-designed paradigms.

Used by: A03, A06, A07, C03

Pepper

Type: Software publication
Status: used
Details: [ViVo] [URL]

A highly extensible platform for conversion and manipulation of linguistic data between an unbound set of formats. Pepper can be used stand-alone as a command line interface, or be integrated as an API into other software products.

Available from: https://corpus-tools.org/pepper/
Documentation: https://corpus-tools.org/pepper/userGuide.html
Cite as: F. Zipser & L. Romary (2010). A model oriented approach to the mapping of annotation formats using standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. Malta. URL: http://hal.archives-ouvertes.fr/inria-00527799/en/

Used by: A01, A06, INF

RStudio Server

Type: Software publication
Status: used
Details: [ViVo] [URL]

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

Used by: A04, A07, C03, INF

TextSTat

Type: Software publication
Status: used
Details: [ViVo] [URL]

Textstat is an easy to use library to calculate statistics from text. It helps determine readability, complexity, and grade level.

Used by: B02

ToolboxTools

Type: Software publication
Status: created
Details: [ViVo] [URL]

This project provides tools to read and write files in the Toolbox format.

Used by: B02

Tools rPraat and mPraat - Interfacing Phonetic Analyses with Signal Processing

Type: Software publication
Status: used
Details: [ViVo] [DOI]

The paper presents the rPraat package for R/mPraat toolbox for Matlab which constitutes an interface between the most popular software for phonetic analyses, Praat, and the two more general programmes. The package adds on to the functionality of Praat, it is shown to be superior in terms of processing speed to other tools, while maintaining the interconnection with the data structure of R and Matlab, which provides a wide range of subsequent processing possibilities. The use of the proposed tool is demonstrated on a comparison of real speech data with synthetic speech generated by means of dynamic unit selection.

Used by: A03, A06, C06

emuR

Type: Software publication
Status: used
Details: [ViVo] [URL]

Raphael Winkelmann, Klaus Jaensch, Steve Cassidy, and Jonathan Harrington. emuR: Main Package of the EMU Speech Database Management System, 2018.

Used by: C06, INF

flairNLP / flair

Type: Software publication
Status: used
Details: [ViVo] [URL]

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends.

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, Santa Fe, New Mexico, USA. Association for Computational Linguistics.

Used by: INF, MGK, Z

spaCy

Type: Software publication
Status: used
Details: [ViVo] [URL]

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

Used by: INF

Stanford Log-linear Part-Of-Speech Tagger

Type: Software publication
Status: used
Details: [ViVo] [URL]

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. This software is a Java implementation of the log-linear part-of-speech taggers described in these papers (if citing just one paper, cite the 2003 one):

Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.

Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.

Used by: A01, INF

ANNIS3

Type: Software publication
Status: used
Details: [ViVo] [DOI] [URL]

A web browser-based search and visualization architecture for complex multilayer linguistic corpora with diverse types of annotation.

Available from: https://corpus-tools.org/annis/
Documentation: https://corpus-tools.org/annis/documentation.html
Cite as: Krause, Thomas & Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. in: Digital Scholarship in the Humanities 2016 (31). http://dsh.oxfordjournals.org/content/31/1/118

Used by: A03, A06, B04, C04, C06, INF

AntConc

Type: Software publication
Status: used
Details: [ViVo] [URL]

A freeware corpus analysis toolkit for concordancing and text analysis.

Used by: B02

CorpusSearch

Type: Software publication
Status:
Details: [ViVo]

Used by: B01

ELAN

Type: Software publication
Status: used
Details: [ViVo] [DOI] [URL]

ELAN is computer software, a professional tool to manually and semi-automatically annotate and transcribe audio or video recordings. It has a tier-based data model that supports multi-level, multi-participant annotation of time-based media.

Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands

Lausberg, H., & Sloetjes, H. (2009). Coding gestural behavior with the NEUROGES-ELAN system. Behavior Research Methods, Instruments, & Computers, 41(3), 841-849. doi:10.3758/BRM.41.3.841.

Used by: A06

EXMARaLDA

Type: Software publication
Status: used
Details: [ViVo] [URL]

EXMARaLDA is a system for working with oral corpora on a computer. It consists of a transcription and annotation tool (Partitur-Editor), a tool for managing corpora (Corpus-Manager) and a query and analysis tool (EXAKT). Further parts of EXMARaLDA are FOLKER and OrthoNormal, which were both developed in and for the FOLK project.
Schmidt T and Wörner K (2014), „EXMARaLDA“, In Handbook on Corpus Phonology, pp. 402-419. Oxford University Press.

Used by: A06, C05

Field Linguist's Toolbox

Type: Software publication
Status: used
Details: [ViVo] [URL]

Toolbox is a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data. Toolbox is free to download and use.

Used by: A06, B02

HAL-Inria

Type: Software publication
Status: used
Details: [ViVo] [URL]

In this paper, we present, SALT, a framework for mapping heterogeneous linguistic formats from one another based on a model-based approach, i.e. independently of the actual formats in which the corresponding linguistic data is being expressed. While we describe the underlying concept of this framework, we identify how it echoes past ongoing standardisation activities within ISO committee TC 37/SC 4, and in particular, the possible conceptual equivalences with ISO CD 24612 (LAF) combined with ISO 24610-1 (FSR), as well as the possible role of the central data category registry (ISOCat), currently under deployment. We thus show the adequacy of our methodology and its capacity to integrate a wide range of possible linguistic annotation models.

Used by: A03

INCEpTION

Type: Software publication
Status: used
Details: [ViVo] [URL]

We introduce INCEpTION, a new annotation platform for tasks including interactive and semantic annotation (e.g., concept linking, fact linking, knowledge base population, semantic frame annotation). These tasks are very time consuming and demanding for annotators, especially when knowledge bases are used. We address these issues by developing an annotation platform that incorporates machine learning capabilities which actively assist and guide annotators. The platform is both generic and modular. It targets a range of research domains in need of semantic annotation, such as digital humanities, bioinformatics, or linguistics. INCEpTION is publicly available as open-source software.

INF is hosting INCEpTION at https://inception.sfb1412.hu-berlin.de (Intranet only)

Used by: A01, A06, B04, INF

PennController for Internet Based Experiments (IBEX)

Type: Software publication
Status: used
Details: [ViVo] [URL]

PennController for Internet Based Experiments (“PennController” or “PCIbex” for short) provides the tools to build and run online experiments, from familiar paradigms like self-paced reading to completely custom-designed paradigms.

Used by: A03, A06, A07, C03

Pepper

Type: Software publication
Status: used
Details: [ViVo] [URL]

A highly extensible platform for conversion and manipulation of linguistic data between an unbound set of formats. Pepper can be used stand-alone as a command line interface, or be integrated as an API into other software products.

Available from: https://corpus-tools.org/pepper/
Documentation: https://corpus-tools.org/pepper/userGuide.html
Cite as: F. Zipser & L. Romary (2010). A model oriented approach to the mapping of annotation formats using standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. Malta. URL: http://hal.archives-ouvertes.fr/inria-00527799/en/

Used by: A01, A06, INF

RStudio Server

Type: Software publication
Status: used
Details: [ViVo] [URL]

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

Used by: A04, A07, C03, INF

TextSTat

Type: Software publication
Status: used
Details: [ViVo] [URL]

Textstat is an easy to use library to calculate statistics from text. It helps determine readability, complexity, and grade level.

Used by: B02

ToolboxTools

Type: Software publication
Status: created
Details: [ViVo] [URL]

This project provides tools to read and write files in the Toolbox format.

Used by: B02

Tools rPraat and mPraat - Interfacing Phonetic Analyses with Signal Processing

Type: Software publication
Status: used
Details: [ViVo] [DOI]

The paper presents the rPraat package for R/mPraat toolbox for Matlab which constitutes an interface between the most popular software for phonetic analyses, Praat, and the two more general programmes. The package adds on to the functionality of Praat, it is shown to be superior in terms of processing speed to other tools, while maintaining the interconnection with the data structure of R and Matlab, which provides a wide range of subsequent processing possibilities. The use of the proposed tool is demonstrated on a comparison of real speech data with synthetic speech generated by means of dynamic unit selection.

Used by: A03, A06, C06

emuR

Type: Software publication
Status: used
Details: [ViVo] [URL]

Raphael Winkelmann, Klaus Jaensch, Steve Cassidy, and Jonathan Harrington. emuR: Main Package of the EMU Speech Database Management System, 2018.

Used by: C06, INF

flairNLP / flair

Type: Software publication
Status: used
Details: [ViVo] [URL]

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends.

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, Santa Fe, New Mexico, USA. Association for Computational Linguistics.

Used by: INF, MGK, Z

spaCy

Type: Software publication
Status: used
Details: [ViVo] [URL]

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

Used by: INF

↑ Corpora (102)

A Grammatically Annotated Corpus of the Old Latvian Postil of Georg Mancelius

BeDiaCo - Berlin Dialogue Corpus

BeMeCo v1

BiNoKo V. 1.0 Birgitta-Notker-Korpus

Bislama Spoken Corpus

Bonner Totenbuchprojekt

British National Corpus (BNC)

C02-Corpus

CaeMmCom

CoACan: Corpus del español académico de Canarias (1.0)

CoCoYum

CoMeCaYo: Corpus Mediático de Canarias en YouTube (1.0)

CoPaCaYo: Corpus del Parlamento de Canarias en YouTube (1.0)

CoParCan: Corpus del Parlamento de Canarias (1.0)

Corpus of Non-Native Addressee Register (CoNNAR). Version 1

Czech corpus Koditex

DNam

DNam corpus + DNam Wenker corpus

ENCOW16

ENCOW16A-NANO

Eye-Tracking Corpus

FOLK excerpt

Falko Corpus

GeWISS excerpt

GermaNet

GermaParl Corpus of Plenary Protocols

Icelandic Parsed Historical Corpus (IcePaHC)

Kobalt_RST: Die Annotation von rhetorischen Strukturen im Kobalt-DaF-Korpus

Lang*Reg: A multi-lingual corpus of intra-individual variation across situations

Lithuanian Corpus

Luther-Bretke-Korpus

Morisien Spoken Corpus

Online production experiment on Imprecision

Penn-Helsinki Corpus of Early Modern English

Penn-Helsinki Corpus of Middle English

Potsdam Commentary Corpus

PreCOXX25: Register-annotated German webcorpus

Prestudy ”situational context” in Czech

RUEG-GER

Ramsès Project

ReFlexAE

Russian National Corpus

SENIE Corpus

Simulated Zoom-Corpus

The GeWiss corpus (Gesprochene Wissenschaftssprache)

The grammatically annotated corpus of the pericopes of the Old Lithuanian Postil of Jonas Bretkūnas

The oral ReFlexAE corpus

The written ReFlexAE corpus (Register Flexibility in Academic Education)

Thesaurus Linguae Aegyptiae (TLA)

WroDiaCo v2

sgs corpus

A Grammatically Annotated Corpus of the Old Latvian Postil of Georg Mancelius

BeDiaCo - Berlin Dialogue Corpus

BeMeCo v1

BiNoKo V. 1.0 Birgitta-Notker-Korpus

Bislama Spoken Corpus

Bonner Totenbuchprojekt

British National Corpus (BNC)

C02-Corpus

CaeMmCom

CoACan: Corpus del español académico de Canarias (1.0)

CoCoYum

CoMeCaYo: Corpus Mediático de Canarias en YouTube (1.0)

CoPaCaYo: Corpus del Parlamento de Canarias en YouTube (1.0)

CoParCan: Corpus del Parlamento de Canarias (1.0)

Corpus of Non-Native Addressee Register (CoNNAR). Version 1

Czech corpus Koditex

DNam

DNam corpus + DNam Wenker corpus

ENCOW16

ENCOW16A-NANO

Eye-Tracking Corpus

FOLK excerpt

Falko Corpus

GeWISS excerpt

GermaNet

GermaParl Corpus of Plenary Protocols

Icelandic Parsed Historical Corpus (IcePaHC)

Kobalt_RST: Die Annotation von rhetorischen Strukturen im Kobalt-DaF-Korpus