Resources (re-)used by the CRC

Several types of data beyond traditional publications are used and created throughout the CRC. We are committed to publish our research data and software wherever possible under an Open Access license. By sharing our resources, we want to enable reproducibility and re-use by the research community.

Filter resources:  

» Corpora
» Experimental research data
» Documents/Other
» Software


Corpora


BeDiaCo - Berlin Dialogue Corpus

Type: Corpus
Status: re-used
Details: [ViVo] [DOI] [URL]

The corpus consists of acoustic recordings of spontaneous dialogues of German native speakers with both task-free and task-based parts and additional read word lists.
Malte Belz, Christine Mooshammer, Alina Zöllner, and Lea‑Sophie Adam. Berlin Dialogue Corpus
(BeDiaCo): Version 2, 2021.


Used by: C06

BeMeCo v1

Type: Corpus
Status: re-used
Details: [ViVo]
lina Zöllner, Christine Mooshammer, and Silke Hamann. Berlin Menutask Corpus (BeMeCo): Version
1, 2021. URL https://rs.cms.hu‑berlin.de/phon
Used by: C06

Birgitta-Notker-Korpus (BiNoKo)

Type: Corpus
Status: re-used
Details: [ViVo]
Annotated multilayer-corpus including social role relationship. Content: Notker’s Book of Psalms and St. Birgitta’s Heavenly revelations (book IV) + Birgittine-Norwegian
Used by: B04

Bislama Spoken Corpus

Type: Corpus
Status: created
Details: [ViVo]

Used by: A02

Bonner Totenbuchprojekt

Type: Corpus
Status: used
Details: [ViVo] [URL]
Das Totenbuch bildete im Alten Ägypten über 1500 Jahre hinweg einen Wissensschatz für den Verstorbenen, der ihm in Schriftform mit ins Grab gegeben wurde.
Used by: B03

British National Corpus (BNC)

Type: Corpus
Status: used
Details: [ViVo] [URL]
The British National Corpus (BNC) was originally created by Oxford University press in the 1980s - early 1990s, and it contains 100 million words of text from a wide range of genres (e.g. spoken, fiction, magazines, newspapers, and academic).
Used by: A05

CaeMmCom

Type: Corpus
Status: used
Details: [ViVo] [URL]

Corpus of Ancient Egyptian Multimodal Communication. 

CAEMmCom – Corpus of Ancient Egyptian Multimodal Communication: Getting Started [pdf]

2020 Multimodale graphische Kommunikation im pharaonischen Ägypten: Entwurf einer Analysemethode, Lingua Aegyptia 28: 81-116.
2020 (mit Rebecca Döhl und Jens-Martin Loebel) CaeMmCom – Corpus altaegyptischer multimodaler Communication. Der Aufbau einer multimodalen Datensammlung altägyptischer Kommunikate, Zeitschrift für digitale Geisteswissenschaft, [open access].


Used by: B03

CoCoYum

Type: Corpus
Status: created
Details: [ViVo] [URL]
Language: Yucatec, Maya
Size: 159.00 tokens
Description: natural language production (spoken) and elicited data
Features: morpheme, glosses, translations, comments
Access: CC-BY-NC-ND

The Collective Corpus of Yucatec Maya (CoCoYum) is a collection of data from various researchers about the Yucatec Mayan language. It contains transcriptions of recordings (e.g. story telling, dialogue, public events), written data as well as elicited data. The corpus will be enlarged in time with fresh data collections and when further researchers add their data to the corpus.
Used by: A06

CoNNAR

Type: Corpus
Status: re-used
Details: [ViVo]
(Corpus of Non‑native Addressee Register)
Used by: C06

Czech corpus Koditex

Type: Corpus
Status: re-used
Details: [ViVo] [URL]
A synchronic, representative and reference 9‑million‑word corpus (excl. punctuation)
compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech.

Zasina, A. J. – Lukeš, D. – Komrsková, Z. – Poukarová, P. – Řehořková, A.: Koditex: A corpus of diversified texts. Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2018. Available at WWW: www.korpus.cz
Used by: A03

DNam corpus + DNam Wenker corpus

Type: Corpus
Status: used
Details: [ViVo] [URL]
The corpus "German in Namibia" („Deutsch in Namibia“ –DNam) was created in the period 2016-2021, in the DFG project „NamDeutsch: Die Dynamik des Deutschen im mehrsprachigen Kontext Namibias“ ("NamDeutsch: The Dynamics of German in Namibia's Multilingual Context" – WI 2155/9-1 and SI 750/4-1, directed by Heike Wiese and Horst Simon in cooperation with Marianne Zappen-Thomson) at the University of Potsdam (until 2019) and at HU Berlin (since 2019), at the FU Berlin and at UNAM Windhoek.

Article PDF
Used by: C07

ENCOW16

Type: Corpus
Status: re-used
Details: [ViVo] [URL]
Access: webcorpora.org (free access for academic use)
Engines: NoSkE and RStudio Server
Used by: A04

Eye-Tracking Corpus

Type: Corpus
Status: created
Details: [ViVo]

Used by: C03

FOLK excerpt

Type: Corpus
Status: used
Details: [ViVo]
Language: German
Size: 194,716 tokens
Description: conversations in various situations
Features: rich metadata lemma, POS, speech unit segmentation, some dependencies
Access: internal



Used by: A06

Falko Corpus

Type: Corpus
Status: re-used
Details: [ViVo] [URL]
L1 and L2- authored argumentative essays collected in a controlled setting.
Used by: C04

GeWISS excerpt

Type: Corpus
Status: used
Details: [ViVo] [URL]

GeWiss is a research project in spoken academic language. It provides a multilingual (German/English/Polish/Italian) corpus of audio recordings and transcriptions of academic communications, as an empirical foundation for comparative research.

To this end, the GeWiss corpus focusses on two main genres of spoken adademic language:

  • talks including discussions, and
  • oral exams,

and it explicitly distinguishes between L1 and L2 subcorpora. The corpus is enlarged and developed continuously.


Used by: A06

GermaNet

Type: Corpus
Status: used
Details: [ViVo] [URL]
GermaNet ist ein lexikalisch-semantisches Wortnetz, das deutsche Nomina, Verben und Adjektive semantisch zueinander in Beziehung setzt, indem es lexikalische Einheiten, die dasselbe Konzept ausdrücken, in Synsets zusammenfasst und semantische Relationen zwischen diesen Synsets definiert. GermaNet hat viel mit dem Englischen WordNet®  gemeinsam und kann als ein Online-Thesaurus oder als eine Lightweight-Ontologie betrachtet werden.
Used by: A01

GermaParl Corpus of Plenary Protocols

Type: Corpus
Status: used
Details: [ViVo] [DOI] [URL]
The GermaParl Corpus has been prepared in the PolMine Project (http://polmine.github.io) and comprises all protocols of plenary sessions in the German Bundestag (1996 - 2016). This version of the corpus is based on plain text documents issued by the German Bundestag. For a period between 2008 and 2010, txt files are not available. To fill the gap, pdf documents were processed. As part of the corpus preparation pipeline, the data has been linguistically annotated (using the TreeTagger) and imported into the Corpus Workbench (CWB). See the GermaParl documentation website (http://polmine.github.io/GermaParl) for further information.
Used by: A01

Icelandic Parsed Historical Corpus (IcePaHC)

Type: Corpus
Status: used
Details: [ViVo] [URL]
The Icelandic Parsed Historical Corpus (IcePaHC) is a project that has built a diachronic corpus with samples of written Icelandic from all periods from the 12th century to modern times. The corpus is mostly compatible with the corpora of historical English developed at UPenn. For historical texts spelling is modernized for phonological change.
Used by: A05

Lang*Reg: A multi-lingual corpus of intra-individual variation across situations

Type: Corpus
Status: created
Details: [ViVo]
Language: German, Persian, Yucatec Maya, Kurdish, Javanese
Size: 36 hours
Description: same speakers varied by mode, acquaintance, professionalism, and expertise
Features: transcription, syntactic segmentation, normalization, token, glossing or POS-tags, some syntax
Access: transcription or annotation in progress; CC-BY-NC-ND
Used by: A06

Lithuanian Corpus

Type: Corpus
Status: re-used
Details: [ViVo]
The facsimile of the Lithuanian text is published in 2005 by Ona Aleknavičienė. OCR generated texts from the facsimiles.
Used by: B02

Morisien Spoken Corpus

Type: Corpus
Status: created
Details: [ViVo]

Used by: A02

Online production experiment on Imprecision

Type: Corpus
Status: created
Details: [ViVo]

Used by: A05

Potsdam Commentary Corpus

Type: Corpus
Status: used
Details: [ViVo] [URL]
The Potsdam Commentary Corpus (PCC) is a corpus of 220 German newspaper commentaries (2.900 sentences, 44.000 tokens) taken from the online issues of the Märkische Allgemeine Zeitung (MAZ subcorpus) and Tagesspiegel (ProCon subcorpus) and is annotated with a range of different types of linguistic information.

[Bourgonje & Stede 2020] Bourgonje, Peter and Stede, Manfred (2020). The Potsdam Commentary Corpus 2.2: Extending Annotations for Shallow Discourse Parsing Proc. of the Language Resources and Evaluation Conference (LREC), Marseille.
Used by: A01

PreCOXX25: Register-annotated German webcorpus

Type: Corpus
Status: re-used
Details: [ViVo]
Access: webcorpora.org (free access for academic use)
Engines: NoSkE and RStudio Server
Used by: A04

Prestudy ”situational context” in Czech

Type: Corpus
Status: re-used
Details: [ViVo]
Ibex farm project (Zehr and Schwarz, 2018)
Used by: A03

Ramsès Project

Type: Corpus
Status: used
Details: [ViVo] [URL]
Morphologically annotated, lemmatized text corpus of Late Egyptian texts (c. 1550 – 1000 BCE) by the University of Liège (http://ramses.ulg.ac.be)
Used by: B03

Russian National Corpus

Type: Corpus
Status: used
Details: [ViVo] [URL]

The Russian National Corpus is a representative collection of texts in Russian, counting about 1,5 bln tokens and completed with linguistic annotation and search tools


Used by: A03

SENIE Corpus

Type: Corpus
Status: re-used
Details: [ViVo] [URL]
Latvian texts provided by the SENIE project of the University
of Latvia. http://senie.korpuss.lv/toc.jsp
Used by: B02

Simulated Zoom-Corpus

Type: Corpus
Status: created
Details: [ViVo]
Simulated zoom interaction with choreographed videos (variation of interlocutor persona [formality] & variation of topic / atstakeness) Simultaneous laboratory recordings of audio and video.
Used by: C02

The ReFlex Corpus

Type: Corpus
Status: re-used
Details: [ViVo]
A longitudinal production study of register development
Used by: C05

Thesaurus Linguae Aegyptiae (TLA)

Type: Corpus
Status: used
Details: [ViVo] [URL]
Digital text corpus of ancient Egyptian and Demotic language, morphosyntactic annotation & lemmatized. Largest corpus of Egyptian texts of different types and times (c. 2500 BCE – 450 AD)
Used by: B03

WroDiaCo v2

Type: Corpus
Status: re-used
Details: [ViVo]
Sarah Wesolek, Malte Belz, and Christine Mooshammer. Wroclaw Dialogue Corpus (WroDiaCo):
Version 2, 2020. URL https://rs.cms.hu‑berlin.de/phon.
Used by: C06

sgs corpus

Type: Corpus
Status: created
Details: [ViVo]
Language: Persian
Size: 26 h
Description: free spoken dialogues with interviewer on fictive crime scenario
Features: social metadata, syntax
Access: internal
Used by: A06

Documents/Other


Big Five Inventory (BFI-10)

Type: Document
Status: used
Details: [ViVo] [DOI] [URL]
The BFI-10 is a highly economic scale that allows the personality to be recorded according to the five-factor model. The scale is easy to administer in different survey modes. The empirical evidence of the validation studies suggests that the BFI-10 allows not only an economic but also a reliable and valid recording of the Big Five. The BFI-10 allows a rough measurement of the individual personality structure of adult interviewees from the German-speaking general population.

Rammstedt, B., Kemper, C. J., Klein, M. C., Beierlein, C., & Kovaleva, A. (2014). Big Five Inventory (BFI-10).
Zusammenstellung sozialwissenschaftlicher Items und Skalen (ZIS).
https://doi.org/10.6102/zis76



Used by: A07, C05

Depressions-Angst-Stress Skalen (DASS 21)

Type: Document
Status: used
Details: [ViVo] [DOI] [URL]
Die DASS eignen sich zur Erfassung von Belastungen durch Depression, Angst und Stress ohne konfundierende somatische Faktoren, wie beispielsweise chronische Schmerzprobleme. Die Skalen sind allerdings auch für Klienten ohne somatische Beschwerden brauchbar. Die DASS sind in der Kurzversion mit 21 Items sowie der Langversion mit 42 Items jeweils auf Deutsch und Englisch verfügbar.
© Lovibond, P.F., Lovibond, S.H., Nilges, P. & Essau, C.
Used by: A07

Epidemic - Pandemic Impacts Inventory (EPII)

Type: Document
Status: used
Details: [ViVo] [URL]
The EPII is a tool designed to assess tangible impacts of epidemics and pandemics across personal and social life domains.
Used by: A07

Interpersonal Reactivity Index (IRI)

Type: Document
Status: used
Details: [ViVo] [DOI]
Davis, M. (1983). Measuring individual differences in empathy: Evidence for a multidimensional approach. Journal of Personality and Social Psychology, 44, 1114–1126.
Used by: C05

Skalen zur motivationalen Regulation beim Lernen im Studium (SMR-LS)

Type: Document
Status: used
Details: [ViVo] [DOI] [URL]

Used by: C05

Experimental research data


Experimental data: Addressee identification study

Type: Experimental research data
Status: created
Details: [ViVo]
Participants:adults and adolescents in Namibia
Used by: C07

Experimental data: Newspaper correction study

Type: Experimental research data
Status: created
Details: [ViVo]
Participants: Adults and adolescents in Namibia and Germany


Used by: C07

Experimental data: speaker evaluation study

Type: Experimental research data
Status: created
Details: [ViVo]
Participants: adults and adolescents in Namibia
Used by: C07

Software


Stanford Log-linear Part-Of-Speech Tagger

Type: Software publication
Status: used
Details: [ViVo] [URL]

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. This software is a Java implementation of the log-linear part-of-speech taggers described in these papers (if citing just one paper, cite the 2003 one):

Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.

Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.
Used by: A01, INF

ANNIS3

Type: Software publication
Status: used
Details: [ViVo] [DOI] [URL]

A web browser-based search and visualization architecture for complex multilayer linguistic corpora with diverse types of annotation.

Available from: https://corpus-tools.org/annis/
Documentation: https://corpus-tools.org/annis/documentation.html
Cite as: Krause, Thomas & Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. in: Digital Scholarship in the Humanities 2016 (31). http://dsh.oxfordjournals.org/content/31/1/118


Used by: A03, A06, B04, C04, C06, INF

AntConc

Type: Software publication
Status: used
Details: [ViVo] [URL]
A freeware corpus analysis toolkit for concordancing and text analysis.


Used by: B02

ELAN

Type: Software publication
Status: used
Details: [ViVo] [DOI] [URL]
ELAN is computer software, a professional tool to manually and semi-automatically annotate and transcribe audio or video recordings. It has a tier-based data model that supports multi-level, multi-participant annotation of time-based media.

Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands

Lausberg, H., & Sloetjes, H. (2009). Coding gestural behavior with the NEUROGES-ELAN system. Behavior Research Methods, Instruments, & Computers, 41(3), 841-849. doi:10.3758/BRM.41.3.841.
Used by: A06

EXMARaLDA

Type: Software publication
Status: used
Details: [ViVo] [URL]
EXMARaLDA is a system for working with oral corpora on a computer. It consists of a transcription and annotation tool (Partitur-Editor), a tool for managing corpora (Corpus-Manager) and a query and analysis tool (EXAKT). Further parts of EXMARaLDA are FOLKER and OrthoNormal, which were both developed in and for the FOLK project.
Used by: A06, C05

Field Linguist's Toolbox

Type: Software publication
Status: used
Details: [ViVo] [URL]
Toolbox is a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data. Toolbox is free to download and use.
Used by: A06, B02

HAL-Inria

Type: Software publication
Status: used
Details: [ViVo] [URL]
In this paper, we present, SALT, a framework for mapping heterogeneous linguistic formats from one another based on a model-based approach, i.e. independently of the actual formats in which the corresponding linguistic data is being expressed. While we describe the underlying concept of this framework, we identify how it echoes past ongoing standardisation activities within ISO committee TC 37/SC 4, and in particular, the possible conceptual equivalences with ISO CD 24612 (LAF) combined with ISO 24610-1 (FSR), as well as the possible role of the central data category registry (ISOCat), currently under deployment. We thus show the adequacy of our methodology and its capacity to integrate a wide range of possible linguistic annotation models.
Used by: A03

INCEpTION

Type: Software publication
Status: used
Details: [ViVo] [URL]
We introduce INCEpTION, a new annotation platform for tasks including interactive and semantic annotation (e.g., concept linking, fact linking, knowledge base population, semantic frame annotation). These tasks are very time consuming and demanding for annotators, especially when knowledge bases are used. We address these issues by developing an annotation platform that incorporates machine learning capabilities which actively assist and guide annotators. The platform is both generic and modular. It targets a range of research domains in need of semantic annotation, such as digital humanities, bioinformatics, or linguistics. INCEpTION is publicly available as open-source software.

INF is hosting INCEpTION at https://inception.sfb1412.hu-berlin.de (Intranet only)
Used by: A01, A06, B04, INF

PennController for Internet Based Experiments (IBEX)

Type: Software publication
Status: used
Details: [ViVo] [URL]
PennController for Internet Based Experiments (“PennController” or “PCIbex” for short) provides the tools to build and run online experiments, from familiar paradigms like self-paced reading to completely custom-designed paradigms.
Used by: A03, A06, A07, C03

Pepper

Type: Software publication
Status: used
Details: [ViVo] [URL]

A highly extensible platform for conversion and manipulation of linguistic data between an unbound set of formats. Pepper can be used stand-alone as a command line interface, or be integrated as an API into other software products.

Available from: https://corpus-tools.org/pepper/
Documentation: https://corpus-tools.org/pepper/userGuide.html
Cite as: F. Zipser & L. Romary (2010). A model oriented approach to the mapping of annotation formats using standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. Malta. URL: http://hal.archives-ouvertes.fr/inria-00527799/en/


Used by: A01, A06, INF

RStudio Server

Type: Software publication
Status: used
Details: [ViVo] [URL]

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.


Used by: A04, A07, C03, INF

TextSTat

Type: Software publication
Status: used
Details: [ViVo] [URL]
Textstat is an easy to use library to calculate statistics from text. It helps determine readability, complexity, and grade level.
Used by: B02

Tools rPraat and mPraat - Interfacing Phonetic Analyses with Signal Processing

Type: Software publication
Status: used
Details: [ViVo] [DOI]
The paper presents the rPraat package for R/mPraat toolbox for Matlab which constitutes an interface between the most popular software for phonetic analyses, Praat, and the two more general programmes. The package adds on to the functionality of Praat, it is shown to be superior in terms of processing speed to other tools, while maintaining the interconnection with the data structure of R and Matlab, which provides a wide range of subsequent processing possibilities. The use of the proposed tool is demonstrated on a comparison of real speech data with synthetic speech generated by means of dynamic unit selection.
Used by: A03, A06, C06

emuR

Type: Software publication
Status: used
Details: [ViVo] [URL]
Raphael Winkelmann, Klaus Jaensch, Steve Cassidy, and Jonathan Harrington. emuR: Main Package of the EMU Speech Database Management System, 2018.
Used by: C06, INF

flairNLP / flair

Type: Software publication
Status: used
Details: [ViVo] [URL]
A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends.

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Used by: INF

spaCy

Type: Software publication
Status: used
Details: [ViVo] [URL]
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
Used by: INF