
06/20/2024 -
06/21/2024 Mohrenstraße 40/41, room 415

Methods in Historical Corpus Building

The workshop features aspects of corpus building (sampling, architecture, pipeline, digitization, OCR), annotation (conception, tagset design, tagging, parsing) and corpus use (search, re-use, re-annotation), throwing a spotlight on a number of historical languages (Old High German, Old Lithuanian, Early New High German, Belarusian) and corpora (RIDGES, Referenzkorpus Altdeutsch, PosTiMe, SLIEKKAS, Lutherkorpus).
In the designated interactive slots there will be plenty of opportunities to discuss your own data issues, try out software and/or adapt presented state-of-the-art techniques to your own research.

Invited Speakers

  • Dr. Loïc Boizou (Universität zu Köln)
  • Prof. Dr. Jolanta Gelumbeckaitė (Goethe-Universität Frankfurt am Main)
  • Ercong Nie (Ludwig-Maximilian-Universität München)


  • 9:00 – 9:30:    opening, introduction to project context (B04)
  • 9:30 – 10:15:   Martin Klotz: Introduction to corpora, research questions, terminology, and corpus infrastructure
  • 10:15 – 10:45: coffee break
  • 10:45 – 11:30: Loïc Boizou: How to build an NLP pipeline on free tools for relatively under-resourced languages with available textual resources
  • 11:30 – 13:00: interactive session, focus on corpus building
  • 13:00 – 14:00: lunch break
  • 14:00 – 15:00: Ercong Nie: Automatic annotation. Demos here (Middle High German) and here (Early New High German)
  • 15:00 – 15:30: coffee break
  • 15:30 – 17:00: interactive session, focus on (semi-)automatic annotation
  • 17:00 – 18:00: wrap-up
  • 19:00                conference dinner (not included)
  • 9:00 – 10:30: Anke Lüdeling, Thomas Krause: Introduction to the RIDGES corpus
  • 10:30 – 11:00: coffee break
  • 11:00 – 12:00: Jolanta Gelumbeckaitė: SLIEKKAS – Developing a standard tagset for Old Lithuanian
  • 12:00 – 13:00: formation of working groups; topics of interest such as flexible corpus (re)use, finding data, Toolbox annotation…
  • 13:00 – 14:00: lunch break
  • 14:00 – 17:30: discussion, coaching, task-solving within the working groups
  • 17:30 – 18:00: wrap-up, final discussion

The software and data needed for the interactive sessions can be downloaded here.


Please e-mail your name, affiliation and (if applicable) a short description of your project (research project, PhD project, student project) to Gohar Schnelle (

If you already have concrete ideas, you are welcome to give further information on your own data-based research, so that we can tailor the discussion to your specific issues:

  • a short characterisation of the data, like: 
    • language(s)
    • data type (e.g. texts, text length, text type, historical source [handwritten, printed] etc.)
    • data formats (e.g. spreadsheet, txt, xml etc.)
    • data size 
    • data complexity
  • software you use, or would like to use
  • specific questions, topics or problems you would like to address


We look forward to hearing from you!

The organising committee,
Mortimer Drach (B04)
Anna Helene Feulner (B04)
Jürg Fleischer (B04)
Martin Klotz (INF)
Thomas Krause (INF)
Gohar Schnelle (B04)
Lars Erik Zeige (B04)