Standardising language data through the conversion pipeline TEIWorLD
DOI:
https://doi.org/10.21248/idsopen.15.2026.54Schlagworte:
conversion, formats, pipeline, use case, TEIWorLDAbstract
The conversion of data into a standard format is a crucial step in many research workflows. Standardisation enables data exchange, reuse, and analysis, which are essential for advancing knowledge in various fields. In this publication, we describe the conversion pipeline TEIWorLD (TEI Workflow for Language Data) that transforms written and spoken language data into standardised formats, specifically I5/TEI P5 XML for written data and ISO/TEI Transcriptions of Spoken Language for spoken data. The pipeline leverages existing tools to convert specific formats into these standards, with an additional transformation step for written data into the archival I5 (short for IDS TEI P5) format used at the Leibniz Institute for the German Language (IDS). We also present two use cases that demonstrate the practical application of standardisation with our conversion pipeline TEIWorLD in language data management on a corpus consisting of more than one format, enabling researchers to efficiently analyse and share their data.
