The MERLIN Corpus

The MERLIN corpus has been jointly created within the MERLIN project by the following partners: University of Technology Dresden (Germany), Eurac Research Bolzano (Italy), Eberhard Karls University Tübingen (Germany), Charles University Prague (Chech Republic), telc Frankfurt/Main (Germany), Berufsförderungsinstitut Oberösterreich, Linz (Austria).

The corpus contains 2,286 texts for learners of Italian, German and Czech that were taken from written examinations of acknowledged test institutions. The exams aim to test knowledge across the levels A1-C1 of the Common European Framework of Reference (CEFR).

The MERLIN data have been enriched with a multi-level annotation. The main annotations available for almost all learner texts (for detailed figures please visit the MERLIN homepage) are target hypotheses (target hypotheses 1) and annotations of grammatical and orthographical learner language features (error annotation 1).

Corpus Information

size:ca. 340,000 tokens
writers:2,286 adults
text type:various (informal and formal email/letter for different purposes, opinion text on different topics), based on standardised language tests
languages:Italian, German, Czech
year of data collection:2012
reference paper:Boyd, A., Hana, J., Nicolas, L., Meurers, D., Wisniewski, K., Abel, A., Schöne, K., Štindlová, B. & Vettori, C. (2014):
The MERLIN corpus: learner language and the CEFR. Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik 26-31 May 2014, pp. 1281-1288. (pdf)


Documents regarding transcription and annotation of the MERLIN corpus are available on the MERLIN platform.

Corpus Access

The MERLIN Corpus can be queried via the ANNIS interface or downloaded on the Eurac Research Clarin Repository.