The MERLIN corpus has been jointly created within the MERLIN project by the following partners: University of Technology Dresden (Germany), Eurac Research Bolzano (Italy), Eberhard Karls University Tübingen (Germany), Charles University Prague (Chech Republic), telc Frankfurt/Main (Germany), Berufsförderungsinstitut Oberösterreich, Linz (Austria).
The corpus contains 2,286 texts for learners of Italian, German and Czech that were taken from written examinations of acknowledged test institutions. The exams aim to test knowledge across the levels A1-C1 of the Common European Framework of Reference (CEFR).
The MERLIN data have been enriched with a multi-level annotation. The main annotations available for almost all learner texts (for detailed figures please visit the MERLIN homepage) are target hypotheses (target hypotheses 1) and annotations of grammatical and orthographical learner language features (error annotation 1).
|size:||ca. 340,000 tokens|
|text type:||various (informal and formal email/letter for different purposes, opinion text on different topics), based on standardised language tests|
|languages:||Italian, German, Czech|
|year of data collection:||2012|
|reference paper:||Boyd, A., Hana, J., Nicolas, L., Meurers, D., Wisniewski, K., Abel, A., Schöne, K., Štindlová, B. & Vettori, C. (2014):|
The MERLIN corpus: learner language and the CEFR. Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik 26-31 May 2014, pp. 1281-1288. (pdf)
Documents regarding transcription and annotation of the MERLIN corpus are available on the MERLIN platform.
The MERLIN Corpus can be queried via the ANNIS interface or downloaded on the Eurac Research Clarin Repository.