The Kolipsi Corpus Family is a collection of Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of the original project and the follow-up study was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.
All sub-corpora of the Kolipsi Corpus Family contain manually performed transcription annotations. Transcription annotations reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level. In addition to that, all texts were automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization.
The Kolipsi-1 Corpus
The Kolipsi-1 Corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”. In addition, data from L1 pupils were collected exclusively for the creation of a native speaker reference corpus.
The data collection took place during the school year 2007/2008 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket (narrative text genre) and (2) in writing a letter to a friend discussing holiday plans (argumentative text genre).
All L2 learner texts have been assigned to the CEFR levels in a reliable way. The corpus contains a variety of additional metadata such as school type, gender, origin, socioeconomic status and L1 of the learners.
The L1 students had to fulfill the same two tasks. However, no further metadata such as CEFR level were collected.
Corpus Information
sub-corpus | year | # tokens | # texts | # writers | writers’ age | language |
---|---|---|---|---|---|---|
Kolipsi-1_L2_IT | 2007/08 | ca. 387,000 | 1,990 | 1000 | 16-18 years | Italian |
Kolipis-1_L2_DE | 2007/08 | ca. 87,000 | 523 | 267 | 16-18 years | German |
Kolipsi-1_L1_IT | 2007/08 | ca. 11,000 | 80 | 43 | 16-18 years | Italian |
Kolipsi-1_L1_DE | 2007/08 | ca. 80,000 | 363 | 183 | 16-18 years | German |
TOTAL Kolipsi-1 | 2007/08 | ca. 565,000 | 2,956 | 1,449 | 16-18 years | Italian/German |
The Kolipsi-2 Corpus
The Kolipsi-2 Corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI II project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”.
The data collection took place during the school year 2014/2015 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket (narrative text genre) and (2) writing an e-mail about negative aspects of social-media chats prompted by a letter to the editor in a youth magazine (argumentative/narrative text genre). The first task was used in the same way as in the original KOLIPSI project.
Corpus Information
sub-corpus | year | # tokens | # texts | # writers | writers’ age | language |
---|---|---|---|---|---|---|
Kolipsi-2_IT | 2014/15 | ca. 400,000 | 2,063 | 1,035 | 16-18 years | Italian |
Kolipsi-2_DE | 2014/15 | ca. 106,000 | 700 | 357 | 16-18 years | German |
TOTAL Kolipsi-2 | 2014/15 | ca. 506,000 | 2,763 | 1,392 | 16-18 years | Italian/German |
The Kolipsi-Matura Corpus
The Kolipsi-Matura Corpus is a L1 corpus composed of final school examinations in Italian and German written by a sample of randomly selected participants of the KOLIPSI project whose L2 texts are part of the Kolipsi-1 Corpus. The data was collected in 2009.
Corpus Information
sub-corpus | year | # tokens | # texts | # writers | writers’ age | language |
---|---|---|---|---|---|---|
Kolipsi-Matura_L1_IT | 2009 | ca. 41,000 | 53 | 53 | 18-19 years | Italian |
Kolipsi-Matura_L1_DE | 2009 | ca. 64,000 | 99 | 99 | 18-19 years | German |
TOTAL Kolipsi-Matura | 2009 | ca. 105,000 | 152 | 152 | 18-19 years | Italian/German |
Documents
- Transcription and annotation guidelines
- Writing task Kolipsi-1 Italian
- Writing task Kolipsi-1 German
- Writing tasks Kolipsi-2 Italian
- Writing tasks Kolipsi-2 German
- Descriptions of the CEFR proficiency level used for Kolipsi-1 (Italian)
- Descriptions of the CEFR proficiency level used for Kolipsi-1 (German)
- Descriptions of the CEFR proficiency level used for Kolipsi-2 (Italian)
- Descriptions of the CEFR proficiency level used for Kolipsi-2 (German)
- Official examination sheet Kolipsi-Matura_DE
- Official examination sheet Kolipsi-Matura_IT
Reference Paper
Glaznieks, A., Frey, J.-C., Abel, A., Nicolas, L. & Vettori, C. (2023): The Kolipsi Corpus Family: Resources for learner corpus research in Italian and German. In: Italian Journal of Computational Linguistics 9(2). https://doi.org/10.4000/ijcol.1210
Corpus Access
All sub-corpora of the Kolipsi Corpus Family can be queried via the ANNIS interface or downloaded on the Eurac Research Clarin Repository.
The Corpus can be queried via the ANNIS interface or download on the Eurac Research Clarin Repository.