The Kolipsi Corpus Family is a collection of Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of the original project and the follow-up study was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.
All sub-corpora of the Kolipsi Corpus Family contain manually performed transcription annotations. Transcription annotations reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level. In addition to that, all texts were automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization.
The Kolipsi-1 Corpus
The Kolipsi-1 Corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”. In addition, data from L1 pupils were collected exclusively for the creation of a native speaker reference corpus.
The data collection took place during the school year 2007/2008 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket (narrative text genre) and (2) in writing a letter to a friend discussing holiday plans (argumentative text genre).
All L2 learner texts have been assigned to the CEFR levels in a reliable way. The corpus contains a variety of additional metadata such as school type, gender, origin, socioeconomic status and L1 of the learners.
The L1 students had to fulfill the same two tasks. However, no further metadata such as CEFR level were collected.
Corpus Information
sub-corpus | year | # tokens | # texts | # writers | writers’ age | language |
---|---|---|---|---|---|---|
Kolipsi-1_L2_IT | 2007/08 | ca. 387,000 | 1,990 | 1000 | 16-18 years | Italian |
Kolipis-1_L2_DE | 2007/08 | ca. 87,000 | 523 | 267 | 16-18 years | German |
Kolipsi-1_L1_IT | 2007/08 | ca. 11,000 | 80 | 43 | 16-18 years | Italian |
Kolipsi-1_L1_DE | 2007/08 | ca. 80,000 | 363 | 183 | 16-18 years | German |
TOTAL Kolipsi-1 | 2007/08 | ca. 565,000 | 2,956 | 1,449 | 16-18 years | Italian/German |
The Kolipsi-2 Corpus
The Kolipsi-2 Corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI II project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”.
The data collection took place during the school year 2014/2015 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket (narrative text genre) and (2) writing an e-mail about negative aspects of social-media chats prompted by a letter to the editor in a youth magazine (argumentative/narrative text genre). The first task was used in the same way as in the original KOLIPSI project.
Corpus Information
sub-corpus | year | # tokens | # texts | # writers | writers’ age | language |
---|---|---|---|---|---|---|
Kolipsi-2_IT | 2014/15 | ca. 400,000 | 2,063 | 1,035 | 16-18 years | Italian |
Kolipsi-2_DE | 2014/15 | ca. 106,000 | 700 | 357 | 16-18 years | German |
TOTAL Kolipsi-2 | 2014/15 | ca. 506,000 | 2,763 | 1,392 | 16-18 years | Italian/German |
The Kolipsi-Matura Corpus
The Kolipsi-Matura Corpus is a L1 corpus composed of final school examinations in Italian and German written by a sample of randomly selected participants of the KOLIPSI project whose L2 texts are part of the Kolipsi-1 Corpus. The data was collected in 2009.
Corpus Information
sub-corpus | year | # tokens | # texts | # writers | writers’ age | language |
---|---|---|---|---|---|---|
Kolipsi-Matura_L1_IT | 2009 | ca. 41,000 | 53 | 53 | 18-19 years | Italian |
Kolipsi-Matura_L1_DE | 2009 | ca. 64,000 | 99 | 99 | 18-19 years | German |
TOTAL Kolipsi-Matura | 2009 | ca. 105,000 | 152 | 152 | 18-19 years | Italian/German |
Corpus Access
All sub-corpora of the Kolipsi Corpus Family can be queried via the ANNIS interface or downloaded on the Eurac Research Clarin Repository.
Reference Paper
Glaznieks, A., Frey, J.-C., Abel, A., Nicolas, L. & Vettori, C. (2023): The Kolipsi Corpus Family: Resources for learner corpus research in Italian and German. In: Italian Journal of Computational Linguistics 9(2). https://doi.org/10.4000/ijcol.1210
Documentation
- Transcription and annotation guidelines
- Writing task Kolipsi-1 Italian
- Writing task Kolipsi-1 German
- Writing tasks Kolipsi-2 Italian
- Writing tasks Kolipsi-2 German
- Descriptions of the CEFR proficiency level used for Kolipsi-1 (Italian)
- Descriptions of the CEFR proficiency level used for Kolipsi-1 (German)
- Descriptions of the CEFR proficiency level used for Kolipsi-2 (Italian)
- Descriptions of the CEFR proficiency level used for Kolipsi-2 (German)
- Official examination sheet Kolipsi-Matura_DE
- Official examination sheet Kolipsi-Matura_IT
Related Publications
Spina, S., Glaznieks, A. & Abel, A. (2025). Intensification in Written L2 Italian: Insights from the multilingual region of South Tyrol. International Journal of Learner Corpus Research 11(2). [https://doi.org/10.1075/ijlcr.23041.spi]
Bienati, A., & Frey, J.-C. (2025). Development of causal connectives in Italian L1 and L2 student writing: A comparison of argumentative texts from lower and upper secondary school. In Continuing Learner Corpus Research: Challenges and Opportunities, Presses universitaires Louvain, 197-212. [https://pul.uclouvain.be/book/?gcoi=29303100485770]
Spina, S., Glaznieks, A. & Abel, A. (2024). L’intensificazione dell’aggettivo in italiano L2: uno studio sugli studenti delle scuole dell’Alto Adige. Italiano LinguaDue 16 (1), 311-331. [https://doi.org/10.54103/2037-3597/23843]
Glaznieks, A., Frey, J.-C. & Abel, A. (2023). Weil-Sätze bei Lernenden des Deutschen. Vergleich zwischen immersiv und nicht-immersiv Deutschlernenden in Südtirol. In: Michael Beißwenger, Eva Gredel, Lothar Lemnitzer & Roman Schneider (Hgg.): Korpusgestützte Sprachanalyse. Grundlagen, Anwendungen und Analysen. Tübingen: Narr Francke Attempto, 401-423.
König, A., Frey, J.-C., & Stemle, E. W. (2021). Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora. Information, 12(5), Article 5. [https://doi.org/10.3390/info12050199]
If you have used the Kolipsi Corpus Family in your work and want to list your publications here, please email porta@eurac.edu!