The Kolipsi Corpus Family – PORTA Eurac Research Learner Corpus Portal

The Kolipsi Corpus Family is a collection of Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of the original project and the follow-up study was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.

All sub-corpora of the Kolipsi Corpus Family contain manually performed transcription annotations. Transcription annotations reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level. In addition to that, all texts were automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization.

The Kolipsi-1 Corpus

The Kolipsi-1 Corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”. In addition, data from L1 pupils were collected exclusively for the creation of a native speaker reference corpus.

The data collection took place during the school year 2007/2008 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket (narrative text genre) and (2) in writing a letter to a friend discussing holiday plans (argumentative text genre).

All L2 learner texts have been assigned to the CEFR levels in a reliable way. The corpus contains a variety of additional metadata such as school type, gender, origin, socioeconomic status and L1 of the learners.

The L1 students had to fulfill the same two tasks. However, no further metadata such as CEFR level were collected.

Corpus Information

sub-corpus	year	# tokens	# texts	# writers	writers’ age	language
Kolipsi-1_L2_IT	2007/08	ca. 387,000	1,990	1000	16-18 years	Italian
Kolipis-1_L2_DE	2007/08	ca. 87,000	523	267	16-18 years	German
Kolipsi-1_L1_IT	2007/08	ca. 11,000	80	43	16-18 years	Italian
Kolipsi-1_L1_DE	2007/08	ca. 80,000	363	183	16-18 years	German
TOTAL Kolipsi-1	2007/08	ca. 565,000	2,956	1,449	16-18 years	Italian/German

The Kolipsi-2 Corpus

The Kolipsi-2 Corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI II project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”.

The data collection took place during the school year 2014/2015 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket (narrative text genre) and (2) writing an e-mail about negative aspects of social-media chats prompted by a letter to the editor in a youth magazine (argumentative/narrative text genre). The first task was used in the same way as in the original KOLIPSI project.

Corpus Information

sub-corpus	year	# tokens	# texts	# writers	writers’ age	language
Kolipsi-2_IT	2014/15	ca. 400,000	2,063	1,035	16-18 years	Italian
Kolipsi-2_DE	2014/15	ca. 106,000	700	357	16-18 years	German
TOTAL Kolipsi-2	2014/15	ca. 506,000	2,763	1,392	16-18 years	Italian/German

The Kolipsi-Matura Corpus

The Kolipsi-Matura Corpus is a L1 corpus composed of final school examinations in Italian and German written by a sample of randomly selected participants of the KOLIPSI project whose L2 texts are part of the Kolipsi-1 Corpus. The data was collected in 2009.

Corpus Information

sub-corpus	year	# tokens	# texts	# writers	writers’ age	language
Kolipsi-Matura_L1_IT	2009	ca. 41,000	53	53	18-19 years	Italian
Kolipsi-Matura_L1_DE	2009	ca. 64,000	99	99	18-19 years	German
TOTAL Kolipsi-Matura	2009	ca. 105,000	152	152	18-19 years	Italian/German

Corpus Access

All sub-corpora of the Kolipsi Corpus Family can be queried via the ANNIS interface or downloaded on the Eurac Research Clarin Repository.

Corpus Query

Download (Kolipsi-1)

Download (Kolipis-2)

Reference Paper

Glaznieks, A., Frey, J.-C., Abel, A., Nicolas, L. & Vettori, C. (2023): The Kolipsi Corpus Family: Resources for learner corpus research in Italian and German. In: Italian Journal of Computational Linguistics 9(2). https://doi.org/10.4000/ijcol.1210

Documentation

Related Publications

Spina, S. & Glaznieks, A. (online-first 2025, 2026): Morphological and syntactic adjective intensification in L2 Italian and German in a multilingual context. Folia Linguistica. [https://doi.org/10.1515/flin-2025-0110]

Spina, S., Glaznieks, A. & Abel, A. (2025). Intensification in Written L2 Italian: Insights from the multilingual region of South Tyrol. International Journal of Learner Corpus Research 11(2), 276 – 308. [https://doi.org/10.1075/ijlcr.23041.spi]

Bienati, A., & Frey, J.-C. (2025). Development of causal connectives in Italian L1 and L2 student writing: A comparison of argumentative texts from lower and upper secondary school. In Continuing Learner Corpus Research: Challenges and Opportunities, Presses universitaires Louvain, 197-212. [https://pul.uclouvain.be/book/?gcoi=29303100485770]

Spina, S., Glaznieks, A. & Abel, A. (2024). L’intensificazione dell’aggettivo in italiano L2: uno studio sugli studenti delle scuole dell’Alto Adige. Italiano LinguaDue 16 (1), 311-331. [https://doi.org/10.54103/2037-3597/23843]

Glaznieks, A., Frey, J.-C. & Abel, A. (2023). Weil-Sätze bei Lernenden des Deutschen. Vergleich zwischen immersiv und nicht-immersiv Deutschlernenden in Südtirol. In: Michael Beißwenger, Eva Gredel, Lothar Lemnitzer & Roman Schneider (Hgg.): Korpusgestützte Sprachanalyse. Grundlagen, Anwendungen und Analysen. Tübingen: Narr Francke Attempto, 401-423.

König, A., Frey, J.-C., & Stemle, E. W. (2021). Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora. Information, 12(5), Article 5. [https://doi.org/10.3390/info12050199]

If you have used the Kolipsi Corpus Family in your work and want to list your publications here, please email porta@eurac.edu!

Corpus Query

Kolipsi-1 Corpus Download
Kolipsi-2 Corpus Download