The Kolipsi Corpus Family

The Kolipsi Corpus Family is a collection of Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of the original project and the follow-up study was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.

All sub-corpora of the Kolipsi Corpus Family contain manually performed transcription annotations. Transcription annotations reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level. In addition to that, all texts were automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization.

The Kolipsi-1 Corpus

The Kolipsi-1 Corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”. In addition, data from L1 pupils were collected exclusively for the creation of a native speaker reference corpus.

The data collection took place during the school year 2007/2008 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket (narrative text genre) and (2) in writing a letter to a friend discussing holiday plans (argumentative text genre).

All L2 learner texts have been assigned to the CEFR levels in a reliable way. The corpus contains a variety of additional metadata such as school type, gender, origin, socioeconomic status and L1 of the learners.

The L1 students had to fulfill the same two tasks. However, no further metadata such as CEFR level were collected.

Corpus Information

sub-corpusyear# tokens# texts# writerswriters’ agelanguage
Kolipsi-1_L2_IT2007/08ca. 365,0001,88994716-18 yearsItalian
Kolipis-1_L2_DE2007/08ca. 89,00053727416-18 yearsGerman
Kolipsi-1_L1_IT2007/08ca. 11,000824416-18 yearsItalian
Kolipsi-1_L1_DE2007/08ca. 80,00036518416-18 yearsGerman
TOTAL Kolipsi-12007/08ca. 545,0002,8731,44916-18 yearsItalian/German

The Kolipsi-2 Corpus

The Kolipsi-2 Corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI II project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”.

The data collection took place during the school year 2014/2015 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket (narrative text genre) and (2) writing an e-mail about negative aspects of social-media chats prompted by a letter to the editor in a youth magazine (argumentative/narrative text genre). The first task was used in the same way as in the original KOLIPSI project.

Corpus Information

sub-corpusyear# tokens# texts# writerswriters’ agelanguage
Kolipsi-2_IT2014/15ca. 400,0002,0631,03516-18 yearsItalian
Kolipsi-2_DE2014/15ca. 105,00070035716-18 yearsGerman
TOTAL Kolipsi-22014/15ca. 505,0002,7631,39216-18 yearsItalian/German

Documents

Reference Paper

Glaznieks, A., Frey, J.-C., Nicolas, L., Abel, A. & Vettori, C. (in preperation): The Kolipsi Corpus Family. A collection of Italian and German L2 learner texts from secondary school pupils.

Corpus Access

All sub-corpora of the Kolipsi Corpus Family can be queried via the ANNIS interface or downloaded on the Eurac Research Clarin Repository.