The ITACA Corpus

The ITACA Corpus is a corpus of argumentative essays written in Italian by upper secondary school students from South Tyrol. It has been created with the aim to investigate and describe the students’ textual competences with a special focus on text coherence.

The ITACA corpus consists of 635 texts collected during the school year 2021/2022 in schools with Italian as a language of instruction.

The whole corpus has been automatically tokenized, lemmatized, and annotated for part-of-speech and dependency relations. A subset of 388 texts additionally contains annotations regarding textual features, such as punctuation, connectives, agreement, anaphora, argumentative structure, off-topics and contradictions.

The corpus furthermore provides metadata regarding student’s age, gender, language background, reading and writing habits, their performance in a standardized language test as well as holistic and analytic coherence evaluations for each text.

Corpus Information

size:382.964 tokens
texts:635 (388 texts further annotated with features of text cohesion and coherence)
writers:635 students from 12th grade (upper secondary schools, age range between 17-19 years old)
text type:argumentative essay
year of data collection:2022
reference paper:


Corpus Access

The corpus can be queried via the ANNIS interface or downloaded on the Eurac Research CLARIN Repository after the end of the project (March 2024).