The ITACA Corpus is a corpus of argumentative essays written in Italian by upper secondary school students from South Tyrol. It has been created with the aim to investigate and describe the students’ textual competences with a special focus on text coherence.
The ITACA corpus consists of 635 texts collected during the school year 2021/2022 in schools with Italian as a language of instruction.
The whole corpus has been automatically tokenized, lemmatized, and annotated for part-of-speech and dependency relations. A subset of 388 texts additionally contains annotations regarding textual features, such as punctuation, connectives, agreement, anaphora, argumentative structure, off-topics and contradictions.
The corpus furthermore provides metadata regarding student’s age, gender, language background, reading and writing habits, their performance in a standardized language test as well as holistic and analytic coherence evaluations for each text.
Corpus Information
size: | 382.964 tokens |
texts: | 635 (388 texts further annotated with features of text cohesion and coherence) |
writers: | 635 students from 12th grade (upper secondary schools, age range between 17-19 years old) |
text type: | argumentative essay |
language: | Italian |
year of data collection: | 2022 |
reference paper: |
Documents
- Annotation schema for textual annotations (in Italian)
- Other additional documents are still to be added here.
Corpus Access
The corpus can be queried via the ANNIS interface or downloaded on the Eurac Research CLARIN Repository after the end of the project (March 2024).