The ITACA corpus consists of argumentative essays written in Italian by upper secondary school students from South Tyrol. Created to investigate and describe students’ textual competences, it places special focus on text coherence.
The corpus comprises 635 texts collected during the school year 2021/2022 in schools with Italian as a language of instruction.
The entire collection is automatically tokenized, lemmatized, and annotated for part-of-speech and dependency relations. Additionally, a subset of 388 texts includes further manual annotations on textual features, such as punctuation, connectives, agreement, anaphora, argumentative structure, off-topic elements and contradictions.
The corpus also provides metadata on each student’s age, gender, language background, reading and writing habits, performance in a standardized language test as well as holistic and analytic coherence evaluations for each text.
Corpus Information
size: | 424.693 tokens |
texts: | 635 (388 texts further annotated with features of text cohesion and coherence) |
writers: | 635 students from 12th grade (upper secondary schools, age range between 17-19 years old) |
text type: | argumentative essay |
language: | Italian |
year of data collection: | 2022 |
Corpus Access
The corpus can be queried via the ANNIS interface or downloaded from the Eurac Research CLARIN Repository after the end of the project (spring/summer 2025).
Documentation
- Corpus handbook
- Writing task administered to students (in Italian)
- Sociolinguistic questionnaire (in Italian)
- Metadata on reading and writing habits
- Coherence ratings
- Rating scale
- Annotation schema for textual annotations
Related Publications
Bienati, A., & Frey, J.-C. (2025). Development of causal connectives in Italian L1 and L2 student writing: A comparison of argumentative texts from lower and upper secondary school. In Continuing Learner Corpus Research: Challenges and Opportunities, Presses universitaires Louvain, 197-212. [https://pul.uclouvain.be/book/?gcoi=29303100485770]
Leone-Pizzighella, A. R., Bienati, A., & Frey, J.-C. (2024). Discourse markers in the curricularization of ‘academic language’. A mixed methods analysis of tipo and praticamente in Italian secondary schools. In L. Cirillo & R. Nodari (Eds.), Studi AItLA 18: Contesti, pratiche e risorse della comunicazione multimodale, 149–162. [https://hdl.handle.net/10863/42783]
Pellegrino, F., Frey, J. C., & Zanasi, L. (2024). Towards an Automatic Evaluation of (In) coherence in Student Essays. Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-It 2024), 04-06 December 2024, Pisa, Italy. Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), Pisa, Italy [https://ceur-ws.org/Vol-3878/82_main_long.pdf].
Zanasi, L., Bienati, A., Frey, J.-C., & Vettori, C. (2024). Condizioni di coerenza e procedure di coesione nella scrittura scolastica: Il caso dei connettivi. CLUB Working Papers in Linguistics, 8, 131–152. [https://amsacta.unibo.it/id/eprint/8065/1/CLUB_WPL_volume8_2024.pdf#page=132]
If you have used the ITACA Corpus in your work and want to list your publications here, please email porta@eurac.edu!