The ITACA Corpus

The ITACA corpus consists of argumentative essays written in Italian by upper secondary school students from South Tyrol. Created to investigate and describe students’ textual competences, it places special focus on text coherence.

The corpus comprises 635 texts collected during the school year 2021/2022 in schools with Italian as a language of instruction.

The entire collection is automatically tokenized, lemmatized, and annotated for part-of-speech and dependency relations. Additionally, a subset of 388 texts includes further manual annotations on textual features, such as punctuation, connectives, agreement, anaphora, argumentative structure, off-topic elements and contradictions.

The corpus also provides metadata on each student’s age, gender, language background, reading and writing habits, performance in a standardized language test as well as holistic and analytic coherence evaluations for each text.

Corpus Information

size:424.693 tokens
texts:635 (388 texts further annotated with features of text cohesion and coherence)
writers:635 students from 12th grade (upper secondary schools, age range between 17-19 years old)
text type:argumentative essay
language:Italian
year of data collection:2022

Corpus Access

The corpus can be queried via the ANNIS interface or downloaded from the Eurac Research CLARIN Repository after the end of the project (spring/summer 2025).

Documentation

Related Publications

Bienati, A., & Frey, J.-C. (2025). Development of causal connectives in Italian L1 and L2 student writing: A comparison of argumentative texts from lower and upper secondary school. In Continuing Learner Corpus Research: Challenges and Opportunities, Presses universitaires Louvain, 197-212. [https://pul.uclouvain.be/book/?gcoi=29303100485770]

Leone-Pizzighella, A. R., Bienati, A., & Frey, J.-C. (2024). Discourse markers in the curricularization of ‘academic language’. A mixed methods analysis of tipo and praticamente in Italian secondary schools. In L. Cirillo & R. Nodari (Eds.), Studi AItLA 18: Contesti, pratiche e risorse della comunicazione multimodale, 149–162. [https://hdl.handle.net/10863/42783]

Pellegrino, F., Frey, J. C., & Zanasi, L. (2024). Towards an Automatic Evaluation of (In) coherence in Student Essays. Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-It 2024), 04-06 December 2024, Pisa, Italy. Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), Pisa, Italy [https://ceur-ws.org/Vol-3878/82_main_long.pdf].

Zanasi, L., Bienati, A., Frey, J.-C., & Vettori, C. (2024). Condizioni di coerenza e procedure di coesione nella scrittura scolastica: Il caso dei connettivi. CLUB Working Papers in Linguistics, 8, 131–152. [https://amsacta.unibo.it/id/eprint/8065/1/CLUB_WPL_volume8_2024.pdf#page=132]

If you have used the ITACA Corpus in your work and want to list your publications here, please email porta@eurac.edu!