TUT is a project for the development of a collection of morphologically, syntactically and semantically annotated Italian sentences; it includes:
the definition of a native representation format (i.e. TUT format), which is dependency-oriented and aims at capturing the richness of the predicate-argument structure, i.e. a crucial layer of representation for several NLP tasks, such as parsing Information Extraction, Machine Translation and Question Answering. the conversion in Penn Treebank and other constituency-based formats, which increases the possibilities of comparison/evaluation and portability of the resource.
Open/close more about the native TUT format.
The native TUT format
TUT adopts a representation format based on the dependency paradigm centred upon the notion of predicate-argument structure, as described with reference to major Italian linguistic phenomena in the Linguistic notes. The choice of this paradigm that describes syntactic structures using dependency relations between pairs of words, depends on the partial configurationality of the reference language, i.e. Italian is a free word order language.
In TUT the dependency relations are annotated by following the Augmented Relational Structure (ARS) where each relation is implemented as a feature structure that can include values for a morpho-syntactic, a functional-syntactic and a syntactic-semantic component.
![[image]](http://mowser.com/img?url=http%3A%2F%2Fwww.di.unito.it%2F%7Etutreeb%2Fimages%2Fimage001.gif)
The need for a description of grammatical relations more detailed and more proximate to semantics has determined the development of a rich and flexible grammatical relation system for TUT, i.e. around 250 relations annotated at variable degree of specification according to a hierarchical organization. When the annotator cannot select a specific relation to label the dependency edge linking two words, he/she can select a more generic relation from the higher levels of this taxonmy.
To represent some phenomenon involving discontinuity and deletions as well as pro-drop subject, TUT has been enriched with a trace-filler notation (see 1. Traces and co-indexing in Linguistic notes for examples and further details).
See at the following example:
![[image]](http://mowser.com/img?url=http%3A%2F%2Fwww.di.unito.it%2F%7Etutreeb%2Fimages%2Fesempio.gif)
Each line contains all the information concerning a single node-word X:
the position of X within the linear order of the sentence,
the morphological features of X (in round brackets),
the position of the head-word Y from which X depends, and the name of the relation linking X to Y (both in square brackets).
According to the ARS, the name of each relation may include three components separated by hyphens:
MORPHOSYNTACTIC - FUNCTIONALSYNTACTIC - SEMANTIC (the symbol + is used as a separator between 2 parts of a single component).
Open/close more about the converted formats.
The converted formats
Parallel treebanks may serve as a suitable infrastructure for the comparison of parsers from different linguistic frameworks, thus contributing in the investigation of the causes of the irreproducibility of state-of-the-art results on annotations other than Penn Wall Street Journal and languages other than English.
A conversion tool has been applied to the native TUT in order to generate a Penn-like annotation. As a side effect, has been developed two other formats that show intermediate layers of variation/similarity with respect to the TUT and Penn in terms of both richness of functional-syntactic information (i.e. amount and specificity of grammatical relations) and type of linguistic framework (i.e. constituency versus dependency, or minimal versus maximal projection). The following image shows the cascade of formats (a rich selection of examples in parallel formats is available in Parallel annotations in TUT formats): ![[image]](http://mowser.com/img?url=http%3A%2F%2Fwww.di.unito.it%2F%7Etutreeb%2Fimages%2Filgoverno-total.jpg)
a) native TUT b) Constituency-TUT, where terminal nodes represent words and non-terminal nodes represent the constituents which are grammatical category projections. Constituency-TUT applies a maximal projection strategy, i.e. all grammatical categories project intermediate and maximal projections (e.g. Verb projects first in Vbar and then in VP; Noun projects first in Nbar and then in NP). Rather than on edges, here, the functional-syntactic relations are annotated on constituents. c) Augmented Penn, which annotates the dependencies of TUT (when possible on constituents), thus showing a repertory of functional relations larger than standard Penn, but draws trees structurally identical to those of Penn. Like Penn, it applies a minimal projection strategy, i.e. a terminal category projects only when the projected constituent includes more than one word. d) Penn, which includes a few functional relations and implements a minimal projection strategy.
The procedures for the treebank development
The TUT development exploits the TULE dependency parser.
The development of formats other than TUT is obtained by automatic conversion.
The treebank currently consists in 2,200 Italian sentences and 200 English sentences.
Open/close more about TUT corpora.
TUT corpus and subcorpora
The current Italian corpus (see below for download and description of current and previous releases) is composed by:
Civil law corpus: currently 1,100 sentences from the Italian civil law code; it has been planned to support the activity on the topic Ontologies and legal knowledge and will enable us to start up an activity on semantic interpretation Newspaper corpus: currently 1,100 sentences among which 400 from Italian newspapers La Stampa and La Repubblica, 600 tematically collected from Italian newspapers and journal on Albanians, and 100 from academic and novels
The first contest among parsing systems for Italian Evalita 2007 - Parsing Task (see publications) exploited TUT as development data set, both in native dependency and Penn format. In particular, the more recently added Italian materials, i.e. 100 sentences for the civil law corpus and 100 for newspaper, are the test set of this contest.
Moreover a small English corpus of 200 sentences has been added just as a support for non-Italian speakers to the comprehension of the annotation scheme. It is composed by:
50 sentences from an Internet site talking about atheism, often rather complex in syntactic structure, reflecting the intricacies of philosophical writings 50 from an Internet site talking about interculturality, as the previous ones complex in syntactic structure, reflecting the intricacies of philosophical writings 50 from a Directive of the European Union, examples of technical-formal writing 50 from a manual written by a non-English speaker (can include some errors of English)
For peculiar features (and extensions) of the TUT scheme for representing English structures see the section Applying TUT on English of the "Linguistic Notes".
Download the treebank
All the treebank data are freely available for download. Each zip file contains the data files and, if needed, a readme.txt with a brief description of data.
ROUGH TEXT
RELEASE
TUT FORMAT
PENN FORMAT*
*The other formats are available under request.
**This is the Evalita 2007- Parsing Task development data set
that includes the CoNLL compliant version; other materials of the contest are available at this
page.
TUT releases
After the first releases (rel. 0.0 in 2004 in native TUT format only; rel. 0.1 in 2006 in native TUT and Penn format; rel. evalita in 2007), TUT Italian treebank has been progressively enlarged and improved achieving the current size in rel. 1.1.
The currently maintained and recommended release is 1.1., which is an improved and enlarged version of the evalita one.
C. Bosco, A. Mazzei, V. Lombardo, G. Attardi, A. Corazza, A. Lavelli, L. Lesmo, G. Satta, M. Simi. Comparing Italian parsers on a common treebank: the Evalita experience. In Proceedings of LREC'08, Marrakesh, Maroc, 2008, pdf B. Magnini, A. Cappelli, F. Tamburini, C. Bosco, A. Mazzei, V. Lombardo, F. Bertagna, N. Calzolari, A. Toral, V. Bartalesi Lenzi, R. Sprugnoli, M. Speranza. Evaluation of Natural Language Tools for Italian: EVALITA 2007. In Proceedings of LREC'08, Marrakesh, Maroc, 2008, pdf C. Bosco, A. Mazzei, V. Lombardo. EVALITA PARSING TASK: an analysis of the first parsing system contest for Italian. Intelligenza artificiale, anno IV, num 2, June 2007, pdf C. Bosco. Multiple-step treebank conversion: from dependency to Penn format. In Proceedings of Linguistic Annotation Workshop (LAW) at ACL'07, Prague, Czeck Republic, 2007, pdf C. Bosco. Linguistic knowledge extraction from corpus parallel annotations. In Proceedings of XL Congresso della Società di Linguistica Italiana, Vercelli, Italy, 2006, pdf-zip C. Bosco, V. Lombardo. Comparing linguistic information in treebank annotations. In Proceedings of LREC'06, Genova, Italy, 2006, ps-zip C. Bosco, V. Lombardo. Dependency and relational structure in treebank annotation. In Proceedings of Workshop on Recent Advances in Dependency Grammar at COLING'04, Geneve, Switzerland, 2004, ps-zip C. Bosco. A grammatical relation system for treebank annotation. Unpublished PhD thesis discussed at University of Torino, ps-zip C. Bosco, V. Lombardo. A relation-schema for treebank annotation. In Proceedings of AI*IA2003, Pisa, Italy, 2003 ps-zip L. Lesmo, V. Lombardo, C. Bosco. Treebank Development: the TUT Approach. In Proceedings of ICON 2002, Mumbay, India, 2002 ps-zip C. Bosco. Grammatical relation's system in treebank annotation. In E. Miltsakaki, C. Monz, and A. Ribeiro, editors, Proceedings of Student Research Workshop of Joint ACL/EACL Meeting, pages 1-6, Toulouse, 2001 ps-zip C. Bosco. A richer annotation schema for an italian treebank. In C. Piliere, editor, Proc. of ESSLLI-2000 Student Session, pages 22-33, Birmingham, 2000 ps-zip C. Bosco, V. Lombardo, D. Vassallo, and L. Lesmo. Building a treebank for italian: a data-driven annotation schema. In Proc. 2nd International Conference on Language Resources and Evaluation LREC 2000, pages 99-105, Athens, 2000 ps-zip
Treebank projects:
Linguistic resources:
UCREL, University centre for computer Corpus REsearch on Language (University of Lancaster) LDC, Linguistic Data Consortium Linguistic Annotation page, by S. Bird at UPenn
Conferences:
[project] [treebank] [documents] [publications] [links] [back to the nlp group]