Treebank annotation

Treebank Annotation
By –
Mohit Jasapara – 2012EEB1059
Aashish Kholiya – 2012MEB1083
1

Treebank
 The termtreebank was coined by linguist Geoffrey Leech in the 1980s because
both syntactic and semantic structure are commonly represented compositionally
as a tree structure.
 In linguistics , a treebank is a parsed text corpus that annotates syntactic or
semantic sentence structure.
 In simple words, treebanks are collections of manually checked syntactic analyses
of sentences.
2

Construction
 Treebanks are often created on top of a corpus that has already been annotated
with part-of-speech tags.
 treebanks are sometimes enhanced with semantic or other linguistic information.
 Treebanks can be created completely manually, where linguists annotate each
sentence with syntactic structure, or semi-automatically, where a parser assigns
some syntactic structure which linguists then check and, if necessary, correct
4

Construction
 In practice, fully checking and completing the parsing of natural language corpora
is a labour-intensive project that can take teams of graduate linguists several years.
 The level of annotation detail and the breadth of the linguistic sample determine
the difficulty of the task and the length of time required to build a treebank.
5

Construction
 Some treebanks follow a specific linguistic theory in their syntactic annotation
(e.g. the BulTreeBank follows HPSG) but most try to be less theory-specific.
However, two main groups can be distinguished:
treebanks that annotate phrase structure (for example the Penn Treebank or ICE-GB)
and
those that annotate dependency structure (for example the Prague Dependency
Treebank or the Quranic Arabic Dependency Treebank).
6

Construction
 It is important to clarify the distinction between the formal representation and the
file format used to store the annotated data.
 Treebanks are necessarily constructed according to a particular grammar. The same
grammar may be implemented by different file formats.
7

Construction
For example, the syntactic analysis for John loves Mary, shown in the figure on the
right, may be represented by simple labelled brackets in a text file, like this (following
the Penn Treebank notation):
8

Construction
 This type of representation is popular because it is light on resources, and the tree
structure is relatively easy to read without software tools. However as corpora
become increasingly complex, other file formats may be preferred. Alternatives
include treebank-specific XML schemes, numbered indentation and various types
of standoff notation.
9

Applications
Computational perspective
 From a computational perspective, Treebank have been used to engineer state-of-the-
art natural language processing systems such as part-of-speech
taggers, parsers, semantic analyzers and machine translation systems.
 Most computational systems utilize gold-standard Treebank data.
 However, an automatically parsed corpus that is not corrected by human linguists
can still be useful.
10

Applications
 It can provide evidence of rule frequency for a parser.
 A parser may be improved by applying it to large amounts of text and gathering
rule frequencies.
 However, it should be obvious that only by a process of correcting and completing
a corpus by hand is it possible then to identify rules absent from the parser
knowledge base. In addition, frequencies are likely to be more accurate.
11

Applications
Corpus linguistics
 In corpus linguistics, Treebank are used to study syntactic phenomena
for example, diachronic corpora can be used to study the time course of syntactic
change.
 Once parsed, a corpus will contain frequency evidence showing how common
different grammatical structures are in use.
 Treebank also provide evidence of coverage and support the discovery of new,
unanticipated, grammatical phenomena.
.
12

Applications
 Interaction research is particularly fruitful as further layers of annotation, e.g.
semantic, pragmatic, are added to a corpus.
 It is then possible to evaluate the impact of non-syntactic phenomena on
grammatical choices
13

Applications
Theoretical linguistics and Psycholinguistics
 Another use of Treebank in theoretical linguistics and psycholinguistics is
interaction evidence.
 A completed Treebank can help linguists carry out experiments as to how the
decision to use one grammatical construction tends to influence the decision to
form others, and to try to understand how speakers and writers make decisions as
they form sentences.
14

Penn Treebank Project
 The Penn Treebank Project annotates naturally-occurring text for linguistic
structure.
 Most notably, it produces skeletal parses showing rough syntactic and semantic
information -- a bank of linguistic trees .
 It also annotate text with part-of-speech tags, and for the Switchboard corpus of
telephone conversations, dysfluency annotation.
 It is located in the LINC Laboratory of the Computer and Information Science
Department at the University of Pennsylvania.
15

Penn Treebank Project
 The Linguistic Data Consortium(LDC) provides tools and formats for creating and
managing linguistic annotations.
 `Linguistic annotation‘ covers any descriptive or analytic notations applied to raw
language data.
 The Penn Treebank is a human-annotated and partially `skeletally' parsed corpus
consisting of over 4.5 million words of American English.
 It includes the Brown Corpus (retagged) and the Wall Street Journal Corpus, as well
as Department of Energy abstracts, Dow Jones Newswire stories, Department of
Agriculture bulletins, Library of America texts, MUC-3 messages, IBM Manual
sentences, WBUR radio transcripts, and ATIS sentences.
16

References
 http://en.wikipedia.org/wiki/Treebank
 http://www.cis.upenn.edu/~treebank/
 https://catalog.ldc.upenn.edu/LDC97S62
 http://mshang.ca/syntree/
 http://faculty.washington.edu/fxia/LAWVI/workshop_presentation_slides/special_se
ssion/pml/
 http://www.seas.upenn.edu/~pdtb/tools.shtml
19

Treebank annotation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Treebank annotation

Similar to Treebank annotation (20)

Recently uploaded

Recently uploaded (20)

Treebank annotation