1. Do we need annotated corpora
in the era of the data deluge?
Martin Wynne
martin.wynne@oucs.ox.ac.uk
researchsupport@oucs.ox.ac.uk
Oxford e-Research Centre &
IT Services (formerly OUCS) &
ACRH2
Lisbon
Thursday 29th November 2012
Faculty of Linguistics, Philology and Phonetics,
University of Oxford
1
2.
3. Problems with annotation
It can:
• lead to circular reasoning
• be incorrect
• be inconsistent
• follow a particular theory
• have a specific level of granularity
• use a particular tag-set
• introduce subjective interpretations
3
5. The case for the corpus today
(against “the web as corpus”)
The spoken corpus: spoken, and other non-computer-mediated data
The historical corpus: pre-internet data (beyond books)
The specialised corpus: with integrity, provenance and controlled
sampling and representativeness
The annotated corpus: adding and sharing linguistic annotation
The web corpus: filtering and organising the data deluge (aka "the web
for corpus")
5
6. The case for the corpus today
(against “the web as corpus”)
But we do need to go beyond the finite text corpus:
●
speech
●
video
●
the language of the internet - new genres, new media, new modes
●
capturing the context, especially other data streams
●
engaging with the non-finite corpora (aka "the web as corpus")
6
7.
8.
9. Image by James Cridland from Flickr. Some rights reserved.
10. Annotation - Why?
• To perform identification, categorization and analysis
of features of the text
• It enables certain types of search and analysis,
especially beyond the word form (e.g. “search for all
inflected forms of cause as a verb”)
• It can be the foundation for further automatic analysis
of a corpus (e.g. POS tags can be used for parsing)
• Preserving the analysis, enabling replicability of
research, and reusability of the annotated corpus
10
11. Annotation: less than the text?
“Annotation of a text is a procedure which loses
information. There is no point in arguing that the
information is in the computer's memory somewhere
- annotation is the substitution of a general
category for a specific item, and with respect to
that area of the classification, the item has lost its
uniqueness.”
(John Sinclair, personal communication, 2001)
11
12. Annotation: how?
•
•
•
•
Annotations should be separable
Detailed and explicit documentation should be
provided
Annotation practices should be linguistically
consensual
Annotation should observe standards
(Leech 2005)
http://www.ota.ox.ac.uk/documents/creating/dlc/
12
13. Annotation standards?
Use of standards can help to ensure successful:
• interpretation,
• interchange,
• preservation,
• incorporation into other resources,
• processing by generic software.
And is a way of resolving tricky encoding decisions, and
of justifying and documenting your decisions.
13
14. Potential problems with annotation
1. Annotation is liable to be subjective and inconsistent
2. Annotation is sometimes intellectual and painstaking,
sometimes trivial and automatic
3. Annotation leads to digital silos
4. Annotation makes building a shared services
infrastructure difficult
14
15. Interoperability and sustainability
for digital textual scholarship
Well-known problems with digital resources in the humanities of:
• fragmentation of communities, resources, tools;
• lack of connectness and interoperability;
• sustainability of online services;
• lack of deployment of tools as reliable and available services
There is a potential solution in distributed, federated infrastructure
services.
15
16.
17.
18. The CLARIN Vision
A researcher in the Darmstadt, from his desktop computer, can:
do a single sign-on, with local authentication, and then:
search for, find and obtain authorization to use corpora in Oxford,
Prague and Berlin
select the precise dataset to work on, and save that selection
run semantic analysis tools from Budapest and statistical tools from
Tübingen over the dataset
use computational power from the local, national or other
computing centre where necessary
obtain advice and support for carrying out all technical and
methodological procedures
save the workflow and results of the analysis, and share those
results with collaborators in Paris, Vienna and Zagreb
discuss and iteratively adopt and re-run the analyses with
collaborators
19.
20. Silos or fishtanks??
Let's talk about fishtanks rather than silos...
There are lots of fishtanks out there, some very elaborate, big, pretty...
But they're all in different places and
unconnected.
And if I want to keep a fish I have to
build a fishtank (or put it in yours)...
And who's going to carry on feeding
the fish?
Let's not all make our own fishtanks.
20
21. Wouldn't it be better to have an ecosystem where we can all set our
fishes free?
You can access all of the riches of the deep and it's a lot easier to get
into fish research
21
22.
23.
24.
25. CLARIN
http://www.clarin.eu/
Infrastructure services for
research in the humanities and
social sciences using language
resources and tools.
Services to include:
Access and identity federation
Network of service centres
•
Concept and component
metadata registries
•
Federated resource discovery
•
Federated search across
resources
•
SOA for connecting tools
•
PID services
•
•
Bamboo
http://www.project-bamboo.org/
Project Bamboo is building
applications and shared
infrastructure for humanities
research, principally:
•
Research environments for
humanities scholars
•
Infrastructure allowing librarians
and technologists to support
humanities scholarship
•
Evolution of shared applications
for the curation and exploration of
widely distributed content
collections
•
Build a community for uptake,
expansion and sustainability
DARIAH
http://www.dariah.eu
Enhance and support digitallyenabled research across the
humanities and arts.
DARIAH is working with
communities of practice to:
Explore and apply ICT-based
methods and tools
•
Improve research opportunities
and outcomes through linking
distributed digital source materials
of many kinds
•
Exchange knowledge, expertise,
methodologies and practices
across domains and disciplines
•
31. Player One (a man)
Player Two (a woman)
[Enter two players]
What news, Borachio?
[Don John, Much Ado About Nothing, I, 3]
I came yonder from a great supper: I can give
you intelligence of an intended marriage.
[Borachio, Much Ado About Nothing, I, 3]
They say the lady is fair; 'tis a truth,
I can bear them witness; and virtuous;
'tis so, I cannot reprove it
A married man! that's most intolerable.
[Earl of Warwick, Henry VI Part I, V, 4]
Yet hasty marriage seldom proveth well.
[Benedick, Much Ado About Nothing, II, 3]
[Richard III, Henry VI Part III, IV, 1]
Is the single man therefore blessed?
No; as a wall'd town is more worthier than a
village, so is the forehead of a married man
more honourable than the bare brow of a
bachelor
Many a good hanging prevents a bad
marriage
[Touchstone, As You Like It, III, 3]
[Feste, Twelfth Night, I, 5]
By this marriage, All little jealousies, which
now seem great,
And all great fears, which now import their
dangers,
Would then be nothing
I may chance have some odd quirks and
remnants of wit broken on me, because I
have railed so long against marriage: but doth
not the appetite alter? a man loves the meat
in his youth that he cannot endure in his age.
[Agrippa, Antony and Cleopatra, II, 2]
[Benedick, Much Ado About Nothing, II, 3]
They are in the very wrath of love, and they
will together. Clubs cannot part them.
Speak low, if you speak love.
[Rosalind, As you Like It, V, 2]
[Don Pedro, Much Ado About Nothing, II, 1]
41. "[There is] a monolithic conception of social space, according to which
it would suffice to have the right information to make the right decisions.
But in point of fact, information itself is far from homogenous and no
purely quantitative approach is satisfying. Having ever greater amounts
of information at our fingertips not only does not make us more
virtuous, as Rousseau already predicted, but it does not even make us
more knowledgeable."
[Tzvetan Todorov, In Defence of the Enlightenment, 2009]
41
42. The simple challenge then...
... to transform the Humanities by promoting shared digital services,
facilities, resources and tools, without destroying the justification and
arguments for the Humanities for the Humanities sake, and thus
accidentally contributing to the decline and eventual destruction of
civilization
42
43.
44. The 'take-home messages'
●
●
●
●
●
in the era of the data deluge, web science and digital scholarship,
we need to rethink the case for the corpus today, and the case
for doing annotation
we need an ecosystem, not separate 'fishtanks'
annotation risks more fragmentation
we need to follow the physical sciences in deciding priorities &
adopting standards, reducing complexity and variety, to promote
shared facilities and infrastructures
but, at the same time, we need to avoid arguments for scientism
and instrumentalism, and to defend
the humanities
44