Do we need annotated corpora
in the era of the data deluge?

Martin Wynne
martin.wynne@oucs.ox.ac.uk
researchsupport@oucs.ox.ac.uk
Oxford e-Research Centre &
IT Services (formerly OUCS) &

ACRH2
Lisbon
Thursday 29th November 2012

Faculty of Linguistics, Philology and Phonetics,
University of Oxford
1
Problems with annotation

It can:
• lead to circular reasoning
• be incorrect
• be inconsistent
• follow a particular theory
• have a specific level of granularity
• use a particular tag-set
• introduce subjective interpretations
3
The data deluge
The case for the corpus today
(against “the web as corpus”)

The spoken corpus: spoken, and other non-computer-mediated data
The historical corpus: pre-internet data (beyond books)
The specialised corpus: with integrity, provenance and controlled
sampling and representativeness
The annotated corpus: adding and sharing linguistic annotation
The web corpus: filtering and organising the data deluge (aka "the web
for corpus")

5
The case for the corpus today
(against “the web as corpus”)

But we do need to go beyond the finite text corpus:
●
speech
●
video
●
the language of the internet - new genres, new media, new modes
●
capturing the context, especially other data streams
●
engaging with the non-finite corpora (aka "the web as corpus")

6
Image by James Cridland from Flickr. Some rights reserved.
Annotation - Why?

• To perform identification, categorization and analysis
of features of the text
• It enables certain types of search and analysis,
especially beyond the word form (e.g. “search for all
inflected forms of cause as a verb”)
• It can be the foundation for further automatic analysis
of a corpus (e.g. POS tags can be used for parsing)
• Preserving the analysis, enabling replicability of
research, and reusability of the annotated corpus

10
Annotation: less than the text?

“Annotation of a text is a procedure which loses
information. There is no point in arguing that the
information is in the computer's memory somewhere
- annotation is the substitution of a general
category for a specific item, and with respect to
that area of the classification, the item has lost its
uniqueness.”
(John Sinclair, personal communication, 2001)

11
Annotation: how?

•
•
•
•

Annotations should be separable
Detailed and explicit documentation should be
provided
Annotation practices should be linguistically
consensual
Annotation should observe standards
(Leech 2005)

http://www.ota.ox.ac.uk/documents/creating/dlc/

12
Annotation standards?

Use of standards can help to ensure successful:
• interpretation,
• interchange,
• preservation,
• incorporation into other resources,
• processing by generic software.
And is a way of resolving tricky encoding decisions, and
of justifying and documenting your decisions.

13
Potential problems with annotation

1. Annotation is liable to be subjective and inconsistent
2. Annotation is sometimes intellectual and painstaking,
sometimes trivial and automatic
3. Annotation leads to digital silos
4. Annotation makes building a shared services
infrastructure difficult

14
Interoperability and sustainability
for digital textual scholarship

Well-known problems with digital resources in the humanities of:
• fragmentation of communities, resources, tools;
• lack of connectness and interoperability;
• sustainability of online services;
• lack of deployment of tools as reliable and available services
There is a potential solution in distributed, federated infrastructure
services.

15
The CLARIN Vision
A researcher in the Darmstadt, from his desktop computer, can:
 do a single sign-on, with local authentication, and then:
 search for, find and obtain authorization to use corpora in Oxford,
Prague and Berlin
 select the precise dataset to work on, and save that selection
 run semantic analysis tools from Budapest and statistical tools from
Tübingen over the dataset
 use computational power from the local, national or other
computing centre where necessary
 obtain advice and support for carrying out all technical and
methodological procedures
 save the workflow and results of the analysis, and share those
results with collaborators in Paris, Vienna and Zagreb
 discuss and iteratively adopt and re-run the analyses with
collaborators
Silos or fishtanks??

Let's talk about fishtanks rather than silos...
There are lots of fishtanks out there, some very elaborate, big, pretty...
But they're all in different places and
unconnected.
And if I want to keep a fish I have to
build a fishtank (or put it in yours)...
And who's going to carry on feeding
the fish?
Let's not all make our own fishtanks.

20
Wouldn't it be better to have an ecosystem where we can all set our
fishes free?

You can access all of the riches of the deep and it's a lot easier to get
into fish research

21
CLARIN
http://www.clarin.eu/

Infrastructure services for
research in the humanities and
social sciences using language
resources and tools.
Services to include:
Access and identity federation
Network of service centres
•
Concept and component
metadata registries
•
Federated resource discovery
•
Federated search across
resources
•
SOA for connecting tools
•
PID services
•
•

Bamboo
http://www.project-bamboo.org/

Project Bamboo is building
applications and shared
infrastructure for humanities
research, principally:
•
Research environments for
humanities scholars
•
Infrastructure allowing librarians
and technologists to support
humanities scholarship
•
Evolution of shared applications
for the curation and exploration of
widely distributed content
collections
•
Build a community for uptake,
expansion and sustainability

DARIAH
http://www.dariah.eu

Enhance and support digitallyenabled research across the
humanities and arts.
DARIAH is working with
communities of practice to:
Explore and apply ICT-based
methods and tools
•

Improve research opportunities
and outcomes through linking
distributed digital source materials
of many kinds
•

Exchange knowledge, expertise,
methodologies and practices
across domains and disciplines
•
Corpus Linguistics

30
Player One (a man)

Player Two (a woman)

[Enter two players]

What news, Borachio?
[Don John, Much Ado About Nothing, I, 3]

I came yonder from a great supper: I can give
you intelligence of an intended marriage.
[Borachio, Much Ado About Nothing, I, 3]

They say the lady is fair; 'tis a truth,
I can bear them witness; and virtuous;
'tis so, I cannot reprove it

A married man! that's most intolerable.

[Earl of Warwick, Henry VI Part I, V, 4]

Yet hasty marriage seldom proveth well.

[Benedick, Much Ado About Nothing, II, 3]
[Richard III, Henry VI Part III, IV, 1]

Is the single man therefore blessed?
No; as a wall'd town is more worthier than a
village, so is the forehead of a married man
more honourable than the bare brow of a
bachelor

Many a good hanging prevents a bad
marriage

[Touchstone, As You Like It, III, 3]

[Feste, Twelfth Night, I, 5]

By this marriage, All little jealousies, which
now seem great,
And all great fears, which now import their
dangers,
Would then be nothing

I may chance have some odd quirks and
remnants of wit broken on me, because I
have railed so long against marriage: but doth
not the appetite alter? a man loves the meat
in his youth that he cannot endure in his age.

[Agrippa, Antony and Cleopatra, II, 2]

[Benedick, Much Ado About Nothing, II, 3]

They are in the very wrath of love, and they
will together. Clubs cannot part them.

Speak low, if you speak love.

[Rosalind, As you Like It, V, 2]

[Don Pedro, Much Ado About Nothing, II, 1]
Data-intensive Humanities

32
Nature 474, 436-440 (2011) | doi:10.1038/474436a
"[There is] a monolithic conception of social space, according to which
it would suffice to have the right information to make the right decisions.
But in point of fact, information itself is far from homogenous and no
purely quantitative approach is satisfying. Having ever greater amounts
of information at our fingertips not only does not make us more
virtuous, as Rousseau already predicted, but it does not even make us
more knowledgeable."
[Tzvetan Todorov, In Defence of the Enlightenment, 2009]

41
The simple challenge then...

... to transform the Humanities by promoting shared digital services,
facilities, resources and tools, without destroying the justification and
arguments for the Humanities for the Humanities sake, and thus
accidentally contributing to the decline and eventual destruction of
civilization

42
The 'take-home messages'
●

●
●
●

●

in the era of the data deluge, web science and digital scholarship,
we need to rethink the case for the corpus today, and the case
for doing annotation
we need an ecosystem, not separate 'fishtanks'
annotation risks more fragmentation
we need to follow the physical sciences in deciding priorities &
adopting standards, reducing complexity and variety, to promote
shared facilities and infrastructures
but, at the same time, we need to avoid arguments for scientism
and instrumentalism, and to defend
the humanities

44
Annotated Corpora for Research in the Humanities

Annotated Corpora for Research in the Humanities

  • 1.
    Do we needannotated corpora in the era of the data deluge? Martin Wynne martin.wynne@oucs.ox.ac.uk researchsupport@oucs.ox.ac.uk Oxford e-Research Centre & IT Services (formerly OUCS) & ACRH2 Lisbon Thursday 29th November 2012 Faculty of Linguistics, Philology and Phonetics, University of Oxford 1
  • 3.
    Problems with annotation Itcan: • lead to circular reasoning • be incorrect • be inconsistent • follow a particular theory • have a specific level of granularity • use a particular tag-set • introduce subjective interpretations 3
  • 4.
  • 5.
    The case forthe corpus today (against “the web as corpus”) The spoken corpus: spoken, and other non-computer-mediated data The historical corpus: pre-internet data (beyond books) The specialised corpus: with integrity, provenance and controlled sampling and representativeness The annotated corpus: adding and sharing linguistic annotation The web corpus: filtering and organising the data deluge (aka "the web for corpus") 5
  • 6.
    The case forthe corpus today (against “the web as corpus”) But we do need to go beyond the finite text corpus: ● speech ● video ● the language of the internet - new genres, new media, new modes ● capturing the context, especially other data streams ● engaging with the non-finite corpora (aka "the web as corpus") 6
  • 9.
    Image by JamesCridland from Flickr. Some rights reserved.
  • 10.
    Annotation - Why? •To perform identification, categorization and analysis of features of the text • It enables certain types of search and analysis, especially beyond the word form (e.g. “search for all inflected forms of cause as a verb”) • It can be the foundation for further automatic analysis of a corpus (e.g. POS tags can be used for parsing) • Preserving the analysis, enabling replicability of research, and reusability of the annotated corpus 10
  • 11.
    Annotation: less thanthe text? “Annotation of a text is a procedure which loses information. There is no point in arguing that the information is in the computer's memory somewhere - annotation is the substitution of a general category for a specific item, and with respect to that area of the classification, the item has lost its uniqueness.” (John Sinclair, personal communication, 2001) 11
  • 12.
    Annotation: how? • • • • Annotations shouldbe separable Detailed and explicit documentation should be provided Annotation practices should be linguistically consensual Annotation should observe standards (Leech 2005) http://www.ota.ox.ac.uk/documents/creating/dlc/ 12
  • 13.
    Annotation standards? Use ofstandards can help to ensure successful: • interpretation, • interchange, • preservation, • incorporation into other resources, • processing by generic software. And is a way of resolving tricky encoding decisions, and of justifying and documenting your decisions. 13
  • 14.
    Potential problems withannotation 1. Annotation is liable to be subjective and inconsistent 2. Annotation is sometimes intellectual and painstaking, sometimes trivial and automatic 3. Annotation leads to digital silos 4. Annotation makes building a shared services infrastructure difficult 14
  • 15.
    Interoperability and sustainability fordigital textual scholarship Well-known problems with digital resources in the humanities of: • fragmentation of communities, resources, tools; • lack of connectness and interoperability; • sustainability of online services; • lack of deployment of tools as reliable and available services There is a potential solution in distributed, federated infrastructure services. 15
  • 18.
    The CLARIN Vision Aresearcher in the Darmstadt, from his desktop computer, can:  do a single sign-on, with local authentication, and then:  search for, find and obtain authorization to use corpora in Oxford, Prague and Berlin  select the precise dataset to work on, and save that selection  run semantic analysis tools from Budapest and statistical tools from Tübingen over the dataset  use computational power from the local, national or other computing centre where necessary  obtain advice and support for carrying out all technical and methodological procedures  save the workflow and results of the analysis, and share those results with collaborators in Paris, Vienna and Zagreb  discuss and iteratively adopt and re-run the analyses with collaborators
  • 20.
    Silos or fishtanks?? Let'stalk about fishtanks rather than silos... There are lots of fishtanks out there, some very elaborate, big, pretty... But they're all in different places and unconnected. And if I want to keep a fish I have to build a fishtank (or put it in yours)... And who's going to carry on feeding the fish? Let's not all make our own fishtanks. 20
  • 21.
    Wouldn't it bebetter to have an ecosystem where we can all set our fishes free? You can access all of the riches of the deep and it's a lot easier to get into fish research 21
  • 25.
    CLARIN http://www.clarin.eu/ Infrastructure services for researchin the humanities and social sciences using language resources and tools. Services to include: Access and identity federation Network of service centres • Concept and component metadata registries • Federated resource discovery • Federated search across resources • SOA for connecting tools • PID services • • Bamboo http://www.project-bamboo.org/ Project Bamboo is building applications and shared infrastructure for humanities research, principally: • Research environments for humanities scholars • Infrastructure allowing librarians and technologists to support humanities scholarship • Evolution of shared applications for the curation and exploration of widely distributed content collections • Build a community for uptake, expansion and sustainability DARIAH http://www.dariah.eu Enhance and support digitallyenabled research across the humanities and arts. DARIAH is working with communities of practice to: Explore and apply ICT-based methods and tools • Improve research opportunities and outcomes through linking distributed digital source materials of many kinds • Exchange knowledge, expertise, methodologies and practices across domains and disciplines •
  • 30.
  • 31.
    Player One (aman) Player Two (a woman) [Enter two players] What news, Borachio? [Don John, Much Ado About Nothing, I, 3] I came yonder from a great supper: I can give you intelligence of an intended marriage. [Borachio, Much Ado About Nothing, I, 3] They say the lady is fair; 'tis a truth, I can bear them witness; and virtuous; 'tis so, I cannot reprove it A married man! that's most intolerable. [Earl of Warwick, Henry VI Part I, V, 4] Yet hasty marriage seldom proveth well. [Benedick, Much Ado About Nothing, II, 3] [Richard III, Henry VI Part III, IV, 1] Is the single man therefore blessed? No; as a wall'd town is more worthier than a village, so is the forehead of a married man more honourable than the bare brow of a bachelor Many a good hanging prevents a bad marriage [Touchstone, As You Like It, III, 3] [Feste, Twelfth Night, I, 5] By this marriage, All little jealousies, which now seem great, And all great fears, which now import their dangers, Would then be nothing I may chance have some odd quirks and remnants of wit broken on me, because I have railed so long against marriage: but doth not the appetite alter? a man loves the meat in his youth that he cannot endure in his age. [Agrippa, Antony and Cleopatra, II, 2] [Benedick, Much Ado About Nothing, II, 3] They are in the very wrath of love, and they will together. Clubs cannot part them. Speak low, if you speak love. [Rosalind, As you Like It, V, 2] [Don Pedro, Much Ado About Nothing, II, 1]
  • 32.
  • 38.
    Nature 474, 436-440(2011) | doi:10.1038/474436a
  • 41.
    "[There is] amonolithic conception of social space, according to which it would suffice to have the right information to make the right decisions. But in point of fact, information itself is far from homogenous and no purely quantitative approach is satisfying. Having ever greater amounts of information at our fingertips not only does not make us more virtuous, as Rousseau already predicted, but it does not even make us more knowledgeable." [Tzvetan Todorov, In Defence of the Enlightenment, 2009] 41
  • 42.
    The simple challengethen... ... to transform the Humanities by promoting shared digital services, facilities, resources and tools, without destroying the justification and arguments for the Humanities for the Humanities sake, and thus accidentally contributing to the decline and eventual destruction of civilization 42
  • 44.
    The 'take-home messages' ● ● ● ● ● inthe era of the data deluge, web science and digital scholarship, we need to rethink the case for the corpus today, and the case for doing annotation we need an ecosystem, not separate 'fishtanks' annotation risks more fragmentation we need to follow the physical sciences in deciding priorities & adopting standards, reducing complexity and variety, to promote shared facilities and infrastructures but, at the same time, we need to avoid arguments for scientism and instrumentalism, and to defend the humanities 44