SlideShare a Scribd company logo
1 of 45
Download to read offline
Do we need annotated corpora
in the era of the data deluge?

Martin Wynne
martin.wynne@oucs.ox.ac.uk
researchsupport@oucs.ox.ac.uk
Oxford e-Research Centre &
IT Services (formerly OUCS) &

ACRH2
Lisbon
Thursday 29th November 2012

Faculty of Linguistics, Philology and Phonetics,
University of Oxford
1
Problems with annotation

It can:
• lead to circular reasoning
• be incorrect
• be inconsistent
• follow a particular theory
• have a specific level of granularity
• use a particular tag-set
• introduce subjective interpretations
3
The data deluge
The case for the corpus today
(against “the web as corpus”)

The spoken corpus: spoken, and other non-computer-mediated data
The historical corpus: pre-internet data (beyond books)
The specialised corpus: with integrity, provenance and controlled
sampling and representativeness
The annotated corpus: adding and sharing linguistic annotation
The web corpus: filtering and organising the data deluge (aka "the web
for corpus")

5
The case for the corpus today
(against “the web as corpus”)

But we do need to go beyond the finite text corpus:
●
speech
●
video
●
the language of the internet - new genres, new media, new modes
●
capturing the context, especially other data streams
●
engaging with the non-finite corpora (aka "the web as corpus")

6
Image by James Cridland from Flickr. Some rights reserved.
Annotation - Why?

• To perform identification, categorization and analysis
of features of the text
• It enables certain types of search and analysis,
especially beyond the word form (e.g. “search for all
inflected forms of cause as a verb”)
• It can be the foundation for further automatic analysis
of a corpus (e.g. POS tags can be used for parsing)
• Preserving the analysis, enabling replicability of
research, and reusability of the annotated corpus

10
Annotation: less than the text?

“Annotation of a text is a procedure which loses
information. There is no point in arguing that the
information is in the computer's memory somewhere
- annotation is the substitution of a general
category for a specific item, and with respect to
that area of the classification, the item has lost its
uniqueness.”
(John Sinclair, personal communication, 2001)

11
Annotation: how?

•
•
•
•

Annotations should be separable
Detailed and explicit documentation should be
provided
Annotation practices should be linguistically
consensual
Annotation should observe standards
(Leech 2005)

http://www.ota.ox.ac.uk/documents/creating/dlc/

12
Annotation standards?

Use of standards can help to ensure successful:
• interpretation,
• interchange,
• preservation,
• incorporation into other resources,
• processing by generic software.
And is a way of resolving tricky encoding decisions, and
of justifying and documenting your decisions.

13
Potential problems with annotation

1. Annotation is liable to be subjective and inconsistent
2. Annotation is sometimes intellectual and painstaking,
sometimes trivial and automatic
3. Annotation leads to digital silos
4. Annotation makes building a shared services
infrastructure difficult

14
Interoperability and sustainability
for digital textual scholarship

Well-known problems with digital resources in the humanities of:
• fragmentation of communities, resources, tools;
• lack of connectness and interoperability;
• sustainability of online services;
• lack of deployment of tools as reliable and available services
There is a potential solution in distributed, federated infrastructure
services.

15
The CLARIN Vision
A researcher in the Darmstadt, from his desktop computer, can:
 do a single sign-on, with local authentication, and then:
 search for, find and obtain authorization to use corpora in Oxford,
Prague and Berlin
 select the precise dataset to work on, and save that selection
 run semantic analysis tools from Budapest and statistical tools from
Tübingen over the dataset
 use computational power from the local, national or other
computing centre where necessary
 obtain advice and support for carrying out all technical and
methodological procedures
 save the workflow and results of the analysis, and share those
results with collaborators in Paris, Vienna and Zagreb
 discuss and iteratively adopt and re-run the analyses with
collaborators
Silos or fishtanks??

Let's talk about fishtanks rather than silos...
There are lots of fishtanks out there, some very elaborate, big, pretty...
But they're all in different places and
unconnected.
And if I want to keep a fish I have to
build a fishtank (or put it in yours)...
And who's going to carry on feeding
the fish?
Let's not all make our own fishtanks.

20
Wouldn't it be better to have an ecosystem where we can all set our
fishes free?

You can access all of the riches of the deep and it's a lot easier to get
into fish research

21
CLARIN
http://www.clarin.eu/

Infrastructure services for
research in the humanities and
social sciences using language
resources and tools.
Services to include:
Access and identity federation
Network of service centres
•
Concept and component
metadata registries
•
Federated resource discovery
•
Federated search across
resources
•
SOA for connecting tools
•
PID services
•
•

Bamboo
http://www.project-bamboo.org/

Project Bamboo is building
applications and shared
infrastructure for humanities
research, principally:
•
Research environments for
humanities scholars
•
Infrastructure allowing librarians
and technologists to support
humanities scholarship
•
Evolution of shared applications
for the curation and exploration of
widely distributed content
collections
•
Build a community for uptake,
expansion and sustainability

DARIAH
http://www.dariah.eu

Enhance and support digitallyenabled research across the
humanities and arts.
DARIAH is working with
communities of practice to:
Explore and apply ICT-based
methods and tools
•

Improve research opportunities
and outcomes through linking
distributed digital source materials
of many kinds
•

Exchange knowledge, expertise,
methodologies and practices
across domains and disciplines
•
Corpus Linguistics

30
Player One (a man)

Player Two (a woman)

[Enter two players]

What news, Borachio?
[Don John, Much Ado About Nothing, I, 3]

I came yonder from a great supper: I can give
you intelligence of an intended marriage.
[Borachio, Much Ado About Nothing, I, 3]

They say the lady is fair; 'tis a truth,
I can bear them witness; and virtuous;
'tis so, I cannot reprove it

A married man! that's most intolerable.

[Earl of Warwick, Henry VI Part I, V, 4]

Yet hasty marriage seldom proveth well.

[Benedick, Much Ado About Nothing, II, 3]
[Richard III, Henry VI Part III, IV, 1]

Is the single man therefore blessed?
No; as a wall'd town is more worthier than a
village, so is the forehead of a married man
more honourable than the bare brow of a
bachelor

Many a good hanging prevents a bad
marriage

[Touchstone, As You Like It, III, 3]

[Feste, Twelfth Night, I, 5]

By this marriage, All little jealousies, which
now seem great,
And all great fears, which now import their
dangers,
Would then be nothing

I may chance have some odd quirks and
remnants of wit broken on me, because I
have railed so long against marriage: but doth
not the appetite alter? a man loves the meat
in his youth that he cannot endure in his age.

[Agrippa, Antony and Cleopatra, II, 2]

[Benedick, Much Ado About Nothing, II, 3]

They are in the very wrath of love, and they
will together. Clubs cannot part them.

Speak low, if you speak love.

[Rosalind, As you Like It, V, 2]

[Don Pedro, Much Ado About Nothing, II, 1]
Data-intensive Humanities

32
Nature 474, 436-440 (2011) | doi:10.1038/474436a
"[There is] a monolithic conception of social space, according to which
it would suffice to have the right information to make the right decisions.
But in point of fact, information itself is far from homogenous and no
purely quantitative approach is satisfying. Having ever greater amounts
of information at our fingertips not only does not make us more
virtuous, as Rousseau already predicted, but it does not even make us
more knowledgeable."
[Tzvetan Todorov, In Defence of the Enlightenment, 2009]

41
The simple challenge then...

... to transform the Humanities by promoting shared digital services,
facilities, resources and tools, without destroying the justification and
arguments for the Humanities for the Humanities sake, and thus
accidentally contributing to the decline and eventual destruction of
civilization

42
The 'take-home messages'
●

●
●
●

●

in the era of the data deluge, web science and digital scholarship,
we need to rethink the case for the corpus today, and the case
for doing annotation
we need an ecosystem, not separate 'fishtanks'
annotation risks more fragmentation
we need to follow the physical sciences in deciding priorities &
adopting standards, reducing complexity and variety, to promote
shared facilities and infrastructures
but, at the same time, we need to avoid arguments for scientism
and instrumentalism, and to defend
the humanities

44
Annotated Corpora for Research in the Humanities

More Related Content

What's hot

ARIN6912 Presentation Week 5: Digital Environments
ARIN6912 Presentation Week 5: Digital EnvironmentsARIN6912 Presentation Week 5: Digital Environments
ARIN6912 Presentation Week 5: Digital Environmentskittysquish
 
Cautious Optimism: Cultivate your Garden
Cautious Optimism: Cultivate your GardenCautious Optimism: Cultivate your Garden
Cautious Optimism: Cultivate your GardenChris Rusbridge
 
Fragments, Pivots and Jumps that Relate and Narrative
Fragments, Pivots and Jumps that Relate and NarrativeFragments, Pivots and Jumps that Relate and Narrative
Fragments, Pivots and Jumps that Relate and NarrativeRuth Tringham
 
The Distributed National Electronic Resource and the Electronic Libraries Pro...
The Distributed National Electronic Resource and the Electronic Libraries Pro...The Distributed National Electronic Resource and the Electronic Libraries Pro...
The Distributed National Electronic Resource and the Electronic Libraries Pro...Chris Rusbridge
 
Digital community design: exploring the role of mobile social software in the...
Digital community design: exploring the role of mobile social software in the...Digital community design: exploring the role of mobile social software in the...
Digital community design: exploring the role of mobile social software in the...Giuseppe Lugano
 
Internet services, by Carlos Cajaraville Lojo
Internet services, by Carlos Cajaraville LojoInternet services, by Carlos Cajaraville Lojo
Internet services, by Carlos Cajaraville LojoJosé M. Rivas
 
2 virtual library article 21 34
2 virtual library article 21 342 virtual library article 21 34
2 virtual library article 21 34prjpublications
 
greenstone digital library software
greenstone digital library softwaregreenstone digital library software
greenstone digital library softwaresharon bacalzo
 
Hartley Presentation on Cataloging & Metadata Trends
Hartley Presentation on Cataloging & Metadata TrendsHartley Presentation on Cataloging & Metadata Trends
Hartley Presentation on Cataloging & Metadata Trendsrshartley
 
GREENSTONE DIGITAL LIBRARY SOFTWARE
GREENSTONE DIGITAL LIBRARY SOFTWAREGREENSTONE DIGITAL LIBRARY SOFTWARE
GREENSTONE DIGITAL LIBRARY SOFTWAREsharon bacalzo
 
Creation of Digital Libraries using Open Source Software
Creation of Digital Libraries using Open Source SoftwareCreation of Digital Libraries using Open Source Software
Creation of Digital Libraries using Open Source SoftwareArun VR
 
Big Data in the Arts and Humanities
Big Data in the Arts and HumanitiesBig Data in the Arts and Humanities
Big Data in the Arts and HumanitiesAndrew Prescott
 
DIGITAL LIBRARIES: WHITHER THOU GOEST?
DIGITAL LIBRARIES: WHITHER THOU GOEST? DIGITAL LIBRARIES: WHITHER THOU GOEST?
DIGITAL LIBRARIES: WHITHER THOU GOEST? IAEME Publication
 
Common online terminologies
Common online terminologiesCommon online terminologies
Common online terminologiesalyssamonicacruz
 
Digital humanities, digital libraries, information science what relation? 4
Digital humanities, digital libraries, information science  what relation? 4Digital humanities, digital libraries, information science  what relation? 4
Digital humanities, digital libraries, information science what relation? 4Anna Maria Tammaro
 

What's hot (20)

p7 e1 niurkavargas
 p7 e1 niurkavargas p7 e1 niurkavargas
p7 e1 niurkavargas
 
ARIN6912 Presentation Week 5: Digital Environments
ARIN6912 Presentation Week 5: Digital EnvironmentsARIN6912 Presentation Week 5: Digital Environments
ARIN6912 Presentation Week 5: Digital Environments
 
Cautious Optimism: Cultivate your Garden
Cautious Optimism: Cultivate your GardenCautious Optimism: Cultivate your Garden
Cautious Optimism: Cultivate your Garden
 
Fragments, Pivots and Jumps that Relate and Narrative
Fragments, Pivots and Jumps that Relate and NarrativeFragments, Pivots and Jumps that Relate and Narrative
Fragments, Pivots and Jumps that Relate and Narrative
 
Dh presentation 2018
Dh presentation 2018Dh presentation 2018
Dh presentation 2018
 
Dh presentation 2019
Dh presentation 2019Dh presentation 2019
Dh presentation 2019
 
The Distributed National Electronic Resource and the Electronic Libraries Pro...
The Distributed National Electronic Resource and the Electronic Libraries Pro...The Distributed National Electronic Resource and the Electronic Libraries Pro...
The Distributed National Electronic Resource and the Electronic Libraries Pro...
 
Digital community design: exploring the role of mobile social software in the...
Digital community design: exploring the role of mobile social software in the...Digital community design: exploring the role of mobile social software in the...
Digital community design: exploring the role of mobile social software in the...
 
Internet services, by Carlos Cajaraville Lojo
Internet services, by Carlos Cajaraville LojoInternet services, by Carlos Cajaraville Lojo
Internet services, by Carlos Cajaraville Lojo
 
2 virtual library article 21 34
2 virtual library article 21 342 virtual library article 21 34
2 virtual library article 21 34
 
greenstone digital library software
greenstone digital library softwaregreenstone digital library software
greenstone digital library software
 
Hartley Presentation on Cataloging & Metadata Trends
Hartley Presentation on Cataloging & Metadata TrendsHartley Presentation on Cataloging & Metadata Trends
Hartley Presentation on Cataloging & Metadata Trends
 
GREENSTONE DIGITAL LIBRARY SOFTWARE
GREENSTONE DIGITAL LIBRARY SOFTWAREGREENSTONE DIGITAL LIBRARY SOFTWARE
GREENSTONE DIGITAL LIBRARY SOFTWARE
 
Creation of Digital Libraries using Open Source Software
Creation of Digital Libraries using Open Source SoftwareCreation of Digital Libraries using Open Source Software
Creation of Digital Libraries using Open Source Software
 
Big Data in the Arts and Humanities
Big Data in the Arts and HumanitiesBig Data in the Arts and Humanities
Big Data in the Arts and Humanities
 
Digital library softaware greenstone & dsapce
Digital library softaware greenstone & dsapceDigital library softaware greenstone & dsapce
Digital library softaware greenstone & dsapce
 
DIGITAL LIBRARIES: WHITHER THOU GOEST?
DIGITAL LIBRARIES: WHITHER THOU GOEST? DIGITAL LIBRARIES: WHITHER THOU GOEST?
DIGITAL LIBRARIES: WHITHER THOU GOEST?
 
Common online terminologies
Common online terminologiesCommon online terminologies
Common online terminologies
 
Preserve or preserve not
Preserve or preserve notPreserve or preserve not
Preserve or preserve not
 
Digital humanities, digital libraries, information science what relation? 4
Digital humanities, digital libraries, information science  what relation? 4Digital humanities, digital libraries, information science  what relation? 4
Digital humanities, digital libraries, information science what relation? 4
 

Viewers also liked

Viewers also liked (20)

Test
TestTest
Test
 
Gift Tips
Gift TipsGift Tips
Gift Tips
 
7.stress dan individu
7.stress dan individu7.stress dan individu
7.stress dan individu
 
Digipak advertisiment pster
Digipak advertisiment psterDigipak advertisiment pster
Digipak advertisiment pster
 
Desarrollo de habilidades directivas, angel ortiz 0280
Desarrollo de habilidades directivas, angel ortiz 0280Desarrollo de habilidades directivas, angel ortiz 0280
Desarrollo de habilidades directivas, angel ortiz 0280
 
المهارات الحركية الكبرى
المهارات الحركية الكبرىالمهارات الحركية الكبرى
المهارات الحركية الكبرى
 
CDP TPNA Municipales 2013
CDP TPNA Municipales 2013CDP TPNA Municipales 2013
CDP TPNA Municipales 2013
 
TH Certification
TH CertificationTH Certification
TH Certification
 
Tic
TicTic
Tic
 
Christmas Ornaments
Christmas Ornaments Christmas Ornaments
Christmas Ornaments
 
Quik start of NITRO RC CAR
Quik start of NITRO RC CARQuik start of NITRO RC CAR
Quik start of NITRO RC CAR
 
Zakhirae Qasas
Zakhirae QasasZakhirae Qasas
Zakhirae Qasas
 
Plus de mal de voiture dans une voiture autonome
Plus de mal de voiture dans une voiture autonomePlus de mal de voiture dans une voiture autonome
Plus de mal de voiture dans une voiture autonome
 
Données Clés - Communauté d'Agglomération Fécamp Caux Littoral
Données Clés - Communauté d'Agglomération Fécamp Caux Littoral Données Clés - Communauté d'Agglomération Fécamp Caux Littoral
Données Clés - Communauté d'Agglomération Fécamp Caux Littoral
 
Faiz Aldalbhi CV English dated 17 Nov 15
Faiz Aldalbhi CV English dated 17 Nov 15Faiz Aldalbhi CV English dated 17 Nov 15
Faiz Aldalbhi CV English dated 17 Nov 15
 
Slide identificationsu2011
Slide identificationsu2011Slide identificationsu2011
Slide identificationsu2011
 
P2
P2P2
P2
 
Moodboard
MoodboardMoodboard
Moodboard
 
Online Assignment
Online AssignmentOnline Assignment
Online Assignment
 
El entrenamiento mental en los negocios online
El entrenamiento mental en los negocios onlineEl entrenamiento mental en los negocios online
El entrenamiento mental en los negocios online
 

Similar to Annotated Corpora for Research in the Humanities

Forty Years of the OTA
Forty Years of the OTAForty Years of the OTA
Forty Years of the OTAMartin Wynne
 
Big data and Digital Transformations in the Humanities
Big data and Digital Transformations in the HumanitiesBig data and Digital Transformations in the Humanities
Big data and Digital Transformations in the HumanitiesMartin Wynne
 
Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumMarieke van Erp
 
101 This is Digital Scholarship Staff Training
101 This is Digital Scholarship Staff Training101 This is Digital Scholarship Staff Training
101 This is Digital Scholarship Staff TrainingNora McGregor
 
Research Data Management for the Humanities and Social Sciences
Research Data Management for the Humanities and Social SciencesResearch Data Management for the Humanities and Social Sciences
Research Data Management for the Humanities and Social SciencesMartin Donnelly
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
Dh presentation helig 2014
Dh presentation helig 2014Dh presentation helig 2014
Dh presentation helig 2014HELIGLIASA
 
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesDigital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesShawn Day
 
sustainable digital publishing of archival catalogues1
sustainable digital publishing of archival catalogues1sustainable digital publishing of archival catalogues1
sustainable digital publishing of archival catalogues1AvanNispen
 
Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...
Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...
Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...UBC Library
 
Designing the Digital Humanities Library Lab @ Leuven (DH3L)
Designing the Digital Humanities Library Lab @ Leuven (DH3L)Designing the Digital Humanities Library Lab @ Leuven (DH3L)
Designing the Digital Humanities Library Lab @ Leuven (DH3L)Demmy Verbeke
 
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...OpenEdition
 
Getting Started with Institutional Repositories and Open Access
Getting Started with Institutional Repositories and Open AccessGetting Started with Institutional Repositories and Open Access
Getting Started with Institutional Repositories and Open AccessAbby Clobridge
 
Data, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of ChileData, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of ChileLEARN Project
 
Tx la conference 2012 final
Tx la conference 2012 finalTx la conference 2012 final
Tx la conference 2012 finalLane Wilkinson
 
Describing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDescribing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDan Brickley
 
Libraries, research infrastructures and the digital humanities: are we ready ...
Libraries, research infrastructures and the digital humanities: are we ready ...Libraries, research infrastructures and the digital humanities: are we ready ...
Libraries, research infrastructures and the digital humanities: are we ready ...Sally Chambers
 
Executable Music Documents
Executable Music DocumentsExecutable Music Documents
Executable Music DocumentsDavid De Roure
 
Epistemic Encounters: Interdisciplinary collaboration in developing virtual r...
Epistemic Encounters: Interdisciplinary collaboration in developing virtual r...Epistemic Encounters: Interdisciplinary collaboration in developing virtual r...
Epistemic Encounters: Interdisciplinary collaboration in developing virtual r...Smiljana Antonijevic
 

Similar to Annotated Corpora for Research in the Humanities (20)

Forty Years of the OTA
Forty Years of the OTAForty Years of the OTA
Forty Years of the OTA
 
Big data and Digital Transformations in the Humanities
Big data and Digital Transformations in the HumanitiesBig data and Digital Transformations in the Humanities
Big data and Digital Transformations in the Humanities
 
AHRC CDP Digital Humanities 101
AHRC CDP Digital Humanities 101  AHRC CDP Digital Humanities 101
AHRC CDP Digital Humanities 101
 
Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
 
101 This is Digital Scholarship Staff Training
101 This is Digital Scholarship Staff Training101 This is Digital Scholarship Staff Training
101 This is Digital Scholarship Staff Training
 
Research Data Management for the Humanities and Social Sciences
Research Data Management for the Humanities and Social SciencesResearch Data Management for the Humanities and Social Sciences
Research Data Management for the Humanities and Social Sciences
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Dh presentation helig 2014
Dh presentation helig 2014Dh presentation helig 2014
Dh presentation helig 2014
 
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesDigital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
 
sustainable digital publishing of archival catalogues1
sustainable digital publishing of archival catalogues1sustainable digital publishing of archival catalogues1
sustainable digital publishing of archival catalogues1
 
Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...
Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...
Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...
 
Designing the Digital Humanities Library Lab @ Leuven (DH3L)
Designing the Digital Humanities Library Lab @ Leuven (DH3L)Designing the Digital Humanities Library Lab @ Leuven (DH3L)
Designing the Digital Humanities Library Lab @ Leuven (DH3L)
 
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
 
Getting Started with Institutional Repositories and Open Access
Getting Started with Institutional Repositories and Open AccessGetting Started with Institutional Repositories and Open Access
Getting Started with Institutional Repositories and Open Access
 
Data, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of ChileData, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of Chile
 
Tx la conference 2012 final
Tx la conference 2012 finalTx la conference 2012 final
Tx la conference 2012 final
 
Describing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDescribing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classification
 
Libraries, research infrastructures and the digital humanities: are we ready ...
Libraries, research infrastructures and the digital humanities: are we ready ...Libraries, research infrastructures and the digital humanities: are we ready ...
Libraries, research infrastructures and the digital humanities: are we ready ...
 
Executable Music Documents
Executable Music DocumentsExecutable Music Documents
Executable Music Documents
 
Epistemic Encounters: Interdisciplinary collaboration in developing virtual r...
Epistemic Encounters: Interdisciplinary collaboration in developing virtual r...Epistemic Encounters: Interdisciplinary collaboration in developing virtual r...
Epistemic Encounters: Interdisciplinary collaboration in developing virtual r...
 

More from Martin Wynne

MacroMicroZoom.pdf
MacroMicroZoom.pdfMacroMicroZoom.pdf
MacroMicroZoom.pdfMartin Wynne
 
CLARIN Supporting Horizon Europe proposals
CLARIN Supporting Horizon Europe proposalsCLARIN Supporting Horizon Europe proposals
CLARIN Supporting Horizon Europe proposalsMartin Wynne
 
CLARIN - Corpora, corpus tools and collaboration
CLARIN - Corpora, corpus tools and collaborationCLARIN - Corpora, corpus tools and collaboration
CLARIN - Corpora, corpus tools and collaborationMartin Wynne
 
Forty-five Years of the OTA
Forty-five Years of the OTAForty-five Years of the OTA
Forty-five Years of the OTAMartin Wynne
 
Corpus Approaches to the Language of Literature 2008
Corpus Approaches to the Language of Literature 2008Corpus Approaches to the Language of Literature 2008
Corpus Approaches to the Language of Literature 2008Martin Wynne
 
Exploring rhetoric in the Electronic Enlightenment
Exploring rhetoric in the Electronic EnlightenmentExploring rhetoric in the Electronic Enlightenment
Exploring rhetoric in the Electronic EnlightenmentMartin Wynne
 
Corpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningCorpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningMartin Wynne
 
Hacking EEBO: colour terms
Hacking EEBO: colour termsHacking EEBO: colour terms
Hacking EEBO: colour termsMartin Wynne
 

More from Martin Wynne (8)

MacroMicroZoom.pdf
MacroMicroZoom.pdfMacroMicroZoom.pdf
MacroMicroZoom.pdf
 
CLARIN Supporting Horizon Europe proposals
CLARIN Supporting Horizon Europe proposalsCLARIN Supporting Horizon Europe proposals
CLARIN Supporting Horizon Europe proposals
 
CLARIN - Corpora, corpus tools and collaboration
CLARIN - Corpora, corpus tools and collaborationCLARIN - Corpora, corpus tools and collaboration
CLARIN - Corpora, corpus tools and collaboration
 
Forty-five Years of the OTA
Forty-five Years of the OTAForty-five Years of the OTA
Forty-five Years of the OTA
 
Corpus Approaches to the Language of Literature 2008
Corpus Approaches to the Language of Literature 2008Corpus Approaches to the Language of Literature 2008
Corpus Approaches to the Language of Literature 2008
 
Exploring rhetoric in the Electronic Enlightenment
Exploring rhetoric in the Electronic EnlightenmentExploring rhetoric in the Electronic Enlightenment
Exploring rhetoric in the Electronic Enlightenment
 
Corpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningCorpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and Learning
 
Hacking EEBO: colour terms
Hacking EEBO: colour termsHacking EEBO: colour terms
Hacking EEBO: colour terms
 

Recently uploaded

Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........LeaCamillePacle
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 

Recently uploaded (20)

Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 

Annotated Corpora for Research in the Humanities

  • 1. Do we need annotated corpora in the era of the data deluge? Martin Wynne martin.wynne@oucs.ox.ac.uk researchsupport@oucs.ox.ac.uk Oxford e-Research Centre & IT Services (formerly OUCS) & ACRH2 Lisbon Thursday 29th November 2012 Faculty of Linguistics, Philology and Phonetics, University of Oxford 1
  • 2.
  • 3. Problems with annotation It can: • lead to circular reasoning • be incorrect • be inconsistent • follow a particular theory • have a specific level of granularity • use a particular tag-set • introduce subjective interpretations 3
  • 5. The case for the corpus today (against “the web as corpus”) The spoken corpus: spoken, and other non-computer-mediated data The historical corpus: pre-internet data (beyond books) The specialised corpus: with integrity, provenance and controlled sampling and representativeness The annotated corpus: adding and sharing linguistic annotation The web corpus: filtering and organising the data deluge (aka "the web for corpus") 5
  • 6. The case for the corpus today (against “the web as corpus”) But we do need to go beyond the finite text corpus: ● speech ● video ● the language of the internet - new genres, new media, new modes ● capturing the context, especially other data streams ● engaging with the non-finite corpora (aka "the web as corpus") 6
  • 7.
  • 8.
  • 9. Image by James Cridland from Flickr. Some rights reserved.
  • 10. Annotation - Why? • To perform identification, categorization and analysis of features of the text • It enables certain types of search and analysis, especially beyond the word form (e.g. “search for all inflected forms of cause as a verb”) • It can be the foundation for further automatic analysis of a corpus (e.g. POS tags can be used for parsing) • Preserving the analysis, enabling replicability of research, and reusability of the annotated corpus 10
  • 11. Annotation: less than the text? “Annotation of a text is a procedure which loses information. There is no point in arguing that the information is in the computer's memory somewhere - annotation is the substitution of a general category for a specific item, and with respect to that area of the classification, the item has lost its uniqueness.” (John Sinclair, personal communication, 2001) 11
  • 12. Annotation: how? • • • • Annotations should be separable Detailed and explicit documentation should be provided Annotation practices should be linguistically consensual Annotation should observe standards (Leech 2005) http://www.ota.ox.ac.uk/documents/creating/dlc/ 12
  • 13. Annotation standards? Use of standards can help to ensure successful: • interpretation, • interchange, • preservation, • incorporation into other resources, • processing by generic software. And is a way of resolving tricky encoding decisions, and of justifying and documenting your decisions. 13
  • 14. Potential problems with annotation 1. Annotation is liable to be subjective and inconsistent 2. Annotation is sometimes intellectual and painstaking, sometimes trivial and automatic 3. Annotation leads to digital silos 4. Annotation makes building a shared services infrastructure difficult 14
  • 15. Interoperability and sustainability for digital textual scholarship Well-known problems with digital resources in the humanities of: • fragmentation of communities, resources, tools; • lack of connectness and interoperability; • sustainability of online services; • lack of deployment of tools as reliable and available services There is a potential solution in distributed, federated infrastructure services. 15
  • 16.
  • 17.
  • 18. The CLARIN Vision A researcher in the Darmstadt, from his desktop computer, can:  do a single sign-on, with local authentication, and then:  search for, find and obtain authorization to use corpora in Oxford, Prague and Berlin  select the precise dataset to work on, and save that selection  run semantic analysis tools from Budapest and statistical tools from Tübingen over the dataset  use computational power from the local, national or other computing centre where necessary  obtain advice and support for carrying out all technical and methodological procedures  save the workflow and results of the analysis, and share those results with collaborators in Paris, Vienna and Zagreb  discuss and iteratively adopt and re-run the analyses with collaborators
  • 19.
  • 20. Silos or fishtanks?? Let's talk about fishtanks rather than silos... There are lots of fishtanks out there, some very elaborate, big, pretty... But they're all in different places and unconnected. And if I want to keep a fish I have to build a fishtank (or put it in yours)... And who's going to carry on feeding the fish? Let's not all make our own fishtanks. 20
  • 21. Wouldn't it be better to have an ecosystem where we can all set our fishes free? You can access all of the riches of the deep and it's a lot easier to get into fish research 21
  • 22.
  • 23.
  • 24.
  • 25. CLARIN http://www.clarin.eu/ Infrastructure services for research in the humanities and social sciences using language resources and tools. Services to include: Access and identity federation Network of service centres • Concept and component metadata registries • Federated resource discovery • Federated search across resources • SOA for connecting tools • PID services • • Bamboo http://www.project-bamboo.org/ Project Bamboo is building applications and shared infrastructure for humanities research, principally: • Research environments for humanities scholars • Infrastructure allowing librarians and technologists to support humanities scholarship • Evolution of shared applications for the curation and exploration of widely distributed content collections • Build a community for uptake, expansion and sustainability DARIAH http://www.dariah.eu Enhance and support digitallyenabled research across the humanities and arts. DARIAH is working with communities of practice to: Explore and apply ICT-based methods and tools • Improve research opportunities and outcomes through linking distributed digital source materials of many kinds • Exchange knowledge, expertise, methodologies and practices across domains and disciplines •
  • 26.
  • 27.
  • 28.
  • 29.
  • 31. Player One (a man) Player Two (a woman) [Enter two players] What news, Borachio? [Don John, Much Ado About Nothing, I, 3] I came yonder from a great supper: I can give you intelligence of an intended marriage. [Borachio, Much Ado About Nothing, I, 3] They say the lady is fair; 'tis a truth, I can bear them witness; and virtuous; 'tis so, I cannot reprove it A married man! that's most intolerable. [Earl of Warwick, Henry VI Part I, V, 4] Yet hasty marriage seldom proveth well. [Benedick, Much Ado About Nothing, II, 3] [Richard III, Henry VI Part III, IV, 1] Is the single man therefore blessed? No; as a wall'd town is more worthier than a village, so is the forehead of a married man more honourable than the bare brow of a bachelor Many a good hanging prevents a bad marriage [Touchstone, As You Like It, III, 3] [Feste, Twelfth Night, I, 5] By this marriage, All little jealousies, which now seem great, And all great fears, which now import their dangers, Would then be nothing I may chance have some odd quirks and remnants of wit broken on me, because I have railed so long against marriage: but doth not the appetite alter? a man loves the meat in his youth that he cannot endure in his age. [Agrippa, Antony and Cleopatra, II, 2] [Benedick, Much Ado About Nothing, II, 3] They are in the very wrath of love, and they will together. Clubs cannot part them. Speak low, if you speak love. [Rosalind, As you Like It, V, 2] [Don Pedro, Much Ado About Nothing, II, 1]
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38. Nature 474, 436-440 (2011) | doi:10.1038/474436a
  • 39.
  • 40.
  • 41. "[There is] a monolithic conception of social space, according to which it would suffice to have the right information to make the right decisions. But in point of fact, information itself is far from homogenous and no purely quantitative approach is satisfying. Having ever greater amounts of information at our fingertips not only does not make us more virtuous, as Rousseau already predicted, but it does not even make us more knowledgeable." [Tzvetan Todorov, In Defence of the Enlightenment, 2009] 41
  • 42. The simple challenge then... ... to transform the Humanities by promoting shared digital services, facilities, resources and tools, without destroying the justification and arguments for the Humanities for the Humanities sake, and thus accidentally contributing to the decline and eventual destruction of civilization 42
  • 43.
  • 44. The 'take-home messages' ● ● ● ● ● in the era of the data deluge, web science and digital scholarship, we need to rethink the case for the corpus today, and the case for doing annotation we need an ecosystem, not separate 'fishtanks' annotation risks more fragmentation we need to follow the physical sciences in deciding priorities & adopting standards, reducing complexity and variety, to promote shared facilities and infrastructures but, at the same time, we need to avoid arguments for scientism and instrumentalism, and to defend the humanities 44