SlideShare a Scribd company logo
1 of 44
Download to read offline
Working with Historical Data
Universität des Saarlandes
Friday 8th
September 2017
Forty years of the Oxford Text Archive:
reflections on repositories, corpora,
and research infrastructure
Martin Wynne
Martin.wynne@bodleian.ox.ac.uk
Bodleian Libraries &
Faculty of Linguistics, Philology and Phonetics,
University of Oxford
National Coordinator, CLARIN-UK
2
Oxford Text Archive, 40 years on and in a new home
3
[ota slide / demo, inc ota-qa!]
4
"The emergence of fast and high capacity networks, a deluge of
data, and web service APIs mean that it is increasingly possible
to imagine and build distributed architectures for scholarly
services, where data, tools, computing resources, and the outputs
of annotation and analysis live in different parts of the network
but can be brought together virtually in the user’s desktop
environment."
http://blogs.it.ox.ac.uk/martinw/2012/04/06/silos-or-fishtanks/.
How far down this road have we travelled so far?
7
0. Non-digital and dispersed
8
1. Digital but dispersed
9
2. Full text
10
11
12
3. Rich repositories
13
14
4. Texts in a corpus
Increasing availability
0) Texts non-digital and dispersed
1) Digital images on various sites
2) Full text
3) Many texts and images in one (virtual) place
4) Texts in a corpus!
But there’s still some way to go...
The ‘corpus’ is not complete for most research questions,
because:
● many texts not digitized yet
● different text types (letters, diaries, workbooks, etc.) found in
different repositories
● works outside the selection criteria (other date ranges, regions,
languages, etc.)
And, there are few tools
available for using on the
corpus (let alone the wider
ecosystem of sources)
What are we aiming for?
Ways to combine close reading with big data approaches.
Close Reading
19
What do you do with a million books?
“There are only about 30,000 days in a human life -- at a book a
day, it would take 30 lifetimes to read a million books and our
research libraries contain more than ten times that number. Only
machines can read through the 400,000 books already publicly
available for free download from the Open Content Alliance.”
Gregory Crane, “What do you do with a million books?”
D-Lib Magazine, March 2006
And 5 million books?
We constructed a corpus of digitized texts containing about 4% of all books ever printed.
Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey
the vast terrain of “culturomics” focusing on linguistic and cultural phenomena that were
reflected in the English language between 1800 and 2000. We show how this approach
can provide insights about fields as diverse as lexicography, the evolution of grammar,
collective memory, the adoption of technology, the pursuit of fame, censorship, and
historical epidemiology. “Culturomics” extends the boundaries of rigorous quantitative
inquiry to a wide array of new phenomena spanning the social sciences and the
humanities.
www.sciencexpress.org / 16 December 2010
Culturomics…
Distant reading: where distance, let me repeat
it, is a condition of knowledge: it allows you to
focus on units that are much smaller or much
larger than the text: devices, themes, tropes—
or genres and systems. And if, between the
very small and the very large, the text itself
disappears, well, it is one of those cases when
one can justifiably say, less is more. If we want
to understand the system in its entirety, we
must accept losing something. We always pay a
price for theoretical knowledge: reality is
infinitely rich; concepts are abstract, are poor.
But it’s precisely this ‘poverty’ that makes it
possible to handle them, and therefore to know.
This is why less is actually more.
Franco Moretti, “Conjectures on World
Literature” Distant Reading, 2013.
Matt Jockers,
University of Nebraska-Lincoln
Macroanalysis: Digital Methods
and Literary History (UIUC
Press, 2013)
Matt Jockers, Macroanalysis (2013)
Simon Raper, “Graphing the history of philosohy”
Everything but the text…
Distant Reading has a long history, in
the Annales school, Book History, etc.
But it’s all about counting stuff, not
reading:
• After-death inventories
• Library holdings/circulation records
• Archives of publishers
• Vocabulary of titles
• Censorship records
Martin, Furet, Darnton, Chartier, etc…
[Thanks to Glenn Roe]
30
Robert Darnton, The
Forbidden Best-Sellers of
Pre-Revolutionary France
(New York, 1995), 189.
Back to ‘close reading’
"It is not easy to justify assertions about the alleged frequency or infrequency
of some particular belief or attitude in the past. How many examples does
one need to cite in order to prove the point? Lacking any satisfactory method
of quantifying these matters, all I can do is to record my impressions after
long immersion in the period."
Keith Thomas, The Ends of Life, Oxford University Press, 2010.
Close,
distant,
and
scalable
reading
DATA:
digitally
assisted
text
analysis
Martin
Mueller,
Northwest
ern
(At least) Two problems with the
digital revolution
1. Data is still not yet sufficiently available and connected
2. We don’t have the right tools yet for hermeneutically-informed
exploration and analysis (in distributed environments)
Distributed virtual infrastructure:
potential advantages
●
Potentially unlimited functionality, since developers can plug in
content and tools that they want to use, and which can interoperate
with other data, tools and infrastructure services, in complex
worksflows;
●
federated resource discovery and content search (i.e. across
collections in different repositories);
●
ad hoc collections and virtual corpora;
●
access to protected resources (e.g. works in copyright, sensitive
data) curated in situ yet still analysed online via secure web
applications.
Distributed virtual infrastructure:
potential disadvantages
Complications:
●
federated identity management;
●
persistent identifiers;
●
monitoring of usage and accounting;
●
monitoring of the availability of services - it might be possible to test the
status of individual components but not a complex workflow and the
interactions between components;
●
difficulties with the visibility, acknowledgement, citations, and recognition of
certain services.
And because it’s complicated...
●
scope creep: infrastructure projects tend to try to build complete ecosystems.
The CLARIN Vision
A researcher in the Saarland, from his desktop computer, will be able to:
 log in locally at their local institution,
 search for, find and obtain authorization to use resources in Oxford, Prague and
Berlin,
 select the precise dataset to work on, and save that selection,
 run semantic analysis tools from Budapest and statistical tools from Tübingen
over the dataset,
 use computational power from local, national or other computing centres (if and
when necessary),
 obtain advice and support for carrying out all technical and methodological
procedures,
 save the workflow and results of the analysis in a citable form,
 share the results with collaborators in Paris, Edinburgh and Zagreb,
 discuss online with collaborators,
 iteratively adapt and re-run the analyses.
38
How do we interprete the results? We need to ask the questions::
● What's in my dataset? What's missing?
● What did the sampling procedure miss?
● What population of texts in the world can I make claims about by searching this
dataset?
● What is the right tool for the job?
● Will I successfully retrieve all occurrences of the word forms which I am
looking for?
● How can I make my search term more sophisticated?
● What claims can I make about the significance of the frequencies?
● How can I improve the process and refine the results?
● Which reference corpus do I need to make comparisons with?
● What do I need to go on to investigate further?
● How can I share my results and methods?
The perils of interpretation, or,
why we need to think about methods
Am I substituting data for analysis and judgement, and to avoid
discussing significance, meaning, values and merit?
The perils of interpretation (2)
In Defence of the Enlightenment
"[There is] a monolithic conception of social space, according to which it would
suffice to have the right information to make the right decisions. But in point of
fact, information itself is far from homogenous and no purely quantitative
approach is satisfying. Having ever greater amounts of information at our
fingertips not only does not make us more virtuous, as Rousseau already
predicted, but it does not even make us more knowledgeable."
[Tzvetan Todorov, In Defence of the Enlightenment, 2009]
Three problems with the digital revolution
1. Silos: data is still not yet sufficiently available and connected
2. Infrastructure: we don’t have the right software tools yet for
hermeneutically-informed exploration and analysis (in distributed
environments)
3. Methods: we don’t yet have, or understanding of the best ways in
which digital research should become part of our toolkit
Some simple and practical next steps
1. Make metadata available at open and persistent URIs
2. Use common controlled vocabularies for some key fields, e.g.
people, dates, places.
3. Provide a linked data portal (where you can search for ‘Boyle’ and
find Royal Society Journal texts, works in EEBO, manuscript
images, ODNB entry, portrait images, library catalogue data, etc.)
Links
http://ota.ox.ac.uk/
http://www.e-enlightenment.com/
http://digital.bodleian.ox.ac.uk/
http://www.clarin.eu/
https://cqpweb.lancs.ac.uk/
https://scalablereading.northwestern.edu/

More Related Content

What's hot

UVA MDST 3703 Thematic Research Collections 2012-09-18
UVA MDST 3703 Thematic Research Collections 2012-09-18UVA MDST 3703 Thematic Research Collections 2012-09-18
UVA MDST 3703 Thematic Research Collections 2012-09-18Rafael Alvarado
 
Electronic literature (e lit) in public libraries
Electronic literature (e lit) in public librariesElectronic literature (e lit) in public libraries
Electronic literature (e lit) in public librariesAlexandr Belov
 
Integrating data publishing with workflows in biodiversity research, Potsdam ...
Integrating data publishing with workflows in biodiversity research, Potsdam ...Integrating data publishing with workflows in biodiversity research, Potsdam ...
Integrating data publishing with workflows in biodiversity research, Potsdam ...Daniel Mietchen
 
Electronic Literature
Electronic LiteratureElectronic Literature
Electronic LiteratureSiswo Harsono
 
20100119 Ape Beyond And Far Beyond
20100119 Ape Beyond And Far Beyond20100119 Ape Beyond And Far Beyond
20100119 Ape Beyond And Far BeyondStefan Gradmann
 
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...OpenEdition
 
Digital collections and humanities research
Digital collections and humanities researchDigital collections and humanities research
Digital collections and humanities researchHarriett Green
 
Electronic literature and its place in digital library
Electronic literature and its place in digital libraryElectronic literature and its place in digital library
Electronic literature and its place in digital libraryAlexandr Belov
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...pathsproject
 
Mdst3703 2013-09-24-hypertext
Mdst3703 2013-09-24-hypertextMdst3703 2013-09-24-hypertext
Mdst3703 2013-09-24-hypertextRafael Alvarado
 
Mdst3703 2013-10-08-thematic-research-collections
Mdst3703 2013-10-08-thematic-research-collectionsMdst3703 2013-10-08-thematic-research-collections
Mdst3703 2013-10-08-thematic-research-collectionsRafael Alvarado
 
Digital Humanities by Ingrid Thomson
Digital Humanities  by Ingrid ThomsonDigital Humanities  by Ingrid Thomson
Digital Humanities by Ingrid Thomsonpvhead123
 
The World of Digital Humanities : Digital Humanities in the World
The World of Digital Humanities : Digital Humanities in the WorldThe World of Digital Humanities : Digital Humanities in the World
The World of Digital Humanities : Digital Humanities in the WorldEdward Vanhoutte
 
Alexander Voiskounsky: Human Behavior in the Virtual Environments
Alexander Voiskounsky: Human Behavior in the Virtual Environments Alexander Voiskounsky: Human Behavior in the Virtual Environments
Alexander Voiskounsky: Human Behavior in the Virtual Environments ÚISK FF UK
 
Authorship Without Agency?: Responding to Computer-Generated Texts
Authorship Without Agency?: Responding to Computer-Generated TextsAuthorship Without Agency?: Responding to Computer-Generated Texts
Authorship Without Agency?: Responding to Computer-Generated TextsLeah Henrickson
 
Mdst3703 2013-10-01-hypertext-and-history
Mdst3703 2013-10-01-hypertext-and-historyMdst3703 2013-10-01-hypertext-and-history
Mdst3703 2013-10-01-hypertext-and-historyRafael Alvarado
 

What's hot (20)

UVA MDST 3703 Thematic Research Collections 2012-09-18
UVA MDST 3703 Thematic Research Collections 2012-09-18UVA MDST 3703 Thematic Research Collections 2012-09-18
UVA MDST 3703 Thematic Research Collections 2012-09-18
 
Transkribus | Günter Mühlberger
Transkribus | Günter MühlbergerTranskribus | Günter Mühlberger
Transkribus | Günter Mühlberger
 
MDST 3703 F10 Seminar 3
MDST 3703 F10 Seminar 3MDST 3703 F10 Seminar 3
MDST 3703 F10 Seminar 3
 
Electronic literature (e lit) in public libraries
Electronic literature (e lit) in public librariesElectronic literature (e lit) in public libraries
Electronic literature (e lit) in public libraries
 
Integrating data publishing with workflows in biodiversity research, Potsdam ...
Integrating data publishing with workflows in biodiversity research, Potsdam ...Integrating data publishing with workflows in biodiversity research, Potsdam ...
Integrating data publishing with workflows in biodiversity research, Potsdam ...
 
Electronic Literature
Electronic LiteratureElectronic Literature
Electronic Literature
 
20100119 Ape Beyond And Far Beyond
20100119 Ape Beyond And Far Beyond20100119 Ape Beyond And Far Beyond
20100119 Ape Beyond And Far Beyond
 
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
 
Ir1
Ir1Ir1
Ir1
 
Digital collections and humanities research
Digital collections and humanities researchDigital collections and humanities research
Digital collections and humanities research
 
Electronic literature and its place in digital library
Electronic literature and its place in digital libraryElectronic literature and its place in digital library
Electronic literature and its place in digital library
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
 
Mdst3703 2013-09-24-hypertext
Mdst3703 2013-09-24-hypertextMdst3703 2013-09-24-hypertext
Mdst3703 2013-09-24-hypertext
 
Mdst3703 2013-10-08-thematic-research-collections
Mdst3703 2013-10-08-thematic-research-collectionsMdst3703 2013-10-08-thematic-research-collections
Mdst3703 2013-10-08-thematic-research-collections
 
MDST 3703 F10 Seminar 5
MDST 3703 F10 Seminar 5MDST 3703 F10 Seminar 5
MDST 3703 F10 Seminar 5
 
Digital Humanities by Ingrid Thomson
Digital Humanities  by Ingrid ThomsonDigital Humanities  by Ingrid Thomson
Digital Humanities by Ingrid Thomson
 
The World of Digital Humanities : Digital Humanities in the World
The World of Digital Humanities : Digital Humanities in the WorldThe World of Digital Humanities : Digital Humanities in the World
The World of Digital Humanities : Digital Humanities in the World
 
Alexander Voiskounsky: Human Behavior in the Virtual Environments
Alexander Voiskounsky: Human Behavior in the Virtual Environments Alexander Voiskounsky: Human Behavior in the Virtual Environments
Alexander Voiskounsky: Human Behavior in the Virtual Environments
 
Authorship Without Agency?: Responding to Computer-Generated Texts
Authorship Without Agency?: Responding to Computer-Generated TextsAuthorship Without Agency?: Responding to Computer-Generated Texts
Authorship Without Agency?: Responding to Computer-Generated Texts
 
Mdst3703 2013-10-01-hypertext-and-history
Mdst3703 2013-10-01-hypertext-and-historyMdst3703 2013-10-01-hypertext-and-history
Mdst3703 2013-10-01-hypertext-and-history
 

Similar to Reflections on Historical Data Repositories and Research Infrastructure

When will there be a digital revolution in the humanities?
When will there be a digital revolution in the humanities?When will there be a digital revolution in the humanities?
When will there be a digital revolution in the humanities?Martin Wynne
 
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...Micah Altman
 
Big data and Digital Transformations in the Humanities
Big data and Digital Transformations in the HumanitiesBig data and Digital Transformations in the Humanities
Big data and Digital Transformations in the HumanitiesMartin Wynne
 
A Case Study Protocol For Meta-Research Into Digital Practices In The Humanities
A Case Study Protocol For Meta-Research Into Digital Practices In The HumanitiesA Case Study Protocol For Meta-Research Into Digital Practices In The Humanities
A Case Study Protocol For Meta-Research Into Digital Practices In The HumanitiesJeff Brooks
 
Research Data Management for the Humanities and Social Sciences
Research Data Management for the Humanities and Social SciencesResearch Data Management for the Humanities and Social Sciences
Research Data Management for the Humanities and Social SciencesMartin Donnelly
 
Digital Humanities - Conversation Starter 2015
Digital Humanities - Conversation Starter 2015Digital Humanities - Conversation Starter 2015
Digital Humanities - Conversation Starter 2015University of Cape Town
 
Dh presentation helig 2014
Dh presentation helig 2014Dh presentation helig 2014
Dh presentation helig 2014HELIGLIASA
 
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the HumanitiesAnnotated Corpora for Research in the Humanities
Annotated Corpora for Research in the HumanitiesMartin Wynne
 
Building a Network of Open Correspondence Projects. A model for Open Science
Building a Network of Open Correspondence Projects. A model for Open ScienceBuilding a Network of Open Correspondence Projects. A model for Open Science
Building a Network of Open Correspondence Projects. A model for Open ScienceFrancesca Di Donato
 
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DHLorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DHlorna_hughes
 
Building a Network of Open Correspondence Projects A model for Open Science
Building a Network of Open Correspondence Projects A model for Open ScienceBuilding a Network of Open Correspondence Projects A model for Open Science
Building a Network of Open Correspondence Projects A model for Open ScienceFrancesca Di Donato
 
Executable Music Documents
Executable Music DocumentsExecutable Music Documents
Executable Music DocumentsDavid De Roure
 
Designing the Digital Humanities Library Lab @ Leuven (DH3L)
Designing the Digital Humanities Library Lab @ Leuven (DH3L)Designing the Digital Humanities Library Lab @ Leuven (DH3L)
Designing the Digital Humanities Library Lab @ Leuven (DH3L)Demmy Verbeke
 
On Competition for Catalogers
On Competition for CatalogersOn Competition for Catalogers
On Competition for CatalogersKaren S Calhoun
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...Hazel Hall
 
Building a Collaboration for Digital Publishing
Building a Collaboration for Digital PublishingBuilding a Collaboration for Digital Publishing
Building a Collaboration for Digital PublishingHarriett Green
 

Similar to Reflections on Historical Data Repositories and Research Infrastructure (20)

When will there be a digital revolution in the humanities?
When will there be a digital revolution in the humanities?When will there be a digital revolution in the humanities?
When will there be a digital revolution in the humanities?
 
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
 
Big data and Digital Transformations in the Humanities
Big data and Digital Transformations in the HumanitiesBig data and Digital Transformations in the Humanities
Big data and Digital Transformations in the Humanities
 
A Case Study Protocol For Meta-Research Into Digital Practices In The Humanities
A Case Study Protocol For Meta-Research Into Digital Practices In The HumanitiesA Case Study Protocol For Meta-Research Into Digital Practices In The Humanities
A Case Study Protocol For Meta-Research Into Digital Practices In The Humanities
 
Research Data Management for the Humanities and Social Sciences
Research Data Management for the Humanities and Social SciencesResearch Data Management for the Humanities and Social Sciences
Research Data Management for the Humanities and Social Sciences
 
Digital Humanities - Conversation Starter 2015
Digital Humanities - Conversation Starter 2015Digital Humanities - Conversation Starter 2015
Digital Humanities - Conversation Starter 2015
 
AHRC CDP Digital Humanities 101
AHRC CDP Digital Humanities 101  AHRC CDP Digital Humanities 101
AHRC CDP Digital Humanities 101
 
Dh presentation helig 2014
Dh presentation helig 2014Dh presentation helig 2014
Dh presentation helig 2014
 
Dh presentation 2019
Dh presentation 2019Dh presentation 2019
Dh presentation 2019
 
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the HumanitiesAnnotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
 
Building a Network of Open Correspondence Projects. A model for Open Science
Building a Network of Open Correspondence Projects. A model for Open ScienceBuilding a Network of Open Correspondence Projects. A model for Open Science
Building a Network of Open Correspondence Projects. A model for Open Science
 
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DHLorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
 
Building a Network of Open Correspondence Projects A model for Open Science
Building a Network of Open Correspondence Projects A model for Open ScienceBuilding a Network of Open Correspondence Projects A model for Open Science
Building a Network of Open Correspondence Projects A model for Open Science
 
Executable Music Documents
Executable Music DocumentsExecutable Music Documents
Executable Music Documents
 
Designing the Digital Humanities Library Lab @ Leuven (DH3L)
Designing the Digital Humanities Library Lab @ Leuven (DH3L)Designing the Digital Humanities Library Lab @ Leuven (DH3L)
Designing the Digital Humanities Library Lab @ Leuven (DH3L)
 
Dh presentation 2018
Dh presentation 2018Dh presentation 2018
Dh presentation 2018
 
On Competition for Catalogers
On Competition for CatalogersOn Competition for Catalogers
On Competition for Catalogers
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...
 
Building a Collaboration for Digital Publishing
Building a Collaboration for Digital PublishingBuilding a Collaboration for Digital Publishing
Building a Collaboration for Digital Publishing
 

More from Martin Wynne

MacroMicroZoom.pdf
MacroMicroZoom.pdfMacroMicroZoom.pdf
MacroMicroZoom.pdfMartin Wynne
 
CLARIN Supporting Horizon Europe proposals
CLARIN Supporting Horizon Europe proposalsCLARIN Supporting Horizon Europe proposals
CLARIN Supporting Horizon Europe proposalsMartin Wynne
 
CLARIN - Corpora, corpus tools and collaboration
CLARIN - Corpora, corpus tools and collaborationCLARIN - Corpora, corpus tools and collaboration
CLARIN - Corpora, corpus tools and collaborationMartin Wynne
 
Forty-five Years of the OTA
Forty-five Years of the OTAForty-five Years of the OTA
Forty-five Years of the OTAMartin Wynne
 
Corpus Approaches to the Language of Literature 2008
Corpus Approaches to the Language of Literature 2008Corpus Approaches to the Language of Literature 2008
Corpus Approaches to the Language of Literature 2008Martin Wynne
 
Exploring rhetoric in the Electronic Enlightenment
Exploring rhetoric in the Electronic EnlightenmentExploring rhetoric in the Electronic Enlightenment
Exploring rhetoric in the Electronic EnlightenmentMartin Wynne
 
Corpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningCorpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningMartin Wynne
 
Hacking EEBO: colour terms
Hacking EEBO: colour termsHacking EEBO: colour terms
Hacking EEBO: colour termsMartin Wynne
 

More from Martin Wynne (8)

MacroMicroZoom.pdf
MacroMicroZoom.pdfMacroMicroZoom.pdf
MacroMicroZoom.pdf
 
CLARIN Supporting Horizon Europe proposals
CLARIN Supporting Horizon Europe proposalsCLARIN Supporting Horizon Europe proposals
CLARIN Supporting Horizon Europe proposals
 
CLARIN - Corpora, corpus tools and collaboration
CLARIN - Corpora, corpus tools and collaborationCLARIN - Corpora, corpus tools and collaboration
CLARIN - Corpora, corpus tools and collaboration
 
Forty-five Years of the OTA
Forty-five Years of the OTAForty-five Years of the OTA
Forty-five Years of the OTA
 
Corpus Approaches to the Language of Literature 2008
Corpus Approaches to the Language of Literature 2008Corpus Approaches to the Language of Literature 2008
Corpus Approaches to the Language of Literature 2008
 
Exploring rhetoric in the Electronic Enlightenment
Exploring rhetoric in the Electronic EnlightenmentExploring rhetoric in the Electronic Enlightenment
Exploring rhetoric in the Electronic Enlightenment
 
Corpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningCorpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and Learning
 
Hacking EEBO: colour terms
Hacking EEBO: colour termsHacking EEBO: colour terms
Hacking EEBO: colour terms
 

Recently uploaded

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 

Recently uploaded (20)

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 

Reflections on Historical Data Repositories and Research Infrastructure

  • 1. Working with Historical Data Universität des Saarlandes Friday 8th September 2017 Forty years of the Oxford Text Archive: reflections on repositories, corpora, and research infrastructure Martin Wynne Martin.wynne@bodleian.ox.ac.uk Bodleian Libraries & Faculty of Linguistics, Philology and Phonetics, University of Oxford National Coordinator, CLARIN-UK
  • 2. 2 Oxford Text Archive, 40 years on and in a new home
  • 3. 3 [ota slide / demo, inc ota-qa!]
  • 4. 4
  • 5. "The emergence of fast and high capacity networks, a deluge of data, and web service APIs mean that it is increasingly possible to imagine and build distributed architectures for scholarly services, where data, tools, computing resources, and the outputs of annotation and analysis live in different parts of the network but can be brought together virtually in the user’s desktop environment." http://blogs.it.ox.ac.uk/martinw/2012/04/06/silos-or-fishtanks/.
  • 6. How far down this road have we travelled so far?
  • 8. 8 1. Digital but dispersed
  • 10. 10
  • 11. 11
  • 13. 13
  • 14. 14 4. Texts in a corpus
  • 15. Increasing availability 0) Texts non-digital and dispersed 1) Digital images on various sites 2) Full text 3) Many texts and images in one (virtual) place 4) Texts in a corpus!
  • 16. But there’s still some way to go... The ‘corpus’ is not complete for most research questions, because: ● many texts not digitized yet ● different text types (letters, diaries, workbooks, etc.) found in different repositories ● works outside the selection criteria (other date ranges, regions, languages, etc.) And, there are few tools available for using on the corpus (let alone the wider ecosystem of sources)
  • 17. What are we aiming for? Ways to combine close reading with big data approaches.
  • 19. 19
  • 20. What do you do with a million books? “There are only about 30,000 days in a human life -- at a book a day, it would take 30 lifetimes to read a million books and our research libraries contain more than ten times that number. Only machines can read through the 400,000 books already publicly available for free download from the Open Content Alliance.” Gregory Crane, “What do you do with a million books?” D-Lib Magazine, March 2006
  • 21. And 5 million books? We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of “culturomics” focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. “Culturomics” extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities. www.sciencexpress.org / 16 December 2010
  • 23. Distant reading: where distance, let me repeat it, is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropes— or genres and systems. And if, between the very small and the very large, the text itself disappears, well, it is one of those cases when one can justifiably say, less is more. If we want to understand the system in its entirety, we must accept losing something. We always pay a price for theoretical knowledge: reality is infinitely rich; concepts are abstract, are poor. But it’s precisely this ‘poverty’ that makes it possible to handle them, and therefore to know. This is why less is actually more. Franco Moretti, “Conjectures on World Literature” Distant Reading, 2013.
  • 24. Matt Jockers, University of Nebraska-Lincoln Macroanalysis: Digital Methods and Literary History (UIUC Press, 2013)
  • 26.
  • 27. Simon Raper, “Graphing the history of philosohy”
  • 28.
  • 29. Everything but the text… Distant Reading has a long history, in the Annales school, Book History, etc. But it’s all about counting stuff, not reading: • After-death inventories • Library holdings/circulation records • Archives of publishers • Vocabulary of titles • Censorship records Martin, Furet, Darnton, Chartier, etc… [Thanks to Glenn Roe]
  • 30. 30 Robert Darnton, The Forbidden Best-Sellers of Pre-Revolutionary France (New York, 1995), 189.
  • 31. Back to ‘close reading’ "It is not easy to justify assertions about the alleged frequency or infrequency of some particular belief or attitude in the past. How many examples does one need to cite in order to prove the point? Lacking any satisfactory method of quantifying these matters, all I can do is to record my impressions after long immersion in the period." Keith Thomas, The Ends of Life, Oxford University Press, 2010.
  • 33. (At least) Two problems with the digital revolution 1. Data is still not yet sufficiently available and connected 2. We don’t have the right tools yet for hermeneutically-informed exploration and analysis (in distributed environments)
  • 34.
  • 35. Distributed virtual infrastructure: potential advantages ● Potentially unlimited functionality, since developers can plug in content and tools that they want to use, and which can interoperate with other data, tools and infrastructure services, in complex worksflows; ● federated resource discovery and content search (i.e. across collections in different repositories); ● ad hoc collections and virtual corpora; ● access to protected resources (e.g. works in copyright, sensitive data) curated in situ yet still analysed online via secure web applications.
  • 36. Distributed virtual infrastructure: potential disadvantages Complications: ● federated identity management; ● persistent identifiers; ● monitoring of usage and accounting; ● monitoring of the availability of services - it might be possible to test the status of individual components but not a complex workflow and the interactions between components; ● difficulties with the visibility, acknowledgement, citations, and recognition of certain services. And because it’s complicated... ● scope creep: infrastructure projects tend to try to build complete ecosystems.
  • 37. The CLARIN Vision A researcher in the Saarland, from his desktop computer, will be able to:  log in locally at their local institution,  search for, find and obtain authorization to use resources in Oxford, Prague and Berlin,  select the precise dataset to work on, and save that selection,  run semantic analysis tools from Budapest and statistical tools from Tübingen over the dataset,  use computational power from local, national or other computing centres (if and when necessary),  obtain advice and support for carrying out all technical and methodological procedures,  save the workflow and results of the analysis in a citable form,  share the results with collaborators in Paris, Edinburgh and Zagreb,  discuss online with collaborators,  iteratively adapt and re-run the analyses.
  • 38. 38
  • 39. How do we interprete the results? We need to ask the questions:: ● What's in my dataset? What's missing? ● What did the sampling procedure miss? ● What population of texts in the world can I make claims about by searching this dataset? ● What is the right tool for the job? ● Will I successfully retrieve all occurrences of the word forms which I am looking for? ● How can I make my search term more sophisticated? ● What claims can I make about the significance of the frequencies? ● How can I improve the process and refine the results? ● Which reference corpus do I need to make comparisons with? ● What do I need to go on to investigate further? ● How can I share my results and methods? The perils of interpretation, or, why we need to think about methods
  • 40. Am I substituting data for analysis and judgement, and to avoid discussing significance, meaning, values and merit? The perils of interpretation (2)
  • 41. In Defence of the Enlightenment "[There is] a monolithic conception of social space, according to which it would suffice to have the right information to make the right decisions. But in point of fact, information itself is far from homogenous and no purely quantitative approach is satisfying. Having ever greater amounts of information at our fingertips not only does not make us more virtuous, as Rousseau already predicted, but it does not even make us more knowledgeable." [Tzvetan Todorov, In Defence of the Enlightenment, 2009]
  • 42. Three problems with the digital revolution 1. Silos: data is still not yet sufficiently available and connected 2. Infrastructure: we don’t have the right software tools yet for hermeneutically-informed exploration and analysis (in distributed environments) 3. Methods: we don’t yet have, or understanding of the best ways in which digital research should become part of our toolkit
  • 43. Some simple and practical next steps 1. Make metadata available at open and persistent URIs 2. Use common controlled vocabularies for some key fields, e.g. people, dates, places. 3. Provide a linked data portal (where you can search for ‘Boyle’ and find Royal Society Journal texts, works in EEBO, manuscript images, ODNB entry, portrait images, library catalogue data, etc.)