Reflections on Historical Data Repositories and Research Infrastructure

Working with Historical Data
Universität des Saarlandes
Friday 8th
September 2017
Forty years of the Oxford Text Archive:
reflections on repositories, corpora,
and research infrastructure
Martin Wynne
Martin.wynne@bodleian.ox.ac.uk
Bodleian Libraries &
Faculty of Linguistics, Philology and Phonetics,
University of Oxford
National Coordinator, CLARIN-UK

2
Oxford Text Archive, 40 years on and in a new home

3
[ota slide / demo, inc ota-qa!]

"The emergence of fast and high capacity networks, a deluge of
data, and web service APIs mean that it is increasingly possible
to imagine and build distributed architectures for scholarly
services, where data, tools, computing resources, and the outputs
of annotation and analysis live in different parts of the network
but can be brought together virtually in the user’s desktop
environment."
http://blogs.it.ox.ac.uk/martinw/2012/04/06/silos-or-fishtanks/.

How far down this road have we travelled so far?

7
0. Non-digital and dispersed

Increasing availability
0) Texts non-digital and dispersed
1) Digital images on various sites
2) Full text
3) Many texts and images in one (virtual) place
4) Texts in a corpus!

But there’s still some way to go...
The ‘corpus’ is not complete for most research questions,
because:
● many texts not digitized yet
● different text types (letters, diaries, workbooks, etc.) found in
different repositories
● works outside the selection criteria (other date ranges, regions,
languages, etc.)
And, there are few tools
available for using on the
corpus (let alone the wider
ecosystem of sources)

What are we aiming for?
Ways to combine close reading with big data approaches.

What do you do with a million books?
“There are only about 30,000 days in a human life -- at a book a
day, it would take 30 lifetimes to read a million books and our
research libraries contain more than ten times that number. Only
machines can read through the 400,000 books already publicly
available for free download from the Open Content Alliance.”
Gregory Crane, “What do you do with a million books?”
D-Lib Magazine, March 2006

And 5 million books?
We constructed a corpus of digitized texts containing about 4% of all books ever printed.
Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey
the vast terrain of “culturomics” focusing on linguistic and cultural phenomena that were
reflected in the English language between 1800 and 2000. We show how this approach
can provide insights about fields as diverse as lexicography, the evolution of grammar,
collective memory, the adoption of technology, the pursuit of fame, censorship, and
historical epidemiology. “Culturomics” extends the boundaries of rigorous quantitative
inquiry to a wide array of new phenomena spanning the social sciences and the
humanities.
www.sciencexpress.org / 16 December 2010

Distant reading: where distance, let me repeat
it, is a condition of knowledge: it allows you to
focus on units that are much smaller or much
larger than the text: devices, themes, tropes—
or genres and systems. And if, between the
very small and the very large, the text itself
disappears, well, it is one of those cases when
one can justifiably say, less is more. If we want
to understand the system in its entirety, we
must accept losing something. We always pay a
price for theoretical knowledge: reality is
infinitely rich; concepts are abstract, are poor.
But it’s precisely this ‘poverty’ that makes it
possible to handle them, and therefore to know.
This is why less is actually more.
Franco Moretti, “Conjectures on World
Literature” Distant Reading, 2013.

Matt Jockers,
University of Nebraska-Lincoln
Macroanalysis: Digital Methods
and Literary History (UIUC
Press, 2013)

Matt Jockers, Macroanalysis (2013)

Simon Raper, “Graphing the history of philosohy”

Everything but the text…
Distant Reading has a long history, in
the Annales school, Book History, etc.
But it’s all about counting stuff, not
reading:
• After-death inventories
• Library holdings/circulation records
• Archives of publishers
• Vocabulary of titles
• Censorship records
Martin, Furet, Darnton, Chartier, etc…
[Thanks to Glenn Roe]

30
Robert Darnton, The
Forbidden Best-Sellers of
Pre-Revolutionary France
(New York, 1995), 189.

Back to ‘close reading’
"It is not easy to justify assertions about the alleged frequency or infrequency
of some particular belief or attitude in the past. How many examples does
one need to cite in order to prove the point? Lacking any satisfactory method
of quantifying these matters, all I can do is to record my impressions after
long immersion in the period."
Keith Thomas, The Ends of Life, Oxford University Press, 2010.

Close,
distant,
and
scalable
reading
DATA:
digitally
assisted
text
analysis
Martin
Mueller,
Northwest
ern

(At least) Two problems with the
digital revolution
1. Data is still not yet sufficiently available and connected
2. We don’t have the right tools yet for hermeneutically-informed
exploration and analysis (in distributed environments)

Distributed virtual infrastructure:
potential advantages
●
Potentially unlimited functionality, since developers can plug in
content and tools that they want to use, and which can interoperate
with other data, tools and infrastructure services, in complex
worksflows;
●
federated resource discovery and content search (i.e. across
collections in different repositories);
●
ad hoc collections and virtual corpora;
●
access to protected resources (e.g. works in copyright, sensitive
data) curated in situ yet still analysed online via secure web
applications.

Distributed virtual infrastructure:
potential disadvantages
Complications:
●
federated identity management;
●
persistent identifiers;
●
monitoring of usage and accounting;
●
monitoring of the availability of services - it might be possible to test the
status of individual components but not a complex workflow and the
interactions between components;
●
difficulties with the visibility, acknowledgement, citations, and recognition of
certain services.
And because it’s complicated...
●
scope creep: infrastructure projects tend to try to build complete ecosystems.

The CLARIN Vision
A researcher in the Saarland, from his desktop computer, will be able to:
 log in locally at their local institution,
 search for, find and obtain authorization to use resources in Oxford, Prague and
Berlin,
 select the precise dataset to work on, and save that selection,
 run semantic analysis tools from Budapest and statistical tools from Tübingen
over the dataset,
 use computational power from local, national or other computing centres (if and
when necessary),
 obtain advice and support for carrying out all technical and methodological
procedures,
 save the workflow and results of the analysis in a citable form,
 share the results with collaborators in Paris, Edinburgh and Zagreb,
 discuss online with collaborators,
 iteratively adapt and re-run the analyses.

How do we interprete the results? We need to ask the questions::
● What's in my dataset? What's missing?
● What did the sampling procedure miss?
● What population of texts in the world can I make claims about by searching this
dataset?
● What is the right tool for the job?
● Will I successfully retrieve all occurrences of the word forms which I am
looking for?
● How can I make my search term more sophisticated?
● What claims can I make about the significance of the frequencies?
● How can I improve the process and refine the results?
● Which reference corpus do I need to make comparisons with?
● What do I need to go on to investigate further?
● How can I share my results and methods?
The perils of interpretation, or,
why we need to think about methods

Am I substituting data for analysis and judgement, and to avoid
discussing significance, meaning, values and merit?
The perils of interpretation (2)

In Defence of the Enlightenment
"[There is] a monolithic conception of social space, according to which it would
suffice to have the right information to make the right decisions. But in point of
fact, information itself is far from homogenous and no purely quantitative
approach is satisfying. Having ever greater amounts of information at our
fingertips not only does not make us more virtuous, as Rousseau already
predicted, but it does not even make us more knowledgeable."
[Tzvetan Todorov, In Defence of the Enlightenment, 2009]

Three problems with the digital revolution
1. Silos: data is still not yet sufficiently available and connected
2. Infrastructure: we don’t have the right software tools yet for
hermeneutically-informed exploration and analysis (in distributed
environments)
3. Methods: we don’t yet have, or understanding of the best ways in
which digital research should become part of our toolkit

Some simple and practical next steps
1. Make metadata available at open and persistent URIs
2. Use common controlled vocabularies for some key fields, e.g.
people, dates, places.
3. Provide a linked data portal (where you can search for ‘Boyle’ and
find Royal Society Journal texts, works in EEBO, manuscript
images, ODNB entry, portrait images, library catalogue data, etc.)

Links
http://ota.ox.ac.uk/
http://www.e-enlightenment.com/
http://digital.bodleian.ox.ac.uk/
http://www.clarin.eu/
https://cqpweb.lancs.ac.uk/
https://scalablereading.northwestern.edu/

Reflections on Historical Data Repositories and Research Infrastructure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reflections on Historical Data Repositories and Research Infrastructure

Similar to Reflections on Historical Data Repositories and Research Infrastructure (20)

More from Martin Wynne

More from Martin Wynne (8)

Recently uploaded

Recently uploaded (20)

Reflections on Historical Data Repositories and Research Infrastructure