Towards greater transparency in digital literary analysis
Towards greater transparency in
digital literary analysis
John Lavagnino, King‟s College London
8 May 2014
Slides at http://goo.gl/dPGhPw
1 General reasons for doing digital
analysis, and some present-day trends
2 A recent study that went badly wrong
3 Open and closed techniques
4 Open and closed data
Things not in the plan
Lots of things that aren‟t analysis are
1 publication and rediscovery (as by the
Women Writers Project, Northeastern
2 discussion, argument, interaction
3 studies of digital culture
Why people do this
Above all, because you can: a byproduct of the
web and the widespread use of computers is a
wealth of textual data. Without books in
transcribed form much less would happen.
Yes, you can always transcribe some new stuff
yourself, but then you immediately need time
and money before doing anything at all.
You can also work with small amounts of
text, but it tends to get less notice.
What‟s harder to do
Texts not in English are less widely available in
digital form and so get analyzed less.
Texts much later than the nineteenth century
are in copyright.
Texts before the nineteenth century pose OCR
problems and have more variable spelling.
It‟s not an accident that there are so many
digital studies of nineteenth-century novels.
Why it‟s worth doing
When there‟s too much to read
When a different kind of attention is
valuable (more systematic? or just very
different from normal reading?)
When it can locate or arrange material as the
basis for more traditional approaches
Matjaž Perc, “Evolution of the most common
English words and phrases over the
centuries”, Journal of the Royal Society
Interface, 7 December 2012: see:
Based on Google ngram data: see
A surprising claim about English
Perc, in his abstract: “We find that the most
common words and phrases in any given
year had a much shorter popularity
lifespan in the sixteenth century than they
had in the twentieth century.”
Top 3-grams, 2007 and 2008
Top 3-grams, early 1520s
(Note that the 3-grams are case-sensitive.)
From 1541‟s top 3-grams
Birthdate of Sir Thomas Bodley: 2 March 1545
Some alternative conclusions
about this research
The world‟s best mass OCR is bad for books
You should read what the providers of your
data say about it: Steven Levitt does
Interdisciplinary journals need to have
reviewers from many fields
Real 1520 trigrams
Perc‟s data set contains no true 1520
imprints: his 1520 book is An Open Letter
to the Christian Nobility of the German
Nation, an early-twentieth-century
translation of a book by Martin Luther
published in German in 1520.
Perc‟s publication of his data and an
interface for exploring it is praiseworthy:
this study is very transparent. It‟s not just
that the Google data is readily available:
Perc constructed his own tables of the top
ngrams year-by-year and published them
Some very rough numbers for 1520
STC titles published in 1520: 114
In English: 47
(And figures for both 1519 and 1521 are
considerably smaller, because 1520
includes many items dated c.1520.)
Limitations of knowledge
The kind of naïve statistical study Perc
performed assumes an entirely reliable
and consistent data set. The Google ngram
data isn‟t like that, but while it can be done
far better, a data set for early-sixteenth-
century English of that kind is not even
When is language unusual?
A man fires an arrow at a Neanderthal in William
Golding‟s novel The Inheritors:
A stick rose upright and there was a lump of bone
in the middle. Lok peered at the stick and the
lump of bone and the small eyes in the bone
things over the face. Suddenly Lok understood
that the man was holding the stick out to him but
neither he nor Lok could reach across the river.
He would have laughed if it were not for the echo
of the screaming in his head. The stick began to
grow shorter at both ends. Then it shot out to
full length again.
An obvious but useful method
David Hoover, “The End of the Irrelevant Text:
Electronic Texts, Linguistics, and Literary
Theory”, Digital Humanities Quarterly 1:2
(2007), used Google to find other instances of
the oxymoronic phrase “grew shorter”.
When referring to physical objects (and not
lectures, distances, patience, …) it‟s not about
sticks, it‟s about fuses, candles, cigarettes…
(in use), and articles of clothing, hair... (over
Hoover: “Part of the power of „the stick
began to grow shorter at both ends‟ is in
the shape of Lok‟s incomprehension. For
Lok, the whole world is alive, so that a
stick that changes length is perfectly
Problems of technique
What forms do you look for? Hoover‟s
investigation looked both at the words
Golding used and at the concept of objects
Searches can give very different results with
slight differences in query.
It really is true
Geoffrey Pullum, “The sparseness of
linguistic data”, Language Log, 7 April
2014: “it really is true that the probability
for most grammatical sequences of words
actually having turned up on the web
really is approximately zero, so
grammaticality cannot possibly be reduced
to probability of having actually occurred.”
Complex techniques: PCA
Larry L. Stewart, “Charles Brockden Brown:
Quantitative Analysis and Literary
Interpretation”, Literary and Linguistic
Computing, June 2003: among other
things, a study of Brown‟s novels Wieland
and Carwin, and the distinctiveness of the
narrating voices of Clara and Carwin.
What is that graph based on?
PCA, or Principal Component
Analysis, takes as input numerous textual
features you choose, and tries to create
“components” that capture as much of the
variation in the texts as possible: reducing
the dozens of dimensions needed to show
all these things down to two that roll
together a lot of what‟s going on (about
half of it, in this case).
This reduction is automatic: and is not really
a statistical analysis, only a rearrangement
of the data. But it does show us groupings
of the chapters based on part of the actual
data, with Clara‟s narration in Wieland
having more exclamation points and
dashes and fewer instances of “our”;
combining these into one feature makes it
easier to see.
Can we get back to the text?
Yes, in that Stewart tells us what goes into
the first principal component (though not
No, in that he doesn‟t show any passages
and analyze them in these terms.
And no, in that a component is a complex
weighted combination of parts of features.
Graphs need analysis
It is still common to treat graphs and other
visualizations as results, not as texts that
themselves need interpretation. Yet they‟re
only of interest if they support substantial
discussion and analysis, and that ought to
appear in the article. Stewart has a
literary-critical discussion of the novels in
light of this analysis: but why not a few
pages first on the graph?
Graphs need interaction
You publish one or two or six graphs in an
article, not two hundred, because they take
up a lot of space. But if a graph‟s worth
doing at all it‟s worth doing
differently, and the best way to explore
this kind of study is to try out variations
For all its flaws, this is one thing the Google
ngrams resource got right.
Big uncurated data
Ted Underwood, Michael L. Black, Loretta
Auvil, and Boris Capitanu, “Mapping
Mutable Genres in Structurally Complex
Volumes” (2013), at
http://arxiv.org/abs/1309.3323: the study
analyzes “a collection of 469,200 volumes
drawn from HathiTrust Digital Library”.
That‟s an open data collection provided by
libraries involved in Google Books.
How do you read 469,200 books?
You start by figuring out how to find the text
in them, by skipping things like bookplates
and tables of contents. (The bookplates are
a reason why Google Books and Google
ngrams studies of the word “library” run
into problems.) Without doing that first
you can‟t go on to study (as they are) the
percentage of first-person novels over
But it‟s not really transparent now
If you need to do that much to the books
before you can analyze them, others either
need to duplicate all of that preliminary
work or get the results of your preliminary
Much work on big data elsewhere is based
on data that is simpler in form than books
are, or has been prepared for use first (at
Curated rather than raw texts
These exist in the humanities, but not
necessarily where you want to work or in
the numbers you desire. Another C19-
novel study by Matthew Wilkens used
texts fixed up at Indiana University, with
fewer textual errors and clearly-defined
structure; but that meant he also had a lot
fewer of them.
Specially prepared data
Once it was more common for digital-
humanities work to involve creation of
new data for analysis: not just basic
texts, but also analysis or extraction of
features by hand as a basis for analysis.
For example, Brad Pasanek and D.
Sculley, “Mining millions of
metaphors”, Literary and Linguistic
Computing, September 2008.
See http://metaphors.lib.virginia.edu/ for
his Mind is a Metaphor
collection, assembled to support a study of
C18 thinking on the subject; a collection
based in the first instance on doing lots of
searches, extended over the course of
many years by several hands.
A little on how it‟s done
Pasanek: “At present I still spend a fair amount of time
conducting proximity searches for two character strings. I search
one term from a set list ("mind," "heart," "soul," "thought,"
"idea," "imagination," "fancy," "reason," "passion," "head,"
"breast," "bosom," or "brain") against another word that I hope
will prove metaphorical. For example, I search for "mind" within
one hundred characters of "mint" and find the following couplet
in William Cowper's poetry:
“The mind and conduct mutually imprint
And stamp their image in each other's mint.””
Creating data as a scholarly activity
The collection itself is a major effort (and not
everyone would have made it public in this
way prior to publishing their monograph).
Creation of this kind of resource is not yet
widely recognized as valuable scholarship:
the usual focus is on “uninterpreted”
And some data comes from sources that cannot
be made generally available (copyright
Are we satisfied?
Over half the metaphors come from
searching Chadwyck-Healey collections of
texts; about a third from reading.
There‟s transparency in that Pasanek
explains in detail how he assembled his
collection; but it would be a challenge to
assemble a rival corpus to compare with
this one. Such an effort shouldn‟t really be
an individual one, but usually will be.
There‟s a potential for openness in new
approaches but some challenges: new
forms of publication appropriate for new
kinds of work, balancing openness and
scholarly recognition, copyright.
We need to find out interesting things to
motivate the changes greater transparency
Please contact me at
Slides: at http://goo.gl/dPGhPw