1. Are Digital Literary Studies even possible?
I do want to toss around
a question that I have
been thinking about for a
long time: Can you have
computational text
analysis and literary
criticism at the same
time?
(Ramsay 2012)
2. What am I looking for?
● Literary criticism disguised as text analysis
● Text analysis disguised as literary criticism
4. “The position of Digital Humanities as a discipline is very
peculiar, being at the same time a methodology and a
discipline in its own right, aimed at the creation of theories
and methods, tools and techniques that can be used for
research and inquiry.”
Definition of DH as an academic field
5. “The position of Statistics as a discipline is very peculiar,
being at the same time a methodology and a discipline in its
own right, aimed at the creation of theories and methods,
tools and techniques that can be used for research and
inquiry.”
(Franco Giusti, Introduzione alla statistica, p. 20)
Definition of DH as an academic field
6. Big Data and Statistics
“The growing digitization of our textual and literary heritage has convinced many
academics and observers of higher education that we are currently experiencing a
renaissance in the Humanities. Some scholars argue that this mass of data is
profoundly changing the methodological toolbox of a field whose scholarship
is traditionally based on close reading and interpretation of texts. Digitization has
rendered novels, plays, poems and historical texts open to forms of statistical
analysis and visualization methods previously unavailable to these
objects. As a result, this “digital turn” is creating a vivid debate within the
Humanities about the effects that the use of algorithms might have on the
interpretation, understanding and teaching of literature and history.”
(Digital Methods in Research – Textual Heritage and Literary Studies, March 27)
7. Jockers' MacroAnalysis
This emerging field [...] was for a good many
decades not emerging at all [...] Technology has
certainly changed some things about the way
literary scholars go about their work, but until
recently change has been mostly at the level of
simple, even anecdotal, search. The humanities
computing/ digital humanities revolution has now
begun, and big data have been a major catalyst.
The questions we may now ask were previously
inconceivable , and to answer these questions
requires a new methodology, a new way of
thinking about our object of study.
8. History of Statistics: 1600-1700
Girolamo Ghilini (1589-1668)
Ristretto della civile, politica, statistica e militare scienza (1666-68)
William Petty (1623-1687)
Several Essays in Political Arithmetick (1699)
Gottfried Achenwall (1719-1772)
Staatsverfassung der Europäischen Reiche im Grundrisse (1752)
9. History of Statistics: 1800
● Emergence of Modern Statistics
● Statistics applied to many fields beside
government
● Calculations became increasingly complicate
● Stronger need to build mechanical calculating
machines
10. The Art of Compiling Statistics
● Automation of the US 1890 census
● Hollerith founded the Tabulating Machine Company, later
called IBM (from 1911 onwards)
● “Be it known that I, HERMAN HOLLERITH, of New York
city, county, and State, have invented a certain new
and useful Improvement in the Art of Compiling
Statistics; and I do hereby declare the following to be a
full, clear, and exact description of the same, reference
being had to the accompanying drawings, forming a part of
this specification, and to the figures and letters of reference
marked thereon.” (Patent US395782 A: Art of Compiling
Statistics - 1889)
11.
12. 1920s: IBM and Columbia U.
●
1924-26: Columbia University
Statistical Laboratory
●
1928-33: Columbia University
Statistical Bureau
●
Served as “Computer Center” for other
academic departments and outside
organizations (Rockfeller and Carnegie
Foundations, Yale, Harvard, Princeton)
13. New statistical machines with the mental power of 100 skilled
mathematicians in solving even highly complex algebraic problems
were demonstrated yesterday for the first time before a group of
psychologists, educational research workers and statisticians in the
laboratories of the Columbia University Statistical Bureau in
Hamilton Hall.One of the tabulators exhibited can work out and print
the results of as many as twelve difficult problems in just a single
rapid operation. It is designed to handle differences and reckon
powers of numbers up to the tenth, whereas such machines hiterto
[sic] have been able to compute only the second power of numbers.
Richard Warren and Robert M. Mendenhall, research workers at
Columbia and statistical consultants for the Carnegie Foundation for
the Advancement of Teaching, are responsible for most of the
inventions which were first announced at the educator's convention
in Atlantic City last week.
These new machines will be a tremendous boon to research, Dr.
Ben. D. Wood, Director of the Statistical Bureau, said yesterday,
through making statistical procedure more accurate, much faster
and less expensive. With the assistance of the new tabu-
1920: The first Super-computing machine?
14. Prof. Benjamin Wood
Pioneer in studies on learning technologies:
● an early study (1928) showing that students
taught with films learned more than those
taught with printed materials alone
● a study (1929-1931) showing that using
typewriters encouraged more and higher
quality writing in addition to more
cooperation in the classroom
●
Consulting role in developing the first
commercial test scoring machine (the
IBM805)
15. 1949: Watson meets Busa
Hollerith 1889
● first, preparing a standard or templet
indicating the relative position or order
in which each item or characteristic of
the individual or thing is to be
recorded;
● second, forming according to such a
standard or templet a separate record
for each individual
● third, actuating a series of circuit
controlling devices, corresponding in
number and position to the standard of
templet
Busa 1951
● Transcription of text, broken down into
phrases, on to separate cards;
● Multiplication of the cards (as many as
there are words on each);
● Indication on each of the resulting
cards the respective entry (lemma);
● Selection and alphabetization of all
cards purely by spelling;
● typographical composition of the pages
for publishing.
16. 1950s: Competing Computers
IBM
“In the late nineteenth century,
many businesses adopted a practice
that organized work using [...] an
ensemble of three to six different
devices […] More relevant is the
‘‘architecture’’ of the entire room—
including the people in it - [ ...] it
was that room, not the
individual machines, that the
electronic computer eventually
replaced.
(Ceruzzi: 16)
UNIVAC
“The flow of information through the
UNIVAC reflected Eckert and Mauchly’s
background in physics and engineering.
[…] the flow of instructions and data
in the UNIVAC mirrored the way humans
using mechanical calculators, books of
tables, and pencil and paper performed
scientific calculations […] a scientist or
engineer would not have found anything
unusual in the way a UNIVAC attacked a
problem.”
(Ceruzzi: 15)
17. Crunching Words before DH
● 1851: Augustus de Morgan
● 1887: T. C. Mendenhall, "The Characteristic Curves of Composition"
● 1888: C. Mascol, "Curves of Pauline and Pseudo- Pauline Style I,"
● 1893: L. A. Sherman, Analytics of Literature: A Manual for the Objective
Study of English Prose and Poetry (Boston: Ginn)
● 1898: W. Lutolawski, Principes de stylométrie
● 1935: G.K. Zipf, The psycho-biology of language; an introduction to
dynamic philology (Boston: Houghton Mifflin Company)
● 1944: G. Udny Yule, The Statistical Study of Literary Vocabulary
(Cambridge UP)
18. The Statistical Study of Literary Vocabulary
These discussions left in my mind a sense of
inadequacy. They did not tell me what I wanted to
know. They dealt with such details as his use of
words and idioms […] mere details, details
certainly quite useful […] but they give no faintest
notion as to what his vocabulary is really like as a
whole […] What I felt I wanted in the first place,
prior to any detail, was some summary, some
picture of the vocabulary as a whole. (p.2)
19. The Statistical Study of Literary Vocabulary
I decided to confine myself to a single class of
words, viz. nouns. The concordance was worked
through page by page and every noun entered on
a card together with the number of times it was
used. From these cards it was easy to book up a
table, the 'frequency of distribution' to use the
statistical term, showing the number of nouns
used once, twice, thrice [...] (p.4)
20. Busa's project
Like all good projects, this one began with a question: What is the
metaphysics of presence in St. Thomas Aquinas? Combing for praesens
and praesentia, he realized that such words were peripheral, and, however
unfortunately, Saint Thomas's doctrine of presence is linked with the
preposition in!
Inquiring what St. Thomas meant by "presence," the young Roberto Busa
realized that we must also study the way function-words affect
meaning-words. To study the significant phrase "in the presence" he
needed the shades of "in". His dissertation, defended in 1946, was
essentially founded on a handmade Thomistic Concordance, essentially
complete, but with one entry.
He had made 10,000 hand-written cards.
(Thomas N. Winter 1999: 6)
21. Early DH and IBM
"The use of the latest data-processing tools developed primarily for science and commerce may
prove a significant factor in facilitating future literary and scholarly studies."
(Paul Tasman, 1957)
1964
Literary Data Processing Conference Proceedings, September 9, 10, 11, 1964. Department of
Scientific and Technical Information, International Business Machines Corp., Data Processing
Division: White Plains, N.Y., 1964
1966
First issue of Computers and the Humanities, published by Queens College of CUNY, with the
financial assistance of IBM corporation and U.S. Steel Foundation. The Academic editor was
Prof. Joseph Raben, Department of English, Queens College
22. Surprise Surprise
● Stylometry is a very popular approach in
Digital Literary studies and Text Analysis today
● The R project for Statistical Computing, a
strongly functional language and environment
to statistically explore data sets, is the most
used language for literary digital studies
23. Leech-Short, Style in Fiction
[...] literary stylistics has, implicitly or explicitly, the goal of
explaining the relation between language and artistic
function. The motivating questions are not so much what as
why and how. From the linguist’s angle, it is ‘Why does the
author here choose this form of expression?’ From the
literary critic’s viewpoint, it is ‘How is such-and-such an
aesthetic effect achieved through language?’
24. Louis T. Milic
● A Quantitative Approach to the Style of
Jonathan Swift. Studies in English Literature, v.
23. The Hague: Mouton, 1967.
● Style and Stylistics; an Analytical Bibliography.
New York: Free Press, 1967.
● Stylists on Style; a Handbook with Selections
for Analysis. New York: Scribner, 1969.
25. Poibeau 2014
[…] computational linguists try to study the mechanisms that make
the comprehension of languages possible. They try to build tools
that show the possibilities and the limits of learning with only the
help of real language data, without dictionaries and similar
resources. They try to understand to what extent we can avoid the
use of dictionaries or of other tools that provide meanings a priori
in order to define meaning exclusively out of a corpus, inferring it
from the way in which words are used in it […] it is clear in fact that
we acquire knowledge about language from what we hear and read.
26. Influence and Information Cascades
Within the field of observational learning, there exists a theory of
information cascades:
“An informational cascade occurs when it is optimal for an
individual, having observed the actions of those ahead of him, to
follow the behavior of the preceding individual without regard to
his own information” […]
In other words, once a cascade begins, it tends to continue and to
create a situation of mass imitation in which individuals repeatedly
avoid the road less taken. […] At the same time, the theory tells us
that cascades are fragile; the introduction of a disruptive force, a
new signal, can cause the cascade to collapse and move in an
entirely new direction. […] some mutant writer would take some
other road, and a new cascade would follow. As a way of modeling
literary influence and intertextuality at scale, information cascades
provide an attractive theoretical framework.
27. Macroanalysis' Genealogy
● in part a response to Franco Moretti’s (Moretti
2000, 56-58) discussion of the need for distant
reading in literary studies
● in part related to text analysis and humanities
computing
● in part indebted to stylometry and the use of
statistics to evaluate and analyze corpora of
texts
28. Distant Reading
● Close reading as a method for gathering
evidence is flawed, because interpretation is
subjective and biased
● big data render close reading totally
inappropriate as a method of studying literary
history
● massive digital-text collections demand a new
type of evidence gathering and meaning
making
29. Linguistics and Stilistics
In recent years we have seen the emergence of computational
methods, usually using statistics, whose main feature is to be
efficient in working with big data. In a way, being efficient
was more important than being meaningful. It is not
possible to compute thousands or millions of documents in a
few seconds with a deep and meaningful analysis, even if
computers are more and more powerful. Suddenly, the easy
way is counting (forms, words, patterns and collocations,
frequencies etc.)
(Poibeau 2014)
30. Statistics: Why?
● Statistics is a science of the aggregate
(Scienza del collettivo)
● The statistical method is the only one that
allows to analyse big data
31. Statistics: Why not?
If you use a statistical method, the individual
items lose their materiality, there are
abstractions that carry only characteristic that
are investigated, erasing all the other features
that are not interesting for the research.
32. Aravamudan on Moretti
● Moretti's work on the long arc of the novel has
expanded our understanding of its scope and range
● European hegemony is exercised, even if he
encourages a cosmopolitan approach
● Moretti has no time for the critical interpretation of
individual fictions, except as exemplary of very
large trends that can be followed through their
tropological and formal analysis, and this is of a
piece with his grand narrative of intellectual
diffusion with Europe as the core.
33. Novel: Rise, Diffusion, Resistance
● The rise of the novel (Ian Watt)
● Enlarging the rise of the novel (Moretti &
Jockers)
● Resisting the Rise of the Novel (Aravamudan)
35. Event in the history of mediation
Enlightenment is not just a philosophical position-taking but an
institutional event in the history of mediation, a time and a
place, as well as a mode of interaction entailing the creation of a
new epistemological infrastructure when new genres and
formats for the presentation of knowledge were explored and new
associational practices developed for the collation of information.
New protocols came about, including the 'postal principle' by
which anyone can address anyone, public credit and copyright,
all of which saturated knowledge production.
36. Distance Transmission Absence
Or as John Guillory extends this argument, the mediations created
by the Enlightenment entailed an understanding of distance,
transmission and absence as operational between the poles of
communication, whether between individuals, objects of analysis,
or knowledge systems. Taking on this insight, we can propose that
genres are to be understood not just as containers for
information but rather as apparatuses of mediation that
traverse social distance, enable cultural transmission and
make absence productive of new forms and new media.
37. Consequences
● put into perspective the use of statistical
computing in literary studies
● taking seriously the meaning of digital computing
● digital support is not simply another support of the
same thing (text), but a transformation of the
(written) text itself in something else
● situate the literary system within the media system
(Fiormonte 2003: 31)
38. Semiotic Computing?
● Connecting the debate on digital representation with semiotics is perhaps
the only possible method that will attack the very core of the digital
production of symbols, highlighting both problems and possibilities
(Fiormonte 2009)
● Instead of R, for literary studies we could use a different programming
paradigm (event-driven (VS object-oriented) and declarative are the ones I
am WILLING TO TRY to understand now)
● P. B. Andersen, A Theory of Computer Semiotics. Semiotic Approaches to
Construction and Assessment of Computer Systems, Cambridge UP, 1997.
39. events in the history of mediation
A Companion to Digital Literary Studies is fundamentally a narrative of
what may be called the scene of "new media encounter" — in this
case, between the literary and the digital. The premise is that the
boundary between codex-based literature and digital information
has now been so breached by shared technological, communicational,
and computational protocols that we might best think in terms of an
encounter rather than a border. And "new media" is the concept that
helps organize our understanding of how to negotiate — which is to
say, mediate — the mixedprotocols in the encounter zone. (LIU
2008)
40. Electronic Documents
● Even when we press it into a mould, the electronic
document is and remains a source in motion.
(Fiormonte 2003: 15)
● If reading consists in […] constructing a network of cross-
references within the text, associating it with other data,
integrating words and images within a personal memory
that is continuously being updated, then hypertext
mechanisms represent an objectivation,
exteriorization, and virtualization of the reading
process. (Levy: 56-57)
41. Books VS hypertexts
If we define a hypertext as a space of possible
readings, a text would then represent a
particular reading of an hypertext […] Any
public text accessible through the Internet is
now a virtual component in an immense and
ever-expanding hypertext. (Lévy : 58-59)
42. Texts VS Events
● representation of texts (TEI, XML, object-
oriented)
● representation of events (performance,
readings, event-driven paradigm languages)
43. Case Study: The Council of Egypt
●
Arab manuscript (14th
century)
●
Vella, Consiglio d'Egitto (18th
century)
●
Sciascia, Consiglio d'Egitto (20th
century)
44. Corporate Orientalism
Taking the late eighteenth century as a very
roughly defined starting point, Orientalism can be
discussed and analyzed as the corporate
institution for dealing with the Orient – dealing
with it by making statements about it, authorizing
views of it, describing it, by teaching it, settling it,
ruling over it: in short, Orientalism as a Western
style for dominating, restructuring, and having
authority over the Orient
(Edward Said, Orientalism)
45. Enlightenment Orientalism
[…] imaginative fiction [...] defined European understandings of
cultures that were seemingly foreign but that shared the past in
ways that needed expert explanation. […] This imagination was
experimental, prospective, and antifoundationalist. […] The
experimentation came to an end, however, partly out of generic
exhaustion and partly as a result of a rising nationalist tide […]
Enlightenment Orientalism was very much an imaginative
Orientalism, circulating images of the East that were nine part
invented and one part referential, but it would be anachronistic to
deem these images ideological, as they did not tend principally
towards domination of the East [...]
46. Side Projects
● Books are falling apart:
http://futuread.hypotheses.org/
● Leggere, scrivere e far di conto:
http://infouma.hypotheses.org/
● History of Humanities Computing pre-1994:
http://historyofhumanitiescomputing.wikispaces.com
● Bibliography of HC pre-1994:
https://www.zotero.org/groups/252168/
47. Definition of DH
● The attempt to create intelligent (reading)
machines
and
● to teach people how to be smarter than the
intelligent machines we created
Would you say that this is a good definition of Digital Humanities as a field?
Does it conveys the position we are in as digital humanistis, and give a sense of what we are doing?
The thing is that this is not a definition of digital humanities at all. I just changed the name from a quote that was describing statistics as a discipline.
So the question is not what is digital humanities; the question is is this similarity just a coincidence or is there a deeper meaning in it?
While I was working at this presentation I came across this announcement of a seminar on digital methods in research that emphasized precisely the importance of statistics.
This is not very far from Jockers' emphasis of Big Data . It could not be a problem in itself, unless it creates a biased approach hidden in the technology and the tools we are using.
I think Jockers' description is not immune from “technological determinism”, placing too much emphasis on technological progress as the cause of methodological changes. Besides I believe it is inaccurate to insist too much on the novelty of methods that are ultimately based on statistics.
To make my point I will need to quickly survey the history of statistics, in order to show the connections with computing and with literary studies, the two main areas we are trying to connect here.
The interesting thing about Hollerith patent is that he is not patenting a machine, but the process with which statistics were compiled, using different machines (that were patented separately). This is pretty much the invention of the algorithm, in a way, with the different steps that were described in the application. It's more on the side of the software than of the hardware, and it involves the way in which data are collected and manipulated.
Hollerith was a graduate of Columbia University, and at Columbia we have the first academic use of IBM machines, not in Hard Sciences as one might expect but in the Columbia University Statistical Laboratory (1924-26) first and Columbia University Statistical Bureau (1929-?) directed by Prof. Benjamin Wood, that served as a "computer center" for other academic departments and for outside organizations like the Rockefeller and Carnegie Foundations, Yale, Harvard, and Princeton.
So, IBM was working on statistical machines. But we should ask at this point why Busa decided to go to IBM? He was not the first one to be interested in counting words in literary works. The story starts in 1851 when Augustus de Morgan, a mathematician that was also in contact with George Boole, wrote a letter to
The connection between this and Busa's work is in my opinion undeniable. And therefore new questions are piling up regarding his work with IBM:
- to what extent was Busa “giving up” to statistics?
-to what extent he was trying to twist this methodology towards a more traditional literary research?
Also we have to keep in mind that Busa was not interested at first in literary studies but in philosophy, and his approach to the text was most probably not the same.
One of the problem of stylometry is precisely what idea of style they are pursuing. Quite clearly Leech and Short definition of (literary) style cannot be applied to stylometry. The relationship between authorial intention and linguistic feature, which is the gist of Leech and Short definition, is hardly applicable to the emphasis on “boring” words.
So, what is stylometry about? My answer is that it should be linked to the question of transmission of language, that is to say the way in which language is transferred from a generation to the next one, from a writer to a reader etc. By language I mean non only oral language, or even written language, but also the language of the novel, for instance. The point is you don't learn to understand a novel using a theoretical discussion of structuralism, you learn what a novel is by reading it. And you learn a language by listening to it, not by reading a grammar book.
Stylometry is situated at this level, more than at the level of literary stylistics.
Jockers' emphasis on influence is itself connected to this kind of questions.
In the case of literary studies, the doubt is even more radical, because we cannot identify literary with linguistic. Language is obviously important, but literature is more than that, and the relationship between the two disciplines is not as straightforward as one can imagine.
It is this new epistemological infrastructure that creates the epistemological productivity that will in turn cause information overload, the need for statistical machinery and big data.
Seeing things from this point of view affects the way we interpret the historical events I presented earlier, with a number of important consequences that we should take into account.
LIU is presenting a similar approach, the only difference being that while, in focusing on what he calls “new media encounters” he is actually talking about History with a capital H, whereas what I am interested in are lowercase events, such as when we read a book or go to theatre or watch a movie. These are the kind of things that we should be able to represent digitally.