Sharing<br />Datta-Mine-ing<br />John Unsworth<br />(with contributions from Ted Underwood and the HTRC executive committee)<br />Graduate School of Library and Information Science<br />University of Illinois, Urbana-Champaign<br />June 2011<br />
Where we’re going<br />Our anxieties about quantitative methods in the humanities<br />Worse and better examples of arguments in the humanities using quantitative data<br />The real problem: data which exists, but to which we don’t have access<br />A solution to that problem, involving three-foot-long spoons. <br />
Julia Flanders, Digital Humanities and the Politics of Scholarly Work: <br />“Debates about quantification, numerical analysis, and the reductiveness of detail have figured significantly in discussions of scholarly method and the nature of literary study for over two centuries, and they also seemed to me to have important continuities with the problem of the aesthetic[….] consider the following brief episode from ItaloCalvino’s If on a Winter's Night a Traveller . In this vignette, a novelist named Calixto Bandera is considering a new spin on the problem of audience:”<br />http://dev.stg.brown.edu/staff/Julia_Flanders/pubs/flanders_dissertation.xhtml<br />
The Computer as Reader<br />“I asked Lotaria if she has already read some books of mine that I lent her. She said no, because here she doesn't have a computer at her disposal.” (186)<br />“She explained to me that a suitably programmed computer can read a novel in a few minutes and record the list of all the words contained in the text, in order of frequency [...]<br />
The Computer as Writer<br />Now, every time I write a word, I see it spun around by the electronic brain, ranked according to its frequency, next to other words whose identity I cannot know [….] Perhaps instead of a book I could write lists of words, in alphabetical order, an avalanche of isolated words which expresses that truth I still do not know, and from which the computer, reversing its program, could construct the book, my book.” (188)<br />
Arguing with Data<br />Data enables arguments based on quantitative and/or empirical data<br />Data still requires interpretation, and you can still make better and worse interpretations, and more or less compelling arguments<br />In addition to new kinds of arguments, you can make new kinds of mistakes, especially mistakes based on incomplete data or on an incomplete understanding of data<br />
Mistakes based on incomplete data<br />
Mistakes based on incomplete data<br />
New kinds of arguments<br />Ted Underwood is exploring the changing etymological basis of diction in English, over a 200-year period, especially the shift from words derived from German, to words derived from Latin, and back again.<br />
Etymology and StyleTed Underwood, 2011<br /><ul><li>English professors have a long, lively history of drawing specious conclusions from the “Latinate” or “Germanic” character of a particular writer’s style.
There is nevertheless good evidence that older words do predominate in informal, and especially spoken English. [Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.]
Can we use this fact to trace broad changes of register in the history of written English?</li></li></ul><li>The fundamental distinction is not Latinate/Germanic, but date of entry. French was the written language for 200 years; words that entered English before that point had to be used in the spoken language to survive. This includes “Latinate” words like “street” and “wall.”<br />http://bit.ly/h8cJem<br />
To understand the significance of the result, it needs to be broken down by genre. Initial results suggest that fiction and nonfiction prose both become more formal (less like speech) in the 18c. Drama and poetry change little, although older, less formal, “speechlike” words always predominate in drama.<br />
Datum = Something Given<br />So, Ted’s investigation concerns historical trends: as such, it is reasonable to think that it might be interesting to extend beyond 1900. <br />Can we do that? Only if we are given the data. <br />
Copyright Creep<br />http://en.wikipedia.org/wiki/Copyright_Term_Extension_Act<br />
A murine chronology (1928)<br />AprilProduction begins on the Mickey Mouse film Plane Crazy inspired by Charles Lindbergh's trans-atlanticflight <br />May 15 Work is completed on the film Plane Crazy.<br />Walt Disney's first silent film featuring Mickey Mouse, Plane Crazy premieres as a sneak preview at a theatre on Sunset Boulevard, in Los Angeles, California. It cost US$1772.89 to make. Minnie Mouse also debuts. <br />May 16 Walt Disney applies for a trademark for "Mickey Mouse", for use in motion pictures.<br />http://www.islandnet.com/~kpolsson/mmouse/<br />
Steamboat Willie<br />“Steamboat Willie has been close to entering the public domain in the United States several times. Each time, copyright protection in the United States has been extended. It could have entered public domain in 4 different years; first in 1956, renewed to 1984, then to 2003 by the Copyright Act of 1976, and finally to the current public domain date of 2023 by the Copyright Term Extension Act (also known pejoratively as the Mickey Mouse Protection Act) of 1998. The U.S. copyright on Steamboat Willie will be in effect through 2023 unless there is another change of the law.”<br />http://en.wikipedia.org/wiki/Steamboat_Willie<br />
The Waste Land<br />T.S. Eliot, by Wyndham Lewis, 1938<br />Original publication of the poem: 1922, in The Dial (an American literary magazine<br />
Copyright and The Waste Land<br />“The copyright was registered in the United States sometime in 1922. <br />The copyright gave 28 years of protection plus any additional time to cause it to expire after midnight on the last day of the year. Thus it was protected up to and throughout 1950 (1922 + 28). <br />In 1950 the copyright could be renewed for 28 more years meaning that it would enter the public domain in the United States after the end of 1978 (1950 + 28). <br />In the United States, the Copyright Act of 1976 extended the renewal from 28 years to 47 years giving The Waste Land protection for 19 more years or throughout 1997 (1950 + 28 + 19).”<br />
Copyright and The Waste Land<br />“On January 1, 1998, The Waste Land went into public domain in the United States. <br />On October 27, 1998 U.S. public law 105-298 extended renewal of copyrighted items (that were still under protection) by 20 years. <br />The Waste Land was, however, already in the public domain in the United States and thus remains in that state. <br />If The Waste Land was written in 1923 it would be protected for 95 years (28 + 28 + 19 + 20) plus the remainder of the last calendar year meaning that it would go into the public domain (in the US) January 1, 2019.” <br />
And in England…<br />“The Waste Land is still under copyright restrictions in the United Kingdom and most likely in the countries of the European Union, the Commonwealth of Nations and other countries. Copies of T.S. Eliot's poems, plays, essays and other of his works that are placed on computers for public access through the internet may be infringing on copyrights held by Faber and Faber, Mrs. T.S. Eliot and others.”<br />Copyright information about the Waste Land comes from R.A. Parker, “Exploring the Waste Land,” a hobbyist site at http://www.std.com/~raparker/exploring/thewasteland/excopy.html<br />
Give, sympathize, control<br />401. 'Datta, dayadhvam, damyata' (Give, sympathize, control). The fable of the meaning of the Thunder is found in the Brihadaranyaka--Upanishad, 5, 1. A translation is found in Deussen'sSechzig Upanishads des Veda, p. 489. <br />
The Waste Land<br /><ul><li>Is a great mashup of Western Culture that tries to grasp patterns through the juxtaposition of fragments. The Waste Land app is a nicely done mashup of stuff Faber owns, concerning that great mashup of Western Culture. Not the same kind of creative production, frankly.
Faber, in England, isn’t letting go of it their property yet, either. By continuing to produce new products based on it, they strengthen their claim to it. The battle for the Waste Land may be lost in the colonies, but not yet in the Kingdom.
The Waste Land marks the chronological beginning of a wasteland created by Datta-mine-ing. </li></li></ul><li>
1923 is about here<br /> |<br />Date range starts here, and<br />goes counter-clockwise.<br />
HathiTrust Research Center<br />
HathiTrust Digital Library History<br />To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge.<br />Launched in October 2008<br />University of Michigan<br />Indiana University<br />Used Google Books Repository at Michigan as model<br />Expanded to include content from <br />CIC Member Libraries<br />UC System Libraries<br />University of Virginia<br />Now includes more than 50 partner institutions<br />
HathiTrust Research Center Today<br />HTRC is dedicated to the provision of access to a comprehensive body of published works for scholarship and education for computational research purposes.<br />Lightweight Organization<br />Executive Committee<br />Beth Plale, Indiana<br />Scott Pool, Illinois<br />Robert H. McDonald, Indiana<br />John Unsworth, Illinois<br />Advisory Board – TBD<br />HT Executive Committee Sponsor<br />Laine Farley, California Digital Library<br />
HathiTrust Research Center Will:<br />Maintain repository of text mining algorithms and retrieval tools available on-line for human and programmatic discovery. Also register derived data sets, indexes, and versions in registry repository. <br />Be a user-driven resource, with an active advisory board, and a community model that allows users to share algorithms and tools. <br />Support interoperability across collections and institutions, through use of inCommon SAML identity. <br />
Non-consumptive Research<br />“Research in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book.” <br />
Non-consumptive Research<br />One of HTRC’s unique challenges is support for non-consumptive research. <br />This will entail bringing algorithms to data, and exporting results, and/or providing people with secure computational environments in which they can work with copyrighted materials without exporting them. <br />We are still going to need to persuade publishers that this not only doesn’t threaten their business, but actually enhances and expands their business. <br />
Starving at the Banquet<br />Variously attributed to Japanese, Chinese, and Jewish tradition; I think the story probably originates with a 13th-century Hassidic rabbi, MaharamMiRottenberg. A (man, woman) asks an (angel, monk) for a preview of heaven and hell. The angel takes the man to a beautiful place where a banquet has been set, and yet the people at the banquet are starving, because they each have three-foot-long spoons strapped to their arms, and they can’t get their food into their mouths. This is clearly hell. Then they go to visit heaven: same setting, same banquet, but everybody’s fat and happy, because they’re using their spoons to feed each other. <br />
Humanists have a long and uncomfortable relationship with quantitative methods, one that Julia Flanders explored in her dissertation on digital humanities and the politics of scholarly work (online, recommended). The passage she cites here, from If On a Winter’s Night a Traveller, sets out these anxieties from an author’s point of view—the fear of being read quantitatively—but (since that work succeeds in making us identify with the author) it’s a good opening to a discussion of our anxieties about reading computationally, as well.
c.f. “How Not To Read A Million Books” or “Distant Reading” or other such arguments.
Calvino’s last move here is to fear something he doesn’t at all believe, namely that the analytical program could reverse engineer a book and a writer’s truth out of a list of words. But there’s something closer to reality here as well: the writer can’t keep track of his use of language in the same mindless numerical way that a computer can—and there are dimensions of style, for example, that can be understood through quantitative analysis in ways the author himself might not understand them. Usually, though, that kind of analysis is not just performed within one book, but in some kind of comparative framework, where comparative frequencies are what’s interesting—and even if the author could keep track of his own word frequencies, he certainly couldn’t keep track of the frequency with which he uses a word compared to all other writing in his language. And finally, to give Calvino his due, it is incumbent on a reader to make some interesting observation or argument based on these frequencies: simply reporting them is neither interesting nor useful.
Data vs. Lore: something my son noticed when he was about five: “Oh, I get it: Data vs. Lore—facts vs. stories.” Never noticed that, myself. He’s a lawyer now, and I’m a lapsed English professor. “Digital computing doesn’t’ change the nature of argument, but it changes elements of argument, such as proof” (Robert Mitchell, Duke, cf TILTS). But I would also say that you tend to argue about different things when you argue with data, I’m going to give some examples that focus not just on computing, but on computing with data. But I’ll also talk about the fact that with these new kinds of arguments there also come new limitations and new kinds of error.
“incomplete data” or, more accurately in this case, mistakes based on an incomplete understanding of data.
Also, you need to understand something about the history of the language to understand the significance of certain quirks in data. And you might say “well, why doesn’t Google just treat its ngrams in a case-insensitive way?” That assumes that case isn’t of interest to some users (as it might be in the example here), and that Google can predict all transformations of the data that would be useful (which it can’t) and that even if they could, they’d have the time and money to perform them (which they don’t). So, they are absolutely literal about the data—and that’s probably the best choice, under the circumstances. But that’s why we need a computational environment in which we can make our own representations of this data, representations suitable to answer particular research questions.
Latin vs. German roots have been used anecdotally to describe, define, characterize, analyze the style of various writers of English, usually along some spectrum of formality to informality, or high to low culture, or artificiality to authenticity. A lot of that’s baloney, but there’s a kernel of truth in this: linguists have provided some data—some empirical evidence that informal—that is, spoken—language has a higher proportion of older words than written English.
Fate of the 500 most common words that entered the English lexicon before 1125, vs. the rest of the language. Why 1125? Really, it’s 1066: but 1125 is the point from which French really becomes the official language of England, for a couple of hundred years. That’s why it’s important to keep track of words that entered the language before that date: if they survive, they are some kind of ground truth about English, against which we could measure trends. The trendline we see here seems to tell the story of those words getting systematically pushed out of the lexicon until a bit before 1800, and then making a resurgence through the 19th century. If you think there’s a 19th-century movement to restore some kind of authentic diction to writing in English, then this trendline is interesting.
So, actual evidence (data) suggests that there’s an other important dimension to this problem, namely genre. The trends in the etymological basis of diction are different for writing than they are for speech, and they’re different for those genres of writing that imitate speech (poetry, drama) than they are for those that don’t (fiction, non-fiction). We don’t have a trendline for poetry and drama here, but if we did, it wouldn’t look the trends for non-fiction and fiction [click] it would be more of a straight line.
Datum, from Latin: something given. Or not given, when access to the data is prevented—for example, because of copyright.Legally, the Waste Land is available for data mining, at least in the US, and it would be an interesting text to look at, in terms of patterns of speech and poetry (and drama, bits of which are embedded throughout). It would no doubt be an outlier or a limit case for Ted’s investigation---but a very interesting outlier. The problem is that we can’t mine much beyond this work: it marks Consider the irony of having started a scholarly career studying 20th-century literature, and arriving at text-mining as your preferred method.
Note the lengthening reach-back segments in this graph. What’s not clear from the graph is that between 1923 and 1963, copyright wasn’t automatically renewed after the initial 28-year period. The University of Michigan (praise be) is working hard to expand the public domain by investigating the copyright status of works in the collection from that time-period, which includes a great deal of the amorphous and difficult category of orphan works (that is, works which could be in copyright, given their publication dates, but for which copyright might not have been renewed,or for which copyright holders might no longer exist). Ownership, as it turns out, is a messy business
Steamboat Willie, our secret overlord.
Every time we extend the term of copyright, we reach back in time to make sure Steamboat Willie is on board. Why? Disney spends a lot of money lobbying your senators and representatives, that’s why.
Note the publication date: 1922. Also note the scholarly mien of our author. Eliot acknowledged Jesse Weston's From Ritual to Romance as a source for the central idea of his poem, and Weston related the theme of the wasteland to the infertility or injury of the king, specifically the Fisher King. So, hold on tight here—for the purposes of the current metaphorical application of the Waste Land, the Fisher King is Mickey Mouse, and he can be healed only if the Scholar (let’s call him Percy) ask the right question about whom copyright is intended to serve—but having been warned he might be sued if he opens his mouth, Percy keeps silent, and the land remains barren. The Waste Land, then, is my boundary marker for the beginning of the new dark ages, under the dominion of Mickey the Merciless. We can mine it, but we can’t mine much beyond it.
A hobbyists notes on the copyright status of the Waste Land: correct as far as I can tell, and it gives you some idea of what a mess this all is.
Remember, it’s a mess like this for every single book published in, oh, the last hundred years or so.
And once you think you have it figured out for one book for one country, then you get into international copyright law, and start all over.
So, according to the upanishads, humans, demons, and celestial beings went to the creator and asked for direction. The creator said DA, but each category of being heard that differently, according to their spiritual needs. As Swami Krishnananda explains: “Human beings are greedy. They want to grab everything. Hoarding is their basic nature. "I want a lot of money"; "I have got a lot of land and property"; "I want to keep it with myself"; "I do not want to give anything to anybody". This is how they think. [….] So, to the human beings this was the instruction - Datta, give, because they are not prepared to give. They always want to keep. Greed is to be controlled by charity.” So the advice to humans: don’t be greedy: Give. The advice to demons was “sympathize,” because their nature is cruel; the advice to celestial beings was “control yourselves” because their nature was to do whatever they wanted to do.
I do believe that Eliot would have relished the kind of exploration of the cultural record that digitized texts and text-mining now make possible. But if we were interested in doing a mashup of 20th century culture, like what Eliot did for earlier periods, that would be impossible. Copyright exists to encourage creativity, first from the author, but then later, from others.
So, you guessed it: The Waste Land has an app. Actually, Apple’s app of the week, last week, when it was introduced. Produced by Faber & Faber (owner of most of the content) and TouchPress (an app developer that mostly concentrates on scientific educational apps). The app is nicely designed and produced, and it’s interesting to watch it dance around rights issues: it has newly commissioned readings and commentary, now owned by Faber, materials mined from the Faber archives (Faber was Eliot’s employer for much of his life), and one “buy-in-app” features (the reading by Sir Alec Guinness), which uses material Faber doesn’t own. Had this been made by an American company, it couldn’t have been sold on iTunes in England.
Google is basically on track to digitize everything in WorldCat (30M volumes of print material). Other smaller digitization efforts like the Open Content Alliance have produced significant amounts of material; cultural institutions are engaged in their own smaller scale efforts, but taken together those still amount to a significant amount of digital content. Much of this is coming from academic research libraries, and much of that is flowing back into the HathiTrust, which began as a shared digital repository for CIC (Big Ten) libraries who were contributing to (and getting OCR'd text and page images back from) the Google Books project. The HathiTrust, today, contains 8.7M digitized volumes, 4.7M titles (lots of titles are more than one volume, and some single-volume titles get digitized more than once), and about 3B pages. Of that, about 27% is in the public domain.
In its most restrictive sense, this means you can mine, but you're not allowed to read and understand. The idea is that with copyrighted material, we have to find a way to permit research, but this research can't include activities that would normally be benefits of having purchased the material being used for research--so, for example, if the material is a book, and one normally has to buy a book in order to read it (pace lending rights, etc. etc.) then you can't read. But of course, as any humanist will tell you, if you can't read, you can't understand. And as any machine learning expert will tell you, if you can't read, you supervise machine learning. So, in this wasteland, we can do research, but only with our eyes closed, only in the dark, and only if we promise not to understand our results.
Example: Underwood and OED’s etymological data—taking what was a component of a product and developing an understanding of how it might be used on its own, as a product, in the form of a web service, for example. Using the interest of scholars as a way of understanding how to evolve the business of academic publishing and research support.
In order to determine whether the HTRC represents heaven or hell, though, scholars are going to have to figure out how to work with publishers—how do we use our respective constraints to feed each other, rather than starving at a non-consumptive banquet?