Data versus Text: 30 years of confrontation


Published on

As presented at UCL Hidden Histories Symposium on 17 Sept 2011

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data versus Text: 30 years of confrontation

  1. 1. Data vs. Text: 40 years of confrontation Lou Burnard
  2. 2. Plan <ul><li>Text and data </li></ul><ul><li>The three ages </li></ul><ul><ul><li>Literary and Linguistic Computing </li></ul></ul><ul><ul><li>Humanities Computing </li></ul></ul><ul><ul><li>Digital Humanities </li></ul></ul><ul><li>What's that noise in the digital library? </li></ul>
  3. 3. Digital Text vs. Digital Data <ul><li>Data processing is about numbers, facts, statistical tendencies </li></ul><ul><li>Text processing is about words, writing, language </li></ul><ul><li>Information technologies have consistently opposed “Data” and “Text” </li></ul><ul><ul><li>By processing Text as if it were Data </li></ul></ul><ul><ul><li>By processing Data as if it were not Text </li></ul></ul>
  4. 4. Digital Text <ul><li>Digital texts are only metaphorically books.. but the metaphor is now both so persuasive and so pervasive that we barely notice it... </li></ul>Is “going digital” only about producing cheaper or more accessible simulations of printed pages and finding new markets for their distribution?.
  5. 5. Three simple truths <ul><li>There is no going back : the knowledge infrastructure is now irretrievably digital </li></ul><ul><li>The business models of the knowledge infrastructure have changed irrevocably </li></ul><ul><li>The quantititative changes facilitated by digital technologies approximate qualitative change </li></ul>
  6. 6. Irrevocable digitality <ul><li>The objects of Humanities scholarship are now digital, even if its methods are not </li></ul><ul><li>And its methods change in response to the technological environment surrounding us... </li></ul><ul><li>We are moving from a web of documents to a web of data </li></ul><ul><li>The political and cultural issues about ownership and access have not fundamentally changed, but are being played out in a different context </li></ul><ul><li>But first, a little history lesson... </li></ul>
  7. 7. Literary & Linguistic Computing
  8. 8. 1949-1980 <ul><li>The Heroic age... </li></ul><ul><ul><li>Father Busa and the Index Thomasticum </li></ul></ul><ul><ul><li>The Brown Corpus </li></ul></ul><ul><ul><li>Thesaurus Linguae Graecae </li></ul></ul><ul><ul><li>etc. </li></ul></ul><ul><li>Concordances, stylistic analysis, authorship studies, language corpora... </li></ul><ul><li>The Mainframe computer, tended by serious people in white coats, ran batch jobs which were submitted to a queue for its attention, and then executed, producing output. </li></ul>
  9. 9. A few pioneers <ul><li>1949-56: Roberto Busa begins work on the Index Thomisticus, a collection of punched cards indexing the work of Aquinas, eventually published in 1975 as Thomae Aquinatis Opera Omnia cum hypertextibus in CDROM. </li></ul><ul><li>1959: Cornell University Press publishes (on paper) a number of literary concordances derived from computer printout </li></ul><ul><li>1968: Foundation of CETEDOC at Louvain la Neuve leading to a massive database of patristic latin texts </li></ul><ul><li>1968: Maurice Tournier and others at Saint Cloud work on statistical analysis of political texts </li></ul><ul><li>1970: Trevor Howard-Hill produces concordances to Shakespearean folio and quarto texts as part of his D Phil on the typesetting of the First Folio </li></ul>
  10. 10. ALLC Conference No. 6 (1980)
  11. 11. Authorship issues From A.Q. Morton Paul, the man and the myth (1966)
  12. 12. In LLC, what counts is counting <ul><li>The primary tools are </li></ul><ul><ul><li>The concordance (seen as an end in itself) </li></ul></ul><ul><ul><li>Quite complex and sophisticated statistics </li></ul></ul><ul><li>There is an implicit claim that “style” or “register” is quantifiable in some way (though the statistical properties of language are barely understood) </li></ul><ul><li>In the US, “cliometrics” and Time on the Cross (1974) </li></ul>
  13. 13. LLC is also a journal, and an annual conference <ul><li> </li></ul>
  14. 14. LLC is alive and well and living in France <ul><li>Text as a statistical phenomenon </li></ul><ul><ul><li>e.g. Maurice Tournier Les mots de mai 68 </li></ul></ul><ul><li>Factor analysis and data mining </li></ul><ul><ul><li>marketing applications </li></ul></ul><ul><li>Textometrie </li></ul>
  15. 15. Humanities Computing
  16. 16. 1980-1994 <ul><li>Institutionalization </li></ul><ul><li>Meanwhile, the historians re-enter the field </li></ul><ul><li>Is Humanities Computing an Academic Discipline? </li></ul><ul><li>The “text encoding” project </li></ul>
  17. 17. <ul><li>In the home, the eighties was a decade of boundless faith in technology that nearly worked </li></ul><ul><li>In academia, likewise, digital methods and resources, though perceived as alien and difficult, were also finding their place </li></ul><ul><li>In the UK </li></ul><ul><ul><li>Computers in Teaching Initiative </li></ul></ul><ul><ul><li>Arts and Humanities Data Service </li></ul></ul><ul><li>Something new, or something old done better? </li></ul>The rise of the HC centre
  18. 18. Communities <ul><li>E-mail and e-mail lists: Humanist </li></ul><ul><li>Electronic Text paradigms </li></ul><ul><ul><li>Oxford Text Archive vs Project Gutenberg </li></ul></ul><ul><ul><li>Publishing on CD-ROM: e.g. the OED </li></ul></ul><ul><li>Natural Language Processing and Artificial Intelligence </li></ul><ul><li>Major public funding for infrastructural activities </li></ul><ul><ul><li>Both national and European </li></ul></ul><ul><li>Is there money to be made in e-publishing? In home computers? On the Internet? </li></ul>
  19. 19. Institutionalization <ul><li>As “new” institutional structures emerge, their cataloguing becomes their self-justification </li></ul>
  20. 20. The re-appearance of quellenkritik <ul><li>In France, J-P Genet and other historians cautiously advance the idea that computerized studies of original source materials might actually be useful </li></ul><ul><li>Further systematized in Germany by Manfred Thaller with kleio, a textual dbms avant la lettre </li></ul><ul><li>An international Association for History and Computing is founded in 1987 </li></ul>
  21. 21. <ul><li>Does Humanities Computing have a Theory? </li></ul><ul><li>Is there anything intrinsically different about digital texts? </li></ul><ul><li>If, on the other hand, HC is purely a bundle of methods, what unifies them? </li></ul><ul><li>Traditions (”scholarly primitives”)* </li></ul><ul><ul><li>finding by means of external characteristics </li></ul></ul><ul><ul><li>analysing by means of internal features </li></ul></ul><ul><ul><li>associating by means of shared perceptions </li></ul></ul><ul><li>What tools and methods best help combine these approaches? </li></ul>The challenge for HC * 1999:
  22. 22. It's all about modelling, stupid
  23. 23. Resources digital resources encoding analysis abstract model
  24. 24. The problem with digital resources (c. 1989) <ul><li>They don't always work </li></ul><ul><ul><li>(You need the right computer to make them work) </li></ul></ul><ul><li>You can't find them </li></ul><ul><ul><li>They're not in the library or the reviews </li></ul></ul><ul><ul><li>They are not consistently described (or at all) </li></ul></ul><ul><li>You don't know where they come from </li></ul><ul><ul><li>They're not authenticated like books </li></ul></ul><ul><ul><li>They're not always scholarly </li></ul></ul><ul><li>They don't keep still </li></ul><ul><ul><li>They don't stay in the same place </li></ul></ul><ul><ul><li>People change them unpredictably </li></ul></ul><ul><li>And above all... </li></ul>There are too many mutually incomprehensible markup formats
  25. 25. A classic example... L oomings “ CALL me Ishmael. Some years ago --- never mind how long precisely--- having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world”
  26. 26. A classic example... L oomings |chap1 <C 1> Loomings chapter chapter[1]{Loomings} :h1.1. Loomings MOBY001001LOOMINGS |C1 .chapter Loomings .cp;.sp 6 a;.ce .bd 1. Loomings ~xi Good news: there is software capable of translating amongst 400 different encoding formats  Bad news: there ARE 400 different encoding formats…
  27. 27. <ul><li>scholarship depends on continuity </li></ul><ul><li>it is not enough to preserve the bytes of an encoding </li></ul><ul><li>there must also be a continuity of comprehension: the encoding must be self-descriptive </li></ul>Transmitting our interpretations Hence the importance of the TEI (Text Encoding Initiative)
  28. 28. TEI: the main achievement of HC? <ul><li>Originally a response to the multiplicity of formats and lack of standards </li></ul><ul><li>The TEI emerged as a single, encyclopaedic model of the “significant particularities” of textual resources </li></ul><ul><li>And also an adaptable architecture able to respond to changing needs and priorities </li></ul>
  29. 29. For example: the tablets of Vindolanda
  30. 30. TEI in practice <ul><li>Collaborative scholarly transcription </li></ul><ul><li>Extension and formalization of pre-existing Leiden conventions as Epidoc standard </li></ul>
  31. 31. Digital Humanities
  32. 32. 1995 - ? <ul><li>While we were talking about the theory.... </li></ul><ul><ul><li>The web happened </li></ul></ul><ul><ul><li>digital archives and libraries became necessary </li></ul></ul><ul><ul><li>mass digitization </li></ul></ul><ul><ul><li>Technologically: commodity computing, grid computing </li></ul></ul><ul><ul><li>Socially: folksonomies, cloud computing... </li></ul></ul><ul><li>Convergence and collaboration </li></ul><ul><ul><li>rethinking scholarly editing </li></ul></ul><ul><ul><li>redefining the discipline </li></ul></ul><ul><li>New infrastructures? </li></ul>
  33. 33. The rise of the digital library <ul><li>“ Public good” digitization efforts </li></ul><ul><ul><li>From Gallica to JISC Digititization Programme </li></ul></ul><ul><li>The metadata challenge </li></ul><ul><ul><li>Authority vs. link-rot </li></ul></ul><ul><ul><li>Dublin Core and OAI/PMH … semanticizing the web </li></ul></ul><ul><ul><li>Can systems be self-organizing ? </li></ul></ul><ul><ul><li>Is there wisdom in crowds ? </li></ul></ul><ul><li>What is the right business model? </li></ul>
  34. 34. An alternative model <ul><li>Could what works for software also work equally well for digital resources? </li></ul>When programmers can read, redistribute, and modify the source code for a piece of software, the software evolves. People improve it, people adapt it, people fix bugs. When developers can access, redistribute, and enhance the digital resources underlying a digital application, new applications can evolve. People can add value, people can adapt it, people can fix bugs.
  35. 35. Open up the data warehouse!
  36. 36. Digital humanities manifesto 2.0 Digital Humanities is not a unified field but an array of convergent practices that explore a universe in which: a) print is no longer the exclusive or the normative medium in which knowledge is produced and/or disseminated; instead, print finds itself absorbed into new, multimedia configurations; and b) digital tools, techniques, and media have altered the production and dissemination of knowledge in the arts, human and social sciences.
  37. 37. Les manifestes, ç a on les connait...
  38. 38. ibid... Digital Humanities implies the multi-purposing and multiple channeling of humanistic knowledge : no channel excludes the other. Its economy is abundance based, not one based upon scarcity .... though notions of humanistic research are everywhere under institutional pressure, there is (potentially) plenty for all. And, indeed, there is plenty to do.
  39. 39. The importance of not reading <ul><li>“ What can you do with a million books?” (Greg Crane) </li></ul><ul><li>“ Although there is still a need for close-reading... we never don't not read” (John Unsworth) </li></ul><ul><li>A new synergy of methods: </li></ul><ul><ul><li>Corpus linguistics </li></ul></ul><ul><ul><li>Pattern recognition </li></ul></ul><ul><ul><li>Data mining </li></ul></ul><ul><ul><li>Visualisation </li></ul></ul>
  40. 40. How to not read <ul><li>We need to find ways of cross-searching, decomposing, and re-composing </li></ul><ul><ul><li>rich xml documents </li></ul></ul><ul><ul><li>complex relational database structures </li></ul></ul><ul><ul><li>simple presentation-focussed websites </li></ul></ul><ul><ul><li>sound, image, video... </li></ul></ul><ul><li>The challenge is to do this in an open and standards-compliant manner </li></ul><ul><li>And on a massive scale </li></ul>
  41. 41. Escaping from the text <ul><li>From footnote to hypertext </li></ul>
  42. 43. conclusions
  43. 44. The changing environment
  44. 45. What's that noise in the digital library? <ul><li>We need to do more with a text than simply read it from beginning to end, or attach annotations to it for others to read, or link it to other digital texts </li></ul><ul><li>We are at last moving beyond the document, towards a distributed world in which “the books in the library can talk to each other” </li></ul>
  45. 46. What makes this possible? <ul><li>It's not rocket science! </li></ul><ul><li>XML markup with a shared semantics (TEI) </li></ul><ul><li>Appropriate use of new technologies (e.g. Unicode, javascript) </li></ul><ul><li>A willingness to open up our data </li></ul>
  46. 47. Rethinking the digital edition <ul><li>The insights of critical editing and philology need to be re-discovered and re-applied in a new environment of abundant documents and vanishing texts </li></ul><ul><li>We need a new synthesis of semiotics and hermeneutics to constitute the “uncritical edition” </li></ul><ul><li>We also need traditional virtues of skepticism and empiricism </li></ul>
  47. 48. Components of the “uncritical” edition <ul><li>Manuscript page images </li></ul><ul><li>Annotated transcriptions </li></ul><ul><li>Critical (synthetic) edition/s </li></ul><ul><li>Modern translation and summary </li></ul><ul><li>Paratextual notes, glossary, foreword, bibliography, etc. </li></ul><ul><li>Manuscript descriptions and meta data </li></ul><ul><li>“ Factoids” about the real world </li></ul>
  48. 49. The textual trinity <ul><li>Textual descriptions tend to focus on one of: </li></ul><ul><ul><li>its linguistic nature (because texts are made of words used in particular ways) </li></ul></ul><ul><ul><li>its physical state (because texts are made up of glyphs arranged in particular ways) </li></ul></ul><ul><ul><li>its intentions (because texts are supposed to tell us something about the world) </li></ul></ul><ul><li>Likewise, software tends to distinguish </li></ul><ul><ul><li>document management and production systems </li></ul></ul><ul><ul><li>image management and production systems </li></ul></ul><ul><ul><li>database systems </li></ul></ul>
  49. 50. (cf Burnard, 1984)
  50. 51. Convergence <ul><li>But the digital agenda requires us to mash these things up : for example to combine </li></ul><ul><ul><li>a GIS database about places in the Aegean sea </li></ul></ul><ul><ul><li>a historical gazeteer of place names in the same area </li></ul></ul><ul><ul><li>a corpus of texts mentioning those place names </li></ul></ul><ul><li>(Hence why TEI has recently expanded its scope to support “factoids” as well as their names) </li></ul><ul><li>To do this well requires philological expertise, which cannot readily be automated </li></ul>
  51. 52. A key role for the Humanities <ul><li>We know about textual objects </li></ul><ul><ul><li>how is this discourse represented? </li></ul></ul><ul><ul><li>what stories does it tell? </li></ul></ul><ul><li>We know about hermeneutics </li></ul><ul><ul><li>what does this discourse mean? </li></ul></ul><ul><ul><li>what does it say – aside from its denotational content? </li></ul></ul><ul><li>This is our contribution to the semantic web </li></ul>