Lecture 2: From Texts to eTexts: Thematic Research Collections and Text Encoding.Emma Clarke & Tomás Ó MurchúTheory and Practice of Digital Humanities.MPhil Digital Humanities
PART 1: THEMATIC RESEARCH COLLECTIONS Why Thematic Research Collections? Libraries as Laboratories (Palmer) Exaggeration? Limitations of scattered content Digital aggregations of primary sources and related materials that support research on a theme. (Palmer). TRCs getting closer to the laboratory ideal – source material, tools & expertise together to advance the production of new knowledge.
THEMATIC RESEARCH COLLECTIONSMany shapes and sizes…May contain manuscripts, images, commentary, audio, letters, translations, versions etc.
DIFFERENCES BETWEEN THEMATIC RESEARCHCOLLECTIONS AND DIGITAL LIBRARIES AND ARCHIVESDigital Libraries/Archives & TRCsDigital Libraries and Archives differ in mission and method.Library collections are amassed for preservation, dispensing, bibliographic, and symbolic purposesDigital Libraries have diverse collections.Perseus Collection – a digital archive.Bolles Collection on the History of London – a TRC within a digital archive (Perseus Collection).www.perseus.tufts.edu/ or perseus.mpiwg-berlin.mpg.de/
CHARACTERISTICS OF THEMATIC RESEARCH COLLECTIONSJohn Unsworth (2000)1. Necessarily Electronic (because of cost of 2,3,8)2. Constituted of Heterogeneous datatypes (multimedia)3. Extensive but thematically coherent4. Structured but open-ended5. Designed to support research6. Authored or multi-authored7. Interdisciplinary8. Collections of digital primary resources (and they are themselves second-generation digital resources)
CHARACTERISTICS OF THEMATIC RESEARCH COLLECTIONSPalmer (2004) Content Function Basic elements * Digital Research support * Thematic Variable characteristics * Coherent Scholarly contribution * Heterogeneous Contextual mass * Structured Interdisciplinary platform * Open-ended Activity support
CHARACTERISTICS OF THEMATIC RESEARCH COLLECTIONSTwo Basic Elements of a TRCDigital : Digital format even though sources may exist as manuscripts, images etc.Thematic: Contents are focused on particular research themes.• Author Orientated-Walt Whitman Archive, Thomas MacGreevy Archive• Historical Event/Period - Salem Witch Trials Archive, 1641 Depositions, September 11 Digital Archive• Specific focused theme – Hamlet on the Ramparts
CHARACTERISTICS OF THEMATIC RESEARCH COLLECTIONSVariable CharacteristicsCoherent: A coherent set of primary resources that relate directly to the theme.Heterogeneous: Manuscripts, letters, critical essays, reviews, biographies, bibliographiesStructured: Permits searches and analysis. Interrelated groups structured together – images together, letters together etc.Open Ended: Potential to grow and change. New sources added and improved. Annotations, links etc. Sep 11 archive
CONTENT DECISIONS IN TRCSWhat goes into the TRC?In both physical and digital libraries, materials are usually separated for reasons unimportant to a researcher. For example, primary texts may be part of a special collection, while secondary works may be in separate book and journal collections.A TRC has a mix of heterogeneous but closely associated materials.For example in the http://dante.ilt.columbia.edu/ - Digital Dante Archive
CONTENT DECISIONS IN TRCSThe Interdisciplinary nature of TRCsTRCs usually contain resources from different fields within the humanities world.For example Thomas MacGreevy Archive aims to promote inquiry into the interconnections between literature, culture, history, and politics by blurring the boundaries that separate the different fields of study.http://www.macgreevy.org
PROBLEMS FOR TRCS1. TRCs contain their own digital primary resources rather than basing their work on digital primary resources produced by libraries or publishers - issues with permissions & copyrights and ability to edit, intervene in, comment on, contextualize materials produced and controlled by others.2. Lack of willingness of libraries to collect the scholars "second- generation" digital publications so that they can become someone elses digital primary
PROBLEMS FOR TRCS3. ―Do-it-yourselfism‖. Each scholar/team builds their own digital library (and acts as his or her own publisher) leads to wasted and duplicated effort, loss of materials and loss of confidence in digital scholarship because, most importantly, it produces a more or less immediate breakdown in referential integrity.4. Marketing, design, editorial skills and services of publishers are not connecting with born-digital scholarly publications: editorial standards are not always what they should be, documentation is sometimes sloppy, problems of rights and permissions are frequently ignored, etc.
PROBLEMS FOR TRCS5. The genre of the thematic research collection is largely developing outside of publishing institutions. As a consequence, publishers seem of questionable relevance to it.6. Publishers have been, historically, the conduit connecting authors to libraries—but that connection is not being made for thematic research collections. As a consequence, publications of this sort are not making their way into library collections.
TEXT ENCODING INITIATIVE TEI Consortium TEI Guidelines Website: TEI
WHY ENCODE?More organised and searchable than a scanContains more information than a transcript• page layout• line breaks• material qualities• physical properties• other meta-data
MARKUP LANGUAGESBy markup language we mean a set of markup conventions used together for encoding texts.A markup language must specify:• what markup is allowed,• what markup is required,• how markup is to be distinguished from text,• and what the markup means―Markup is an act of interpretation‖ (Cummings)Following examples from University of Michigan Library
WHY XML? WHY NOT HTML?Three characteristics of XML seem to the TEI to make it unlike other markup languages:• emphasis on descriptive rather than procedural markup;• document type concept;• independence of any one hardware or software system.Compared with HTML, XML has some other important characteristics:• it is extensible (customisable): it does not contain a fixed set of tags• its documents must be well-formed according to a defined syntax, and may be formally validated• it focuses on the meaning of data, not its presentation
TEI GUIDELINESOfficial title: Guidelines for Electronic text Encoding and InterchangeContinually revised set of proposals of suggested methods for text encoding.Guidelines describe the principles that should be used when marking up textsThey will evolve and inevitably change but they will overall stay true to the initial design goals:
INITIAL DESIGN GOALS OF TEI GUIDELINES1. suffice to represent the textual features needed forresearch2. be simple, clear, and concrete3. be easy for researchers to use without special-purposesoftware4. allow the rigorous definition and efficient processing of texts5. provide for user-defined extensions6. conform to existing and emergent standards
THE GUIDELINESApply to texts in any natural language, of any date, in any literary genre or text type, without restriction on form or content.Are customisable.Examples of document content (tags) Textual elements Titles/ paragraphs/ headings/ dedications Non-textual elements Graphics/ illustrations/ cover/ binding material/ line breaks Meta-data Publication dates/ prices/ page counts / history
CRITICISM (?) OF TEIIf marking up texts is ―an act of interpretation‖ then it is one person/ a group of people’s interpretation of what is important information.By marking up documents and creating online scholarly editions, we are using historical texts / documents in a way that they were never intended to be used by the creator.―Because (TEI) … treats the humanities corpus … as informational structures, it ipso facto violates some of the most basic reading practices of the humanities community, scholarly as well as popular.” (McGann 2001: 139)
TEI PROJECTSA Family At War: The Diary of Mary Martin1 January – 25 May 1916Written in letter format to her sonCharlie who went missing in actionduring WW1, the diary chronicles thedaily activities of Mary, her family,friends and relatives.Diary of Mary Martin site
TEI PROJECTSAutour d’une séquence et des notes du Cahier 46: enjeu du codage dans les brouillons de ProustAround a sequence and some notes of Notebook 46: encoding issues about Prousts drafts Proust Prototype