Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bibliotheca Digitalis Summer school: Beyond the Page: enriching the digital library - Lou Burnard

126 views

Published on

Bibliotheca Digitalis. Reconstitution of Early Modern Cultural Networks. From Primary Source to Data.
DARIAH / Biblissima Summer School, 4-8 July 2017, Le Mans, France.

1st day, July 4th – Digital sources: theoretical fundamentals.

Beyond the Page: enriching the digital library.
Lou Burnard – Co-funder of TEI, Oxford.
Abstract: https://bvh.hypotheses.org/3294#conf-JYRamel

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Bibliotheca Digitalis Summer school: Beyond the Page: enriching the digital library - Lou Burnard

  1. 1. Bibliotheca Digitalis Reconstitution of Early Modern Cultural Networks From Primary Source to Data DARIAH / Biblissima Summer School Le Mans, 4-8 July 2017 Beyond the Page: enriching the digital library Lou Burnard 1st day, July 4th – Digital sources: theoretical fundamentals
  2. 2. Beyond the Page : enriching the digital library Lou Burnard 1/32
  3. 3. 2/32
  4. 4. The Textual Trinity A document can be described in terms of... its physical state (because texts are made up of glyphs arranged in particular ways) its linguistic nature (because texts are made of words used in particular ways) its intentions (because texts are supposed to tell us something about the world) (Burnard 1987, Burnard 1989, Burnard & Greenstein 1994) 3/32
  5. 5. (Or maybe it’s more than a trinity) 4/32
  6. 6. Software families Existing software systems tend to specialize ... document management and production systems image management and production systems linguistic analysis and management database systems 5/32
  7. 7. Convergence But convergence is now on everyone’s digital agenda. When you make a mashup combining a GIS database about places in the Aegean sea a historical gazeteer of placenames in the same area a corpus of texts mentioning those placenames you need to combine the strengths of a database with tools for linguistic analysis, and with tools for rendering spatial information. A few examples: https://pleiades.stoa.org/places/109236 http://www.mappingpaintings.org https://mapoflondon.uvic.ca/map.htm 6/32
  8. 8. The problem Today’s digital library applications still focus on serving up virtual pages for the reader: the metaphor of the book is so pervasive that we can barely see it. Self-evidently, digitization makes it possible to offer cheaper and more accessible simulations of printed or written pages. But this is not enough... digital texts should aim to go ‘beyond the page’ 7/32
  9. 9. What use is a digital text ? Digital applications enable us to do more with a text, and especially with a collection of texts! more than simply read it from beginning to end more than attach annotations to it for others to read, more than perform brute-force “text mining” on it. The content of the digital library must therefore be enriched, even if this requires the use of techniques which are not currently automatable. 8/32
  10. 10. What’s that noise in the digital library? A digital edition should capture the intentions and meaning of a text, not simply its appearance Otherwise, there can be no analysis beyond the documentary level, no ‘conversation between books’ 9/32
  11. 11. Enrichment or Representation? When we go from this... ... to this, what is happening? 10/32
  12. 12. Editing It’s customary to distinguish (at least) these types or levels of interpretation: paleographic level : identifying the characters and other graphemic components documentary or diplomatic level : determining what was originally written editorial or semantic level : determining how it ought to be read Digitization provides an opportunity to make each step explicit, complex, and reversible 11/32
  13. 13. The hermeneutic circle of digital enrichment 12/32
  14. 14. Enrichment Adding markup to a document determines how it can be processed. It can concern many different aspects : the presentation of the document – its use of writing styles or typefaces, its rendering and layout the rhetorical organization of a document – its sections and subsections, its paragraphs and lists and headings and footnotes metatextual aspects of the document – its corrections and additions and deletions and errors and lacunae linguistic properties of a document – its syntax and morphology and semantics the document as an object – information about its origins and custodial history, its transmission and reception, its social function and category... and many others. 13/32
  15. 15. Let’s focus on just one aspect: the treatment of names occurring in a document. 14/32
  16. 16. Some background theory Reference is a fundamental semiotic concept Natural languages often distinguish words associated with abstract concepts from words associated with (concepts concerning) specific objects Proper names, technical terms, etc behave differently from other kinds of word and often have a different linguistic status they do not appear in lexicons they are often ‘non-translatable’ What distinguishes them is chiefly their association with real (or fictive) entities. ‘king’ is a noun with no particular referent; ‘Martin Luther King’ refers to a specific person, as does (in context) ‘the king’. Likewise with places, ‘city’ refers to a type of place, not a particular one; ‘City of London’ refers to a particular place, as does (in context) ‘the city’ 15/32
  17. 17. named entity recognition is a multi-stage operation decide which input strings reference named entities decide which particular entities are intended (optionally) assemble and associate other information about each referenced entity Only the first of these is (more or less) automatable, despite decades of research. 16/32
  18. 18. The NLP (MUC) ‘Named Entity Recognition’ paradigm input strings are linguistically analysed (parsed, morphologically analysed, etc.) for candidate tokens candidates are resolved and disambiguated using a (pre-existing) ‘knowledge base’ such as Wikipedia data mining and language modelling systems work similarly, though the knowledge base may be less structured The real challenge is to build the knowledge base ... 17/32
  19. 19. Kinds of entity persons, historical or fictional : ‘Lou Burnard’, ‘Harry Potter’, ‘Pseudo-Dionysius the Areopagite’ named places, of any kind ‘Le Mans’, ‘Atlantis’, ‘Prussia’, ‘the Eiffel Tower’ named groupings of people ‘The Drones’, ‘Gallimard’, ‘the Thracians’ Physical objects, works of art etc. ‘the Alfred Jewel’, ‘Excalibur’, ‘the Mona Lisa’ etc. (Are animals objects or people?) 18/32
  20. 20. Entity properties What might you want to know about an entity? Some things are obvious, but the list is in principle unbounded: the various names associated with them at different times their chronology (birth, death, creation etc.) their composition, dimensions, classifications, etc. their associations with other entities identifiers used for them in standard authority control lists The last is particularly important for work in the LOD paradigm. 19/32
  21. 21. Kinds of entity reference TEI provides several elements for the markup of names and nominal expressions: <rs> (‘referring string’) – any phrase which refers to a person or place, e.g. ‘the girl you mentioned’, ‘10 miles Northeast of Attica’ ... <name> – any lexical item recognized as a proper name e.g. ‘Budleigh Salterton’ , ‘Bouallebec’, ‘John Doe’ ... <persName>, <placeName>, <orgName>: specific types of name: ‘syntactic sugar’ for <name type="person"> etc. A rich set of proposals for the components of such elements A project must decide which approach best suits its needs 20/32
  22. 22. Nominal expressions often have internal structure are sometimes ambiguous (same referent, different target) are often multiform (different referent, same target) TEI XML markup can help... 21/32
  23. 23. Components of personal names <persName xml:lang="de"> <forename type="first">Johann</forename> <forename type="middle">Sebastian</forename> <surname>Bach</surname> </persName> <persName xml:lang="fr"> <forename type="composé">Jean-Sébastien</forename> <surname>Bach</surname> </persName> Not to mention... <roleName> (‘Emperor’, ‘conseiller’), <genName> (‘the Elder’) <addName> (‘Hammer of the Scots’), <nameLink> (‘van der’) ... 22/32
  24. 24. Components of place names names of a specific geo-political type (<district>, <settlement>, <region>, <country>, <bloc>) <placeName> <district>6ème arr.</district> <settlement type="city">Paris, </settlement> <country>France</country> </placeName> names of geographical features such as a mountains or rivers and terms for such features (<geogName> and <geogFeat>) <placeName> <geogFeat>Mont</geogFeat> <geogName>Blanc</geogName> </placeName> a relational expression <rs type="place"> <measure>10 miles</measure> <offset>Northeast of</offset> <settlement>Attica</settlement> </rs> 23/32
  25. 25. Resolving referents Within a single language, in a single document, the same person is referred to in different ways: <persName>Clara Schumann</persName> .... <persName>Clara</persName> .... <persName>Frau Schumann</persName> The @ref can be used to show that these are all references to the same person <persName ref="#CS">Clara Schumann</persName> .... <persName ref="#CS">Clara</persName> .... <persName>Clara Wieck</persName> ... <persName ref="#CS">Frau Schumann</persName> 24/32
  26. 26. Associating reference and entity the value of @ref can be any form of URI, pointing to a place where there is more information about this entity, provided locally or externally <persName ref="https://en.wikipedia.org/wiki/Clara_Schumann"> Clara Schumann</persName> <persName ref="#CS">Clara Schumann</persName> <persName ref="myBib:CS">Clara Schumann</persName> All we want to say about CS can be provided using a <person> element somewhere <person xml:id="CS"> <persName notAfter="1840-09-12">Clara Wieck</persName> <birth when="1819-09-13"> <placeName>Leipzig</placeName> </birth> <ref type="VIAF" target="http://viaf.org/viaf/44499359"/> <idno type="ISNI">ISN:0000000121305653</idno> <!--etc --> </person> 25/32
  27. 27. Resolving ambiguity Person or place? <s>Jean likes <name>Nancy</name> </s> We could clarify this by using a more precise tag (<persName> or <placeName>) rather than <name>. Or we could resolve it by supplying the appropriate target for the @ref attribute on <name>: <s>Jean likes <name ref="#PLACE123">Nancy</name> </s> <!-- ... --> <person xml:id="PERS123"> <persName> <forename>Nancy</forename> <surname>Ide</surname> </persName> <!-- ... --> </person> <place xml:id="PLACE123"> <placeName notBefore="1400">Nancy</placeName> <placeName notAfter="0056">Nantium</placeName> <!-- ... -->26/32
  28. 28. Data vs. Text TEI distinguishes names from things. The assumption is that names are found in source texts, whereas things exist in the real world, and are described by additional data. Data can take a semi-textual form structured in XML, though it need not do so. ‘Text is not a special type of data; data is a special type of text.’ 27/32
  29. 29. For example Extract from Histoire Chronologique de la Chancelerie de France..., p. 5 personal names (Odolric, Adalric, Gezon, Lothaire, Adaleron, Arnoul) ... names of social positions (Grand Chancelier, Secretaire, Roi...) a nick name (‘dit Le Faineant’) titles of other sources (pour la donation de l’Abbaie de Bonneval, Antiquitez de Troyes) explicit quotation (‘Sinum Lotarii gloriosissimi Regis... ’) The formatting helps... but only a bit: we need to make these things explicit. 28/32
  30. 30. Another example: Paris, BnF, ms. français 16753 First page of Registres de permis d’imprimer... 29/32
  31. 31. One possible encoding... This seems to be text as data... 30/32
  32. 32. .... continued ... and this seems to be data as text... 31/32
  33. 33. Tentative conclusions, intended to provoke debate reading a text involves identifying and understanding its data reading many texts at a distance contributes to, but does not replace, an understanding of the data they represent data is itself a kind of text, requiring the same nuanced interpretive judgment 32/32

×