Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

338 views
255 views

Published on

Presentation of the paper Cataloguing for a Billion Word Library of Greek and Latin by Gregory Crane, Bridget Almas, Alison Babeu, Lisa Cerrato, Anna Krohn, Frederik Baumgardt, Monica Berti, Greta Franzini and Simona Stoyanova in DATeCH 2014. #digidays

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
338
On SlideShare
0
From Embeds
0
Number of Embeds
70
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

  1. 1. How do you catalog a billion word library? Bridget Almas 1 , Alison Babeu 1 , Frederik Baumgardt 2 , Lisa Cerrato 1 , Gregory Crane 12 , Greta Franzini 2 , Anna Krohn 1 , Simona Stoyanova 2 1. Perseus Digital Library, Tufts University 2. Open Philology Project, University of Leipzig
  2. 2. Major points 1. We are interested in the logical structures within/across physical books: Text Groups, Author Y, Papyri from X Works, e.g., Vergil’s Aeneid Individual words, e.g., Arma virumque cano
  3. 3. Major points 2. From a pragmatic perspective, we only need one version of a logical work (e.g., Tacitus’ Annales). We can use that marked up version as a query that we match against very large and very error-filled corpora.
  4. 4. Major points 3. A text collection can serve as a catalog, with all other versions of the texts in that collection (including translations as well as shorter quotations as well as alternate editions) represented as annotations on that collection.
  5. 5. Adding markup for a citation scheme <div1 type="Book" n="1"> <milestone ed="p" n="1" unit="card"/> <l n=”1”>Arma virumque cano, Troiae qui primus ab oris</l> <l n=”2”>Italiam, fato profugus, Laviniaque venit</l> <l n=”3”>litora, multum ille et terris iactatus et alto</l>
  6. 6. Our ability to align texts is what makes our approach possible -- a single version of Goethe’s Faust allows us to organize thousands of editions.
  7. 7. Canonical Text Services URNs urn:cts:greekLit:tlg0284.tlg052.perseus-grc1 urn:cts:latinLit:phi0474.phi052.opp-lat1 urn:cts:latinLit:stoa0255.stoa004
  8. 8. Canonical Text Services URNs These URNs allow us to represent any particular word in any version of any text -- they allow us to represent our textual data (including annotations) as a very large RDF graph. Its not a million book library but a billion word data set.
  9. 9. Canonical Text Services URNs urn:cts:greekLit:tlg0284.tlg052.perseus-grc1 urn:cts:latinLit:phi0474.phi052.opp-lat1 urn:cts:latinLit:stoa0255.stoa004 Canonical Text Services name space
  10. 10. Canonical Text Services URNs urn:cts:greekLit:tlg0284.tlg052.perseus-grc1 urn:cts:latinLit:phi0474.phi052.opp-lat1 urn:cts:latinLit:stoa0255.stoa004 Greek literature Latin literature
  11. 11. Canonical Text Services URNs urn:cts:greekLit:tlg0284.tlg052.perseus-grc1 urn:cts:latinLit:phi0474.phi052.opp-lat1 urn:cts:latinLit:stoa0255.stoa004 TextGroup = tlg052 Following the Thesaurus Linguae Graecae, we assign 284 to Aelius Aristides
  12. 12. Canonical Text Services URNs urn:cts:greekLit:tlg0284.tlg052.perseus-grc1 urn:cts:latinLit:phi0474.phi052.opp-lat1 urn:cts:latinLit:stoa0255.stoa004 A TextGroup can define any useful collection: * inscriptions from Ephesus * the Homeric Hymns
  13. 13. Canonical Text Services URNs urn:cts:greekLit:tlg0284.tlg052.perseus-grc1 urn:cts:latinLit:phi0474.phi052.opp-lat1 urn:cts:latinLit:stoa0255.stoa004 FRBR ( Functional Requirements for Bibliographic Records) Works tlg0284.tlg052 designates the Embassy of Achilles by Aelius Aristides
  14. 14. Representing different versions OCT Loeb 1.41 confidere 1 same confidere 1 1.41 propediem 1 sub prope 1 1.41 insert diem 1 1.41 ipsum 1 same ipsum 1 1.41 eos 1 same eos 1
  15. 15. Representing different versions OCT Loeb 1.41 confidere 1 same confidere 1 1.41 propediem 1 sub prope 1 1.41 insert diem 1 1.41 ipsum 1 same ipsum 1 1.41 eos 1 same eos 1 We can pragmatically represent the differences between our reference text and all other versions
  16. 16. Representing different versions OCT Loeb 1.41 confidere 1 same confidere 1 1.41 propediem 1 sub prope 1 1.41 insert diem 1 1.41 ipsum 1 same ipsum 1 1.41 eos 1 same eos 1 The reference text does not have to be the best text -- it does not even have to be perfect. It organizes all other texts, even with noise.
  17. 17. Conclusions We are developing the Perseus Corpus of Greek Texts (c. 20m words of Greek and Latin) * Based on texts in Perseus * FRBR metadata from the Perseus Catalog * Revised XML brought in line with CTS and with the EpiDoc subset of TEI XML * Offers an extended “TEI by example”
  18. 18. Conclusions We are preparing for a Leipzig Corpus * This would be a superset of the Perseus Corpus * Ideally much larger * Initial work will include an additional 20 million words of primarily later Greek and Latin

×