Presentation of the paper Cataloguing for a Billion Word Library of Greek and Latin by Gregory Crane, Bridget Almas, Alison Babeu, Lisa Cerrato, Anna Krohn, Frederik Baumgardt, Monica Berti, Greta Franzini and Simona Stoyanova in DATeCH 2014. #digidays
Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
1. How do you catalog a billion word library?
Bridget Almas
1
, Alison Babeu
1
, Frederik Baumgardt
2
, Lisa Cerrato
1
, Gregory Crane
12
,
Greta Franzini
2
, Anna Krohn
1
, Simona Stoyanova
2
1. Perseus Digital Library, Tufts University
2. Open Philology Project, University of Leipzig
2. Major points
1. We are interested in the logical structures within/across physical books:
Text Groups, Author Y, Papyri from X
Works, e.g., Vergil’s Aeneid
Individual words, e.g., Arma virumque cano
3. Major points
2. From a pragmatic perspective, we only need one version of a logical work
(e.g., Tacitus’ Annales). We can use that marked up version as a query that we
match against very large and very error-filled corpora.
4. Major points
3. A text collection can serve as a catalog, with all other versions of the texts in
that collection (including translations as well as shorter quotations as well as
alternate editions) represented as annotations on that collection.
5.
6.
7.
8. Adding markup for a citation scheme
<div1 type="Book" n="1">
<milestone ed="p" n="1" unit="card"/>
<l n=”1”>Arma virumque cano, Troiae qui primus ab oris</l>
<l n=”2”>Italiam, fato profugus, Laviniaque venit</l>
<l n=”3”>litora, multum ille et terris iactatus et alto</l>
9.
10.
11.
12.
13.
14. Our ability to align texts is what makes our approach possible
-- a single version of Goethe’s Faust allows us to organize
thousands of editions.
15. Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
16. Canonical Text Services URNs
These URNs allow us to represent any particular word in any version of any text -- they
allow us to represent our textual data (including annotations) as a very large RDF
graph.
Its not a million book library but a billion word data set.
17. Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
Canonical Text Services name space
18. Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
Greek literature
Latin literature
19. Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
TextGroup = tlg052
Following the Thesaurus Linguae Graecae, we assign 284 to Aelius Aristides
20. Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
A TextGroup can define any useful collection:
* inscriptions from Ephesus
* the Homeric Hymns
21. Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
FRBR ( Functional Requirements for Bibliographic Records) Works
tlg0284.tlg052 designates the Embassy of Achilles by Aelius Aristides
22. Representing different versions
OCT Loeb
1.41 confidere 1 same confidere 1
1.41 propediem 1 sub prope 1
1.41 insert diem 1
1.41 ipsum 1 same ipsum 1
1.41 eos 1 same eos 1
23. Representing different versions
OCT Loeb
1.41 confidere 1 same confidere 1
1.41 propediem 1 sub prope 1
1.41 insert diem 1
1.41 ipsum 1 same ipsum 1
1.41 eos 1 same eos 1
We can pragmatically represent the differences between our reference text and all other versions
24. Representing different versions
OCT Loeb
1.41 confidere 1 same confidere 1
1.41 propediem 1 sub prope 1
1.41 insert diem 1
1.41 ipsum 1 same ipsum 1
1.41 eos 1 same eos 1
The reference text does not have to be the best text -- it does not even have to be perfect. It organizes all other texts,
even with noise.
25. Conclusions
We are developing the Perseus Corpus of Greek Texts (c. 20m words of Greek
and Latin)
* Based on texts in Perseus
* FRBR metadata from the Perseus Catalog
* Revised XML brought in line with CTS and with the EpiDoc subset of TEI
XML
* Offers an extended “TEI by example”
26. Conclusions
We are preparing for a Leipzig Corpus
* This would be a superset of the Perseus Corpus
* Ideally much larger
* Initial work will include an additional 20 million words of primarily later Greek
and Latin