Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

How do you catalog a billion word library?
Bridget Almas
1
, Alison Babeu
1
, Frederik Baumgardt
2
, Lisa Cerrato
1
, Gregory Crane
12
,
Greta Franzini
2
, Anna Krohn
1
, Simona Stoyanova
2
1. Perseus Digital Library, Tufts University
2. Open Philology Project, University of Leipzig

Major points
1. We are interested in the logical structures within/across physical books:
Text Groups, Author Y, Papyri from X
Works, e.g., Vergil’s Aeneid
Individual words, e.g., Arma virumque cano

Major points
2. From a pragmatic perspective, we only need one version of a logical work
(e.g., Tacitus’ Annales). We can use that marked up version as a query that we
match against very large and very error-filled corpora.

Major points
3. A text collection can serve as a catalog, with all other versions of the texts in
that collection (including translations as well as shorter quotations as well as
alternate editions) represented as annotations on that collection.

Adding markup for a citation scheme
<div1 type="Book" n="1">
<milestone ed="p" n="1" unit="card"/>
<l n=”1”>Arma virumque cano, Troiae qui primus ab oris</l>
<l n=”2”>Italiam, fato profugus, Laviniaque venit</l>
<l n=”3”>litora, multum ille et terris iactatus et alto</l>

Our ability to align texts is what makes our approach possible
-- a single version of Goethe’s Faust allows us to organize
thousands of editions.

Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004

These URNs allow us to represent any particular word in any version of any text -- they
allow us to represent our textual data (including annotations) as a very large RDF
graph.
Its not a million book library but a billion word data set.

Canonical Text Services name space

Greek literature
Latin literature

TextGroup = tlg052
Following the Thesaurus Linguae Graecae, we assign 284 to Aelius Aristides

A TextGroup can define any useful collection:
* inscriptions from Ephesus
* the Homeric Hymns

FRBR ( Functional Requirements for Bibliographic Records) Works
tlg0284.tlg052 designates the Embassy of Achilles by Aelius Aristides

Representing different versions
OCT Loeb
1.41 confidere 1 same confidere 1
1.41 propediem 1 sub prope 1
1.41 insert diem 1
1.41 ipsum 1 same ipsum 1
1.41 eos 1 same eos 1

OCT Loeb
1.41 insert diem 1
We can pragmatically represent the differences between our reference text and all other versions

OCT Loeb
1.41 insert diem 1
The reference text does not have to be the best text -- it does not even have to be perfect. It organizes all other texts,
even with noise.

Conclusions
We are developing the Perseus Corpus of Greek Texts (c. 20m words of Greek
and Latin)
* Based on texts in Perseus
* FRBR metadata from the Perseus Catalog
* Revised XML brought in line with CTS and with the EpiDoc subset of TEI
XML
* Offers an extended “TEI by example”

Conclusions
We are preparing for a Leipzig Corpus
* This would be a superset of the Perseus Corpus
* Ideally much larger
* Initial work will include an additional 20 million words of primarily later Greek
and Latin

Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Recently uploaded

Recently uploaded (20)

Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin