SlideShare a Scribd company logo
1 of 26
Download to read offline
How do you catalog a billion word library?
Bridget Almas
1
, Alison Babeu
1
, Frederik Baumgardt
2
, Lisa Cerrato
1
, Gregory Crane
12
,
Greta Franzini
2
, Anna Krohn
1
, Simona Stoyanova
2
1. Perseus Digital Library, Tufts University
2. Open Philology Project, University of Leipzig
Major points
1. We are interested in the logical structures within/across physical books:
Text Groups, Author Y, Papyri from X
Works, e.g., Vergil’s Aeneid
Individual words, e.g., Arma virumque cano
Major points
2. From a pragmatic perspective, we only need one version of a logical work
(e.g., Tacitus’ Annales). We can use that marked up version as a query that we
match against very large and very error-filled corpora.
Major points
3. A text collection can serve as a catalog, with all other versions of the texts in
that collection (including translations as well as shorter quotations as well as
alternate editions) represented as annotations on that collection.
Adding markup for a citation scheme
<div1 type="Book" n="1">
<milestone ed="p" n="1" unit="card"/>
<l n=”1”>Arma virumque cano, Troiae qui primus ab oris</l>
<l n=”2”>Italiam, fato profugus, Laviniaque venit</l>
<l n=”3”>litora, multum ille et terris iactatus et alto</l>
Our ability to align texts is what makes our approach possible
-- a single version of Goethe’s Faust allows us to organize
thousands of editions.
Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
Canonical Text Services URNs
These URNs allow us to represent any particular word in any version of any text -- they
allow us to represent our textual data (including annotations) as a very large RDF
graph.
Its not a million book library but a billion word data set.
Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
Canonical Text Services name space
Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
Greek literature
Latin literature
Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
TextGroup = tlg052
Following the Thesaurus Linguae Graecae, we assign 284 to Aelius Aristides
Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
A TextGroup can define any useful collection:
* inscriptions from Ephesus
* the Homeric Hymns
Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
FRBR ( Functional Requirements for Bibliographic Records) Works
tlg0284.tlg052 designates the Embassy of Achilles by Aelius Aristides
Representing different versions
OCT Loeb
1.41 confidere 1 same confidere 1
1.41 propediem 1 sub prope 1
1.41 insert diem 1
1.41 ipsum 1 same ipsum 1
1.41 eos 1 same eos 1
Representing different versions
OCT Loeb
1.41 confidere 1 same confidere 1
1.41 propediem 1 sub prope 1
1.41 insert diem 1
1.41 ipsum 1 same ipsum 1
1.41 eos 1 same eos 1
We can pragmatically represent the differences between our reference text and all other versions
Representing different versions
OCT Loeb
1.41 confidere 1 same confidere 1
1.41 propediem 1 sub prope 1
1.41 insert diem 1
1.41 ipsum 1 same ipsum 1
1.41 eos 1 same eos 1
The reference text does not have to be the best text -- it does not even have to be perfect. It organizes all other texts,
even with noise.
Conclusions
We are developing the Perseus Corpus of Greek Texts (c. 20m words of Greek
and Latin)
* Based on texts in Perseus
* FRBR metadata from the Perseus Catalog
* Revised XML brought in line with CTS and with the EpiDoc subset of TEI
XML
* Offers an extended “TEI by example”
Conclusions
We are preparing for a Leipzig Corpus
* This would be a superset of the Perseus Corpus
* Ideally much larger
* Initial work will include an additional 20 million words of primarily later Greek
and Latin

More Related Content

Viewers also liked

Microdata cataloging tool (nada)
Microdata cataloging tool (nada)Microdata cataloging tool (nada)
Microdata cataloging tool (nada)Divya Vyas
 
Computer Science Library Training
Computer Science Library TrainingComputer Science Library Training
Computer Science Library Trainingpvhead123
 
Presentacion mineria
Presentacion mineriaPresentacion mineria
Presentacion mineriaviktor93
 
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...DIGIBIS
 
Library of Congress Subject Headings
Library of Congress Subject HeadingsLibrary of Congress Subject Headings
Library of Congress Subject Headingsroycekitts
 
Post coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information sciencePost coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information scienceharshaec
 
Indexing or dividing_head
Indexing or dividing_headIndexing or dividing_head
Indexing or dividing_headJavaria Chiragh
 
Theory of Library Cataloguing
Theory of Library Cataloguing Theory of Library Cataloguing
Theory of Library Cataloguing Anupama Saini
 

Viewers also liked (12)

Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Microdata cataloging tool (nada)
Microdata cataloging tool (nada)Microdata cataloging tool (nada)
Microdata cataloging tool (nada)
 
Computer Science Library Training
Computer Science Library TrainingComputer Science Library Training
Computer Science Library Training
 
Presentacion mineria
Presentacion mineriaPresentacion mineria
Presentacion mineria
 
Laravel and SOLR
Laravel and SOLRLaravel and SOLR
Laravel and SOLR
 
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
 
RDA y el proceso de catalogación
RDA y el proceso de catalogaciónRDA y el proceso de catalogación
RDA y el proceso de catalogación
 
Library of Congress Subject Headings
Library of Congress Subject HeadingsLibrary of Congress Subject Headings
Library of Congress Subject Headings
 
POPSI
POPSIPOPSI
POPSI
 
Post coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information sciencePost coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information science
 
Indexing or dividing_head
Indexing or dividing_headIndexing or dividing_head
Indexing or dividing_head
 
Theory of Library Cataloguing
Theory of Library Cataloguing Theory of Library Cataloguing
Theory of Library Cataloguing
 

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 

Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

  • 1. How do you catalog a billion word library? Bridget Almas 1 , Alison Babeu 1 , Frederik Baumgardt 2 , Lisa Cerrato 1 , Gregory Crane 12 , Greta Franzini 2 , Anna Krohn 1 , Simona Stoyanova 2 1. Perseus Digital Library, Tufts University 2. Open Philology Project, University of Leipzig
  • 2. Major points 1. We are interested in the logical structures within/across physical books: Text Groups, Author Y, Papyri from X Works, e.g., Vergil’s Aeneid Individual words, e.g., Arma virumque cano
  • 3. Major points 2. From a pragmatic perspective, we only need one version of a logical work (e.g., Tacitus’ Annales). We can use that marked up version as a query that we match against very large and very error-filled corpora.
  • 4. Major points 3. A text collection can serve as a catalog, with all other versions of the texts in that collection (including translations as well as shorter quotations as well as alternate editions) represented as annotations on that collection.
  • 5.
  • 6.
  • 7.
  • 8. Adding markup for a citation scheme <div1 type="Book" n="1"> <milestone ed="p" n="1" unit="card"/> <l n=”1”>Arma virumque cano, Troiae qui primus ab oris</l> <l n=”2”>Italiam, fato profugus, Laviniaque venit</l> <l n=”3”>litora, multum ille et terris iactatus et alto</l>
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. Our ability to align texts is what makes our approach possible -- a single version of Goethe’s Faust allows us to organize thousands of editions.
  • 15. Canonical Text Services URNs urn:cts:greekLit:tlg0284.tlg052.perseus-grc1 urn:cts:latinLit:phi0474.phi052.opp-lat1 urn:cts:latinLit:stoa0255.stoa004
  • 16. Canonical Text Services URNs These URNs allow us to represent any particular word in any version of any text -- they allow us to represent our textual data (including annotations) as a very large RDF graph. Its not a million book library but a billion word data set.
  • 17. Canonical Text Services URNs urn:cts:greekLit:tlg0284.tlg052.perseus-grc1 urn:cts:latinLit:phi0474.phi052.opp-lat1 urn:cts:latinLit:stoa0255.stoa004 Canonical Text Services name space
  • 18. Canonical Text Services URNs urn:cts:greekLit:tlg0284.tlg052.perseus-grc1 urn:cts:latinLit:phi0474.phi052.opp-lat1 urn:cts:latinLit:stoa0255.stoa004 Greek literature Latin literature
  • 19. Canonical Text Services URNs urn:cts:greekLit:tlg0284.tlg052.perseus-grc1 urn:cts:latinLit:phi0474.phi052.opp-lat1 urn:cts:latinLit:stoa0255.stoa004 TextGroup = tlg052 Following the Thesaurus Linguae Graecae, we assign 284 to Aelius Aristides
  • 20. Canonical Text Services URNs urn:cts:greekLit:tlg0284.tlg052.perseus-grc1 urn:cts:latinLit:phi0474.phi052.opp-lat1 urn:cts:latinLit:stoa0255.stoa004 A TextGroup can define any useful collection: * inscriptions from Ephesus * the Homeric Hymns
  • 21. Canonical Text Services URNs urn:cts:greekLit:tlg0284.tlg052.perseus-grc1 urn:cts:latinLit:phi0474.phi052.opp-lat1 urn:cts:latinLit:stoa0255.stoa004 FRBR ( Functional Requirements for Bibliographic Records) Works tlg0284.tlg052 designates the Embassy of Achilles by Aelius Aristides
  • 22. Representing different versions OCT Loeb 1.41 confidere 1 same confidere 1 1.41 propediem 1 sub prope 1 1.41 insert diem 1 1.41 ipsum 1 same ipsum 1 1.41 eos 1 same eos 1
  • 23. Representing different versions OCT Loeb 1.41 confidere 1 same confidere 1 1.41 propediem 1 sub prope 1 1.41 insert diem 1 1.41 ipsum 1 same ipsum 1 1.41 eos 1 same eos 1 We can pragmatically represent the differences between our reference text and all other versions
  • 24. Representing different versions OCT Loeb 1.41 confidere 1 same confidere 1 1.41 propediem 1 sub prope 1 1.41 insert diem 1 1.41 ipsum 1 same ipsum 1 1.41 eos 1 same eos 1 The reference text does not have to be the best text -- it does not even have to be perfect. It organizes all other texts, even with noise.
  • 25. Conclusions We are developing the Perseus Corpus of Greek Texts (c. 20m words of Greek and Latin) * Based on texts in Perseus * FRBR metadata from the Perseus Catalog * Revised XML brought in line with CTS and with the EpiDoc subset of TEI XML * Offers an extended “TEI by example”
  • 26. Conclusions We are preparing for a Leipzig Corpus * This would be a superset of the Perseus Corpus * Ideally much larger * Initial work will include an additional 20 million words of primarily later Greek and Latin