Structured and Unstructured:Extracting Information From Classics Scholarly Texts

Structured and Unstructured:
Extracting Information From Classics
Scholarly Texts

Matteo Romanello1
1 Centre for Computing in the Humanities
King’s College London

Graduate Colloquium - DHSI 2010
University of Victoria BC - 8th June 2010

Romanello CCH
Extracting Information From Scholarly Texts

The Project at a glance

Project started in October 2009;
Disciplines: Digital Humanities, Classics, Computer
Science;
co-supervised by:
Willard McCarty (KCL, Department of Digital Humanities)
Jonathan Ginzburg (KCL, Department of Computer
Science)
project supported by an AHRC (Arts and Humanities
Research Council) award

Romanello CCH

Goal

Devising an automatic system to improve semantic
information retrieval over a discipline-specific corpus of
unstructured texts
focus on secondary sources (e.g. journal papers) as
opposed to primary sources (i.e. Ancient Texts)
automatic -> scalable with huge amount of data
information retrieval -> the task of retrieving information
unstructured texts -> raw texts (e.g. .txt files) as opposed
to the structured/encoded XML

Example
“Hom. Il. XII 1”: sequence of 14 characters meaning “first line
of the twelfth book of Homer’s Iliad”
Romanello CCH

Semantic Information Retrieval

Semantic vs String Matching based IR
Romanello CCH

Named Entities as Entry Point to Information

Entities to be extracted:
1 Place Names (ancient and modern);
2 Relevant Person Names (mythological names, ancient authors,
modern scholars)
3 References to primary and secondary sources (canonical
texts and modern publications about them)
Romanello CCH

Work Phases

Romanello CCH

Corpus building

Getting materials
Crawling online archives

Extracting the text from collected documents
Tools for text extraction from PDF -> open issues with
Ancient Greek encoding
re-OCR documents even the native digital ones

Romanello CCH

Corpus Building II

Corpora
open access, multilingual
Princeton/Stanford Working Papers in Classics (PSWPC)
Lexis online
470 articles in 2 corpora

OCR
Finereader
Ocropus (layout analysis)
text extracted from PDFs (tools like pdftotext etc.)
Alignment of multiple OCR outputs

Romanello CCH

Building the Knowledge Base (KB)

Goal: integrate different data sources into a single KB
Why?
Information about the same entities spread over several
data sources
Data sources might use different output formats (raw text,
DBs, HTML, XML etc.)
partial overlappings but no interoperability

How?
Use of high level ontologies to map records related to the
same entity
Result: KB containing semantic data

Romanello CCH

Corpus Processing

Tasks
1 sentence identiﬁcation
2 entities extraction (named entities recognition +
disambiguation)
KB implied to build up an entity context
3 canonical references extraction
KB provides training data
4 modern bibliographic references extraction
KB provides list of journals/name places/authors to improve
the perfomances of the tool

Romanello CCH

Canonical References

Romanello CCH

Canonical References Extraction

1 citations used speciﬁcally for primary sources (i.e. works of
ancient authors)
2 essential entry point to information: refer to the research
object, i.e. ancient texts
3 logical instead of physical citation scheme (e.g., chapter/paragr
vs. page)
4 variation -> time, style, language (regexp insufﬁcient!)

Example
Hom. Il. XII 1
Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803
Hes. fr. 321 M.-W.
Callimaco, ’ep.’ 28 Pf., 5-6

Romanello CCH

So What?

New Possible Research Questions:
how citing primary sources in Classics changed?
what are the characteristics of citation and co-citation
networks?
the traditional IR tools in Classics are actually exhaustive?

Romanello CCH

Why a Digital Humanities project?

Better understanding of
the discipline speciﬁties
users’ needs
Writing code to develop a project means
formalizing the way a given result is obtained
creating a repeatable and thus confutable process
introducing a reasoning based on the analysis of
quantitative data into Classics
Being able to
apply the product of a DH research to traditional scholarship

Romanello CCH

Thanks for your attention!
matteo.romanello@kcl.ac.uk
http://kcl.academia.edu/MatteoRomanello

Romanello CCH

Structured and Unstructured:Extracting Information From Classics Scholarly Texts

More Related Content

Similar to Structured and Unstructured:Extracting Information From Classics Scholarly Texts

More from Matteo Romanello

Recently uploaded

Structured and Unstructured:Extracting Information From Classics Scholarly Texts