Romanello tokyo

Structured Vs Unstructured:
Extracting Information From Scholarly Texts in
European Classical Studies

Matteo Romanello1
1 Centre for Computing in the Humanities

EIRI - CCH Symposium on the Digitization in the Humanities
Keio University - Tokyo 18th March 2010

Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 1 / 26

Overview

1 Introduction

2 Motivations and Background

3 Methodology

4 Work Phases

5 Expected Results


Introduction

Overview

1 Introduction


3 Methodology

4 Work Phases

5 Expected Results


Introduction

The Project at a glance

Project started in October 2009;
Field of application: Digital Humanities, Classics (particularly
Greek literature);
co-supervision between the CCH and the CS department at King’s
-> application of Computational Linguistics method


Introduction

Focus

Scholarly Texts from the European Scholarly Tradition in Classical
Studies
Secondary sources, e.g. journal papers, as opposed to primary
sources, i.e. Ancient Texts

Sets of texts considered so far:
Princeton - Stanford Working Papers in Classics (PSWPC)
LEXIS online: classics journal available online under Open Access
policy
goal -> information extraction


Introduction

Goal

Devising an automatic system to improve semantic information
retrieval over a discipline-speciﬁc corpus of unstructured texts
focus on secondary sources
automatic -> scalable with huge amount of data
information retrieval -> the task of retrieving information
unstructured texts -> raw texts (e.g. .txt ﬁles) as opposed to the
structured/encoded XML


Motivations

Overview

1 Introduction


3 Methodology

4 Work Phases

5 Expected Results


Motivations

The Million Book Library

archives.org, Google Books -> growth of
volume of information publicly available in
electronic format
longer “shelf-life” of books in
Classics/Humanities
need for effective tools to access
information for research purposes


Motivations

Information Extraction in Classics: challenges

lack of tools comparable to CiteseerX, GoPubMed, etc.
results of traditional search engines -> high recall but low precision
need to go beyond TOCs or string matching-based IR
still issues with encoding of Ancient Greek
no ad-hoc gold standards/training set
lack of tools speciﬁcally tailored to Classics resources
electronically available text does not mean electronic text


Methodology

Overview

1 Introduction


3 Methodology

4 Work Phases

5 Expected Results


Methodology

Named Entities as Access Point to Information

mentions of entities matter for Classicists -> importance of print
indexes in Classics
Disambiguation, different spellings or translations of names
relating different expressions to the same entity


Methodology

Named Entities as Access Point to Information

Entities to be extracted:
1 Place Names (ancient and modern);
2 Relevant Person Names (mythological names, ancient authors,
modern scholars)
3 References to primary and secondary sources (canonical texts
and modern publications about them)


Methodology

Reuse of Structured Information

Reuse of structured data sources, e.g. thesauri, authority lists, etc.,
produced by scholars over the last two decades.
-> To train machine-learning based tools to mine unstructured texts.
Related work:
Research in the AI ﬁeld -> Semantic Integration
Use of Wikipedia/DBpedia in NLP
Related projects: EROCS by IBM


Work Phases

Overview

1 Introduction


3 Methodology

4 Work Phases

5 Expected Results


Work Phases


Work Phases

Corpus building

Getting materials
Crawling online archives

Extracting the text from collected documents
Tools for text extraction from PDF -> open issues with Ancient
Greek encoding
re-OCR documents even the native digital ones


Work Phases

Corpus Building II

Corpora
open access, multilingual
Princeton/Stanford Working Papers in Classics (PSWPC)
Lexis online
470 articles in 2 corpora

OCR
Finereader
Ocropus (layout analysis)
text extracted from PDFs (tools like pdftotext etc.)
Alignment of multiple OCR outputs


Work Phases

Building the Knowledge Base (KB)

Goal: integrate different data sources into a single KB
Why?
Information about the same entities spread over several data
sources
Data sources might use different output formats (raw text, DBs,
HTML, XML etc.)
partial overlappings but no interoperability

How?
Use of high level ontologies to map records related to the same
entity
Result: KB containing semantic data


Work Phases

Building the Knowledge Base (KB) II

Ontologies -> in CS a formalism to model data
Integrating data sources:
import each datasource
map it to high level ontologies (e.g., CIDOC-CRM)
ﬁnd overlappings between datasources -> alignign the records
The obtained knowledge base will be used as support for all the text
processing tasks
Implementation of the KB: RDF triple store with a SPARQL interface


Work Phases

Corpus Processing

1 sentence identiﬁcation
2 entities extraction (named entities recognition + disambiguation)
KB implied to build up an entity context
3 canonical references extraction
KB provides training data
4 modern bibliographic references extraction
KB provides list of journals/name places/authors to improve the
perfomances of the tool


Work Phases


Work Phases

Canonical References Extraction

1 citations used speciﬁcally for primary sources (i.e. works of ancient
authors)
2 essential entry point to information: refer to the research object, i.e.
ancient texts
3 logical instead of physical citation scheme (e.g., chapter/paragr vs.
page)
4 variation -> time, style, language (regexp insufﬁcient!)

Example
Hom. Il. XII 1
Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803
Hes. fr. 321 M.-W.
Callimaco, ’ep.’ 28 Pf., 5-6


Expected Results

Overview

1 Introduction


3 Methodology

4 Work Phases

5 Expected Results


Expected Results

Results

Provide automatically multiple meaningful entry points to
information
Enrich the corpus with links to resources (particularly primary
sources)
Improve the user access to the corpus
Demonstrate the scalability of the approach

Tools/Resources
Knowledge Base for Classics
Articles with improved text quality
(improved) corpora to be released
single tools for information extraction (e.g. CREX Canonical
References EXtractor)


Expected Results

Possible Applications

Solution to problems peculiar of Classics might help to improve
the performances of existing tools/algorithms

Collections of secondary sources as corpora:
citation patterns
citation and co-citation networks
trends in the Classics citation practice


Expected Results

Thanks for your attention!
matteo.romanello@kcl.ac.uk
http://uk.linkedin.com/in/matteoromanello


Romanello tokyo

Recommended

Recommended

More Related Content

Similar to Romanello tokyo

Similar to Romanello tokyo (20)

More from Matteo Romanello

More from Matteo Romanello (16)

Recently uploaded

Recently uploaded (20)

Romanello tokyo