SlideShare a Scribd company logo
Structured Vs Unstructured:
Extracting Information From Scholarly Texts in
         European Classical Studies

                           Matteo Romanello1
                   1 Centre   for Computing in the Humanities


    EIRI - CCH Symposium on the Digitization in the Humanities
             Keio University - Tokyo 18th March 2010




 Romanello (CCH)       Extracting Information From Scholarly Texts   EIRI - CCH Symposium   1 / 26
Overview



1   Introduction

2   Motivations and Background

3   Methodology

4   Work Phases

5   Expected Results




     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   2 / 26
Introduction


Overview



1   Introduction

2   Motivations and Background

3   Methodology

4   Work Phases

5   Expected Results




     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   3 / 26
Introduction


The Project at a glance




    Project started in October 2009;
    Field of application: Digital Humanities, Classics (particularly
    Greek literature);
    co-supervision between the CCH and the CS department at King’s
    -> application of Computational Linguistics method




     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   4 / 26
Introduction


Focus




   Scholarly Texts from the European Scholarly Tradition in Classical
   Studies
   Secondary sources, e.g. journal papers, as opposed to primary
   sources, i.e. Ancient Texts

   Sets of texts considered so far:
         Princeton - Stanford Working Papers in Classics (PSWPC)
         LEXIS online: classics journal available online under Open Access
         policy
   goal -> information extraction




    Romanello (CCH)     Extracting Information From Scholarly Texts   EIRI - CCH Symposium   5 / 26
Introduction


Goal




Devising an automatic system to improve semantic information
retrieval over a discipline-specific corpus of unstructured texts
    focus on secondary sources
    automatic -> scalable with huge amount of data
    information retrieval -> the task of retrieving information
    unstructured texts -> raw texts (e.g. .txt files) as opposed to the
    structured/encoded XML




       Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   6 / 26
Motivations


Overview



1   Introduction

2   Motivations and Background

3   Methodology

4   Work Phases

5   Expected Results




     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   7 / 26
Motivations


The Million Book Library




     archives.org, Google Books -> growth of
     volume of information publicly available in
     electronic format
     longer “shelf-life” of books in
     Classics/Humanities
     need for effective tools to access
     information for research purposes




     Romanello (CCH)    Extracting Information From Scholarly Texts   EIRI - CCH Symposium   8 / 26
Motivations


Information Extraction in Classics: challenges




    lack of tools comparable to CiteseerX, GoPubMed, etc.
    results of traditional search engines -> high recall but low precision
    need to go beyond TOCs or string matching-based IR
    still issues with encoding of Ancient Greek
    no ad-hoc gold standards/training set
    lack of tools specifically tailored to Classics resources
    electronically available text does not mean electronic text




     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   9 / 26
Methodology


Overview



1   Introduction

2   Motivations and Background

3   Methodology

4   Work Phases

5   Expected Results




     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   10 / 26
Methodology


Named Entities as Access Point to Information




    mentions of entities matter for Classicists -> importance of print
    indexes in Classics
    Disambiguation, different spellings or translations of names
    relating different expressions to the same entity




     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   11 / 26
Methodology


Named Entities as Access Point to Information




Entities to be extracted:
  1   Place Names (ancient and modern);
  2   Relevant Person Names (mythological names, ancient authors,
      modern scholars)
  3   References to primary and secondary sources (canonical texts
      and modern publications about them)




      Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   12 / 26
Methodology


Reuse of Structured Information




Reuse of structured data sources, e.g. thesauri, authority lists, etc.,
produced by scholars over the last two decades.
-> To train machine-learning based tools to mine unstructured texts.
Related work:
    Research in the AI field -> Semantic Integration
    Use of Wikipedia/DBpedia in NLP
    Related projects: EROCS by IBM




     Romanello (CCH)    Extracting Information From Scholarly Texts   EIRI - CCH Symposium   13 / 26
Work Phases


Overview



1   Introduction

2   Motivations and Background

3   Methodology

4   Work Phases

5   Expected Results




     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   14 / 26
Work Phases




Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   15 / 26
Work Phases


Corpus building




Getting materials
Crawling online archives

Extracting the text from collected documents
    Tools for text extraction from PDF -> open issues with Ancient
    Greek encoding
    re-OCR documents even the native digital ones




     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   16 / 26
Work Phases


Corpus Building II


Corpora
    open access, multilingual
    Princeton/Stanford Working Papers in Classics (PSWPC)
    Lexis online
    470 articles in 2 corpora

OCR
   Finereader
    Ocropus (layout analysis)
    text extracted from PDFs (tools like pdftotext etc.)
    Alignment of multiple OCR outputs


     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   17 / 26
Work Phases


Building the Knowledge Base (KB)


Goal: integrate different data sources into a single KB
Why?
    Information about the same entities spread over several data
    sources
    Data sources might use different output formats (raw text, DBs,
    HTML, XML etc.)
    partial overlappings but no interoperability

How?
   Use of high level ontologies to map records related to the same
   entity
    Result: KB containing semantic data


     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   18 / 26
Work Phases


Building the Knowledge Base (KB) II




Ontologies -> in CS a formalism to model data
Integrating data sources:
    import each datasource
    map it to high level ontologies (e.g., CIDOC-CRM)
    find overlappings between datasources -> alignign the records
The obtained knowledge base will be used as support for all the text
processing tasks
Implementation of the KB: RDF triple store with a SPARQL interface




     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   19 / 26
Work Phases


Corpus Processing




 1    sentence identification
 2    entities extraction (named entities recognition + disambiguation)
           KB implied to build up an entity context
  3   canonical references extraction
          KB provides training data
  4   modern bibliographic references extraction
         KB provides list of journals/name places/authors to improve the
         perfomances of the tool




      Romanello (CCH)      Extracting Information From Scholarly Texts   EIRI - CCH Symposium   20 / 26
Work Phases




Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   21 / 26
Work Phases


Canonical References Extraction


  1   citations used specifically for primary sources (i.e. works of ancient
      authors)
  2   essential entry point to information: refer to the research object, i.e.
      ancient texts
  3   logical instead of physical citation scheme (e.g., chapter/paragr vs.
      page)
  4   variation -> time, style, language (regexp insufficient!)

Example
Hom. Il. XII 1
Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803
Hes. fr. 321 M.-W.
Callimaco, ’ep.’ 28 Pf., 5-6



      Romanello (CCH)       Extracting Information From Scholarly Texts   EIRI - CCH Symposium   22 / 26
Expected Results


Overview



1   Introduction

2   Motivations and Background

3   Methodology

4   Work Phases

5   Expected Results




     Romanello (CCH)   Extracting Information From Scholarly Texts   EIRI - CCH Symposium   23 / 26
Expected Results


Results


    Provide automatically multiple meaningful entry points to
    information
    Enrich the corpus with links to resources (particularly primary
    sources)
    Improve the user access to the corpus
    Demonstrate the scalability of the approach

Tools/Resources
    Knowledge Base for Classics
    Articles with improved text quality
    (improved) corpora to be released
    single tools for information extraction (e.g. CREX Canonical
    References EXtractor)

    Romanello (CCH)    Extracting Information From Scholarly Texts   EIRI - CCH Symposium   24 / 26
Expected Results


Possible Applications




    Solution to problems peculiar of Classics might help to improve
    the performances of existing tools/algorithms

Collections of secondary sources as corpora:
    citation patterns
    citation and co-citation networks
    trends in the Classics citation practice




     Romanello (CCH)    Extracting Information From Scholarly Texts   EIRI - CCH Symposium   25 / 26
Expected Results




Thanks for your attention!
matteo.romanello@kcl.ac.uk
http://uk.linkedin.com/in/matteoromanello




     Romanello (CCH)     Extracting Information From Scholarly Texts   EIRI - CCH Symposium   26 / 26

More Related Content

Similar to Romanello tokyo

Structured and Unstructured:Extracting Information From Classics Scholarly Texts
Structured and Unstructured:Extracting Information From Classics Scholarly TextsStructured and Unstructured:Extracting Information From Classics Scholarly Texts
Structured and Unstructured:Extracting Information From Classics Scholarly Texts
Matteo Romanello
 
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly TextsStuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts
Matteo Romanello
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
Freedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations ariseFreedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations arise
University of Bologna
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
guest756e05
 
Global Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage LibraryGlobal Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage Library
Martin Kalfatovic
 
Library Catalogues: from Traditional to Next-Generation
Library Catalogues: from Traditional to Next-GenerationLibrary Catalogues: from Traditional to Next-Generation
Library Catalogues: from Traditional to Next-Generation
KC Tan
 
Open Annotation Collaboration Introduction
Open Annotation Collaboration IntroductionOpen Annotation Collaboration Introduction
Open Annotation Collaboration Introduction
Timothy Cole
 
Ontologies and thesauri. How to answer complex questions using interoperability?
Ontologies and thesauri. How to answer complex questions using interoperability?Ontologies and thesauri. How to answer complex questions using interoperability?
Ontologies and thesauri. How to answer complex questions using interoperability?
Equipex Biblissima
 
Catalog of the Future
Catalog of the FutureCatalog of the Future
Catalog of the Future
sgrucan
 
Annotated Bibliographical Reference Corpora In Digital Humanities
Annotated Bibliographical Reference Corpora In Digital HumanitiesAnnotated Bibliographical Reference Corpora In Digital Humanities
Annotated Bibliographical Reference Corpora In Digital Humanities
Faith Brown
 
Next Generation Catalogs: Extensible Catalog, David Lindahl
Next Generation Catalogs: Extensible Catalog, David LindahlNext Generation Catalogs: Extensible Catalog, David Lindahl
Next Generation Catalogs: Extensible Catalog, David Lindahl
youthelectronix
 
Semantic Web in the Digital Humanities
Semantic Web in the Digital HumanitiesSemantic Web in the Digital Humanities
Semantic Web in the Digital Humanities
Leipziger Semantic Web Tag
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Ontotext
 
Next generation online catalogs
Next generation online catalogsNext generation online catalogs
Next generation online catalogs
afraser246
 
OpenCitations
OpenCitationsOpenCitations
OpenCitations
University of Bologna
 
Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04
Rinke Hoekstra
 
OntoMaven Repositories and OMG API4KP
OntoMaven Repositories and OMG API4KPOntoMaven Repositories and OMG API4KP
OntoMaven Repositories and OMG API4KP
Aksw Group
 
An introduction to OAI-ORE
An introduction to OAI-OREAn introduction to OAI-ORE
An introduction to OAI-ORE
Julie Allinson
 

Similar to Romanello tokyo (20)

Structured and Unstructured:Extracting Information From Classics Scholarly Texts
Structured and Unstructured:Extracting Information From Classics Scholarly TextsStructured and Unstructured:Extracting Information From Classics Scholarly Texts
Structured and Unstructured:Extracting Information From Classics Scholarly Texts
 
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly TextsStuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Freedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations ariseFreedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations arise
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
 
Global Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage LibraryGlobal Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage Library
 
Library Catalogues: from Traditional to Next-Generation
Library Catalogues: from Traditional to Next-GenerationLibrary Catalogues: from Traditional to Next-Generation
Library Catalogues: from Traditional to Next-Generation
 
Open Annotation Collaboration Introduction
Open Annotation Collaboration IntroductionOpen Annotation Collaboration Introduction
Open Annotation Collaboration Introduction
 
Ontologies and thesauri. How to answer complex questions using interoperability?
Ontologies and thesauri. How to answer complex questions using interoperability?Ontologies and thesauri. How to answer complex questions using interoperability?
Ontologies and thesauri. How to answer complex questions using interoperability?
 
Catalog of the Future
Catalog of the FutureCatalog of the Future
Catalog of the Future
 
Annotated Bibliographical Reference Corpora In Digital Humanities
Annotated Bibliographical Reference Corpora In Digital HumanitiesAnnotated Bibliographical Reference Corpora In Digital Humanities
Annotated Bibliographical Reference Corpora In Digital Humanities
 
Next Generation Catalogs: Extensible Catalog, David Lindahl
Next Generation Catalogs: Extensible Catalog, David LindahlNext Generation Catalogs: Extensible Catalog, David Lindahl
Next Generation Catalogs: Extensible Catalog, David Lindahl
 
Semantic Web in the Digital Humanities
Semantic Web in the Digital HumanitiesSemantic Web in the Digital Humanities
Semantic Web in the Digital Humanities
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
Next generation online catalogs
Next generation online catalogsNext generation online catalogs
Next generation online catalogs
 
OpenCitations
OpenCitationsOpenCitations
OpenCitations
 
Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04
 
OntoMaven Repositories and OMG API4KP
OntoMaven Repositories and OMG API4KPOntoMaven Repositories and OMG API4KP
OntoMaven Repositories and OMG API4KP
 
An introduction to OAI-ORE
An introduction to OAI-OREAn introduction to OAI-ORE
An introduction to OAI-ORE
 

More from Matteo Romanello

Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
Matteo Romanello
 
Scaling up the Extraction of Canonical Citations in Classics
Scaling up the Extraction of Canonical Citations in ClassicsScaling up the Extraction of Canonical Citations in Classics
Scaling up the Extraction of Canonical Citations in Classics
Matteo Romanello
 
Transforming Indexes Locorum into Citation Networks
Transforming Indexes Locorum into Citation NetworksTransforming Indexes Locorum into Citation Networks
Transforming Indexes Locorum into Citation Networks
Matteo Romanello
 
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
Matteo Romanello
 
Introduction to the Text Reuse panel at DH 2014
Introduction to the Text Reuse panel at DH 2014Introduction to the Text Reuse panel at DH 2014
Introduction to the Text Reuse panel at DH 2014
Matteo Romanello
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in Classics
Matteo Romanello
 
DARIAH Geo-browser: Exploring Data through Time and Space
DARIAH Geo-browser: Exploring Data through Time and SpaceDARIAH Geo-browser: Exploring Data through Time and Space
DARIAH Geo-browser: Exploring Data through Time and Space
Matteo Romanello
 
Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?
Matteo Romanello
 
[poster] Extracting Information From Classics Scholarly Texts
[poster] Extracting Information From Classics Scholarly Texts[poster] Extracting Information From Classics Scholarly Texts
[poster] Extracting Information From Classics Scholarly Texts
Matteo Romanello
 
DIGITAL HUMANITIES E FILOLOGIA Un'introduzione
DIGITAL HUMANITIES   E FILOLOGIA   Un'introduzioneDIGITAL HUMANITIES   E FILOLOGIA   Un'introduzione
DIGITAL HUMANITIES E FILOLOGIA Un'introduzione
Matteo Romanello
 
Ht159 Poster
Ht159 PosterHt159 Poster
Ht159 Poster
Matteo Romanello
 
Rethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesRethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by Ontologies
Matteo Romanello
 
Presentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, TorontoPresentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, Toronto
Matteo Romanello
 
Linking Primary and Secondary by Microformats
Linking Primary and Secondary by MicroformatsLinking Primary and Secondary by Microformats
Linking Primary and Secondary by Microformats
Matteo Romanello
 
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
Matteo Romanello
 
M.Romanello Ecal Presentation
M.Romanello Ecal PresentationM.Romanello Ecal Presentation
M.Romanello Ecal Presentation
Matteo Romanello
 

More from Matteo Romanello (16)

Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
 
Scaling up the Extraction of Canonical Citations in Classics
Scaling up the Extraction of Canonical Citations in ClassicsScaling up the Extraction of Canonical Citations in Classics
Scaling up the Extraction of Canonical Citations in Classics
 
Transforming Indexes Locorum into Citation Networks
Transforming Indexes Locorum into Citation NetworksTransforming Indexes Locorum into Citation Networks
Transforming Indexes Locorum into Citation Networks
 
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
 
Introduction to the Text Reuse panel at DH 2014
Introduction to the Text Reuse panel at DH 2014Introduction to the Text Reuse panel at DH 2014
Introduction to the Text Reuse panel at DH 2014
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in Classics
 
DARIAH Geo-browser: Exploring Data through Time and Space
DARIAH Geo-browser: Exploring Data through Time and SpaceDARIAH Geo-browser: Exploring Data through Time and Space
DARIAH Geo-browser: Exploring Data through Time and Space
 
Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?
 
[poster] Extracting Information From Classics Scholarly Texts
[poster] Extracting Information From Classics Scholarly Texts[poster] Extracting Information From Classics Scholarly Texts
[poster] Extracting Information From Classics Scholarly Texts
 
DIGITAL HUMANITIES E FILOLOGIA Un'introduzione
DIGITAL HUMANITIES   E FILOLOGIA   Un'introduzioneDIGITAL HUMANITIES   E FILOLOGIA   Un'introduzione
DIGITAL HUMANITIES E FILOLOGIA Un'introduzione
 
Ht159 Poster
Ht159 PosterHt159 Poster
Ht159 Poster
 
Rethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesRethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by Ontologies
 
Presentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, TorontoPresentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, Toronto
 
Linking Primary and Secondary by Microformats
Linking Primary and Secondary by MicroformatsLinking Primary and Secondary by Microformats
Linking Primary and Secondary by Microformats
 
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
 
M.Romanello Ecal Presentation
M.Romanello Ecal PresentationM.Romanello Ecal Presentation
M.Romanello Ecal Presentation
 

Recently uploaded

Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
Celine George
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
NgcHiNguyn25
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Denish Jangid
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 

Recently uploaded (20)

Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 

Romanello tokyo

  • 1. Structured Vs Unstructured: Extracting Information From Scholarly Texts in European Classical Studies Matteo Romanello1 1 Centre for Computing in the Humanities EIRI - CCH Symposium on the Digitization in the Humanities Keio University - Tokyo 18th March 2010 Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 1 / 26
  • 2. Overview 1 Introduction 2 Motivations and Background 3 Methodology 4 Work Phases 5 Expected Results Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 2 / 26
  • 3. Introduction Overview 1 Introduction 2 Motivations and Background 3 Methodology 4 Work Phases 5 Expected Results Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 3 / 26
  • 4. Introduction The Project at a glance Project started in October 2009; Field of application: Digital Humanities, Classics (particularly Greek literature); co-supervision between the CCH and the CS department at King’s -> application of Computational Linguistics method Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 4 / 26
  • 5. Introduction Focus Scholarly Texts from the European Scholarly Tradition in Classical Studies Secondary sources, e.g. journal papers, as opposed to primary sources, i.e. Ancient Texts Sets of texts considered so far: Princeton - Stanford Working Papers in Classics (PSWPC) LEXIS online: classics journal available online under Open Access policy goal -> information extraction Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 5 / 26
  • 6. Introduction Goal Devising an automatic system to improve semantic information retrieval over a discipline-specific corpus of unstructured texts focus on secondary sources automatic -> scalable with huge amount of data information retrieval -> the task of retrieving information unstructured texts -> raw texts (e.g. .txt files) as opposed to the structured/encoded XML Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 6 / 26
  • 7. Motivations Overview 1 Introduction 2 Motivations and Background 3 Methodology 4 Work Phases 5 Expected Results Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 7 / 26
  • 8. Motivations The Million Book Library archives.org, Google Books -> growth of volume of information publicly available in electronic format longer “shelf-life” of books in Classics/Humanities need for effective tools to access information for research purposes Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 8 / 26
  • 9. Motivations Information Extraction in Classics: challenges lack of tools comparable to CiteseerX, GoPubMed, etc. results of traditional search engines -> high recall but low precision need to go beyond TOCs or string matching-based IR still issues with encoding of Ancient Greek no ad-hoc gold standards/training set lack of tools specifically tailored to Classics resources electronically available text does not mean electronic text Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 9 / 26
  • 10. Methodology Overview 1 Introduction 2 Motivations and Background 3 Methodology 4 Work Phases 5 Expected Results Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 10 / 26
  • 11. Methodology Named Entities as Access Point to Information mentions of entities matter for Classicists -> importance of print indexes in Classics Disambiguation, different spellings or translations of names relating different expressions to the same entity Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 11 / 26
  • 12. Methodology Named Entities as Access Point to Information Entities to be extracted: 1 Place Names (ancient and modern); 2 Relevant Person Names (mythological names, ancient authors, modern scholars) 3 References to primary and secondary sources (canonical texts and modern publications about them) Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 12 / 26
  • 13. Methodology Reuse of Structured Information Reuse of structured data sources, e.g. thesauri, authority lists, etc., produced by scholars over the last two decades. -> To train machine-learning based tools to mine unstructured texts. Related work: Research in the AI field -> Semantic Integration Use of Wikipedia/DBpedia in NLP Related projects: EROCS by IBM Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 13 / 26
  • 14. Work Phases Overview 1 Introduction 2 Motivations and Background 3 Methodology 4 Work Phases 5 Expected Results Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 14 / 26
  • 15. Work Phases Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 15 / 26
  • 16. Work Phases Corpus building Getting materials Crawling online archives Extracting the text from collected documents Tools for text extraction from PDF -> open issues with Ancient Greek encoding re-OCR documents even the native digital ones Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 16 / 26
  • 17. Work Phases Corpus Building II Corpora open access, multilingual Princeton/Stanford Working Papers in Classics (PSWPC) Lexis online 470 articles in 2 corpora OCR Finereader Ocropus (layout analysis) text extracted from PDFs (tools like pdftotext etc.) Alignment of multiple OCR outputs Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 17 / 26
  • 18. Work Phases Building the Knowledge Base (KB) Goal: integrate different data sources into a single KB Why? Information about the same entities spread over several data sources Data sources might use different output formats (raw text, DBs, HTML, XML etc.) partial overlappings but no interoperability How? Use of high level ontologies to map records related to the same entity Result: KB containing semantic data Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 18 / 26
  • 19. Work Phases Building the Knowledge Base (KB) II Ontologies -> in CS a formalism to model data Integrating data sources: import each datasource map it to high level ontologies (e.g., CIDOC-CRM) find overlappings between datasources -> alignign the records The obtained knowledge base will be used as support for all the text processing tasks Implementation of the KB: RDF triple store with a SPARQL interface Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 19 / 26
  • 20. Work Phases Corpus Processing 1 sentence identification 2 entities extraction (named entities recognition + disambiguation) KB implied to build up an entity context 3 canonical references extraction KB provides training data 4 modern bibliographic references extraction KB provides list of journals/name places/authors to improve the perfomances of the tool Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 20 / 26
  • 21. Work Phases Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 21 / 26
  • 22. Work Phases Canonical References Extraction 1 citations used specifically for primary sources (i.e. works of ancient authors) 2 essential entry point to information: refer to the research object, i.e. ancient texts 3 logical instead of physical citation scheme (e.g., chapter/paragr vs. page) 4 variation -> time, style, language (regexp insufficient!) Example Hom. Il. XII 1 Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803 Hes. fr. 321 M.-W. Callimaco, ’ep.’ 28 Pf., 5-6 Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 22 / 26
  • 23. Expected Results Overview 1 Introduction 2 Motivations and Background 3 Methodology 4 Work Phases 5 Expected Results Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 23 / 26
  • 24. Expected Results Results Provide automatically multiple meaningful entry points to information Enrich the corpus with links to resources (particularly primary sources) Improve the user access to the corpus Demonstrate the scalability of the approach Tools/Resources Knowledge Base for Classics Articles with improved text quality (improved) corpora to be released single tools for information extraction (e.g. CREX Canonical References EXtractor) Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 24 / 26
  • 25. Expected Results Possible Applications Solution to problems peculiar of Classics might help to improve the performances of existing tools/algorithms Collections of secondary sources as corpora: citation patterns citation and co-citation networks trends in the Classics citation practice Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 25 / 26
  • 26. Expected Results Thanks for your attention! matteo.romanello@kcl.ac.uk http://uk.linkedin.com/in/matteoromanello Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 26 / 26