Structured and Unstructured:
                 Extracting Information From Classics
                            Scholarly Texts

                                              Matteo Romanello1
                                     1 Centre   for Computing in the Humanities
                                                 King’s College London


                                 Graduate Colloquium - DHSI 2010
                               University of Victoria BC - 8th June 2010



Romanello                                                                         CCH
Extracting Information From Scholarly Texts
The Project at a glance



               Project started in October 2009;
               Disciplines: Digital Humanities, Classics, Computer
               Science;
               co-supervised by:
                       Willard McCarty (KCL, Department of Digital Humanities)
                       Jonathan Ginzburg (KCL, Department of Computer
                       Science)
               project supported by an AHRC (Arts and Humanities
               Research Council) award



Romanello                                                                        CCH
Extracting Information From Scholarly Texts
Goal

       Devising an automatic system to improve semantic
       information retrieval over a discipline-specific corpus of
       unstructured texts
               focus on secondary sources (e.g. journal papers) as
               opposed to primary sources (i.e. Ancient Texts)
               automatic -> scalable with huge amount of data
               information retrieval -> the task of retrieving information
               unstructured texts -> raw texts (e.g. .txt files) as opposed
               to the structured/encoded XML

       Example
       “Hom. Il. XII 1”: sequence of 14 characters meaning “first line
       of the twelfth book of Homer’s Iliad”
Romanello                                                                    CCH
Extracting Information From Scholarly Texts
Semantic Information Retrieval




                                 Semantic vs String Matching based IR
Romanello                                                               CCH
Extracting Information From Scholarly Texts
Named Entities as Entry Point to Information




       Entities to be extracted:
            1   Place Names (ancient and modern);
            2   Relevant Person Names (mythological names, ancient authors,
                modern scholars)
            3   References to primary and secondary sources (canonical
                texts and modern publications about them)
Romanello                                                                     CCH
Extracting Information From Scholarly Texts
Work Phases




Romanello                                     CCH
Extracting Information From Scholarly Texts
Corpus building




       Getting materials
       Crawling online archives

       Extracting the text from collected documents
               Tools for text extraction from PDF -> open issues with
               Ancient Greek encoding
               re-OCR documents even the native digital ones




Romanello                                                               CCH
Extracting Information From Scholarly Texts
Corpus Building II


       Corpora
               open access, multilingual
               Princeton/Stanford Working Papers in Classics (PSWPC)
               Lexis online
               470 articles in 2 corpora

       OCR
          Finereader
               Ocropus (layout analysis)
               text extracted from PDFs (tools like pdftotext etc.)
               Alignment of multiple OCR outputs

Romanello                                                              CCH
Extracting Information From Scholarly Texts
Building the Knowledge Base (KB)

       Goal: integrate different data sources into a single KB
       Why?
               Information about the same entities spread over several
               data sources
               Data sources might use different output formats (raw text,
               DBs, HTML, XML etc.)
               partial overlappings but no interoperability

       How?
          Use of high level ontologies to map records related to the
          same entity
               Result: KB containing semantic data

Romanello                                                                   CCH
Extracting Information From Scholarly Texts
Corpus Processing



       Tasks
            1   sentence identification
            2   entities extraction (named entities recognition +
                disambiguation)
                       KB implied to build up an entity context
            3   canonical references extraction
                    KB provides training data
            4   modern bibliographic references extraction
                   KB provides list of journals/name places/authors to improve
                   the perfomances of the tool



Romanello                                                                        CCH
Extracting Information From Scholarly Texts
Canonical References




Romanello                                     CCH
Extracting Information From Scholarly Texts
Canonical References Extraction

            1   citations used specifically for primary sources (i.e. works of
                ancient authors)
            2   essential entry point to information: refer to the research
                object, i.e. ancient texts
            3   logical instead of physical citation scheme (e.g., chapter/paragr
                vs. page)
            4   variation -> time, style, language (regexp insufficient!)

       Example
       Hom. Il. XII 1
       Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803
       Hes. fr. 321 M.-W.
       Callimaco, ’ep.’ 28 Pf., 5-6

Romanello                                                                           CCH
Extracting Information From Scholarly Texts
So What?




       New Possible Research Questions:
          how citing primary sources in Classics changed?
               what are the characteristics of citation and co-citation
               networks?
               the traditional IR tools in Classics are actually exhaustive?




Romanello                                                                      CCH
Extracting Information From Scholarly Texts
Why a Digital Humanities project?



               Better understanding of
                       the discipline specifities
                       users’ needs
               Writing code to develop a project means
                       formalizing the way a given result is obtained
                       creating a repeatable and thus confutable process
                       introducing a reasoning based on the analysis of
                       quantitative data into Classics
               Being able to
                       apply the product of a DH research to traditional scholarship




Romanello                                                                              CCH
Extracting Information From Scholarly Texts
Thanks for your attention!
       matteo.romanello@kcl.ac.uk
       http://kcl.academia.edu/MatteoRomanello




Romanello                                        CCH
Extracting Information From Scholarly Texts

Structured and Unstructured:Extracting Information From Classics Scholarly Texts

  • 1.
    Structured and Unstructured: Extracting Information From Classics Scholarly Texts Matteo Romanello1 1 Centre for Computing in the Humanities King’s College London Graduate Colloquium - DHSI 2010 University of Victoria BC - 8th June 2010 Romanello CCH Extracting Information From Scholarly Texts
  • 2.
    The Project ata glance Project started in October 2009; Disciplines: Digital Humanities, Classics, Computer Science; co-supervised by: Willard McCarty (KCL, Department of Digital Humanities) Jonathan Ginzburg (KCL, Department of Computer Science) project supported by an AHRC (Arts and Humanities Research Council) award Romanello CCH Extracting Information From Scholarly Texts
  • 3.
    Goal Devising an automatic system to improve semantic information retrieval over a discipline-specific corpus of unstructured texts focus on secondary sources (e.g. journal papers) as opposed to primary sources (i.e. Ancient Texts) automatic -> scalable with huge amount of data information retrieval -> the task of retrieving information unstructured texts -> raw texts (e.g. .txt files) as opposed to the structured/encoded XML Example “Hom. Il. XII 1”: sequence of 14 characters meaning “first line of the twelfth book of Homer’s Iliad” Romanello CCH Extracting Information From Scholarly Texts
  • 4.
    Semantic Information Retrieval Semantic vs String Matching based IR Romanello CCH Extracting Information From Scholarly Texts
  • 5.
    Named Entities asEntry Point to Information Entities to be extracted: 1 Place Names (ancient and modern); 2 Relevant Person Names (mythological names, ancient authors, modern scholars) 3 References to primary and secondary sources (canonical texts and modern publications about them) Romanello CCH Extracting Information From Scholarly Texts
  • 6.
    Work Phases Romanello CCH Extracting Information From Scholarly Texts
  • 7.
    Corpus building Getting materials Crawling online archives Extracting the text from collected documents Tools for text extraction from PDF -> open issues with Ancient Greek encoding re-OCR documents even the native digital ones Romanello CCH Extracting Information From Scholarly Texts
  • 8.
    Corpus Building II Corpora open access, multilingual Princeton/Stanford Working Papers in Classics (PSWPC) Lexis online 470 articles in 2 corpora OCR Finereader Ocropus (layout analysis) text extracted from PDFs (tools like pdftotext etc.) Alignment of multiple OCR outputs Romanello CCH Extracting Information From Scholarly Texts
  • 9.
    Building the KnowledgeBase (KB) Goal: integrate different data sources into a single KB Why? Information about the same entities spread over several data sources Data sources might use different output formats (raw text, DBs, HTML, XML etc.) partial overlappings but no interoperability How? Use of high level ontologies to map records related to the same entity Result: KB containing semantic data Romanello CCH Extracting Information From Scholarly Texts
  • 10.
    Corpus Processing Tasks 1 sentence identification 2 entities extraction (named entities recognition + disambiguation) KB implied to build up an entity context 3 canonical references extraction KB provides training data 4 modern bibliographic references extraction KB provides list of journals/name places/authors to improve the perfomances of the tool Romanello CCH Extracting Information From Scholarly Texts
  • 11.
    Canonical References Romanello CCH Extracting Information From Scholarly Texts
  • 12.
    Canonical References Extraction 1 citations used specifically for primary sources (i.e. works of ancient authors) 2 essential entry point to information: refer to the research object, i.e. ancient texts 3 logical instead of physical citation scheme (e.g., chapter/paragr vs. page) 4 variation -> time, style, language (regexp insufficient!) Example Hom. Il. XII 1 Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803 Hes. fr. 321 M.-W. Callimaco, ’ep.’ 28 Pf., 5-6 Romanello CCH Extracting Information From Scholarly Texts
  • 13.
    So What? New Possible Research Questions: how citing primary sources in Classics changed? what are the characteristics of citation and co-citation networks? the traditional IR tools in Classics are actually exhaustive? Romanello CCH Extracting Information From Scholarly Texts
  • 14.
    Why a DigitalHumanities project? Better understanding of the discipline specifities users’ needs Writing code to develop a project means formalizing the way a given result is obtained creating a repeatable and thus confutable process introducing a reasoning based on the analysis of quantitative data into Classics Being able to apply the product of a DH research to traditional scholarship Romanello CCH Extracting Information From Scholarly Texts
  • 15.
    Thanks for yourattention! matteo.romanello@kcl.ac.uk http://kcl.academia.edu/MatteoRomanello Romanello CCH Extracting Information From Scholarly Texts