Structured and Unstructured:
                 Extracting Information From Classics
                            Scholarly T...
The Project at a glance



               Project started in October 2009;
               Disciplines: Digital Humanities,...
Goal

       Devising an automatic system to improve semantic
       information retrieval over a discipline-specific corpu...
Semantic Information Retrieval




                                 Semantic vs String Matching based IR
Romanello        ...
Named Entities as Entry Point to Information




       Entities to be extracted:
            1   Place Names (ancient and...
Work Phases




Romanello                                     CCH
Extracting Information From Scholarly Texts
Corpus building




       Getting materials
       Crawling online archives

       Extracting the text from collected do...
Corpus Building II


       Corpora
               open access, multilingual
               Princeton/Stanford Working Pap...
Building the Knowledge Base (KB)

       Goal: integrate different data sources into a single KB
       Why?
             ...
Corpus Processing



       Tasks
            1   sentence identification
            2   entities extraction (named entiti...
Canonical References




Romanello                                     CCH
Extracting Information From Scholarly Texts
Canonical References Extraction

            1   citations used specifically for primary sources (i.e. works of
           ...
So What?




       New Possible Research Questions:
          how citing primary sources in Classics changed?
           ...
Why a Digital Humanities project?



               Better understanding of
                       the discipline specifiti...
Thanks for your attention!
       matteo.romanello@kcl.ac.uk
       http://kcl.academia.edu/MatteoRomanello




Romanello ...
Upcoming SlideShare
Loading in...5
×

Structured and Unstructured:Extracting Information From Classics Scholarly Texts

702

Published on

Slides of the talk given at the DHSI 2010 Graduate Colloquium at UVic (Canada).

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
702
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Structured and Unstructured:Extracting Information From Classics Scholarly Texts

  1. 1. Structured and Unstructured: Extracting Information From Classics Scholarly Texts Matteo Romanello1 1 Centre for Computing in the Humanities King’s College London Graduate Colloquium - DHSI 2010 University of Victoria BC - 8th June 2010 Romanello CCH Extracting Information From Scholarly Texts
  2. 2. The Project at a glance Project started in October 2009; Disciplines: Digital Humanities, Classics, Computer Science; co-supervised by: Willard McCarty (KCL, Department of Digital Humanities) Jonathan Ginzburg (KCL, Department of Computer Science) project supported by an AHRC (Arts and Humanities Research Council) award Romanello CCH Extracting Information From Scholarly Texts
  3. 3. Goal Devising an automatic system to improve semantic information retrieval over a discipline-specific corpus of unstructured texts focus on secondary sources (e.g. journal papers) as opposed to primary sources (i.e. Ancient Texts) automatic -> scalable with huge amount of data information retrieval -> the task of retrieving information unstructured texts -> raw texts (e.g. .txt files) as opposed to the structured/encoded XML Example “Hom. Il. XII 1”: sequence of 14 characters meaning “first line of the twelfth book of Homer’s Iliad” Romanello CCH Extracting Information From Scholarly Texts
  4. 4. Semantic Information Retrieval Semantic vs String Matching based IR Romanello CCH Extracting Information From Scholarly Texts
  5. 5. Named Entities as Entry Point to Information Entities to be extracted: 1 Place Names (ancient and modern); 2 Relevant Person Names (mythological names, ancient authors, modern scholars) 3 References to primary and secondary sources (canonical texts and modern publications about them) Romanello CCH Extracting Information From Scholarly Texts
  6. 6. Work Phases Romanello CCH Extracting Information From Scholarly Texts
  7. 7. Corpus building Getting materials Crawling online archives Extracting the text from collected documents Tools for text extraction from PDF -> open issues with Ancient Greek encoding re-OCR documents even the native digital ones Romanello CCH Extracting Information From Scholarly Texts
  8. 8. Corpus Building II Corpora open access, multilingual Princeton/Stanford Working Papers in Classics (PSWPC) Lexis online 470 articles in 2 corpora OCR Finereader Ocropus (layout analysis) text extracted from PDFs (tools like pdftotext etc.) Alignment of multiple OCR outputs Romanello CCH Extracting Information From Scholarly Texts
  9. 9. Building the Knowledge Base (KB) Goal: integrate different data sources into a single KB Why? Information about the same entities spread over several data sources Data sources might use different output formats (raw text, DBs, HTML, XML etc.) partial overlappings but no interoperability How? Use of high level ontologies to map records related to the same entity Result: KB containing semantic data Romanello CCH Extracting Information From Scholarly Texts
  10. 10. Corpus Processing Tasks 1 sentence identification 2 entities extraction (named entities recognition + disambiguation) KB implied to build up an entity context 3 canonical references extraction KB provides training data 4 modern bibliographic references extraction KB provides list of journals/name places/authors to improve the perfomances of the tool Romanello CCH Extracting Information From Scholarly Texts
  11. 11. Canonical References Romanello CCH Extracting Information From Scholarly Texts
  12. 12. Canonical References Extraction 1 citations used specifically for primary sources (i.e. works of ancient authors) 2 essential entry point to information: refer to the research object, i.e. ancient texts 3 logical instead of physical citation scheme (e.g., chapter/paragr vs. page) 4 variation -> time, style, language (regexp insufficient!) Example Hom. Il. XII 1 Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803 Hes. fr. 321 M.-W. Callimaco, ’ep.’ 28 Pf., 5-6 Romanello CCH Extracting Information From Scholarly Texts
  13. 13. So What? New Possible Research Questions: how citing primary sources in Classics changed? what are the characteristics of citation and co-citation networks? the traditional IR tools in Classics are actually exhaustive? Romanello CCH Extracting Information From Scholarly Texts
  14. 14. Why a Digital Humanities project? Better understanding of the discipline specifities users’ needs Writing code to develop a project means formalizing the way a given result is obtained creating a repeatable and thus confutable process introducing a reasoning based on the analysis of quantitative data into Classics Being able to apply the product of a DH research to traditional scholarship Romanello CCH Extracting Information From Scholarly Texts
  15. 15. Thanks for your attention! matteo.romanello@kcl.ac.uk http://kcl.academia.edu/MatteoRomanello Romanello CCH Extracting Information From Scholarly Texts
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×