Introduction               Motivations                 Methodology         WorkPhases   ExpectedResults




              ...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Overview


       In...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Overview


       In...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




The Project at a gla...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Goal



       Devis...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Overview


       In...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




The Million Book Lib...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Information extracti...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Access points to inf...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Mining and informati...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Overview


       In...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Finding Mentions of ...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Reuse of Structured ...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Overview


       In...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Extracting Informati...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Corpus building

   ...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Corpus Building II

...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Structured datasourc...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Structured datasourc...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Corpus Processing


...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Canonical References...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Overview


       In...
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Results
            ...
Upcoming SlideShare
Loading in …5
×

Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

1,324 views

Published on

PhD seminar presentation at CCH/KCL

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,324
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

  1. 1. Introduction Motivations Methodology WorkPhases ExpectedResults Structured Vs Unstructured: Extracting Information From Classics Scholarly Texts Matteo Romanello1 1 Centre for Computing in the Humanities PhD Seminar London 28/01/2010 Extracting Information From Classics Scholarly Texts CCH
  2. 2. Introduction Motivations Methodology WorkPhases ExpectedResults Overview Introduction Motivations and Background Methodology Work Phases Expected Results Extracting Information From Classics Scholarly Texts CCH
  3. 3. Introduction Motivations Methodology WorkPhases ExpectedResults Overview Introduction Motivations and Background Methodology Work Phases Expected Results Extracting Information From Classics Scholarly Texts CCH
  4. 4. Introduction Motivations Methodology WorkPhases ExpectedResults The Project at a glance Project started in October 2009; Field of application: Digital Humanities, Classics (particularly Greek literature); co-supervision between the CCH and the CS department at King’s -> application of Computational Linguistics method Extracting Information From Classics Scholarly Texts CCH
  5. 5. Introduction Motivations Methodology WorkPhases ExpectedResults Goal Devising an automatic system to improve information retrieval over a discipline-specific corpus of unstructured texts focus on secondary sources automatic -> scalable with huge amount of data information retrieval -> the task of retrieving information unstructured texts -> raw texts (e.g. .txt files) as opposed to the structured/encoded XML Extracting Information From Classics Scholarly Texts CCH
  6. 6. Introduction Motivations Methodology WorkPhases ExpectedResults Overview Introduction Motivations and Background Methodology Work Phases Expected Results Extracting Information From Classics Scholarly Texts CCH
  7. 7. Introduction Motivations Methodology WorkPhases ExpectedResults The Million Book Library archives.org, Google Books -> growth of volume of information available in electronic format longer “shelf-life” of books in Classics/Humanities results of traditional search engines -> high recall but low precision need for effective tools to access information for research purposes Extracting Information From Classics Scholarly Texts CCH
  8. 8. Introduction Motivations Methodology WorkPhases ExpectedResults Information extraction in Classics lack of tools comparable to Citeseer, CiteseerX, GoPubMed for other disciplines are JSTOR’s features/functionalities enough for scholarly purposes? still issues with encoding of ancient greek (e.g., The +$%j& of Danaids) Extracting Information From Classics Scholarly Texts CCH
  9. 9. Introduction Motivations Methodology WorkPhases ExpectedResults Access points to information going beyond TOCs or string matching-based IR access points meaningful for Classics scholars Contribution to research problems peculiar of Classics can help to improve the performances of existing tools/algorithms Analysis of papers published in a Classics journal (or archive) as corpus Extracting Information From Classics Scholarly Texts CCH
  10. 10. Introduction Motivations Methodology WorkPhases ExpectedResults Mining and information extraction from classics texts no ad-hoc gold standards/training set lack of tools specifically tailored to Classics resources electronically available text does not mean electronic text Possible corpus analysis citation patterns citation and co-citation networks trends in the Classics citation practice Extracting Information From Classics Scholarly Texts CCH
  11. 11. Introduction Motivations Methodology WorkPhases ExpectedResults Overview Introduction Motivations and Background Methodology Work Phases Expected Results Extracting Information From Classics Scholarly Texts CCH
  12. 12. Introduction Motivations Methodology WorkPhases ExpectedResults Finding Mentions of Realia mentions of realia are information that matter -> importance of print indexes in Classics Using realia as access points to information Identifying mentions of Realia Disambiguation, different spellings or translations of names Kinds of realia we are interested in extracting 1. Place Names (ancient and modern); 2. Relevant person Names(mythological names, ancient authors, modern scholars) 3. Reference to primary and secondary sources (canonical texts and modern publications about them) Extracting Information From Classics Scholarly Texts CCH
  13. 13. Introduction Motivations Methodology WorkPhases ExpectedResults Reuse of Structured Information Scholars have been producing over the last years several structured datasources: use of structured information to train machine-learning based tools to mine unstructured texts Related projects: EROCS by IBM current practice: Wikipedia/DBpedia as datasource of structured information what improvements by using a discipline specific Knowledge B ase? Extracting Information From Classics Scholarly Texts CCH
  14. 14. Introduction Motivations Methodology WorkPhases ExpectedResults Overview Introduction Motivations and Background Methodology Work Phases Expected Results Extracting Information From Classics Scholarly Texts CCH
  15. 15. Introduction Motivations Methodology WorkPhases ExpectedResults Extracting Information From Classics Scholarly Texts CCH
  16. 16. Introduction Motivations Methodology WorkPhases ExpectedResults Corpus building Getting materials Crawling online archives Characteristics of considered corpora Open Access -> publically accessible Possibly multilingual Extracting the text from collected documents Tools for text extraction from PDF -> open issues with Ancient Greek encoding re-OCR documents even the native digital ones Extracting Information From Classics Scholarly Texts CCH
  17. 17. Introduction Motivations Methodology WorkPhases ExpectedResults Corpus Building II Corpora Princeton/Stanford Working Papers in Classics (PSWPC) Lexis 300 articles in 2 corpora OCR Finereader Ocropus (layout analysis) text extracted from PDFs (tools like pdftotext etc.) Extracting Information From Classics Scholarly Texts CCH
  18. 18. Introduction Motivations Methodology WorkPhases ExpectedResults Structured datasources Information about the same entities (i.e. realia) can be spread over several datasources partial overlappings Datasources can use different formats (text, DB, HTML, XML etc.) no interoperability Extracting Information From Classics Scholarly Texts CCH
  19. 19. Introduction Motivations Methodology WorkPhases ExpectedResults Structured datasources II To create a semantic knowledge base (KB) import each datasource map it to high level ontologies (e.g., CIDOC-CRM) find overlappings between datasources -> alignign the records The obtained knowledge base will be used as support for all the text processing tasks Extracting Information From Classics Scholarly Texts CCH
  20. 20. Introduction Motivations Methodology WorkPhases ExpectedResults Corpus Processing 1. sentence identification 2. entities extraction (named entities recognition + disambiguation) KB implied to build up an entity context 3. canonical references extraction KB provides training data 4. modern bibliographic references extraction KB provides list of journals/name places/authors to improve the perfomances of the tool Extracting Information From Classics Scholarly Texts CCH
  21. 21. Introduction Motivations Methodology WorkPhases ExpectedResults Canonical References Extraction 1. citations used specifically for secondary sources (i.e. works of ancient authors) 2. essential entry point to information: refer to the research object, i.e. Ancient Texts 3. logical instead of physical citation scheme (e.g., chapter/paragr vs. page) 4. variation -> time, style, language (regexp insufficient!) Example Hom. Il. XII 1 Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803 Hes. fr. 321 M.-W. Callimaco, ’ep.’ 28 Pf., 5-6 Extracting Information From Classics Scholarly Texts CCH
  22. 22. Introduction Motivations Methodology WorkPhases ExpectedResults Overview Introduction Motivations and Background Methodology Work Phases Expected Results Extracting Information From Classics Scholarly Texts CCH
  23. 23. Introduction Motivations Methodology WorkPhases ExpectedResults Results Provide automatically multiple meaningful entry points to information Enrich the corpus with links to resources (particularly primary sources) Improve the user access to the corpus Demonstrate the scalability of the approach Tools/Resources Knowledge Base for Classics Articles with improved text quality Corpora released single tools fr information extraction (e.g. Canonical References Extractor) Extracting Information From Classics Scholarly Texts CCH

×