Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Romanello tokyo

921 views

Published on

presentation of my research project held at the EIRI – CCH Conference on the Digitization in the Humanities at Keio University (Tokyo)

Published in: Education
  • Be the first to comment

  • Be the first to like this

Romanello tokyo

  1. 1. Structured Vs Unstructured: Extracting Information From Scholarly Texts in European Classical Studies Matteo Romanello1 1 Centre for Computing in the Humanities EIRI - CCH Symposium on the Digitization in the Humanities Keio University - Tokyo 18th March 2010 Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 1 / 26
  2. 2. Overview 1 Introduction 2 Motivations and Background 3 Methodology 4 Work Phases 5 Expected Results Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 2 / 26
  3. 3. Introduction Overview 1 Introduction 2 Motivations and Background 3 Methodology 4 Work Phases 5 Expected Results Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 3 / 26
  4. 4. Introduction The Project at a glance Project started in October 2009; Field of application: Digital Humanities, Classics (particularly Greek literature); co-supervision between the CCH and the CS department at King’s -> application of Computational Linguistics method Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 4 / 26
  5. 5. Introduction Focus Scholarly Texts from the European Scholarly Tradition in Classical Studies Secondary sources, e.g. journal papers, as opposed to primary sources, i.e. Ancient Texts Sets of texts considered so far: Princeton - Stanford Working Papers in Classics (PSWPC) LEXIS online: classics journal available online under Open Access policy goal -> information extraction Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 5 / 26
  6. 6. Introduction Goal Devising an automatic system to improve semantic information retrieval over a discipline-specific corpus of unstructured texts focus on secondary sources automatic -> scalable with huge amount of data information retrieval -> the task of retrieving information unstructured texts -> raw texts (e.g. .txt files) as opposed to the structured/encoded XML Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 6 / 26
  7. 7. Motivations Overview 1 Introduction 2 Motivations and Background 3 Methodology 4 Work Phases 5 Expected Results Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 7 / 26
  8. 8. Motivations The Million Book Library archives.org, Google Books -> growth of volume of information publicly available in electronic format longer “shelf-life” of books in Classics/Humanities need for effective tools to access information for research purposes Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 8 / 26
  9. 9. Motivations Information Extraction in Classics: challenges lack of tools comparable to CiteseerX, GoPubMed, etc. results of traditional search engines -> high recall but low precision need to go beyond TOCs or string matching-based IR still issues with encoding of Ancient Greek no ad-hoc gold standards/training set lack of tools specifically tailored to Classics resources electronically available text does not mean electronic text Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 9 / 26
  10. 10. Methodology Overview 1 Introduction 2 Motivations and Background 3 Methodology 4 Work Phases 5 Expected Results Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 10 / 26
  11. 11. Methodology Named Entities as Access Point to Information mentions of entities matter for Classicists -> importance of print indexes in Classics Disambiguation, different spellings or translations of names relating different expressions to the same entity Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 11 / 26
  12. 12. Methodology Named Entities as Access Point to Information Entities to be extracted: 1 Place Names (ancient and modern); 2 Relevant Person Names (mythological names, ancient authors, modern scholars) 3 References to primary and secondary sources (canonical texts and modern publications about them) Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 12 / 26
  13. 13. Methodology Reuse of Structured Information Reuse of structured data sources, e.g. thesauri, authority lists, etc., produced by scholars over the last two decades. -> To train machine-learning based tools to mine unstructured texts. Related work: Research in the AI field -> Semantic Integration Use of Wikipedia/DBpedia in NLP Related projects: EROCS by IBM Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 13 / 26
  14. 14. Work Phases Overview 1 Introduction 2 Motivations and Background 3 Methodology 4 Work Phases 5 Expected Results Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 14 / 26
  15. 15. Work Phases Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 15 / 26
  16. 16. Work Phases Corpus building Getting materials Crawling online archives Extracting the text from collected documents Tools for text extraction from PDF -> open issues with Ancient Greek encoding re-OCR documents even the native digital ones Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 16 / 26
  17. 17. Work Phases Corpus Building II Corpora open access, multilingual Princeton/Stanford Working Papers in Classics (PSWPC) Lexis online 470 articles in 2 corpora OCR Finereader Ocropus (layout analysis) text extracted from PDFs (tools like pdftotext etc.) Alignment of multiple OCR outputs Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 17 / 26
  18. 18. Work Phases Building the Knowledge Base (KB) Goal: integrate different data sources into a single KB Why? Information about the same entities spread over several data sources Data sources might use different output formats (raw text, DBs, HTML, XML etc.) partial overlappings but no interoperability How? Use of high level ontologies to map records related to the same entity Result: KB containing semantic data Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 18 / 26
  19. 19. Work Phases Building the Knowledge Base (KB) II Ontologies -> in CS a formalism to model data Integrating data sources: import each datasource map it to high level ontologies (e.g., CIDOC-CRM) find overlappings between datasources -> alignign the records The obtained knowledge base will be used as support for all the text processing tasks Implementation of the KB: RDF triple store with a SPARQL interface Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 19 / 26
  20. 20. Work Phases Corpus Processing 1 sentence identification 2 entities extraction (named entities recognition + disambiguation) KB implied to build up an entity context 3 canonical references extraction KB provides training data 4 modern bibliographic references extraction KB provides list of journals/name places/authors to improve the perfomances of the tool Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 20 / 26
  21. 21. Work Phases Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 21 / 26
  22. 22. Work Phases Canonical References Extraction 1 citations used specifically for primary sources (i.e. works of ancient authors) 2 essential entry point to information: refer to the research object, i.e. ancient texts 3 logical instead of physical citation scheme (e.g., chapter/paragr vs. page) 4 variation -> time, style, language (regexp insufficient!) Example Hom. Il. XII 1 Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803 Hes. fr. 321 M.-W. Callimaco, ’ep.’ 28 Pf., 5-6 Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 22 / 26
  23. 23. Expected Results Overview 1 Introduction 2 Motivations and Background 3 Methodology 4 Work Phases 5 Expected Results Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 23 / 26
  24. 24. Expected Results Results Provide automatically multiple meaningful entry points to information Enrich the corpus with links to resources (particularly primary sources) Improve the user access to the corpus Demonstrate the scalability of the approach Tools/Resources Knowledge Base for Classics Articles with improved text quality (improved) corpora to be released single tools for information extraction (e.g. CREX Canonical References EXtractor) Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 24 / 26
  25. 25. Expected Results Possible Applications Solution to problems peculiar of Classics might help to improve the performances of existing tools/algorithms Collections of secondary sources as corpora: citation patterns citation and co-citation networks trends in the Classics citation practice Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 25 / 26
  26. 26. Expected Results Thanks for your attention! matteo.romanello@kcl.ac.uk http://uk.linkedin.com/in/matteoromanello Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 26 / 26

×