Information Extraction on Noisy Texts for Historical Research


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Information Extraction on Noisy Texts for Historical Research

  1. 1. Information Extraction on Noisy Texts for Historical ResearchMike BryantKepa Joseba RodriquezTobias BlankeReto Speck 19th July 2012
  2. 2. Why EHRI?Fragmentation and dispersal of archival sources • Geographical scope of Holocaust • Attempts to destroy the evidence • Migration of Holocaust survivors • Multiplicity documentation projects after the war
  3. 3. The Adler case
  4. 4. The Adler Case 5 - King’s College 2 - ITS International Tracing Service 4 – NIOD 1 - Jewish Museum Prague 3 YAD VASHEM CONNECTING COLLECTIONS
  5. 5. Connecting CollectionsCollection-level metadataEnhance existing services Develop new services• Build a virtual observatory • Build a virtual research – A digital infrastructure to environment unlock sources – Problem-driven – User-driven
  6. 6. Integrate multiple layers of Metadata Archival (Finding aids, thesaurus) Machine Generated (extracted entities) User Generated Metadata (annotations)
  7. 7. Services for partner archives• OCR – Provide a general-purpose OCR service tailored to the needs of historical material – Allow attaching scanned paper finding aids to “bare-bones” collection descriptions and automatically storing/indexing OCR output• Named Entity Extraction – Integrate NEE services to bootstrap the process of tagging collection descriptions – Integrate NEE with the EHRI thesaurus, to filter and validate NEE output – Build “candidate” search indexes, with crowd-sourced validation
  8. 8. Workflow Tools – the Ocropodium Project1. Workflow development 2. Batch Process 3. Transcript correction
  9. 9. NEE Experiment – Corpus data• Wiener Library: Holocaust survivor testimonies • 17 pages • ~93% OCR word accuracy • King’s College London: H.M.S. Kelly Newsletters • 33 pages • ~92.5% OCR word accuracy
  10. 10. NEE Experiment - Tools• Extracted entities “Find all information about – Person prisoners arriving in Therezin from – Location the Netherlands in 1944” – Organisation• Tools “Find all documentation from Hans – Alchemy API Gunther Adler on SS guards in – OpenCalais Therezin” – Apache OpenNLP – Stanford NER• Manually annotated source data – Tokenized and POS tagged using TreeTagger – Imported into MMAX2 for manual entity tagging
  11. 11. NEE Experiment - ResultsLow performance of the tools in corrected and raw text Raw Corrected P R F1 P R F1 Alchemy 0.61 0.38 0.47 0.63 0.38 0.48 OpenCalais 0.75 0.29 0.41 0.69 0.30 0.42 OpenNLP 0.42 0.12 0.19 0.53 0.13 0.21 Stanford 0.57 0.52 0.54 0.60 0.61 0.60
  12. 12. LOC extraction most accurate, ORG least WL F1-Score KCL F1-Score
  13. 13. NEE Experiment – Personal names• Person names: commonly written in non-standard forms• Person and location names are used for other kind of entities, e.g. warships • Warships frequently annotated as PER
  14. 14. NEE Experiment - OrganisationsPerformance of type ORG extraction is very low• Names of organizations appear in non-standard forms • Jargon and abbreviations abound, particularly in Kelly newsletters• Many organizations no longer exist • SS and other relevant Nazi organizations have not be detected• Spelling errors and typos in the original files: • OpenCalais used general knowledge to resolve this problem • Use of general knowledge my be problematic. • “Klan, Walter” → “Ku Klux Klan”
  15. 15. Relative performance• Stanford NER best performance across both datasets – Most effective on PER and LOC types• Alchemy API best results on ORG type – Biggest difference between raw OCR and manually corrected text – Not massively ahead of OpenCalais/Stanford• Apache OpenNLP worst performance on our data – But: most open of the tools and theoretically trainable
  16. 16. Conclusions• Manual correction of OCR output does not significantly improve the performance (on our material) – Raw output is enough to obtain provisional candidates for N-gram indexing• Best results likely to come from combinations of tools – Specific workflows for specific material, no silver bullet• Focus in near team: – Identify most significant patterns of error – Implement pre-processing pipeline using simple heuristics and pattern matching tools• Focus in longer term: – Integrate EHRI thesaurus and other forms of knowledge to validate and correct the output of NE extraction tools
  17. 17. ThanksAny questions?Publications:• Tobias Blanke, Mike Bryant, Mark Hedges: Ocropodium: open source OCR for small-scale historical archives. Journal of Information Science, Vol. 38, No. 1.• Tobias Blanke, Michael Bryant, Mark Hedges: Open source OCR for Scientific Workflows in History. Journal of Documentation, Forthcoming.