Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Named Entity Recognition for Europeana Newspapers

142 views

Published on

Overview of Europeana Newspapers Named Entity Recognition for Oceanic Exchanges Workshop, Stuttgart, Germany, 8-9 May 2018

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Named Entity Recognition for Europeana Newspapers

  1. 1. NER for Europeana Newspapers Clemens Neudecker (@cneudecker) Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
  2. 2. Background
  3. 3. Why Named Entity Recognition? • Analysis* of query log files from the National Library of Wales newspaper website: a vast majority of searches queries contain either person or place names * Paul Gooding, Exploring Usage of Digital Newspaper Archives through Web Log Analysis: A Case Study of Welsh Newspapers Online, presented at DH2014, Lausanne) • Improving Information Retrieval • Linking to authority files (Linked Data) • Historical Social Network Analysis (HNA/SNA)
  4. 4. Languages • Dutch (1614 – 1900) • French (1814 – 1944) • German (1721 – 1949) • Together approx. 50% of the total collection
  5. 5. Many challenges • Historical data (language) • Noisy data (OCR) • Multilingual data • Lack of extensive metadata • Lack of open resources (tagged corpora, gazetteers) • Lack of common annotation guidelines • Limitations of annotation tools
  6. 6. Technology
  7. 7. Reuse of existing NER tools • Simple evaluation of – Apache OpenNLP – Stanford CoreNLP – GATE • Choice of using Stanford CoreNLP since – Java-based (thread safe, scalable) – Good performance (f-measure) – Strong and active community – Rather robust against noisy input (CRF)
  8. 8. Approach • Adaptation of Stanford CoreNLP by the KB National Library of the Netherlands to directly consume ENMAP (= Europeana Newspapers METS/ALTO profile) objects
  9. 9. Approach • Export option ALTO v3 with tags added <String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0" VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5"> </String> <String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0" VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10"> </String> … <Tags> <NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/> <NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/> </Tags>
  10. 10. Annotation • Quick evaluation of annotation tools: – BRAT – WebANNO – INL Attestation Tool • Choice of INL Attestation Tool since: – Optimized for tagging speed – Supported by consortium partner (INL/IVDNT)
  11. 11. Corpus creation • Selection of 100 pages each per language • Processing of the OCRed texts with StanfordNER to get initial tagging results • Manual verification and annotation
  12. 12. Corpus statistics Language # tokens # PER # LOC # ORG French 207,000 5,672 5,614 2,574 Dutch 182,483 4,492 4,448 1,160 German 96,735 7,914 6,143 2,784 Language # tokens # PER # LOC # ORG French 100% 2,75% 2,71% 1,24% Dutch 100% 2,46% 2,44% 0,64% German 100% 8,18% 6,35% 2,88% Language Word-Error-Rate (Bag of Words) Reading Order Success Rate French 16,6% 19,9% Dutch 17,6% 23,2% German 15,9% / 21,9% 13,6%
  13. 13. ner-app https://github.com/EuropeanaNewspapers/ner-app
  14. 14. ner-corpora https://github.com/EuropeanaNewspapers/ner-corpora
  15. 15. Evaluation: NL
  16. 16. Evaluation FR
  17. 17. Evaluation DE • A Named Entity Recognition Shootout for German M. Riedl and S. Padó. Proceedings of ACL, Melbourne, Australia, (2018).To appear.
  18. 18. NER vs OCR success rate 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 NER OCR
  19. 19. Future Plans
  20. 20. Improving performance • Possible additional features – Distributional similarity (Clark 2003) – Semantic generalization (Faruqui & Padò 2010) – Word embeddings (Braune 2017) • Gazetteers – Person names, historical place names • Data cleanup and improvement – https://github.com/EuropeanaNewspapers/ ner-corpora/wiki
  21. 21. Trias NER • Combination and voting of different NER classifiers, e.g. – Stanford CoreNLP – Spacy – NLTK • Inspiration: https://github.com/KBNLresearch/Trias_NER
  22. 22. Disambiguation • Disambiguation of person and place names • Inspiration: https://github.com/KBNLresearch/europeana np-dbpedia-disambiguation
  23. 23. Linking • Linking of recognised and disambiguated NE‘s to authority files (e.g. Wikidata, GND) • Inspiration: https://github.com/KBNLresearch/dac

×