Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Large-scale refinement of digital historical 
newspapers with named entity recognition 
IFLA Newspaper Pre-Conference 
14 ...
Overview 
• Background 
• NER Introduction 
• Approach 
• Challenges 
• Scalability 
• First results 
• Outlook 
This proj...
Background 
• Europeana Newspapers 
EU Best Practice Network 
• 10 million newspaper pages 
with full-text from 12 librari...
Named entity recognition (I) 
1. Detect names of 
persons, places, 
organisations 
This project is partially funded under ...
Named entity recognition (II) 
2. Disambiguate entities 
This project is partially funded under the ICT Policy Support Pro...
Named entity recognition (III) 
3. Link to online resources 
This project is partially funded under the ICT Policy Support...
Approach (I) 
• Tackle content in 
Dutch, German, French 
(about 50% of the 10m pages) 
This project is partially funded u...
Approach (II) 
• Use a machine learning tool (open source) 
developed by Stanford University, adapted 
for Europeana Newsp...
Approach (III) 
• Create (and release) training 
material by manually annotating 
named entities on OCR‘d 
newspaper pages...
Challenges 
• OCR quality 
• Multiple (mixed) 
languages 
• Historical spelling 
This project is partially funded under th...
Scalability 
• Stanford NER software is multi-threaded 
e.g. 4 CPU cores – 4x throughput 
• Optimise the NER classifier by...
First results (Dutch) 
Persons Locations Organizations 
Precision 0.940 0.950 0.942 
Recall 0.588 0.760 0.559 
F-measure 0...
First results (French) 
Persons Locations 
Precision 0.529 0.548 
Recall 0.834 0.216 
F-measure 0.622 0.310 
This project ...
Outlook 
• Q3: Release of training data for Named Entity Recognition 
in Dutch, German, French 
• Q3: First results for Ge...
www.europeana-newspapers.eu/ 
www.theeuropeanlibrary.org/tel4/newspapers 
https://github.com/KBNLresearch/europeananp-ner ...
Upcoming SlideShare
Loading in …5
×

Large scale refinement of digital historical newspapers with named entities recognition

681 views

Published on

Large scale refinement of digital historical newspapers with named entities recognition
IFLA 2013 Satellite Meeting on Newspaper & Genloc Sections, 13-14 August 2014, Geneva, Switzerland.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Large scale refinement of digital historical newspapers with named entities recognition

  1. 1. Large-scale refinement of digital historical newspapers with named entity recognition IFLA Newspaper Pre-Conference 14 August 2014, Geneva Clemens Neudecker, SBB, @cneudecker
  2. 2. Overview • Background • NER Introduction • Approach • Challenges • Scalability • First results • Outlook This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  3. 3. Background • Europeana Newspapers EU Best Practice Network • 10 million newspaper pages with full-text from 12 libraries • 36 million newspaper pages with metadata for Europeana This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  4. 4. Named entity recognition (I) 1. Detect names of persons, places, organisations This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  5. 5. Named entity recognition (II) 2. Disambiguate entities This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  6. 6. Named entity recognition (III) 3. Link to online resources This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  7. 7. Approach (I) • Tackle content in Dutch, German, French (about 50% of the 10m pages) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  8. 8. Approach (II) • Use a machine learning tool (open source) developed by Stanford University, adapted for Europeana Newspapers by KBNL https://github.com/KBNLresearch/europeananp-ner This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  9. 9. Approach (III) • Create (and release) training material by manually annotating named entities on OCR‘d newspaper pages This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  10. 10. Challenges • OCR quality • Multiple (mixed) languages • Historical spelling This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  11. 11. Scalability • Stanford NER software is multi-threaded e.g. 4 CPU cores – 4x throughput • Optimise the NER classifier by filtering noise and sentences without NE‘s marked • Robust proven Java technology This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  12. 12. First results (Dutch) Persons Locations Organizations Precision 0.940 0.950 0.942 Recall 0.588 0.760 0.559 F-measure 0.689 0.838 0.671 This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  13. 13. First results (French) Persons Locations Precision 0.529 0.548 Recall 0.834 0.216 F-measure 0.622 0.310 This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp * Score for organisations omitted since not enough present in the source material
  14. 14. Outlook • Q3: Release of training data for Named Entity Recognition in Dutch, German, French • Q3: First results for German (Austrian, Italian/South Tirol), final results for Dutch, French • Q4: Release of software (open source) for disambiguating and linking of NER results to DBPedia This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  15. 15. www.europeana-newspapers.eu/ www.theeuropeanlibrary.org/tel4/newspapers https://github.com/KBNLresearch/europeananp-ner Thank you for your attention! IFLA Newspaper Pre-Conference 14 August 2014, Geneva Clemens Neudecker, SBB, @cneudecker

×