1. Europeana Newspapers:
The Gateway to European Newspapers Online
IFLA 2013 SATELLITE MEETING ON
NEWSPAPER & GENLOC SECTIONS
Singapore, 14 August 2013
Clemens Neudecker
@cneudecker
2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Overview
• Objectives
• Overview of Dataset
• Workflows & Technologies
• Questions & Answers
2
Image: Nationaal Archief The Netherlands
3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Objectives
• Refinement of 10 mill. pages with OCR, OLR, NER
• Ingestion of metadata for 18 mill. pages in Europeana
• Create a full text content browser for newspapers
• Create a unified METS/ALTO profile (ENMAP)
• Produce tools in order to ease creation of ENMAP objects
• Share best practices and provide recommendations
3
4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Who
12 content providers
2 networking partners
4 technology providers
1 aggregator
5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Recently associated
6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
The data
7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
The workflow
11
12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OCR @ UIBK
• OCR = Optical Character Recognition
• Technologies: ABBYY FineReader SDK
• State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts
• METS/ALTO package containing images, metadata & full text
12
13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OLR @ CCS
• OLR = Optical Layout Recognition
• Technologies: docWorks
• Separation of columns, articles, headlines, page classes
• METS/ALTO package containing images, metadata & full text
13
14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
NER @ KB
• NER = Named Entities Recognition
• Technologies: Stanford CRF-NER
• Open source: https://github.com/KBNLresearch/europeananp-ner
• Detection of Named entities: Person, Location, Organization
14
15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
QA @ PRImA
• Layout and OCR evaluation
• Technologies: Ground truth + Evaluation Tools (IMPACT)
• In-depth scenario driven evaluation using profiles with more than 600 metrics
15
16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Full-text search @ TEL
16
Blog
www.europeana-newspapers.eu
Workshop
16 Sept. 2013 (Amsterdam)
17. Thank you for your attention!
clemens.neudecker@kb.nl