4. Data
• 1618 – 2016
• 12 countries
• 40 languages
• 120 TB
• Ca. 1,000 titles
• 3,3M issues
5. Data
• Metadata for more than >20 million pages
• 12 million pages processed with OCR
• 2 million pages processed with OLR
• Most content licensed as Public Domain
• Metadata licensed CC0
• Copyright cut-off date
6. Data
• JP2000 images for use with IIPsrv
• METS container with embedded MODS
for structural and bibliographic metadata
• ALTO for OCRed text
• EDM for Europeana
Europeana Newspapers METS/ALTO Profile
(ENMAP)
9. Preprocessing
• Preprocessing with adaptive Binarization to
reduce overall image file size and processing
time (yielded >90% reduction of data volume
vs. <1% lower accuracy in OCR results)
• Preprocessing to create tiled JP2000 files
for zooming using graphicsmagick + kakadu
• Created easy-to-use set of preprocessing tools
that also validate and harmonize data input
for efficient OCR/OLR processing
10. OCR/OLR
• OCR: ABBYY FineReader Engine 11
– Gothic license per page (A4!)
– 4 servers with 8 cores = 32 processing cores
– Average processing time of 5s per newspaper page
• OLR: CCS docWorks
– Article separation & page classification
– Possibility for post-correction/validation of results
11. NER
• Stanford CoreNLP Named Entity Recognition
(Conditional Random Fields)
• Adapted for METS/ALTO processing
• Added ALTO v3 (tags) output
• https://github.com/EuropeanaNewspapers/ner-app
• Annotated training & evaluation data
• 100 pages each for (historical) German, French, Dutch
• https://github.com/EuropeanaNewspapers/ner-corpora
12. Evaluation
• Scenario-based performance evaluation of
OCR/OLR using PAGE ground truth
• Ground truth dataset:
http://primaresearch.org/datasets/ENP
• Performance Evaluation Report:
http://www.europeana-newspapers.eu/wp-
content/uploads/2015/05/D3.5_Performance_
Evaluation_Report_1.0.pdf
14. IIIF
• International Image Interoperability
Framework (iiif.io) for online presentation
and aggregation
• Implementing Image API and Presentation API
• Europeana IIIF Task Force:
https://pro.europeana.eu/post/iiif-adoption-
by-europeana-future-perspectives-for-the-
network-1
16. Future plans
• Migration of data from TEL (closed 12/2016)
to new Europeana Thematic Collections
http://europeana.eu/portal/
• Re-develop Newspapers API
• Re-develop search & browse interface
• Add new newspaper content
• Create virtual exhibitions & browse entry points
18. Future plans
• Automatic OCR error correction
• Improved newspaper layout analysis
• Named Entity Recognition, Disambiguation
and Linking (Wikidata)
• Extraction and classification of image content
• Deep semantic structuring of newspapers
• User corrections and annotations
19. Collaboration with Researchers
• Interviews with researchers
• Europeana Research
• CLARIN
• Viral Texts
• Oceanic Exchanges
• DDB
• impresso