Europeana Newspapers - Data, Tools & Future Plans

Europeana Newspapers
Data, Tools & Future Plans
Clemens Neudecker
Staatsbibliothek zu Berlin
@cneudecker

Europeana Newspapers
• EU FP7 ICT-PSP Project 2012 – 2015
• www.europeana-newspapers.eu
• Main outcomes
– TEL Historic Newspapers Portal:
http://www.theeuropeanlibrary.org/tel4/newspapers
– Deliverables:
http://www.europeana-newspapers.eu/
public-materials/deliverables/
– Tools:
http://www.europeana-newspapers.eu/
public-materials/tools/
– Final Report:
http://europeananewspapers.github.io/

Data
• 1618 – 2016
• 12 countries
• 40 languages
• 120 TB
• Ca. 1,000 titles
• 3,3M issues

Data
• Metadata for more than >20 million pages
• 12 million pages processed with OCR
• 2 million pages processed with OLR
• Most content licensed as Public Domain
• Metadata licensed CC0
• Copyright cut-off date

Data
• JP2000 images for use with IIPsrv
• METS container with embedded MODS
for structural and bibliographic metadata
• ALTO for OCRed text
• EDM for Europeana
 Europeana Newspapers METS/ALTO Profile
(ENMAP)

Data
• Portals
– http://www.theeuropeanlibrary.org/tel4/newspapers
– http://europeana.eu/portal/search.html?query=euro-
peana_collectionName%3A92*ewspapers*&rows=24
&qt=false
• Downloads
– https://pro.europeana.eu/itemtype/newspapers
– http://test-solr-mongo.eanadev.org/europeana-
research-newspapers-dump/

Preprocessing
• Preprocessing with adaptive Binarization to
reduce overall image file size and processing
time (yielded >90% reduction of data volume
vs. <1% lower accuracy in OCR results)
• Preprocessing to create tiled JP2000 files
for zooming using graphicsmagick + kakadu
• Created easy-to-use set of preprocessing tools
that also validate and harmonize data input
for efficient OCR/OLR processing

OCR/OLR
• OCR: ABBYY FineReader Engine 11
– Gothic license per page (A4!)
– 4 servers with 8 cores = 32 processing cores
– Average processing time of 5s per newspaper page
• OLR: CCS docWorks
– Article separation & page classification
– Possibility for post-correction/validation of results

NER
• Stanford CoreNLP Named Entity Recognition
(Conditional Random Fields)
• Adapted for METS/ALTO processing
• Added ALTO v3 (tags) output
• https://github.com/EuropeanaNewspapers/ner-app
• Annotated training & evaluation data
• 100 pages each for (historical) German, French, Dutch
• https://github.com/EuropeanaNewspapers/ner-corpora

Evaluation
• Scenario-based performance evaluation of
OCR/OLR using PAGE ground truth
• Ground truth dataset:
http://primaresearch.org/datasets/ENP
• Performance Evaluation Report:
http://www.europeana-newspapers.eu/wp-
content/uploads/2015/05/D3.5_Performance_
Evaluation_Report_1.0.pdf

IIIF
• International Image Interoperability
Framework (iiif.io) for online presentation
and aggregation
• Implementing Image API and Presentation API
• Europeana IIIF Task Force:
https://pro.europeana.eu/post/iiif-adoption-
by-europeana-future-perspectives-for-the-
network-1

Future plans
• Migration of data from TEL (closed 12/2016)
to new Europeana Thematic Collections
http://europeana.eu/portal/
• Re-develop Newspapers API
• Re-develop search & browse interface
• Add new newspaper content
• Create virtual exhibitions & browse entry points

Future Plans
https://acceptance-npc.eanadev.org/portal/de/collections/newspapers

Future plans
• Automatic OCR error correction
• Improved newspaper layout analysis
• Named Entity Recognition, Disambiguation
and Linking (Wikidata)
• Extraction and classification of image content
• Deep semantic structuring of newspapers
• User corrections and annotations

Collaboration with Researchers
• Interviews with researchers
• Europeana Research
• CLARIN
• Viral Texts
• Oceanic Exchanges
• DDB
• impresso

Coding da Vinci
https://codingdavinci.de/

Thank you for your attention!
Questions?
Clemens Neudecker
Staatsbibliothek zu Berlin
@cneudecker

Europeana Newspapers - Data, Tools & Future Plans

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Europeana Newspapers - Data, Tools & Future Plans

Similar to Europeana Newspapers - Data, Tools & Future Plans (20)

More from cneudecker

More from cneudecker (20)

Recently uploaded

Recently uploaded (20)

Europeana Newspapers - Data, Tools & Future Plans