Europeana Newspapers - Data, Tools & Future Plans

C
Europeana Newspapers
Data, Tools & Future Plans
Clemens Neudecker
Staatsbibliothek zu Berlin
@cneudecker
Europeana Newspapers
• EU FP7 ICT-PSP Project 2012 – 2015
• www.europeana-newspapers.eu
• Main outcomes
– TEL Historic Newspapers Portal:
http://www.theeuropeanlibrary.org/tel4/newspapers
– Deliverables:
http://www.europeana-newspapers.eu/
public-materials/deliverables/
– Tools:
http://www.europeana-newspapers.eu/
public-materials/tools/
– Final Report:
http://europeananewspapers.github.io/
Data
Data
• 1618 – 2016
• 12 countries
• 40 languages
• 120 TB
• Ca. 1,000 titles
• 3,3M issues
Data
• Metadata for more than >20 million pages
• 12 million pages processed with OCR
• 2 million pages processed with OLR
• Most content licensed as Public Domain
• Metadata licensed CC0
• Copyright cut-off date
Data
• JP2000 images for use with IIPsrv
• METS container with embedded MODS
for structural and bibliographic metadata
• ALTO for OCRed text
• EDM for Europeana
 Europeana Newspapers METS/ALTO Profile
(ENMAP)
Data
• Portals
– http://www.theeuropeanlibrary.org/tel4/newspapers
– http://europeana.eu/portal/search.html?query=euro-
peana_collectionName%3A92*ewspapers*&rows=24
&qt=false
• Downloads
– https://pro.europeana.eu/itemtype/newspapers
– http://test-solr-mongo.eanadev.org/europeana-
research-newspapers-dump/
Tools & Technologies
Preprocessing
• Preprocessing with adaptive Binarization to
reduce overall image file size and processing
time (yielded >90% reduction of data volume
vs. <1% lower accuracy in OCR results)
• Preprocessing to create tiled JP2000 files
for zooming using graphicsmagick + kakadu
• Created easy-to-use set of preprocessing tools
that also validate and harmonize data input
for efficient OCR/OLR processing
OCR/OLR
• OCR: ABBYY FineReader Engine 11
– Gothic license per page (A4!)
– 4 servers with 8 cores = 32 processing cores
– Average processing time of 5s per newspaper page
• OLR: CCS docWorks
– Article separation & page classification
– Possibility for post-correction/validation of results
NER
• Stanford CoreNLP Named Entity Recognition
(Conditional Random Fields)
• Adapted for METS/ALTO processing
• Added ALTO v3 (tags) output
• https://github.com/EuropeanaNewspapers/ner-app
• Annotated training & evaluation data
• 100 pages each for (historical) German, French, Dutch
• https://github.com/EuropeanaNewspapers/ner-corpora
Evaluation
• Scenario-based performance evaluation of
OCR/OLR using PAGE ground truth
• Ground truth dataset:
http://primaresearch.org/datasets/ENP
• Performance Evaluation Report:
http://www.europeana-newspapers.eu/wp-
content/uploads/2015/05/D3.5_Performance_
Evaluation_Report_1.0.pdf
Evaluation
IIIF
• International Image Interoperability
Framework (iiif.io) for online presentation
and aggregation
• Implementing Image API and Presentation API
• Europeana IIIF Task Force:
https://pro.europeana.eu/post/iiif-adoption-
by-europeana-future-perspectives-for-the-
network-1
Future plans
Future plans
• Migration of data from TEL (closed 12/2016)
to new Europeana Thematic Collections
http://europeana.eu/portal/
• Re-develop Newspapers API
• Re-develop search & browse interface
• Add new newspaper content
• Create virtual exhibitions & browse entry points
Future Plans
https://acceptance-npc.eanadev.org/portal/de/collections/newspapers
Future plans
• Automatic OCR error correction
• Improved newspaper layout analysis
• Named Entity Recognition, Disambiguation
and Linking (Wikidata)
• Extraction and classification of image content
• Deep semantic structuring of newspapers
• User corrections and annotations
Collaboration with Researchers
• Interviews with researchers
• Europeana Research
• CLARIN
• Viral Texts
• Oceanic Exchanges
• DDB
• impresso
Coding da Vinci
https://codingdavinci.de/
Thank you for your attention!
Questions?
Clemens Neudecker
Staatsbibliothek zu Berlin
@cneudecker
1 of 21

More Related Content

Similar to Europeana Newspapers - Data, Tools & Future Plans (20)

How to read a million books?How to read a million books?
How to read a million books?
cneudecker592 views
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers Project
Europeana Newspapers878 views
All WP Meeting Athens - Europeana Inside - Gordon McKennaAll WP Meeting Athens - Europeana Inside - Gordon McKenna
All WP Meeting Athens - Europeana Inside - Gordon McKenna
Digitised Manuscripts to Europeana1.3K views
The Europeana Newspapers ProjectThe Europeana Newspapers Project
The Europeana Newspapers Project
The European Library546 views
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013
Europeana Newspapers2.1K views
MetadataMetadata
Metadata
Europeana Newspapers1.3K views
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
Jean-Philippe Moreux349 views
Europeana Newspapers - Europeana Newspapers -
Europeana Newspapers -
TU Delft, Netherlands4.8K views
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project Overview
Europeana Newspapers1.4K views
ALIADA Project. AtCultALIADA Project. AtCult
ALIADA Project. AtCult
aliada project651 views
co:op-READ-Convention Marburg - Günter Mühlbergerco:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlberger
ICARUS - International Centre for Archival Research1.2K views
Europeana Newspapers Aggregation PlanEuropeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation Plan
Europeana Newspapers1K views

Recently uploaded(20)

[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh36 views
Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum120 views
Java Platform Approach 1.0 - Picnic MeetupJava Platform Approach 1.0 - Picnic Meetup
Java Platform Approach 1.0 - Picnic Meetup
Rick Ossendrijver24 views

Europeana Newspapers - Data, Tools & Future Plans

  • 1. Europeana Newspapers Data, Tools & Future Plans Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker
  • 2. Europeana Newspapers • EU FP7 ICT-PSP Project 2012 – 2015 • www.europeana-newspapers.eu • Main outcomes – TEL Historic Newspapers Portal: http://www.theeuropeanlibrary.org/tel4/newspapers – Deliverables: http://www.europeana-newspapers.eu/ public-materials/deliverables/ – Tools: http://www.europeana-newspapers.eu/ public-materials/tools/ – Final Report: http://europeananewspapers.github.io/
  • 4. Data • 1618 – 2016 • 12 countries • 40 languages • 120 TB • Ca. 1,000 titles • 3,3M issues
  • 5. Data • Metadata for more than >20 million pages • 12 million pages processed with OCR • 2 million pages processed with OLR • Most content licensed as Public Domain • Metadata licensed CC0 • Copyright cut-off date
  • 6. Data • JP2000 images for use with IIPsrv • METS container with embedded MODS for structural and bibliographic metadata • ALTO for OCRed text • EDM for Europeana  Europeana Newspapers METS/ALTO Profile (ENMAP)
  • 7. Data • Portals – http://www.theeuropeanlibrary.org/tel4/newspapers – http://europeana.eu/portal/search.html?query=euro- peana_collectionName%3A92*ewspapers*&rows=24 &qt=false • Downloads – https://pro.europeana.eu/itemtype/newspapers – http://test-solr-mongo.eanadev.org/europeana- research-newspapers-dump/
  • 9. Preprocessing • Preprocessing with adaptive Binarization to reduce overall image file size and processing time (yielded >90% reduction of data volume vs. <1% lower accuracy in OCR results) • Preprocessing to create tiled JP2000 files for zooming using graphicsmagick + kakadu • Created easy-to-use set of preprocessing tools that also validate and harmonize data input for efficient OCR/OLR processing
  • 10. OCR/OLR • OCR: ABBYY FineReader Engine 11 – Gothic license per page (A4!) – 4 servers with 8 cores = 32 processing cores – Average processing time of 5s per newspaper page • OLR: CCS docWorks – Article separation & page classification – Possibility for post-correction/validation of results
  • 11. NER • Stanford CoreNLP Named Entity Recognition (Conditional Random Fields) • Adapted for METS/ALTO processing • Added ALTO v3 (tags) output • https://github.com/EuropeanaNewspapers/ner-app • Annotated training & evaluation data • 100 pages each for (historical) German, French, Dutch • https://github.com/EuropeanaNewspapers/ner-corpora
  • 12. Evaluation • Scenario-based performance evaluation of OCR/OLR using PAGE ground truth • Ground truth dataset: http://primaresearch.org/datasets/ENP • Performance Evaluation Report: http://www.europeana-newspapers.eu/wp- content/uploads/2015/05/D3.5_Performance_ Evaluation_Report_1.0.pdf
  • 14. IIIF • International Image Interoperability Framework (iiif.io) for online presentation and aggregation • Implementing Image API and Presentation API • Europeana IIIF Task Force: https://pro.europeana.eu/post/iiif-adoption- by-europeana-future-perspectives-for-the- network-1
  • 16. Future plans • Migration of data from TEL (closed 12/2016) to new Europeana Thematic Collections http://europeana.eu/portal/ • Re-develop Newspapers API • Re-develop search & browse interface • Add new newspaper content • Create virtual exhibitions & browse entry points
  • 18. Future plans • Automatic OCR error correction • Improved newspaper layout analysis • Named Entity Recognition, Disambiguation and Linking (Wikidata) • Extraction and classification of image content • Deep semantic structuring of newspapers • User corrections and annotations
  • 19. Collaboration with Researchers • Interviews with researchers • Europeana Research • CLARIN • Viral Texts • Oceanic Exchanges • DDB • impresso
  • 21. Thank you for your attention! Questions? Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker