SlideShare a Scribd company logo
1 of 21
Download to read offline
Europeana Newspapers
Data, Tools & Future Plans
Clemens Neudecker
Staatsbibliothek zu Berlin
@cneudecker
Europeana Newspapers
• EU FP7 ICT-PSP Project 2012 – 2015
• www.europeana-newspapers.eu
• Main outcomes
– TEL Historic Newspapers Portal:
http://www.theeuropeanlibrary.org/tel4/newspapers
– Deliverables:
http://www.europeana-newspapers.eu/
public-materials/deliverables/
– Tools:
http://www.europeana-newspapers.eu/
public-materials/tools/
– Final Report:
http://europeananewspapers.github.io/
Data
Data
• 1618 – 2016
• 12 countries
• 40 languages
• 120 TB
• Ca. 1,000 titles
• 3,3M issues
Data
• Metadata for more than >20 million pages
• 12 million pages processed with OCR
• 2 million pages processed with OLR
• Most content licensed as Public Domain
• Metadata licensed CC0
• Copyright cut-off date
Data
• JP2000 images for use with IIPsrv
• METS container with embedded MODS
for structural and bibliographic metadata
• ALTO for OCRed text
• EDM for Europeana
 Europeana Newspapers METS/ALTO Profile
(ENMAP)
Data
• Portals
– http://www.theeuropeanlibrary.org/tel4/newspapers
– http://europeana.eu/portal/search.html?query=euro-
peana_collectionName%3A92*ewspapers*&rows=24
&qt=false
• Downloads
– https://pro.europeana.eu/itemtype/newspapers
– http://test-solr-mongo.eanadev.org/europeana-
research-newspapers-dump/
Tools & Technologies
Preprocessing
• Preprocessing with adaptive Binarization to
reduce overall image file size and processing
time (yielded >90% reduction of data volume
vs. <1% lower accuracy in OCR results)
• Preprocessing to create tiled JP2000 files
for zooming using graphicsmagick + kakadu
• Created easy-to-use set of preprocessing tools
that also validate and harmonize data input
for efficient OCR/OLR processing
OCR/OLR
• OCR: ABBYY FineReader Engine 11
– Gothic license per page (A4!)
– 4 servers with 8 cores = 32 processing cores
– Average processing time of 5s per newspaper page
• OLR: CCS docWorks
– Article separation & page classification
– Possibility for post-correction/validation of results
NER
• Stanford CoreNLP Named Entity Recognition
(Conditional Random Fields)
• Adapted for METS/ALTO processing
• Added ALTO v3 (tags) output
• https://github.com/EuropeanaNewspapers/ner-app
• Annotated training & evaluation data
• 100 pages each for (historical) German, French, Dutch
• https://github.com/EuropeanaNewspapers/ner-corpora
Evaluation
• Scenario-based performance evaluation of
OCR/OLR using PAGE ground truth
• Ground truth dataset:
http://primaresearch.org/datasets/ENP
• Performance Evaluation Report:
http://www.europeana-newspapers.eu/wp-
content/uploads/2015/05/D3.5_Performance_
Evaluation_Report_1.0.pdf
Evaluation
IIIF
• International Image Interoperability
Framework (iiif.io) for online presentation
and aggregation
• Implementing Image API and Presentation API
• Europeana IIIF Task Force:
https://pro.europeana.eu/post/iiif-adoption-
by-europeana-future-perspectives-for-the-
network-1
Future plans
Future plans
• Migration of data from TEL (closed 12/2016)
to new Europeana Thematic Collections
http://europeana.eu/portal/
• Re-develop Newspapers API
• Re-develop search & browse interface
• Add new newspaper content
• Create virtual exhibitions & browse entry points
Future Plans
https://acceptance-npc.eanadev.org/portal/de/collections/newspapers
Future plans
• Automatic OCR error correction
• Improved newspaper layout analysis
• Named Entity Recognition, Disambiguation
and Linking (Wikidata)
• Extraction and classification of image content
• Deep semantic structuring of newspapers
• User corrections and annotations
Collaboration with Researchers
• Interviews with researchers
• Europeana Research
• CLARIN
• Viral Texts
• Oceanic Exchanges
• DDB
• impresso
Coding da Vinci
https://codingdavinci.de/
Thank you for your attention!
Questions?
Clemens Neudecker
Staatsbibliothek zu Berlin
@cneudecker

More Related Content

What's hot

Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
BigData_Europe
 

What's hot (20)

Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
 
SC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDESC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDE
 
Open Data at the Federal Level 2021
Open Data at the Federal Level 2021Open Data at the Federal Level 2021
Open Data at the Federal Level 2021
 
National journey planner norway
National journey planner norwayNational journey planner norway
National journey planner norway
 
Sound Archives and Musical Instrument Collections
Sound Archives and Musical Instrument CollectionsSound Archives and Musical Instrument Collections
Sound Archives and Musical Instrument Collections
 
About company
About companyAbout company
About company
 
SC1 Workshop 2 Technical overview
SC1 Workshop 2 Technical overviewSC1 Workshop 2 Technical overview
SC1 Workshop 2 Technical overview
 
Open data hackathon jelgava - report
Open data hackathon   jelgava - reportOpen data hackathon   jelgava - report
Open data hackathon jelgava - report
 
Big data value policy context and public private partnership
Big data value policy context and public private partnershipBig data value policy context and public private partnership
Big data value policy context and public private partnership
 
APIdays 2018 BnF API projects
APIdays 2018 BnF API projectsAPIdays 2018 BnF API projects
APIdays 2018 BnF API projects
 
Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013
 
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
 
ADEQUATe and CommuniData
ADEQUATe and CommuniDataADEQUATe and CommuniData
ADEQUATe and CommuniData
 
Data Gathering and Analysis BoF- RipEstat
Data Gathering and Analysis BoF- RipEstatData Gathering and Analysis BoF- RipEstat
Data Gathering and Analysis BoF- RipEstat
 
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
 
Challenges in the Search of European Cultural Heritage
Challenges in the Search of European Cultural HeritageChallenges in the Search of European Cultural Heritage
Challenges in the Search of European Cultural Heritage
 
Cityscope data features
Cityscope data featuresCityscope data features
Cityscope data features
 
NeISS City Dashboard
NeISS City DashboardNeISS City Dashboard
NeISS City Dashboard
 
BDE SC6 workshop - introduction 2016
BDE SC6 workshop - introduction 2016BDE SC6 workshop - introduction 2016
BDE SC6 workshop - introduction 2016
 
Inspire hack 2017-linked-data
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-data
 

Similar to Europeana Newspapers - Data, Tools & Future Plans

The Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventThe Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final Event
Europeana Newspapers
 
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Centre of Competence
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013
Europeana Newspapers
 
ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project Overview
Europeana Newspapers
 

Similar to Europeana Newspapers - Data, Tools & Future Plans (20)

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
The Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventThe Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final Event
 
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
 
The European(a) Newspapers Project
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers Project
 
Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspapers
 
The Europeana Newspapers Presentation - Cyberspace 2012
The Europeana Newspapers Presentation - Cyberspace 2012The Europeana Newspapers Presentation - Cyberspace 2012
The Europeana Newspapers Presentation - Cyberspace 2012
 
All WP Meeting Athens - Europeana Inside - Gordon McKenna
All WP Meeting Athens - Europeana Inside - Gordon McKennaAll WP Meeting Athens - Europeana Inside - Gordon McKenna
All WP Meeting Athens - Europeana Inside - Gordon McKenna
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections online
 
The Europeana Newspapers Project
The Europeana Newspapers ProjectThe Europeana Newspapers Project
The Europeana Newspapers Project
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013
 
Metadata
MetadataMetadata
Metadata
 
Data Mining Newspapers Metadata
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
 
Europeana Newspapers -
Europeana Newspapers - Europeana Newspapers -
Europeana Newspapers -
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
 
ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project Overview
 
ALIADA Project. AtCult
ALIADA Project. AtCultALIADA Project. AtCult
ALIADA Project. AtCult
 
co:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlbergerco:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlberger
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
Europeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation PlanEuropeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation Plan
 

More from cneudecker

OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
cneudecker
 

More from cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBB
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Europeana Newspapers - Data, Tools & Future Plans

  • 1. Europeana Newspapers Data, Tools & Future Plans Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker
  • 2. Europeana Newspapers • EU FP7 ICT-PSP Project 2012 – 2015 • www.europeana-newspapers.eu • Main outcomes – TEL Historic Newspapers Portal: http://www.theeuropeanlibrary.org/tel4/newspapers – Deliverables: http://www.europeana-newspapers.eu/ public-materials/deliverables/ – Tools: http://www.europeana-newspapers.eu/ public-materials/tools/ – Final Report: http://europeananewspapers.github.io/
  • 4. Data • 1618 – 2016 • 12 countries • 40 languages • 120 TB • Ca. 1,000 titles • 3,3M issues
  • 5. Data • Metadata for more than >20 million pages • 12 million pages processed with OCR • 2 million pages processed with OLR • Most content licensed as Public Domain • Metadata licensed CC0 • Copyright cut-off date
  • 6. Data • JP2000 images for use with IIPsrv • METS container with embedded MODS for structural and bibliographic metadata • ALTO for OCRed text • EDM for Europeana  Europeana Newspapers METS/ALTO Profile (ENMAP)
  • 7. Data • Portals – http://www.theeuropeanlibrary.org/tel4/newspapers – http://europeana.eu/portal/search.html?query=euro- peana_collectionName%3A92*ewspapers*&rows=24 &qt=false • Downloads – https://pro.europeana.eu/itemtype/newspapers – http://test-solr-mongo.eanadev.org/europeana- research-newspapers-dump/
  • 9. Preprocessing • Preprocessing with adaptive Binarization to reduce overall image file size and processing time (yielded >90% reduction of data volume vs. <1% lower accuracy in OCR results) • Preprocessing to create tiled JP2000 files for zooming using graphicsmagick + kakadu • Created easy-to-use set of preprocessing tools that also validate and harmonize data input for efficient OCR/OLR processing
  • 10. OCR/OLR • OCR: ABBYY FineReader Engine 11 – Gothic license per page (A4!) – 4 servers with 8 cores = 32 processing cores – Average processing time of 5s per newspaper page • OLR: CCS docWorks – Article separation & page classification – Possibility for post-correction/validation of results
  • 11. NER • Stanford CoreNLP Named Entity Recognition (Conditional Random Fields) • Adapted for METS/ALTO processing • Added ALTO v3 (tags) output • https://github.com/EuropeanaNewspapers/ner-app • Annotated training & evaluation data • 100 pages each for (historical) German, French, Dutch • https://github.com/EuropeanaNewspapers/ner-corpora
  • 12. Evaluation • Scenario-based performance evaluation of OCR/OLR using PAGE ground truth • Ground truth dataset: http://primaresearch.org/datasets/ENP • Performance Evaluation Report: http://www.europeana-newspapers.eu/wp- content/uploads/2015/05/D3.5_Performance_ Evaluation_Report_1.0.pdf
  • 14. IIIF • International Image Interoperability Framework (iiif.io) for online presentation and aggregation • Implementing Image API and Presentation API • Europeana IIIF Task Force: https://pro.europeana.eu/post/iiif-adoption- by-europeana-future-perspectives-for-the- network-1
  • 16. Future plans • Migration of data from TEL (closed 12/2016) to new Europeana Thematic Collections http://europeana.eu/portal/ • Re-develop Newspapers API • Re-develop search & browse interface • Add new newspaper content • Create virtual exhibitions & browse entry points
  • 18. Future plans • Automatic OCR error correction • Improved newspaper layout analysis • Named Entity Recognition, Disambiguation and Linking (Wikidata) • Extraction and classification of image content • Deep semantic structuring of newspapers • User corrections and annotations
  • 19. Collaboration with Researchers • Interviews with researchers • Europeana Research • CLARIN • Viral Texts • Oceanic Exchanges • DDB • impresso
  • 21. Thank you for your attention! Questions? Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker