SlideShare a Scribd company logo
Europeana Newspapers
Data, Tools & Future Plans
Clemens Neudecker
Staatsbibliothek zu Berlin
@cneudecker
Europeana Newspapers
• EU FP7 ICT-PSP Project 2012 – 2015
• www.europeana-newspapers.eu
• Main outcomes
– TEL Historic Newspapers Portal:
http://www.theeuropeanlibrary.org/tel4/newspapers
– Deliverables:
http://www.europeana-newspapers.eu/
public-materials/deliverables/
– Tools:
http://www.europeana-newspapers.eu/
public-materials/tools/
– Final Report:
http://europeananewspapers.github.io/
Data
Data
• 1618 – 2016
• 12 countries
• 40 languages
• 120 TB
• Ca. 1,000 titles
• 3,3M issues
Data
• Metadata for more than >20 million pages
• 12 million pages processed with OCR
• 2 million pages processed with OLR
• Most content licensed as Public Domain
• Metadata licensed CC0
• Copyright cut-off date
Data
• JP2000 images for use with IIPsrv
• METS container with embedded MODS
for structural and bibliographic metadata
• ALTO for OCRed text
• EDM for Europeana
 Europeana Newspapers METS/ALTO Profile
(ENMAP)
Data
• Portals
– http://www.theeuropeanlibrary.org/tel4/newspapers
– http://europeana.eu/portal/search.html?query=euro-
peana_collectionName%3A92*ewspapers*&rows=24
&qt=false
• Downloads
– https://pro.europeana.eu/itemtype/newspapers
– http://test-solr-mongo.eanadev.org/europeana-
research-newspapers-dump/
Tools & Technologies
Preprocessing
• Preprocessing with adaptive Binarization to
reduce overall image file size and processing
time (yielded >90% reduction of data volume
vs. <1% lower accuracy in OCR results)
• Preprocessing to create tiled JP2000 files
for zooming using graphicsmagick + kakadu
• Created easy-to-use set of preprocessing tools
that also validate and harmonize data input
for efficient OCR/OLR processing
OCR/OLR
• OCR: ABBYY FineReader Engine 11
– Gothic license per page (A4!)
– 4 servers with 8 cores = 32 processing cores
– Average processing time of 5s per newspaper page
• OLR: CCS docWorks
– Article separation & page classification
– Possibility for post-correction/validation of results
NER
• Stanford CoreNLP Named Entity Recognition
(Conditional Random Fields)
• Adapted for METS/ALTO processing
• Added ALTO v3 (tags) output
• https://github.com/EuropeanaNewspapers/ner-app
• Annotated training & evaluation data
• 100 pages each for (historical) German, French, Dutch
• https://github.com/EuropeanaNewspapers/ner-corpora
Evaluation
• Scenario-based performance evaluation of
OCR/OLR using PAGE ground truth
• Ground truth dataset:
http://primaresearch.org/datasets/ENP
• Performance Evaluation Report:
http://www.europeana-newspapers.eu/wp-
content/uploads/2015/05/D3.5_Performance_
Evaluation_Report_1.0.pdf
Evaluation
IIIF
• International Image Interoperability
Framework (iiif.io) for online presentation
and aggregation
• Implementing Image API and Presentation API
• Europeana IIIF Task Force:
https://pro.europeana.eu/post/iiif-adoption-
by-europeana-future-perspectives-for-the-
network-1
Future plans
Future plans
• Migration of data from TEL (closed 12/2016)
to new Europeana Thematic Collections
http://europeana.eu/portal/
• Re-develop Newspapers API
• Re-develop search & browse interface
• Add new newspaper content
• Create virtual exhibitions & browse entry points
Future Plans
https://acceptance-npc.eanadev.org/portal/de/collections/newspapers
Future plans
• Automatic OCR error correction
• Improved newspaper layout analysis
• Named Entity Recognition, Disambiguation
and Linking (Wikidata)
• Extraction and classification of image content
• Deep semantic structuring of newspapers
• User corrections and annotations
Collaboration with Researchers
• Interviews with researchers
• Europeana Research
• CLARIN
• Viral Texts
• Oceanic Exchanges
• DDB
• impresso
Coding da Vinci
https://codingdavinci.de/
Thank you for your attention!
Questions?
Clemens Neudecker
Staatsbibliothek zu Berlin
@cneudecker

More Related Content

What's hot

Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
BigData_Europe
 
SC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDESC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDE
BigData_Europe
 
Open Data at the Federal Level 2021
Open Data at the Federal Level 2021Open Data at the Federal Level 2021
Open Data at the Federal Level 2021
Bart Hanssens
 
National journey planner norway
National journey planner norwayNational journey planner norway
National journey planner norway
FabMob
 
Sound Archives and Musical Instrument Collections
Sound Archives and Musical Instrument CollectionsSound Archives and Musical Instrument Collections
Sound Archives and Musical Instrument Collections
Synapta
 
About company
About companyAbout company
About company
Ilya Klintsov
 
SC1 Workshop 2 Technical overview
SC1 Workshop 2 Technical overviewSC1 Workshop 2 Technical overview
SC1 Workshop 2 Technical overview
BigData_Europe
 
Open data hackathon jelgava - report
Open data hackathon   jelgava - reportOpen data hackathon   jelgava - report
Open data hackathon jelgava - report
WirelessInfo
 
Big data value policy context and public private partnership
Big data value policy context and public private partnershipBig data value policy context and public private partnership
Big data value policy context and public private partnership
BigData_Europe
 
APIdays 2018 BnF API projects
APIdays 2018 BnF API projectsAPIdays 2018 BnF API projects
APIdays 2018 BnF API projects
Isabelle REUSA
 
Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013
Lars G. Svensson
 
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
🧑‍💻 Manuel Coppotelli
 
ADEQUATe and CommuniData
ADEQUATe and CommuniDataADEQUATe and CommuniData
ADEQUATe and CommuniData
Stadt Wien
 
Data Gathering and Analysis BoF- RipEstat
Data Gathering and Analysis BoF- RipEstatData Gathering and Analysis BoF- RipEstat
Data Gathering and Analysis BoF- RipEstat
APNIC
 
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
BigData_Europe
 
Challenges in the Search of European Cultural Heritage
Challenges in the Search of European Cultural HeritageChallenges in the Search of European Cultural Heritage
Challenges in the Search of European Cultural Heritage
Mónica Marrero
 
Cityscope data features
Cityscope data featuresCityscope data features
Cityscope data features
Lorna Campbell
 
NeISS City Dashboard
NeISS City DashboardNeISS City Dashboard
NeISS City Dashboard
NeISSProject
 
BDE SC6 workshop - introduction 2016
BDE SC6 workshop - introduction 2016BDE SC6 workshop - introduction 2016
BDE SC6 workshop - introduction 2016
BigData_Europe
 
Inspire hack 2017-linked-data
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-data
Raul Palma
 

What's hot (20)

Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
 
SC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDESC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDE
 
Open Data at the Federal Level 2021
Open Data at the Federal Level 2021Open Data at the Federal Level 2021
Open Data at the Federal Level 2021
 
National journey planner norway
National journey planner norwayNational journey planner norway
National journey planner norway
 
Sound Archives and Musical Instrument Collections
Sound Archives and Musical Instrument CollectionsSound Archives and Musical Instrument Collections
Sound Archives and Musical Instrument Collections
 
About company
About companyAbout company
About company
 
SC1 Workshop 2 Technical overview
SC1 Workshop 2 Technical overviewSC1 Workshop 2 Technical overview
SC1 Workshop 2 Technical overview
 
Open data hackathon jelgava - report
Open data hackathon   jelgava - reportOpen data hackathon   jelgava - report
Open data hackathon jelgava - report
 
Big data value policy context and public private partnership
Big data value policy context and public private partnershipBig data value policy context and public private partnership
Big data value policy context and public private partnership
 
APIdays 2018 BnF API projects
APIdays 2018 BnF API projectsAPIdays 2018 BnF API projects
APIdays 2018 BnF API projects
 
Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013
 
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
 
ADEQUATe and CommuniData
ADEQUATe and CommuniDataADEQUATe and CommuniData
ADEQUATe and CommuniData
 
Data Gathering and Analysis BoF- RipEstat
Data Gathering and Analysis BoF- RipEstatData Gathering and Analysis BoF- RipEstat
Data Gathering and Analysis BoF- RipEstat
 
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
 
Challenges in the Search of European Cultural Heritage
Challenges in the Search of European Cultural HeritageChallenges in the Search of European Cultural Heritage
Challenges in the Search of European Cultural Heritage
 
Cityscope data features
Cityscope data featuresCityscope data features
Cityscope data features
 
NeISS City Dashboard
NeISS City DashboardNeISS City Dashboard
NeISS City Dashboard
 
BDE SC6 workshop - introduction 2016
BDE SC6 workshop - introduction 2016BDE SC6 workshop - introduction 2016
BDE SC6 workshop - introduction 2016
 
Inspire hack 2017-linked-data
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-data
 

Similar to Europeana Newspapers - Data, Tools & Future Plans

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
Sven Schlarb
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
cneudecker
 
The Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventThe Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final Event
Europeana Newspapers
 
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Centre of Competence
 
The European(a) Newspapers Project
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers Project
Europeana Newspapers
 
Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspapers
cneudecker
 
The Europeana Newspapers Presentation - Cyberspace 2012
The Europeana Newspapers Presentation - Cyberspace 2012The Europeana Newspapers Presentation - Cyberspace 2012
The Europeana Newspapers Presentation - Cyberspace 2012
Europeana Newspapers
 
All WP Meeting Athens - Europeana Inside - Gordon McKenna
All WP Meeting Athens - Europeana Inside - Gordon McKennaAll WP Meeting Athens - Europeana Inside - Gordon McKenna
All WP Meeting Athens - Europeana Inside - Gordon McKenna
Digitised Manuscripts to Europeana
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections online
lab_SNG
 
The Europeana Newspapers Project
The Europeana Newspapers ProjectThe Europeana Newspapers Project
The Europeana Newspapers Project
The European Library
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013
Europeana Newspapers
 
Metadata
MetadataMetadata
Data Mining Newspapers Metadata
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
Jean-Philippe Moreux
 
Europeana Newspapers -
Europeana Newspapers - Europeana Newspapers -
Europeana Newspapers -
TU Delft, Netherlands
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
cneudecker
 
ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project Overview
Europeana Newspapers
 
ALIADA Project. AtCult
ALIADA Project. AtCultALIADA Project. AtCult
ALIADA Project. AtCult
aliada project
 
co:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlbergerco:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlberger
ICARUS - International Centre for Archival Research
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
cneudecker
 
Europeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation PlanEuropeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation Plan
Europeana Newspapers
 

Similar to Europeana Newspapers - Data, Tools & Future Plans (20)

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
The Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventThe Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final Event
 
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
 
The European(a) Newspapers Project
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers Project
 
Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspapers
 
The Europeana Newspapers Presentation - Cyberspace 2012
The Europeana Newspapers Presentation - Cyberspace 2012The Europeana Newspapers Presentation - Cyberspace 2012
The Europeana Newspapers Presentation - Cyberspace 2012
 
All WP Meeting Athens - Europeana Inside - Gordon McKenna
All WP Meeting Athens - Europeana Inside - Gordon McKennaAll WP Meeting Athens - Europeana Inside - Gordon McKenna
All WP Meeting Athens - Europeana Inside - Gordon McKenna
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections online
 
The Europeana Newspapers Project
The Europeana Newspapers ProjectThe Europeana Newspapers Project
The Europeana Newspapers Project
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013
 
Metadata
MetadataMetadata
Metadata
 
Data Mining Newspapers Metadata
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
 
Europeana Newspapers -
Europeana Newspapers - Europeana Newspapers -
Europeana Newspapers -
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
 
ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project Overview
 
ALIADA Project. AtCult
ALIADA Project. AtCultALIADA Project. AtCult
ALIADA Project. AtCult
 
co:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlbergerco:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlberger
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
Europeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation PlanEuropeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation Plan
 

More from cneudecker

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
cneudecker
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
cneudecker
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
cneudecker
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
cneudecker
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
cneudecker
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
cneudecker
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
cneudecker
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
cneudecker
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
cneudecker
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
cneudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
cneudecker
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
cneudecker
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
cneudecker
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
cneudecker
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
cneudecker
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
cneudecker
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
cneudecker
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
cneudecker
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?
cneudecker
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBB
cneudecker
 

More from cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBB
 

Recently uploaded

Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 

Recently uploaded (20)

Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 

Europeana Newspapers - Data, Tools & Future Plans

  • 1. Europeana Newspapers Data, Tools & Future Plans Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker
  • 2. Europeana Newspapers • EU FP7 ICT-PSP Project 2012 – 2015 • www.europeana-newspapers.eu • Main outcomes – TEL Historic Newspapers Portal: http://www.theeuropeanlibrary.org/tel4/newspapers – Deliverables: http://www.europeana-newspapers.eu/ public-materials/deliverables/ – Tools: http://www.europeana-newspapers.eu/ public-materials/tools/ – Final Report: http://europeananewspapers.github.io/
  • 4. Data • 1618 – 2016 • 12 countries • 40 languages • 120 TB • Ca. 1,000 titles • 3,3M issues
  • 5. Data • Metadata for more than >20 million pages • 12 million pages processed with OCR • 2 million pages processed with OLR • Most content licensed as Public Domain • Metadata licensed CC0 • Copyright cut-off date
  • 6. Data • JP2000 images for use with IIPsrv • METS container with embedded MODS for structural and bibliographic metadata • ALTO for OCRed text • EDM for Europeana  Europeana Newspapers METS/ALTO Profile (ENMAP)
  • 7. Data • Portals – http://www.theeuropeanlibrary.org/tel4/newspapers – http://europeana.eu/portal/search.html?query=euro- peana_collectionName%3A92*ewspapers*&rows=24 &qt=false • Downloads – https://pro.europeana.eu/itemtype/newspapers – http://test-solr-mongo.eanadev.org/europeana- research-newspapers-dump/
  • 9. Preprocessing • Preprocessing with adaptive Binarization to reduce overall image file size and processing time (yielded >90% reduction of data volume vs. <1% lower accuracy in OCR results) • Preprocessing to create tiled JP2000 files for zooming using graphicsmagick + kakadu • Created easy-to-use set of preprocessing tools that also validate and harmonize data input for efficient OCR/OLR processing
  • 10. OCR/OLR • OCR: ABBYY FineReader Engine 11 – Gothic license per page (A4!) – 4 servers with 8 cores = 32 processing cores – Average processing time of 5s per newspaper page • OLR: CCS docWorks – Article separation & page classification – Possibility for post-correction/validation of results
  • 11. NER • Stanford CoreNLP Named Entity Recognition (Conditional Random Fields) • Adapted for METS/ALTO processing • Added ALTO v3 (tags) output • https://github.com/EuropeanaNewspapers/ner-app • Annotated training & evaluation data • 100 pages each for (historical) German, French, Dutch • https://github.com/EuropeanaNewspapers/ner-corpora
  • 12. Evaluation • Scenario-based performance evaluation of OCR/OLR using PAGE ground truth • Ground truth dataset: http://primaresearch.org/datasets/ENP • Performance Evaluation Report: http://www.europeana-newspapers.eu/wp- content/uploads/2015/05/D3.5_Performance_ Evaluation_Report_1.0.pdf
  • 14. IIIF • International Image Interoperability Framework (iiif.io) for online presentation and aggregation • Implementing Image API and Presentation API • Europeana IIIF Task Force: https://pro.europeana.eu/post/iiif-adoption- by-europeana-future-perspectives-for-the- network-1
  • 16. Future plans • Migration of data from TEL (closed 12/2016) to new Europeana Thematic Collections http://europeana.eu/portal/ • Re-develop Newspapers API • Re-develop search & browse interface • Add new newspaper content • Create virtual exhibitions & browse entry points
  • 18. Future plans • Automatic OCR error correction • Improved newspaper layout analysis • Named Entity Recognition, Disambiguation and Linking (Wikidata) • Extraction and classification of image content • Deep semantic structuring of newspapers • User corrections and annotations
  • 19. Collaboration with Researchers • Interviews with researchers • Europeana Research • CLARIN • Viral Texts • Oceanic Exchanges • DDB • impresso
  • 21. Thank you for your attention! Questions? Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker