SlideShare a Scribd company logo
Europeana Newspapers
Data, Tools & Future Plans
Clemens Neudecker
Staatsbibliothek zu Berlin
@cneudecker
Europeana Newspapers
• EU FP7 ICT-PSP Project 2012 – 2015
• www.europeana-newspapers.eu
• Main outcomes
– TEL Historic Newspapers Portal:
http://www.theeuropeanlibrary.org/tel4/newspapers
– Deliverables:
http://www.europeana-newspapers.eu/
public-materials/deliverables/
– Tools:
http://www.europeana-newspapers.eu/
public-materials/tools/
– Final Report:
http://europeananewspapers.github.io/
Data
Data
• 1618 – 2016
• 12 countries
• 40 languages
• 120 TB
• Ca. 1,000 titles
• 3,3M issues
Data
• Metadata for more than >20 million pages
• 12 million pages processed with OCR
• 2 million pages processed with OLR
• Most content licensed as Public Domain
• Metadata licensed CC0
• Copyright cut-off date
Data
• JP2000 images for use with IIPsrv
• METS container with embedded MODS
for structural and bibliographic metadata
• ALTO for OCRed text
• EDM for Europeana
 Europeana Newspapers METS/ALTO Profile
(ENMAP)
Data
• Portals
– http://www.theeuropeanlibrary.org/tel4/newspapers
– http://europeana.eu/portal/search.html?query=euro-
peana_collectionName%3A92*ewspapers*&rows=24
&qt=false
• Downloads
– https://pro.europeana.eu/itemtype/newspapers
– http://test-solr-mongo.eanadev.org/europeana-
research-newspapers-dump/
Tools & Technologies
Preprocessing
• Preprocessing with adaptive Binarization to
reduce overall image file size and processing
time (yielded >90% reduction of data volume
vs. <1% lower accuracy in OCR results)
• Preprocessing to create tiled JP2000 files
for zooming using graphicsmagick + kakadu
• Created easy-to-use set of preprocessing tools
that also validate and harmonize data input
for efficient OCR/OLR processing
OCR/OLR
• OCR: ABBYY FineReader Engine 11
– Gothic license per page (A4!)
– 4 servers with 8 cores = 32 processing cores
– Average processing time of 5s per newspaper page
• OLR: CCS docWorks
– Article separation & page classification
– Possibility for post-correction/validation of results
NER
• Stanford CoreNLP Named Entity Recognition
(Conditional Random Fields)
• Adapted for METS/ALTO processing
• Added ALTO v3 (tags) output
• https://github.com/EuropeanaNewspapers/ner-app
• Annotated training & evaluation data
• 100 pages each for (historical) German, French, Dutch
• https://github.com/EuropeanaNewspapers/ner-corpora
Evaluation
• Scenario-based performance evaluation of
OCR/OLR using PAGE ground truth
• Ground truth dataset:
http://primaresearch.org/datasets/ENP
• Performance Evaluation Report:
http://www.europeana-newspapers.eu/wp-
content/uploads/2015/05/D3.5_Performance_
Evaluation_Report_1.0.pdf
Evaluation
IIIF
• International Image Interoperability
Framework (iiif.io) for online presentation
and aggregation
• Implementing Image API and Presentation API
• Europeana IIIF Task Force:
https://pro.europeana.eu/post/iiif-adoption-
by-europeana-future-perspectives-for-the-
network-1
Future plans
Future plans
• Migration of data from TEL (closed 12/2016)
to new Europeana Thematic Collections
http://europeana.eu/portal/
• Re-develop Newspapers API
• Re-develop search & browse interface
• Add new newspaper content
• Create virtual exhibitions & browse entry points
Future Plans
https://acceptance-npc.eanadev.org/portal/de/collections/newspapers
Future plans
• Automatic OCR error correction
• Improved newspaper layout analysis
• Named Entity Recognition, Disambiguation
and Linking (Wikidata)
• Extraction and classification of image content
• Deep semantic structuring of newspapers
• User corrections and annotations
Collaboration with Researchers
• Interviews with researchers
• Europeana Research
• CLARIN
• Viral Texts
• Oceanic Exchanges
• DDB
• impresso
Coding da Vinci
https://codingdavinci.de/
Thank you for your attention!
Questions?
Clemens Neudecker
Staatsbibliothek zu Berlin
@cneudecker

More Related Content

What's hot

Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
BigData_Europe
 
SC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDESC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDE
BigData_Europe
 
Open Data at the Federal Level 2021
Open Data at the Federal Level 2021Open Data at the Federal Level 2021
Open Data at the Federal Level 2021
Bart Hanssens
 
National journey planner norway
National journey planner norwayNational journey planner norway
National journey planner norway
FabMob
 
Sound Archives and Musical Instrument Collections
Sound Archives and Musical Instrument CollectionsSound Archives and Musical Instrument Collections
Sound Archives and Musical Instrument Collections
Synapta
 
About company
About companyAbout company
About company
Ilya Klintsov
 
SC1 Workshop 2 Technical overview
SC1 Workshop 2 Technical overviewSC1 Workshop 2 Technical overview
SC1 Workshop 2 Technical overview
BigData_Europe
 
Open data hackathon jelgava - report
Open data hackathon   jelgava - reportOpen data hackathon   jelgava - report
Open data hackathon jelgava - report
WirelessInfo
 
Big data value policy context and public private partnership
Big data value policy context and public private partnershipBig data value policy context and public private partnership
Big data value policy context and public private partnership
BigData_Europe
 
APIdays 2018 BnF API projects
APIdays 2018 BnF API projectsAPIdays 2018 BnF API projects
APIdays 2018 BnF API projects
Isabelle REUSA
 
Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013
Lars G. Svensson
 
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
🧑‍💻 Manuel Coppotelli
 
ADEQUATe and CommuniData
ADEQUATe and CommuniDataADEQUATe and CommuniData
ADEQUATe and CommuniData
Stadt Wien
 
Data Gathering and Analysis BoF- RipEstat
Data Gathering and Analysis BoF- RipEstatData Gathering and Analysis BoF- RipEstat
Data Gathering and Analysis BoF- RipEstat
APNIC
 
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
BigData_Europe
 
Challenges in the Search of European Cultural Heritage
Challenges in the Search of European Cultural HeritageChallenges in the Search of European Cultural Heritage
Challenges in the Search of European Cultural Heritage
Mónica Marrero
 
Cityscope data features
Cityscope data featuresCityscope data features
Cityscope data features
Lorna Campbell
 
NeISS City Dashboard
NeISS City DashboardNeISS City Dashboard
NeISS City Dashboard
NeISSProject
 
BDE SC6 workshop - introduction 2016
BDE SC6 workshop - introduction 2016BDE SC6 workshop - introduction 2016
BDE SC6 workshop - introduction 2016
BigData_Europe
 
Inspire hack 2017-linked-data
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-data
Raul Palma
 

What's hot (20)

Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
 
SC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDESC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDE
 
Open Data at the Federal Level 2021
Open Data at the Federal Level 2021Open Data at the Federal Level 2021
Open Data at the Federal Level 2021
 
National journey planner norway
National journey planner norwayNational journey planner norway
National journey planner norway
 
Sound Archives and Musical Instrument Collections
Sound Archives and Musical Instrument CollectionsSound Archives and Musical Instrument Collections
Sound Archives and Musical Instrument Collections
 
About company
About companyAbout company
About company
 
SC1 Workshop 2 Technical overview
SC1 Workshop 2 Technical overviewSC1 Workshop 2 Technical overview
SC1 Workshop 2 Technical overview
 
Open data hackathon jelgava - report
Open data hackathon   jelgava - reportOpen data hackathon   jelgava - report
Open data hackathon jelgava - report
 
Big data value policy context and public private partnership
Big data value policy context and public private partnershipBig data value policy context and public private partnership
Big data value policy context and public private partnership
 
APIdays 2018 BnF API projects
APIdays 2018 BnF API projectsAPIdays 2018 BnF API projects
APIdays 2018 BnF API projects
 
Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013
 
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
Advanced Topics in OpenAPI: Added Value Services and Protection in the OpenTr...
 
ADEQUATe and CommuniData
ADEQUATe and CommuniDataADEQUATe and CommuniData
ADEQUATe and CommuniData
 
Data Gathering and Analysis BoF- RipEstat
Data Gathering and Analysis BoF- RipEstatData Gathering and Analysis BoF- RipEstat
Data Gathering and Analysis BoF- RipEstat
 
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
 
Challenges in the Search of European Cultural Heritage
Challenges in the Search of European Cultural HeritageChallenges in the Search of European Cultural Heritage
Challenges in the Search of European Cultural Heritage
 
Cityscope data features
Cityscope data featuresCityscope data features
Cityscope data features
 
NeISS City Dashboard
NeISS City DashboardNeISS City Dashboard
NeISS City Dashboard
 
BDE SC6 workshop - introduction 2016
BDE SC6 workshop - introduction 2016BDE SC6 workshop - introduction 2016
BDE SC6 workshop - introduction 2016
 
Inspire hack 2017-linked-data
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-data
 

Similar to Europeana Newspapers - Data, Tools & Future Plans

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
Sven Schlarb
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
cneudecker
 
The Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventThe Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final Event
Europeana Newspapers
 
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Centre of Competence
 
The European(a) Newspapers Project
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers Project
Europeana Newspapers
 
Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspapers
cneudecker
 
The Europeana Newspapers Presentation - Cyberspace 2012
The Europeana Newspapers Presentation - Cyberspace 2012The Europeana Newspapers Presentation - Cyberspace 2012
The Europeana Newspapers Presentation - Cyberspace 2012
Europeana Newspapers
 
All WP Meeting Athens - Europeana Inside - Gordon McKenna
All WP Meeting Athens - Europeana Inside - Gordon McKennaAll WP Meeting Athens - Europeana Inside - Gordon McKenna
All WP Meeting Athens - Europeana Inside - Gordon McKenna
Digitised Manuscripts to Europeana
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections online
lab_SNG
 
The Europeana Newspapers Project
The Europeana Newspapers ProjectThe Europeana Newspapers Project
The Europeana Newspapers Project
The European Library
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013
Europeana Newspapers
 
Metadata
MetadataMetadata
Data Mining Newspapers Metadata
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
Jean-Philippe Moreux
 
Europeana Newspapers -
Europeana Newspapers - Europeana Newspapers -
Europeana Newspapers -
TU Delft, Netherlands
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
cneudecker
 
ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project Overview
Europeana Newspapers
 
ALIADA Project. AtCult
ALIADA Project. AtCultALIADA Project. AtCult
ALIADA Project. AtCult
aliada project
 
co:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlbergerco:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlberger
ICARUS - International Centre for Archival Research
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
cneudecker
 
Europeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation PlanEuropeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation Plan
Europeana Newspapers
 

Similar to Europeana Newspapers - Data, Tools & Future Plans (20)

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
The Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventThe Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final Event
 
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
 
The European(a) Newspapers Project
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers Project
 
Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspapers
 
The Europeana Newspapers Presentation - Cyberspace 2012
The Europeana Newspapers Presentation - Cyberspace 2012The Europeana Newspapers Presentation - Cyberspace 2012
The Europeana Newspapers Presentation - Cyberspace 2012
 
All WP Meeting Athens - Europeana Inside - Gordon McKenna
All WP Meeting Athens - Europeana Inside - Gordon McKennaAll WP Meeting Athens - Europeana Inside - Gordon McKenna
All WP Meeting Athens - Europeana Inside - Gordon McKenna
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections online
 
The Europeana Newspapers Project
The Europeana Newspapers ProjectThe Europeana Newspapers Project
The Europeana Newspapers Project
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013
 
Metadata
MetadataMetadata
Metadata
 
Data Mining Newspapers Metadata
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
 
Europeana Newspapers -
Europeana Newspapers - Europeana Newspapers -
Europeana Newspapers -
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
 
ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project Overview
 
ALIADA Project. AtCult
ALIADA Project. AtCultALIADA Project. AtCult
ALIADA Project. AtCult
 
co:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlbergerco:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlberger
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
Europeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation PlanEuropeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation Plan
 

More from cneudecker

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
cneudecker
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
cneudecker
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
cneudecker
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
cneudecker
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
cneudecker
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
cneudecker
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
cneudecker
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
cneudecker
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
cneudecker
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
cneudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
cneudecker
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
cneudecker
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
cneudecker
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
cneudecker
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
cneudecker
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
cneudecker
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
cneudecker
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
cneudecker
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?
cneudecker
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBB
cneudecker
 

More from cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBB
 

Recently uploaded

Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
ZachWylie3
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
David Wilson
 
Step-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From ScratchStep-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From Scratch
softsuave
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
Baishakhi Ray
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
janagijoythi
 
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
alexjohnson7307
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Nicolás Lopéz
 
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and CitiesThe Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
Arpan Buwa
 
Gen AI: Privacy Risks of Large Language Models (LLMs)
Gen AI: Privacy Risks of Large Language Models (LLMs)Gen AI: Privacy Risks of Large Language Models (LLMs)
Gen AI: Privacy Risks of Large Language Models (LLMs)
Debmalya Biswas
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
FIDO Alliance
 
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
bellared2
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 

Recently uploaded (20)

Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
 
Step-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From ScratchStep-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From Scratch
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
 
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
 
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and CitiesThe Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
 
Gen AI: Privacy Risks of Large Language Models (LLMs)
Gen AI: Privacy Risks of Large Language Models (LLMs)Gen AI: Privacy Risks of Large Language Models (LLMs)
Gen AI: Privacy Risks of Large Language Models (LLMs)
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
 
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 

Europeana Newspapers - Data, Tools & Future Plans

  • 1. Europeana Newspapers Data, Tools & Future Plans Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker
  • 2. Europeana Newspapers • EU FP7 ICT-PSP Project 2012 – 2015 • www.europeana-newspapers.eu • Main outcomes – TEL Historic Newspapers Portal: http://www.theeuropeanlibrary.org/tel4/newspapers – Deliverables: http://www.europeana-newspapers.eu/ public-materials/deliverables/ – Tools: http://www.europeana-newspapers.eu/ public-materials/tools/ – Final Report: http://europeananewspapers.github.io/
  • 4. Data • 1618 – 2016 • 12 countries • 40 languages • 120 TB • Ca. 1,000 titles • 3,3M issues
  • 5. Data • Metadata for more than >20 million pages • 12 million pages processed with OCR • 2 million pages processed with OLR • Most content licensed as Public Domain • Metadata licensed CC0 • Copyright cut-off date
  • 6. Data • JP2000 images for use with IIPsrv • METS container with embedded MODS for structural and bibliographic metadata • ALTO for OCRed text • EDM for Europeana  Europeana Newspapers METS/ALTO Profile (ENMAP)
  • 7. Data • Portals – http://www.theeuropeanlibrary.org/tel4/newspapers – http://europeana.eu/portal/search.html?query=euro- peana_collectionName%3A92*ewspapers*&rows=24 &qt=false • Downloads – https://pro.europeana.eu/itemtype/newspapers – http://test-solr-mongo.eanadev.org/europeana- research-newspapers-dump/
  • 9. Preprocessing • Preprocessing with adaptive Binarization to reduce overall image file size and processing time (yielded >90% reduction of data volume vs. <1% lower accuracy in OCR results) • Preprocessing to create tiled JP2000 files for zooming using graphicsmagick + kakadu • Created easy-to-use set of preprocessing tools that also validate and harmonize data input for efficient OCR/OLR processing
  • 10. OCR/OLR • OCR: ABBYY FineReader Engine 11 – Gothic license per page (A4!) – 4 servers with 8 cores = 32 processing cores – Average processing time of 5s per newspaper page • OLR: CCS docWorks – Article separation & page classification – Possibility for post-correction/validation of results
  • 11. NER • Stanford CoreNLP Named Entity Recognition (Conditional Random Fields) • Adapted for METS/ALTO processing • Added ALTO v3 (tags) output • https://github.com/EuropeanaNewspapers/ner-app • Annotated training & evaluation data • 100 pages each for (historical) German, French, Dutch • https://github.com/EuropeanaNewspapers/ner-corpora
  • 12. Evaluation • Scenario-based performance evaluation of OCR/OLR using PAGE ground truth • Ground truth dataset: http://primaresearch.org/datasets/ENP • Performance Evaluation Report: http://www.europeana-newspapers.eu/wp- content/uploads/2015/05/D3.5_Performance_ Evaluation_Report_1.0.pdf
  • 14. IIIF • International Image Interoperability Framework (iiif.io) for online presentation and aggregation • Implementing Image API and Presentation API • Europeana IIIF Task Force: https://pro.europeana.eu/post/iiif-adoption- by-europeana-future-perspectives-for-the- network-1
  • 16. Future plans • Migration of data from TEL (closed 12/2016) to new Europeana Thematic Collections http://europeana.eu/portal/ • Re-develop Newspapers API • Re-develop search & browse interface • Add new newspaper content • Create virtual exhibitions & browse entry points
  • 18. Future plans • Automatic OCR error correction • Improved newspaper layout analysis • Named Entity Recognition, Disambiguation and Linking (Wikidata) • Extraction and classification of image content • Deep semantic structuring of newspapers • User corrections and annotations
  • 19. Collaboration with Researchers • Interviews with researchers • Europeana Research • CLARIN • Viral Texts • Oceanic Exchanges • DDB • impresso
  • 21. Thank you for your attention! Questions? Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker