SlideShare a Scribd company logo
Evaluation and post-correction of OCR of digitised
historical newspapers
A research project
Lotte Wilms (KB) & Janneke van der Zwaan (NL eScience Center)
@lottewilms @jvdzwa
• Digitised Dutch newspapers
• 1618-1995
• Images + metadata + text
• now: 11 million pages (in 1.351.123 issues)
• prognosis 2020: 20 million pages
• Full text searchable on: www.delpher.nl
Delpher newspaper corpus
Crowdsourced
corrections
type/format level comments
PDF issue Searchable text + scan
JPEG-2000 page Access: JPEG 2000 lossy compression,
colour (or greyscale in case of original
from microfilm)
Master: JPEG 2000 part 1, lossless
compression, greyscale or colour
Dublin Core iss./p./art. Descriptive metadata
OCR article XML
ALTO page
mpeg21-didl issue Structural metadata
The data
Aims of project
• Insight into quality of our OCR
• Insight into automated methods of post-correction
• Reprocessing images
• Machine learning approach
• Other?
Output of the project
• Representative sample set of digitised newspapers, with ground truth
• Report on quality of OCR of Delpher’s digitised newspapers
• Report on post-correction possibilities of OCR using automatic
techniques
• Impact analysis of most likely method of improvement
• Prototype for OCR post-correction and evaluation using deep learning
Sample set
• 2000 pages
• Representative of the whole collection, taking into account:
• Date of publication
• Date of production
• Software used
Production information
<OCRProcessing>
Database with production
information
• Extracted from the metadata:
• Issue identifier
• Newspaper title
• Publication date
• Producer
• Production date
• Software used
?? Some issues processed twice with ABBYY 8.1 & 9.0
Ground truth
Post-correction methods
• Deep learning (by Janneke van der Zwaan from the Netherlands eScience
Center)
• PICCL by Martin Reynaert (UvT & RU)
• https://github.com/LanguageMachines/PICCL
• Proprietary software from a startup?
• More?
Improve OCR using Deep
Learning
• Character-Level
Language models
• Long Short Term
Memory (LSTM)
 Also applicable for OCR evaluation
without GT!
Reprocessing images
• Service provider
• Access or master images?
• Access: JPEG 2000 lossy compression, colour (or greyscale in case of
original from microfilm)
• Master: JPEG 2000 part 1, lossless compression, greyscale and colour
• Standard software or new solutions?
Evaluation
• Focused on wordbased searching
• Bag of words?
• Use existing tools or create our own?
Impact analysis
• Most likely scenario
• Impact on the organisation
• Percentage of improvement on OCR
• Effort needed to implement method
• How to handle different versions of OCR
Any tips or
questions?

More Related Content

Similar to Evaluation and post-correction of OCR of digitised historical newspapers

Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans
cneudecker
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
Sven Schlarb
 
From paper to screen: Putting maps on the web
From paper to screen:  Putting maps on the webFrom paper to screen:  Putting maps on the web
From paper to screen: Putting maps on the web
Petr Pridal
 
Cincom Smalltalk News
Cincom Smalltalk NewsCincom Smalltalk News
Cincom Smalltalk News
ESUG
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Maurice Nsabimana
 
MapInfo Professional 12.5 and Discover3D 2014 - A brief overview
MapInfo Professional 12.5 and Discover3D 2014 - A brief overviewMapInfo Professional 12.5 and Discover3D 2014 - A brief overview
MapInfo Professional 12.5 and Discover3D 2014 - A brief overview
Prakher Hajela Saxena
 
The future of_conver_ai[6933]
The future of_conver_ai[6933]The future of_conver_ai[6933]
The future of_conver_ai[6933]
LianaYe2
 
Crossmedia Workflows
Crossmedia WorkflowsCrossmedia Workflows
Crossmedia WorkflowsDwight Kelly
 
Converting and Transforming Technical Graphics
Converting and Transforming Technical GraphicsConverting and Transforming Technical Graphics
Converting and Transforming Technical Graphics
dclsocialmedia
 
NovoDynamics Company Overview
NovoDynamics Company OverviewNovoDynamics Company Overview
NovoDynamics Company OverviewArt Nicholas
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
cneudecker
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
VRA 2012, Cataloging Case Studies, ROBOCATALOGING
VRA 2012, Cataloging Case Studies, ROBOCATALOGINGVRA 2012, Cataloging Case Studies, ROBOCATALOGING
VRA 2012, Cataloging Case Studies, ROBOCATALOGING
Visual Resources Association
 
About Scanning and Metadata Standards - NEMO 2010
About Scanning and Metadata Standards - NEMO 2010About Scanning and Metadata Standards - NEMO 2010
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
cneudecker
 
Digital Content Creation
Digital Content CreationDigital Content Creation
The Poznań Foundation of Scientific Libraries - Gorny et Lewandowski
The Poznań Foundation of Scientific Libraries  - Gorny et LewandowskiThe Poznań Foundation of Scientific Libraries  - Gorny et Lewandowski
The Poznań Foundation of Scientific Libraries - Gorny et Lewandowski
IMPACT Centre of Competence
 
SPCA2013 - The Newest Trends in Document and Data Capture on Microsoft Platform
SPCA2013 - The Newest Trends in Document and Data Capture on Microsoft PlatformSPCA2013 - The Newest Trends in Document and Data Capture on Microsoft Platform
SPCA2013 - The Newest Trends in Document and Data Capture on Microsoft Platform
NCCOMMS
 

Similar to Evaluation and post-correction of OCR of digitised historical newspapers (20)

Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
From paper to screen: Putting maps on the web
From paper to screen:  Putting maps on the webFrom paper to screen:  Putting maps on the web
From paper to screen: Putting maps on the web
 
Cincom Smalltalk News
Cincom Smalltalk NewsCincom Smalltalk News
Cincom Smalltalk News
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
MapInfo Professional 12.5 and Discover3D 2014 - A brief overview
MapInfo Professional 12.5 and Discover3D 2014 - A brief overviewMapInfo Professional 12.5 and Discover3D 2014 - A brief overview
MapInfo Professional 12.5 and Discover3D 2014 - A brief overview
 
Mapinfo 2014
Mapinfo 2014Mapinfo 2014
Mapinfo 2014
 
The future of_conver_ai[6933]
The future of_conver_ai[6933]The future of_conver_ai[6933]
The future of_conver_ai[6933]
 
Crossmedia Workflows
Crossmedia WorkflowsCrossmedia Workflows
Crossmedia Workflows
 
Converting and Transforming Technical Graphics
Converting and Transforming Technical GraphicsConverting and Transforming Technical Graphics
Converting and Transforming Technical Graphics
 
NovoDynamics Company Overview
NovoDynamics Company OverviewNovoDynamics Company Overview
NovoDynamics Company Overview
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
VRA 2012, Cataloging Case Studies, ROBOCATALOGING
VRA 2012, Cataloging Case Studies, ROBOCATALOGINGVRA 2012, Cataloging Case Studies, ROBOCATALOGING
VRA 2012, Cataloging Case Studies, ROBOCATALOGING
 
About Scanning and Metadata Standards - NEMO 2010
About Scanning and Metadata Standards - NEMO 2010About Scanning and Metadata Standards - NEMO 2010
About Scanning and Metadata Standards - NEMO 2010
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
Digital Content Creation
Digital Content CreationDigital Content Creation
Digital Content Creation
 
The Poznań Foundation of Scientific Libraries - Gorny et Lewandowski
The Poznań Foundation of Scientific Libraries  - Gorny et LewandowskiThe Poznań Foundation of Scientific Libraries  - Gorny et Lewandowski
The Poznań Foundation of Scientific Libraries - Gorny et Lewandowski
 
SPCA2013 - The Newest Trends in Document and Data Capture on Microsoft Platform
SPCA2013 - The Newest Trends in Document and Data Capture on Microsoft PlatformSPCA2013 - The Newest Trends in Document and Data Capture on Microsoft Platform
SPCA2013 - The Newest Trends in Document and Data Capture on Microsoft Platform
 

More from IMPACT Centre of Competence

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
IMPACT Centre of Competence
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
IMPACT Centre of Competence
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
IMPACT Centre of Competence
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
IMPACT Centre of Competence
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
 

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 

Evaluation and post-correction of OCR of digitised historical newspapers

  • 1. Evaluation and post-correction of OCR of digitised historical newspapers A research project Lotte Wilms (KB) & Janneke van der Zwaan (NL eScience Center) @lottewilms @jvdzwa
  • 2. • Digitised Dutch newspapers • 1618-1995 • Images + metadata + text • now: 11 million pages (in 1.351.123 issues) • prognosis 2020: 20 million pages • Full text searchable on: www.delpher.nl Delpher newspaper corpus
  • 4. type/format level comments PDF issue Searchable text + scan JPEG-2000 page Access: JPEG 2000 lossy compression, colour (or greyscale in case of original from microfilm) Master: JPEG 2000 part 1, lossless compression, greyscale or colour Dublin Core iss./p./art. Descriptive metadata OCR article XML ALTO page mpeg21-didl issue Structural metadata The data
  • 5. Aims of project • Insight into quality of our OCR • Insight into automated methods of post-correction • Reprocessing images • Machine learning approach • Other?
  • 6. Output of the project • Representative sample set of digitised newspapers, with ground truth • Report on quality of OCR of Delpher’s digitised newspapers • Report on post-correction possibilities of OCR using automatic techniques • Impact analysis of most likely method of improvement • Prototype for OCR post-correction and evaluation using deep learning
  • 7. Sample set • 2000 pages • Representative of the whole collection, taking into account: • Date of publication • Date of production • Software used
  • 9. Database with production information • Extracted from the metadata: • Issue identifier • Newspaper title • Publication date • Producer • Production date • Software used ?? Some issues processed twice with ABBYY 8.1 & 9.0
  • 11. Post-correction methods • Deep learning (by Janneke van der Zwaan from the Netherlands eScience Center) • PICCL by Martin Reynaert (UvT & RU) • https://github.com/LanguageMachines/PICCL • Proprietary software from a startup? • More?
  • 12. Improve OCR using Deep Learning • Character-Level Language models • Long Short Term Memory (LSTM)  Also applicable for OCR evaluation without GT!
  • 13. Reprocessing images • Service provider • Access or master images? • Access: JPEG 2000 lossy compression, colour (or greyscale in case of original from microfilm) • Master: JPEG 2000 part 1, lossless compression, greyscale and colour • Standard software or new solutions?
  • 14. Evaluation • Focused on wordbased searching • Bag of words? • Use existing tools or create our own?
  • 15. Impact analysis • Most likely scenario • Impact on the organisation • Percentage of improvement on OCR • Effort needed to implement method • How to handle different versions of OCR