SlideShare a Scribd company logo
Senka Drobac and Pekka Kauppinen and Krister Lindén
Improving OCR of historical
newspapers and journals
published in Finland
1
Motivation
•Corpus of historical newspapers and magazines that has been
digitized by the National Library of Finland
•OCR was done with commercial software Abbyy FineReader
•Character accuracy rate (CAR): ~ 90-91%
Figure from: Vesanto, Aleksi, et al. "A system for identifying and exploring text repetition in
large historical document corpora." Proceedings of the 21st Nordic Conference on
Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. No. 131.
Linköping University Electronic Press, 2017.
Ocropy
•Decided to train models with Ocropy
•Ocropy:
• Open source, uses LSTM, line based
• Tools for preprocessing, segmentation, training, recognition,
evaluation
• Above 98.5% CAR on German 19th and 20th century
OCR workflow (Ocropy)
Image
Line
segmentation Line
images
Text
(lines)
Text
(lines)
Binarized
Image
(Pre-processing)
Binarization OCR
Post
processing
trained model
Data
1771 - 1939
Languages: Finnish and
Swedish
Typefaces: Fraktur and
Antiqua
• Good quality
• Finnish Fraktur
• One column
• Good quality
• Swedish Antiqua
• Two columns
• Binarized image
• Difficult segmentation
• Binarized image
• Challenging segmentation
• Many different fonts on one
page
• Both Finnish and Swedish
on the same page
• Poor quality
Line examples - Fraktur
☛ För billigt pris: En kursläde i garden
Sananlennätinkonttori awoinna joka päiwä
-— Salama iski tiistai yönä klo
pitänyt tarpeellisena warata jonkunlaisen
Line examples – Antiqua
osakkaat kutsutaan täten varsinaiseen yhtiö-
nuksia määräämälleen rautatiease-
m stammanträda i nämnde kontors loka
Heines poetische Werke. I två band. 17 m.
Goal
•Train a model that would be able to recognize everything
Previous work
•Ocropy + post correction (FST)
•Finnish data sets:
• CAR: 93.5% - 94.83%
• After post-processing CAR: 93.68% - 95.21%
•It is better to randomly sample 10 000 lines from the entire
corpus than train on all lines from 250 pages
• Lots of Swedish material → add Swedish training data
Finnish:
~10 000 training lines
(randomly picked)
~75% Fraktur, ~25% Antiqua
Swedish:
~ 6 000 training lines
(randomly picked)
~50% Fraktur, ~50% Antiqua
Experiments
model fin-test
fin-3k 94.0 / 76
fin-4k 94.1 / 77
fin-5k 93.2 / 72
fin-6k 94.4 / 77
fin-7k 94.5 / 78
fin-8k 94.3 / 77
fin-9k 94.0 / 76
fin 93.9 / 75
Results show CAR (%) / WAR (%)
Experiments
Not enough Finnish
Antiqua in training
Results show CAR (%) / WAR (%)
Finnish results improve with
additional Swedish data
Future work
•Need more Finnish Antiqua data
•1D LSTM too small memory → use Deep Neural Networks
Calamari-ocr
Thank you!
senka.drobac@helsinki.fi

More Related Content

More from IMPACT Centre of Competence

Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
IMPACT Centre of Competence
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
IMPACT Centre of Competence
 
Session1 02.anna-maria sichani
Session1 02.anna-maria sichaniSession1 02.anna-maria sichani
Session1 02.anna-maria sichani
IMPACT Centre of Competence
 
Session1 01.konstantin baierer
Session1 01.konstantin baiererSession1 01.konstantin baierer
Session1 01.konstantin baierer
IMPACT Centre of Competence
 
Advanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesAdvanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slides
IMPACT Centre of Competence
 
Xii simposi internacional noves tendencies
Xii simposi internacional noves tendenciesXii simposi internacional noves tendencies
Xii simposi internacional noves tendencies
IMPACT Centre of Competence
 
Impact management report 2016
Impact management report 2016Impact management report 2016
Impact management report 2016
IMPACT Centre of Competence
 
DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...
DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...
DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...
IMPACT Centre of Competence
 
Digitisation at KU Leuven University Libraries: Towards consolidation
Digitisation at KU Leuven University Libraries: Towards consolidationDigitisation at KU Leuven University Libraries: Towards consolidation
Digitisation at KU Leuven University Libraries: Towards consolidation
IMPACT Centre of Competence
 

More from IMPACT Centre of Competence (20)

Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 
Session1 02.anna-maria sichani
Session1 02.anna-maria sichaniSession1 02.anna-maria sichani
Session1 02.anna-maria sichani
 
Session1 01.konstantin baierer
Session1 01.konstantin baiererSession1 01.konstantin baierer
Session1 01.konstantin baierer
 
Advanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesAdvanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slides
 
Xii simposi internacional noves tendencies
Xii simposi internacional noves tendenciesXii simposi internacional noves tendencies
Xii simposi internacional noves tendencies
 
Impact management report 2016
Impact management report 2016Impact management report 2016
Impact management report 2016
 
DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...
DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...
DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...
 
Digitisation at KU Leuven University Libraries: Towards consolidation
Digitisation at KU Leuven University Libraries: Towards consolidationDigitisation at KU Leuven University Libraries: Towards consolidation
Digitisation at KU Leuven University Libraries: Towards consolidation
 

Recently uploaded

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 

Recently uploaded (20)

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 

Session4 04.senka drobac

  • 1. Senka Drobac and Pekka Kauppinen and Krister Lindén Improving OCR of historical newspapers and journals published in Finland 1
  • 2. Motivation •Corpus of historical newspapers and magazines that has been digitized by the National Library of Finland •OCR was done with commercial software Abbyy FineReader •Character accuracy rate (CAR): ~ 90-91%
  • 3. Figure from: Vesanto, Aleksi, et al. "A system for identifying and exploring text repetition in large historical document corpora." Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. No. 131. Linköping University Electronic Press, 2017.
  • 4. Ocropy •Decided to train models with Ocropy •Ocropy: • Open source, uses LSTM, line based • Tools for preprocessing, segmentation, training, recognition, evaluation • Above 98.5% CAR on German 19th and 20th century
  • 5. OCR workflow (Ocropy) Image Line segmentation Line images Text (lines) Text (lines) Binarized Image (Pre-processing) Binarization OCR Post processing trained model
  • 6. Data 1771 - 1939 Languages: Finnish and Swedish Typefaces: Fraktur and Antiqua
  • 7. • Good quality • Finnish Fraktur • One column
  • 8. • Good quality • Swedish Antiqua • Two columns
  • 9. • Binarized image • Difficult segmentation
  • 10. • Binarized image • Challenging segmentation • Many different fonts on one page
  • 11. • Both Finnish and Swedish on the same page
  • 13. Line examples - Fraktur ☛ För billigt pris: En kursläde i garden Sananlennätinkonttori awoinna joka päiwä -— Salama iski tiistai yönä klo pitänyt tarpeellisena warata jonkunlaisen
  • 14. Line examples – Antiqua osakkaat kutsutaan täten varsinaiseen yhtiö- nuksia määräämälleen rautatiease- m stammanträda i nämnde kontors loka Heines poetische Werke. I två band. 17 m.
  • 15. Goal •Train a model that would be able to recognize everything
  • 16. Previous work •Ocropy + post correction (FST) •Finnish data sets: • CAR: 93.5% - 94.83% • After post-processing CAR: 93.68% - 95.21% •It is better to randomly sample 10 000 lines from the entire corpus than train on all lines from 250 pages
  • 17. • Lots of Swedish material → add Swedish training data Finnish: ~10 000 training lines (randomly picked) ~75% Fraktur, ~25% Antiqua Swedish: ~ 6 000 training lines (randomly picked) ~50% Fraktur, ~50% Antiqua
  • 18. Experiments model fin-test fin-3k 94.0 / 76 fin-4k 94.1 / 77 fin-5k 93.2 / 72 fin-6k 94.4 / 77 fin-7k 94.5 / 78 fin-8k 94.3 / 77 fin-9k 94.0 / 76 fin 93.9 / 75 Results show CAR (%) / WAR (%)
  • 19. Experiments Not enough Finnish Antiqua in training Results show CAR (%) / WAR (%) Finnish results improve with additional Swedish data
  • 20. Future work •Need more Finnish Antiqua data •1D LSTM too small memory → use Deep Neural Networks Calamari-ocr