Slides of the paper Improving OCR of historical newspapers and journals published in Finland by Senka Drobac, Pekka Kauppinen and Krister Lindén at the 3rd Edition of the DATeCH2019 International Conference
1. Senka Drobac and Pekka Kauppinen and Krister Lindén
Improving OCR of historical
newspapers and journals
published in Finland
1
2. Motivation
•Corpus of historical newspapers and magazines that has been
digitized by the National Library of Finland
•OCR was done with commercial software Abbyy FineReader
•Character accuracy rate (CAR): ~ 90-91%
3. Figure from: Vesanto, Aleksi, et al. "A system for identifying and exploring text repetition in
large historical document corpora." Proceedings of the 21st Nordic Conference on
Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. No. 131.
Linköping University Electronic Press, 2017.
4. Ocropy
•Decided to train models with Ocropy
•Ocropy:
• Open source, uses LSTM, line based
• Tools for preprocessing, segmentation, training, recognition,
evaluation
• Above 98.5% CAR on German 19th and 20th century
13. Line examples - Fraktur
☛ För billigt pris: En kursläde i garden
Sananlennätinkonttori awoinna joka päiwä
-— Salama iski tiistai yönä klo
pitänyt tarpeellisena warata jonkunlaisen
14. Line examples – Antiqua
osakkaat kutsutaan täten varsinaiseen yhtiö-
nuksia määräämälleen rautatiease-
m stammanträda i nämnde kontors loka
Heines poetische Werke. I två band. 17 m.
16. Previous work
•Ocropy + post correction (FST)
•Finnish data sets:
• CAR: 93.5% - 94.83%
• After post-processing CAR: 93.68% - 95.21%
•It is better to randomly sample 10 000 lines from the entire
corpus than train on all lines from 250 pages
17. • Lots of Swedish material → add Swedish training data
Finnish:
~10 000 training lines
(randomly picked)
~75% Fraktur, ~25% Antiqua
Swedish:
~ 6 000 training lines
(randomly picked)
~50% Fraktur, ~50% Antiqua
18. Experiments
model fin-test
fin-3k 94.0 / 76
fin-4k 94.1 / 77
fin-5k 93.2 / 72
fin-6k 94.4 / 77
fin-7k 94.5 / 78
fin-8k 94.3 / 77
fin-9k 94.0 / 76
fin 93.9 / 75
Results show CAR (%) / WAR (%)