1. AI for digitized
cultural heritage
Clemens Neudecker (@cneudecker)
Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
#QURATOR2020 – Conference on Digital Curation Technologies
20 January 2020, Fraunhofer FOKUS, Berlin
qurator@sbb.spk-berlin.de
2. Table of contents
● Introduction
● Challenges & Goals
● Document Layout Analysis
● Optical Character Recognition
● Named Entity Recognition
3. Background
• Staatsbibliothek zu Berlin –
Preußischer Kulturbesitz
(Berlin State Library, SBB)
• Established 1661
• Largest research library in Germany
• Over 12m volumes, 23m objects total
• Legal deposit since 1699
• https://staatsbibliothek-berlin.de/en/
4. Digitization @ SBB
• Since 2007: in-house Digitization Center
• Approx. 1.7M images annual production
• Up to 80 concurrent digitization projects
• >20 diverse bookscanners, scanrobots, etc.
• Operation in two shifts with 24 operators
• Digitisation-on-demand service
• KITODO open source workflow
management software
5. Data
• Digitized Collections
• https://digital.staatsbibliothek-berlin.de/
• ca. 165,000 documents
• ca. 5M pages with OCR fulltext
• Digitized Newspapers (ZEFYS)
• http://zefys.staatsbibliothek-berlin.de/
• ca. 7M pages digitized
• ca. 3M pages with OCR fulltext
• Special subject databases, catalogues,
datasets etc.
• Public Domain license up to 1920
(exceptions apply)
• ca. 2,5 PetaBytes
6. Qurator @ SBB
• Topic: „Automated curation technologies for digitized cultural heritage“
• Team:
• 3x data scientist = 108 PM
• 2x manager = 12 PM
• ML server:
• 2x Nvidia Tesla V100 32GB
• 2x 18-core Intel XEON 2.7 Ghz
• 192GB DDR4 RAM
• Open Source development
• https://github.com/qurator-spk
• Open datasets
• https://zenodo.org/communities/stabi
• Trained models
• https://qurator-data.de/
https://xkcd.com/1838/
9. Historical language
• Spelling variation
• Special characters
• Long s ſ
• Umlauts
• Ligatures æ, st, fi, …
• Hyphens ⸗
• Special chars ↄ, st, st, st, st, …
• Symbols ☞, ❧, ∴, …
10. Users want this (and more)
• Keyword search in digitized collections
• Filters to in-/exclude document regions (e.g. running titles, footnotes)
• Query expansion for historical spelling variants („Teil“ „Theyl“)
• Search for named entities in digitized collections
• Linking of named entities to authority files, geocoordinates
• Digital Humanities
• Text- and data mining (Oceanic Exchanges)
• Extraction of historical social netzworks from documents (SoNAR-IDH)
• Query by image, image similarity search
12. Layout analysis
• Image pre-processing
• Deskewing, Dewarping, Binarization
• Pixelwise segmentation
• Text vs. non-text regions
(Images, Tables, Separators etc.)
• Textline detection
13. • Implemented as pixel-labeling
• Ground Truth: use results from
previous ICDAR binarization
competitions (DIBCO)
• Combination of 4 models
• Optimized for printed text
• No denoising/despeckling (yet)
Source image Binarized image
Binarization
14. Source image Segmentation result
Segmentation
• Training a CNN
(ResNet50/U-Net)
for pixel labeling
• Distinguish up to 16
different classes
• columns, paragraphs,
separators
• headlines, footnotes,
marginalia
• tables, graphics,
formula
• etc.
16. • Current approach: purely based on
segmentation (optical features)
• Future plans: hybrid approach
combining optical features with
language features (apply a
transformer - e.g. BERT - to
determine the correct sequence
of regions by semantics)
Reading order
18. A modern OCR-Workflow
Binarization
Textline
segmentation
OCR Postcorrection
20
–
rath mit einer Pœna fiſcali angeſehen worden,
und ſolche durch des Hon. Graffen von
Königsfeld Vor–
ſpruch, nur aus Gnaden nachgelaſſen erhalten.
Sondern man hat auich dieſen 4. Wochen lang
alle Abend bey der Jnquißtin gantz allein
gelaſſen
Binnen welcher gantzer Zeit der Schreiber
Bredekam beſtändig bey Jhme geweſen, und
ſich in
der am 13ten Octobt. a.c. in Judicio gegen
ſeinen geweſenen Hrn. introducirter Appellation
deſſen Bey-
raths bedienet hat;
33) Dabenehenſt iſt der Schreiber binnen dieſer
gantzen Zeit auf freyem Fuß geblieben, und
hat nicht nur durch ſeinen Conlulenten, ſondern
auch, weilen del lnquilti ſelbſten in Jhtem
Gefängnüß
ſo viele Freyheit gelaſſen worden, daß ſie
frembden Beſuch von Jhren Anberwandten
ohngehindert en–
pfangen können, durch andere Perſonen ſich
mit ihr über alles, Was Er oder ſie dereinſten zu
ſagen hat–
ten· vereinigen können, immaſſen der Hofrath
[...]
20
rath mit einer Pœna fiſcali angeſehen worden,
und ſolche durch des Hrn. Graffen von
Königsfeld Vor–
ſpruch, nur aus Gnaden nachgelaſſen erhalten.
Sondern man hat auch dieſen 4. Wochen lang
alle Abend bey der Jnquisitin gantz allein
gelaſſen.
Binnen welcher gantzer Zeit der Schreiber
Bredekaw beſtändig bey Jhme geweſen, und
ſich in
der am 13 ten Octobr. a.c. in Judicio gegen
ſeinen geweſenen Hrn. introducirter Appellation
deſſen Bey-
raths bedienet hat;
33) Dabenebenſt iſt der Schreiber binnen dieſer
gantzen Zeit auf freyem Fuß geblieben, und
hat nicht nur durch ſeinen Conſulenten, ſondern
auch, weilen der Inquiſitin ſelbſten in Jhrem
Gefängnüß
ſo viele Freyheit gelaſſen worden, daß ſie
frembden Beſuch von Jhren Anverwandten
ohngehindert em–
pfangen können, durch andere Perſonen ſich
mit ihr über alles, Was Er oder ſie dereinſten zu
ſagen hat–
ten, vereinigen können, immaſſen der Hofrath
[...]
Acten-mäßiger Verlauff, Des Fameusen
Processus sich verhaltende ... (1749)
19. learns features: curves, edges, shapes etc.
Recurrent Layer
Feature Maps →
Probability Matrix
Convolutional
Layer
Pixel →
Feature Maps
Connectionist
Temporal
Classification Layer
Probability Matrix →
Labels
learns characters in sliding windows + context
learns most probable text output
20. Optical Character Recognition Models
• Standard-Models in Tesseract OCR
• Not reproducable
• Encoding issues
• ch- and ck-Ligatures as <, >
• no long s (ſ) for Antiqua
• no superscript e (aᵉ, uᵉ, etc.)
¹GT4HistOCR: Ground Truth for training OCR engines on historical documents
in German Fraktur and Early Modern Latin – Springmann et al.
• Our Model for Calamari OCR
• Reproducable
• Based on GT4HistOCR-Dataset¹
• Incunabula, Fraktur, early Antiqua
• 300.000 textlines
• 1 week training on Nvidia RTX 2080
21. Voting of multiple OCR models
• Instead of a single model k equally
strong models are trained
• k-fold Cross Validation
• Models vote – agree on a common
recognition result
• Sum of model confidences
i: 0.8 l: 0.2 j: 0.0
Beyſp i: 0.4 l: 0.5 j: 0.1 el.
i: 0.3 l: 0.4 j: 0.3
Σ: 1.5
23. Named Entity Recognition
● Information extraction from a given text
● Identification and classification of named entities such as e.g.:
● Persons
● Locations
● Organisations
● Products
● Events
● etc.
Vorwort von Alexander v. Humboldt zu den "Erinnerungen der Reise nach Indien von S. K. H.
dem Prinzen Waldemar von Preussen" : [Berlin, den 18 December 1854]
24. BERT - Pretraining
Google:
● BERT-base: 110M parameters
● 100 languages
● 100 largest Wikipedias
● 16x Google
Tensor Processing Units
with 64GB VRAM each
● Processing time ca. 4 days
SBB:
● Starting from Google Model
● 2.333.647 German language
pages (OCR) from the SBB
digitized collections
● 1x NVIDIA V100 GPU
with 32GB VRAM
● 10 epochs
● Processing time ca. 2 weeks
25. NER Training - Ground Truth
● CoNLL 2003 corpus (ca. 200.000 tokens)
● GermEval Konvens 2014 corpus (ca. 450.000 tokens)
● Historical newspapers (Europeana Newspapers):
○ Newspapers from 1926 (Landesbibliothek Dr. Friedrich
Teßmann, ca. 70.000 tokens, LFT)
○ Newspapers from 1710 - 1873 (Austrian National
Library, ca. 30.000 tokens, ONB)
○ Newspapers from 1872 - 1930 (Staatsbibliothek zu
Berlin, ca. 50.000 tokens, SBB)
f1 score: 84.3% ± 1.1%
(5-fold cross validation)
Kai Labusch, Clemens Neudecker and David Zellhöfer:
BERT for Named Entity Recognition in Contemporary
and Historic German, KONVENS 2019.
26. • Disambiguation and Linking of named entities to an authority
file/knowledge base (Wikidata, GND, Geonames)
• Initial approach using embeddings (Fasttext & Flair & BERT)
with nearest neighbour search
Named Entity Disambiguation and Linking
CC BY-SA 4.0 Aparravi
27. Thank you for your attention!
Questions please?
Clemens Neudecker (@cneudecker)
Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
#QURATOR2020 – Conference on Digital Curation Technologies
20 January 2020, Fraunhofer FOKUS, Berlin
qurator@sbb.spk-berlin.de