AI for digitized cultural heritage

C
AI for digitized
cultural heritage
Clemens Neudecker (@cneudecker)
Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
#QURATOR2020 – Conference on Digital Curation Technologies
20 January 2020, Fraunhofer FOKUS, Berlin
qurator@sbb.spk-berlin.de
Table of contents
● Introduction
● Challenges & Goals
● Document Layout Analysis
● Optical Character Recognition
● Named Entity Recognition
Background
• Staatsbibliothek zu Berlin –
Preußischer Kulturbesitz
(Berlin State Library, SBB)
• Established 1661
• Largest research library in Germany
• Over 12m volumes, 23m objects total
• Legal deposit since 1699
• https://staatsbibliothek-berlin.de/en/
Digitization @ SBB
• Since 2007: in-house Digitization Center
• Approx. 1.7M images annual production
• Up to 80 concurrent digitization projects
• >20 diverse bookscanners, scanrobots, etc.
• Operation in two shifts with 24 operators
• Digitisation-on-demand service
• KITODO open source workflow
management software
Data
• Digitized Collections
• https://digital.staatsbibliothek-berlin.de/
• ca. 165,000 documents
• ca. 5M pages with OCR fulltext
• Digitized Newspapers (ZEFYS)
• http://zefys.staatsbibliothek-berlin.de/
• ca. 7M pages digitized
• ca. 3M pages with OCR fulltext
• Special subject databases, catalogues,
datasets etc.
• Public Domain license up to 1920
(exceptions apply)
• ca. 2,5 PetaBytes
Qurator @ SBB
• Topic: „Automated curation technologies for digitized cultural heritage“
• Team:
• 3x data scientist = 108 PM
• 2x manager = 12 PM
• ML server:
• 2x Nvidia Tesla V100 32GB
• 2x 18-core Intel XEON 2.7 Ghz
• 192GB DDR4 RAM
• Open Source development
• https://github.com/qurator-spk
• Open datasets
• https://zenodo.org/communities/stabi
• Trained models
• https://qurator-data.de/
https://xkcd.com/1838/
Challenges & Goals
Historical documents
Historical language
• Spelling variation
• Special characters
• Long s ſ
• Umlauts
• Ligatures æ, st, fi, …
• Hyphens ⸗
• Special chars ↄ, st, st, st, st, …
• Symbols ☞, ❧, ∴, …
Users want this (and more)
• Keyword search in digitized collections
• Filters to in-/exclude document regions (e.g. running titles, footnotes)
• Query expansion for historical spelling variants („Teil“  „Theyl“)
• Search for named entities in digitized collections
• Linking of named entities to authority files, geocoordinates
• Digital Humanities
• Text- and data mining (Oceanic Exchanges)
• Extraction of historical social netzworks from documents (SoNAR-IDH)
• Query by image, image similarity search
Document Layout Analysis
Layout analysis
• Image pre-processing
• Deskewing, Dewarping, Binarization
• Pixelwise segmentation
• Text vs. non-text regions
(Images, Tables, Separators etc.)
• Textline detection
• Implemented as pixel-labeling
• Ground Truth: use results from
previous ICDAR binarization
competitions (DIBCO)
• Combination of 4 models
• Optimized for printed text
• No denoising/despeckling (yet)
Source image Binarized image
Binarization
Source image Segmentation result
Segmentation
• Training a CNN
(ResNet50/U-Net)
for pixel labeling
• Distinguish up to 16
different classes
• columns, paragraphs,
separators
• headlines, footnotes,
marginalia
• tables, graphics,
formula
• etc.
Textline detection
• Detect all textlines in the image and extract their bounding boxes
• Current approach: purely based on
segmentation (optical features)
• Future plans: hybrid approach
combining optical features with
language features (apply a
transformer - e.g. BERT - to
determine the correct sequence
of regions by semantics)
Reading order
Optical Character
Recognition
A modern OCR-Workflow
Binarization
Textline
segmentation
OCR Postcorrection
20
–
rath mit einer Pœna fiſcali angeſehen worden,
und ſolche durch des Hon. Graffen von
Königsfeld Vor–
ſpruch, nur aus Gnaden nachgelaſſen erhalten.
Sondern man hat auich dieſen 4. Wochen lang
alle Abend bey der Jnquißtin gantz allein
gelaſſen
Binnen welcher gantzer Zeit der Schreiber
Bredekam beſtändig bey Jhme geweſen, und
ſich in
der am 13ten Octobt. a.c. in Judicio gegen
ſeinen geweſenen Hrn. introducirter Appellation
deſſen Bey-
raths bedienet hat;
33) Dabenehenſt iſt der Schreiber binnen dieſer
gantzen Zeit auf freyem Fuß geblieben, und
hat nicht nur durch ſeinen Conlulenten, ſondern
auch, weilen del lnquilti ſelbſten in Jhtem
Gefängnüß
ſo viele Freyheit gelaſſen worden, daß ſie
frembden Beſuch von Jhren Anberwandten
ohngehindert en–
pfangen können, durch andere Perſonen ſich
mit ihr über alles, Was Er oder ſie dereinſten zu
ſagen hat–
ten· vereinigen können, immaſſen der Hofrath
[...]
20
rath mit einer Pœna fiſcali angeſehen worden,
und ſolche durch des Hrn. Graffen von
Königsfeld Vor–
ſpruch, nur aus Gnaden nachgelaſſen erhalten.
Sondern man hat auch dieſen 4. Wochen lang
alle Abend bey der Jnquisitin gantz allein
gelaſſen.
Binnen welcher gantzer Zeit der Schreiber
Bredekaw beſtändig bey Jhme geweſen, und
ſich in
der am 13 ten Octobr. a.c. in Judicio gegen
ſeinen geweſenen Hrn. introducirter Appellation
deſſen Bey-
raths bedienet hat;
33) Dabenebenſt iſt der Schreiber binnen dieſer
gantzen Zeit auf freyem Fuß geblieben, und
hat nicht nur durch ſeinen Conſulenten, ſondern
auch, weilen der Inquiſitin ſelbſten in Jhrem
Gefängnüß
ſo viele Freyheit gelaſſen worden, daß ſie
frembden Beſuch von Jhren Anverwandten
ohngehindert em–
pfangen können, durch andere Perſonen ſich
mit ihr über alles, Was Er oder ſie dereinſten zu
ſagen hat–
ten, vereinigen können, immaſſen der Hofrath
[...]
Acten-mäßiger Verlauff, Des Fameusen
Processus sich verhaltende ... (1749)
learns features: curves, edges, shapes etc.
Recurrent Layer
Feature Maps →
Probability Matrix
Convolutional
Layer
Pixel →
Feature Maps
Connectionist
Temporal
Classification Layer
Probability Matrix →
Labels
learns characters in sliding windows + context
learns most probable text output
Optical Character Recognition Models
• Standard-Models in Tesseract OCR
• Not reproducable
• Encoding issues
• ch- and ck-Ligatures as <, >
• no long s (ſ) for Antiqua
• no superscript e (aᵉ, uᵉ, etc.)
¹GT4HistOCR: Ground Truth for training OCR engines on historical documents
in German Fraktur and Early Modern Latin – Springmann et al.
• Our Model for Calamari OCR
• Reproducable
• Based on GT4HistOCR-Dataset¹
• Incunabula, Fraktur, early Antiqua
• 300.000 textlines
• 1 week training on Nvidia RTX 2080
Voting of multiple OCR models
• Instead of a single model k equally
strong models are trained
• k-fold Cross Validation
• Models vote – agree on a common
recognition result
• Sum of model confidences
i: 0.8 l: 0.2 j: 0.0
Beyſp i: 0.4 l: 0.5 j: 0.1 el.
i: 0.3 l: 0.4 j: 0.3
Σ: 1.5
Named Entity Recognition
Named Entity Recognition
● Information extraction from a given text
● Identification and classification of named entities such as e.g.:
● Persons
● Locations
● Organisations
● Products
● Events
● etc.
Vorwort von Alexander v. Humboldt zu den "Erinnerungen der Reise nach Indien von S. K. H.
dem Prinzen Waldemar von Preussen" : [Berlin, den 18 December 1854]
BERT - Pretraining
Google:
● BERT-base: 110M parameters
● 100 languages
● 100 largest Wikipedias
● 16x Google
Tensor Processing Units
with 64GB VRAM each
● Processing time ca. 4 days
SBB:
● Starting from Google Model
● 2.333.647 German language
pages (OCR) from the SBB
digitized collections
● 1x NVIDIA V100 GPU
with 32GB VRAM
● 10 epochs
● Processing time ca. 2 weeks
NER Training - Ground Truth
● CoNLL 2003 corpus (ca. 200.000 tokens)
● GermEval Konvens 2014 corpus (ca. 450.000 tokens)
● Historical newspapers (Europeana Newspapers):
○ Newspapers from 1926 (Landesbibliothek Dr. Friedrich
Teßmann, ca. 70.000 tokens, LFT)
○ Newspapers from 1710 - 1873 (Austrian National
Library, ca. 30.000 tokens, ONB)
○ Newspapers from 1872 - 1930 (Staatsbibliothek zu
Berlin, ca. 50.000 tokens, SBB)
 f1 score: 84.3% ± 1.1%
(5-fold cross validation)
Kai Labusch, Clemens Neudecker and David Zellhöfer:
BERT for Named Entity Recognition in Contemporary
and Historic German, KONVENS 2019.
• Disambiguation and Linking of named entities to an authority
file/knowledge base (Wikidata, GND, Geonames)
• Initial approach using embeddings (Fasttext & Flair & BERT)
with nearest neighbour search
Named Entity Disambiguation and Linking
CC BY-SA 4.0 Aparravi
Thank you for your attention!
Questions please?
Clemens Neudecker (@cneudecker)
Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
#QURATOR2020 – Conference on Digital Curation Technologies
20 January 2020, Fraunhofer FOKUS, Berlin
qurator@sbb.spk-berlin.de
1 of 27

More Related Content

Similar to AI for digitized cultural heritage(20)

MongoDB Mojo: Building a Basic Perl AppMongoDB Mojo: Building a Basic Perl App
MongoDB Mojo: Building a Basic Perl App
Stephen Steneker2.4K views
Books disvovered once again - final summaryBooks disvovered once again - final summary
Books disvovered once again - final summary
Books Discovered Once Again1K views
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
Jean-Philippe Moreux349 views
How to read a million books?How to read a million books?
How to read a million books?
cneudecker592 views
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers Project
Europeana Newspapers878 views
PhD Thesis Digitisation ProjectPhD Thesis Digitisation Project
PhD Thesis Digitisation Project
Lorna Campbell2.8K views
SKOS hands-on workshop (tutorial) by Regine SteinSKOS hands-on workshop (tutorial) by Regine Stein
SKOS hands-on workshop (tutorial) by Regine Stein
Israeli Internet Association technology committee3.4K views
The Ground Truth: Arabic Scientific Manuscripts WorkshopThe Ground Truth: Arabic Scientific Manuscripts Workshop
The Ground Truth: Arabic Scientific Manuscripts Workshop
Digital Research and Curator Team @ British Library243 views

Recently uploaded(20)

ThroughputThroughput
Throughput
Moisés Armani Ramírez28 views
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh34 views
Green Leaf Consulting: Capabilities DeckGreen Leaf Consulting: Capabilities Deck
Green Leaf Consulting: Capabilities Deck
GreenLeafConsulting170 views
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya51 views
CXL at OCPCXL at OCP
CXL at OCP
CXL Forum183 views

AI for digitized cultural heritage

  • 1. AI for digitized cultural heritage Clemens Neudecker (@cneudecker) Staatsbibliothek zu Berlin – Preußischer Kulturbesitz #QURATOR2020 – Conference on Digital Curation Technologies 20 January 2020, Fraunhofer FOKUS, Berlin qurator@sbb.spk-berlin.de
  • 2. Table of contents ● Introduction ● Challenges & Goals ● Document Layout Analysis ● Optical Character Recognition ● Named Entity Recognition
  • 3. Background • Staatsbibliothek zu Berlin – Preußischer Kulturbesitz (Berlin State Library, SBB) • Established 1661 • Largest research library in Germany • Over 12m volumes, 23m objects total • Legal deposit since 1699 • https://staatsbibliothek-berlin.de/en/
  • 4. Digitization @ SBB • Since 2007: in-house Digitization Center • Approx. 1.7M images annual production • Up to 80 concurrent digitization projects • >20 diverse bookscanners, scanrobots, etc. • Operation in two shifts with 24 operators • Digitisation-on-demand service • KITODO open source workflow management software
  • 5. Data • Digitized Collections • https://digital.staatsbibliothek-berlin.de/ • ca. 165,000 documents • ca. 5M pages with OCR fulltext • Digitized Newspapers (ZEFYS) • http://zefys.staatsbibliothek-berlin.de/ • ca. 7M pages digitized • ca. 3M pages with OCR fulltext • Special subject databases, catalogues, datasets etc. • Public Domain license up to 1920 (exceptions apply) • ca. 2,5 PetaBytes
  • 6. Qurator @ SBB • Topic: „Automated curation technologies for digitized cultural heritage“ • Team: • 3x data scientist = 108 PM • 2x manager = 12 PM • ML server: • 2x Nvidia Tesla V100 32GB • 2x 18-core Intel XEON 2.7 Ghz • 192GB DDR4 RAM • Open Source development • https://github.com/qurator-spk • Open datasets • https://zenodo.org/communities/stabi • Trained models • https://qurator-data.de/ https://xkcd.com/1838/
  • 9. Historical language • Spelling variation • Special characters • Long s ſ • Umlauts • Ligatures æ, st, fi, … • Hyphens ⸗ • Special chars ↄ, st, st, st, st, … • Symbols ☞, ❧, ∴, …
  • 10. Users want this (and more) • Keyword search in digitized collections • Filters to in-/exclude document regions (e.g. running titles, footnotes) • Query expansion for historical spelling variants („Teil“  „Theyl“) • Search for named entities in digitized collections • Linking of named entities to authority files, geocoordinates • Digital Humanities • Text- and data mining (Oceanic Exchanges) • Extraction of historical social netzworks from documents (SoNAR-IDH) • Query by image, image similarity search
  • 12. Layout analysis • Image pre-processing • Deskewing, Dewarping, Binarization • Pixelwise segmentation • Text vs. non-text regions (Images, Tables, Separators etc.) • Textline detection
  • 13. • Implemented as pixel-labeling • Ground Truth: use results from previous ICDAR binarization competitions (DIBCO) • Combination of 4 models • Optimized for printed text • No denoising/despeckling (yet) Source image Binarized image Binarization
  • 14. Source image Segmentation result Segmentation • Training a CNN (ResNet50/U-Net) for pixel labeling • Distinguish up to 16 different classes • columns, paragraphs, separators • headlines, footnotes, marginalia • tables, graphics, formula • etc.
  • 15. Textline detection • Detect all textlines in the image and extract their bounding boxes
  • 16. • Current approach: purely based on segmentation (optical features) • Future plans: hybrid approach combining optical features with language features (apply a transformer - e.g. BERT - to determine the correct sequence of regions by semantics) Reading order
  • 18. A modern OCR-Workflow Binarization Textline segmentation OCR Postcorrection 20 – rath mit einer Pœna fiſcali angeſehen worden, und ſolche durch des Hon. Graffen von Königsfeld Vor– ſpruch, nur aus Gnaden nachgelaſſen erhalten. Sondern man hat auich dieſen 4. Wochen lang alle Abend bey der Jnquißtin gantz allein gelaſſen Binnen welcher gantzer Zeit der Schreiber Bredekam beſtändig bey Jhme geweſen, und ſich in der am 13ten Octobt. a.c. in Judicio gegen ſeinen geweſenen Hrn. introducirter Appellation deſſen Bey- raths bedienet hat; 33) Dabenehenſt iſt der Schreiber binnen dieſer gantzen Zeit auf freyem Fuß geblieben, und hat nicht nur durch ſeinen Conlulenten, ſondern auch, weilen del lnquilti ſelbſten in Jhtem Gefängnüß ſo viele Freyheit gelaſſen worden, daß ſie frembden Beſuch von Jhren Anberwandten ohngehindert en– pfangen können, durch andere Perſonen ſich mit ihr über alles, Was Er oder ſie dereinſten zu ſagen hat– ten· vereinigen können, immaſſen der Hofrath [...] 20 rath mit einer Pœna fiſcali angeſehen worden, und ſolche durch des Hrn. Graffen von Königsfeld Vor– ſpruch, nur aus Gnaden nachgelaſſen erhalten. Sondern man hat auch dieſen 4. Wochen lang alle Abend bey der Jnquisitin gantz allein gelaſſen. Binnen welcher gantzer Zeit der Schreiber Bredekaw beſtändig bey Jhme geweſen, und ſich in der am 13 ten Octobr. a.c. in Judicio gegen ſeinen geweſenen Hrn. introducirter Appellation deſſen Bey- raths bedienet hat; 33) Dabenebenſt iſt der Schreiber binnen dieſer gantzen Zeit auf freyem Fuß geblieben, und hat nicht nur durch ſeinen Conſulenten, ſondern auch, weilen der Inquiſitin ſelbſten in Jhrem Gefängnüß ſo viele Freyheit gelaſſen worden, daß ſie frembden Beſuch von Jhren Anverwandten ohngehindert em– pfangen können, durch andere Perſonen ſich mit ihr über alles, Was Er oder ſie dereinſten zu ſagen hat– ten, vereinigen können, immaſſen der Hofrath [...] Acten-mäßiger Verlauff, Des Fameusen Processus sich verhaltende ... (1749)
  • 19. learns features: curves, edges, shapes etc. Recurrent Layer Feature Maps → Probability Matrix Convolutional Layer Pixel → Feature Maps Connectionist Temporal Classification Layer Probability Matrix → Labels learns characters in sliding windows + context learns most probable text output
  • 20. Optical Character Recognition Models • Standard-Models in Tesseract OCR • Not reproducable • Encoding issues • ch- and ck-Ligatures as <, > • no long s (ſ) for Antiqua • no superscript e (aᵉ, uᵉ, etc.) ¹GT4HistOCR: Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin – Springmann et al. • Our Model for Calamari OCR • Reproducable • Based on GT4HistOCR-Dataset¹ • Incunabula, Fraktur, early Antiqua • 300.000 textlines • 1 week training on Nvidia RTX 2080
  • 21. Voting of multiple OCR models • Instead of a single model k equally strong models are trained • k-fold Cross Validation • Models vote – agree on a common recognition result • Sum of model confidences i: 0.8 l: 0.2 j: 0.0 Beyſp i: 0.4 l: 0.5 j: 0.1 el. i: 0.3 l: 0.4 j: 0.3 Σ: 1.5
  • 23. Named Entity Recognition ● Information extraction from a given text ● Identification and classification of named entities such as e.g.: ● Persons ● Locations ● Organisations ● Products ● Events ● etc. Vorwort von Alexander v. Humboldt zu den "Erinnerungen der Reise nach Indien von S. K. H. dem Prinzen Waldemar von Preussen" : [Berlin, den 18 December 1854]
  • 24. BERT - Pretraining Google: ● BERT-base: 110M parameters ● 100 languages ● 100 largest Wikipedias ● 16x Google Tensor Processing Units with 64GB VRAM each ● Processing time ca. 4 days SBB: ● Starting from Google Model ● 2.333.647 German language pages (OCR) from the SBB digitized collections ● 1x NVIDIA V100 GPU with 32GB VRAM ● 10 epochs ● Processing time ca. 2 weeks
  • 25. NER Training - Ground Truth ● CoNLL 2003 corpus (ca. 200.000 tokens) ● GermEval Konvens 2014 corpus (ca. 450.000 tokens) ● Historical newspapers (Europeana Newspapers): ○ Newspapers from 1926 (Landesbibliothek Dr. Friedrich Teßmann, ca. 70.000 tokens, LFT) ○ Newspapers from 1710 - 1873 (Austrian National Library, ca. 30.000 tokens, ONB) ○ Newspapers from 1872 - 1930 (Staatsbibliothek zu Berlin, ca. 50.000 tokens, SBB)  f1 score: 84.3% ± 1.1% (5-fold cross validation) Kai Labusch, Clemens Neudecker and David Zellhöfer: BERT for Named Entity Recognition in Contemporary and Historic German, KONVENS 2019.
  • 26. • Disambiguation and Linking of named entities to an authority file/knowledge base (Wikidata, GND, Geonames) • Initial approach using embeddings (Fasttext & Flair & BERT) with nearest neighbour search Named Entity Disambiguation and Linking CC BY-SA 4.0 Aparravi
  • 27. Thank you for your attention! Questions please? Clemens Neudecker (@cneudecker) Staatsbibliothek zu Berlin – Preußischer Kulturbesitz #QURATOR2020 – Conference on Digital Curation Technologies 20 January 2020, Fraunhofer FOKUS, Berlin qurator@sbb.spk-berlin.de