EuropeanaTech x AI: Qurator.ai @ Berlin State Library
May. 26, 2021•0 likes•135 views
Download to read offline
Report
Technology
The EuropeanaTech Community and Europeana Foundation are delighted to introduce a new webinar series to explore the opportunities and challenges of working with Artificial Intelligence in the cultural heritage and arts sector.
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
1. AI for digitized cultural heritage
Qurator.ai @ Berlin State Library
Clemens Neudecker (@cneudecker)
EuropeanaTech x AI webinar
21 May 2021
2. Berlin State Library (SBB)
● Established 1661 in Berlin (Kingdom of Prussia)
● Largest research library in Germany
(25M media objects, 2.5 PetaBytes digital data storage)
● Forms part of the larger LAM legal entity
Prussian Cultural Heritage Foundation (SPK)
● https://staatsbibliothek-berlin.de/
● In-house Digitization Center since 2007
○ ~80 concurrent digitization projects
○ ~2M scanned images annual production
● Digital collections give access to ~185k digitized documents
(mostly Public Domain)
● https://digital.staatsbibliothek-berlin.de/
3. Qurator.ai @ SBB
● SBB responsible for sub-project 10: “AI for digitized cultural heritage”
● Main goal: improve the quality and efficiency of (document) digitization
● Full recognition and enrichment
pipeline for digitized documents
● Development of open source tools
https://github.com/qurator-spk
● Publication of open datasets
https://zenodo.org/communities/stabi
● Releases of trained models
https://qurator-data.de/
● Showcases (only available in German)
https://qurator.ai/innovationlab/staatsbibliothek-zu-berlin/
4. Image Preprocessing: Binarization
● Binarization (i.e. the conversion of colour/greyscale images to black or white pixels) can be used to
increase the contrast between background (paper) and foreground (ink) and to remove defects, noise
etc. which improves subsequent processes
● OCR engines require binarized images for recognition
● Training of autoencoder model for document image binarization
https://github.com/qurator-spk/sbb_binarization
5. Document Image Analysis
● High-quality analysis of document layout is key for all subsequent tasks
● Training of multiple ResNet50-U-Net models for pixelwise segmentation
● 1st iteration (“pure” ML)
○ some problems with headings,
drop capitals, reading order
● 2nd iteration (“hybrid”)
○ additional heuristics deliver
improvements for textlines
and reading order detection
https://github.com/qurator-spk/eynollah
Text regions
Text lines
6. Image (Similarity) Search
● Document layout analysis provides (pixel coordinate) information about image content contained in
the digitized documents
● Extraction (and release) of ~600k graphical elements from document images
● Training an image classification
model on the basis of ImageNet
● ROI within image using YOLO v3
● Approximate nearest neighbour
search for similar images
● Alternative search and browse
entry to digitised collections
https://github.com/qurator-spk/sbb_images
7. OCR / Text Recognition
● Traditionally, OCR for historical documents is hard
(Fraktur fonts, complex layouts, defects and
damages, historical spelling)
● Thanks to deep learning for OCR (Calamari) and
public GT datasets (GT4HistOCR), nearly error-
free OCR is now possible!
● A single (language independent) OCR model can be
applied for both Fraktur + Antigua (also mixed)
● Initial evaluations show reductions of
Character-Error-Rate from ~20% to ~2%
https://github.com/qurator-spk/ocrd_calamari
8. OCR Postcorrection
● Even with highly accurate OCR, there remain a few recognition errors
● Idea: train a machine translation model to “translate” OCR errors to correct words
● Challenges:
○ retain historical spelling variants
○ avoid introducing new errors
● Two-step model (seq2seq LSTM):
○ First, detect the parts of text with errors
(this helps artificially increase the error
density in the input for step two)
○ Translate (i.e. correct) errors in the OCR text
● Relative OCR accuracy improvement: 18%
https://github.com/qurator-spk/sbb_ocr_postcorrection
9. Named Entity Recognition
● Named Entity Recognition (NER) is used to identify proper names of persons, locations,
organizations in unstructured text (here: OCR results)
● Unsupervised Pre-Training of BERT model on the digitized historical documents
● Supervised Training of BERT model for NER with labeled data for German NER
● Results are state of the art with f1 score of 85.6%
https://github.com/qurator-spk/sbb_ner
10. Named Entity Disambiguation and Linking
● Entities recognized by NER can be ambiguous
● Example: “Paris is in France”
- Paris the city or Paris (Hilton) the person?
● Necessary to determine the correct entity by context
● Establishing a knowledge base for comparison based on Wikidata/Wikipedia
(harvesting of all articles for the corresponding categories)
● Training of a “context-comparison” BERT embeddings model that decides for a given entity
in the OCR text whether it is similar to a Wikipedia lemma
● Enrichment of the OCR text with links to Wikidata IDs and geo-coordinates for toponyms
https://github.com/qurator-spk/sbb_ned
11. Data Annotation
● neat (named entity annotation tool) for data annotation (and OCR correction)
● Simple, browser based Javascript tool
(no installation or rights required)
● TSV (tab-separated-values)
as internal working format
● Embeds image snippets
via IIIF Image API to aid with annotation
● Due to (popular demand - i.e. Covid-19),
neat can now also be used for OCR correction
or transcription (e.g. to create GT)
https://github.com/qurator-spk/neat
12. Future Work
● Processing all the digitized documents in SBB with the Qurator pipeline would give us some greatly
improved data to extend this work, and for training better models
● But AI/ML is quite demanding on computation - with our current server (36 CPU cores, 2x V100,
192 GiB RAM) this would take years...what can we do to increase throughput without sacrificing
performance?
● Methods that combine computer vision
(document image analysis) and natural
language processing (OCR text content)
features promise further improvements
● Extending current developments to other
languages and scripts (esp. Asian) and layouts (e.g. right-to-left, vertical)
● Provision of interactive demos in our SBB LAB https://lab.sbb.berlin/