Successfully reported this slideshow.

EuropeanaTech x AI: Qurator.ai @ Berlin State Library

0

Share

1 of 13
1 of 13

EuropeanaTech x AI: Qurator.ai @ Berlin State Library

0

Share

Download to read offline

The EuropeanaTech Community and Europeana Foundation are delighted to introduce a new webinar series to explore the opportunities and challenges of working with Artificial Intelligence in the cultural heritage and arts sector.

The EuropeanaTech Community and Europeana Foundation are delighted to introduce a new webinar series to explore the opportunities and challenges of working with Artificial Intelligence in the cultural heritage and arts sector.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

EuropeanaTech x AI: Qurator.ai @ Berlin State Library

  1. 1. AI for digitized cultural heritage Qurator.ai @ Berlin State Library Clemens Neudecker (@cneudecker) EuropeanaTech x AI webinar 21 May 2021
  2. 2. Berlin State Library (SBB) ● Established 1661 in Berlin (Kingdom of Prussia) ● Largest research library in Germany (25M media objects, 2.5 PetaBytes digital data storage) ● Forms part of the larger LAM legal entity Prussian Cultural Heritage Foundation (SPK) ● https://staatsbibliothek-berlin.de/ ● In-house Digitization Center since 2007 ○ ~80 concurrent digitization projects ○ ~2M scanned images annual production ● Digital collections give access to ~185k digitized documents (mostly Public Domain) ● https://digital.staatsbibliothek-berlin.de/
  3. 3. Qurator.ai @ SBB ● SBB responsible for sub-project 10: “AI for digitized cultural heritage” ● Main goal: improve the quality and efficiency of (document) digitization ● Full recognition and enrichment pipeline for digitized documents ● Development of open source tools https://github.com/qurator-spk ● Publication of open datasets https://zenodo.org/communities/stabi ● Releases of trained models https://qurator-data.de/ ● Showcases (only available in German) https://qurator.ai/innovationlab/staatsbibliothek-zu-berlin/
  4. 4. Image Preprocessing: Binarization ● Binarization (i.e. the conversion of colour/greyscale images to black or white pixels) can be used to increase the contrast between background (paper) and foreground (ink) and to remove defects, noise etc. which improves subsequent processes ● OCR engines require binarized images for recognition ● Training of autoencoder model for document image binarization https://github.com/qurator-spk/sbb_binarization
  5. 5. Document Image Analysis ● High-quality analysis of document layout is key for all subsequent tasks ● Training of multiple ResNet50-U-Net models for pixelwise segmentation ● 1st iteration (“pure” ML) ○ some problems with headings, drop capitals, reading order ● 2nd iteration (“hybrid”) ○ additional heuristics deliver improvements for textlines and reading order detection https://github.com/qurator-spk/eynollah Text regions Text lines
  6. 6. Image (Similarity) Search ● Document layout analysis provides (pixel coordinate) information about image content contained in the digitized documents ● Extraction (and release) of ~600k graphical elements from document images ● Training an image classification model on the basis of ImageNet ● ROI within image using YOLO v3 ● Approximate nearest neighbour search for similar images ● Alternative search and browse entry to digitised collections https://github.com/qurator-spk/sbb_images
  7. 7. OCR / Text Recognition ● Traditionally, OCR for historical documents is hard (Fraktur fonts, complex layouts, defects and damages, historical spelling) ● Thanks to deep learning for OCR (Calamari) and public GT datasets (GT4HistOCR), nearly error- free OCR is now possible! ● A single (language independent) OCR model can be applied for both Fraktur + Antigua (also mixed) ● Initial evaluations show reductions of Character-Error-Rate from ~20% to ~2% https://github.com/qurator-spk/ocrd_calamari
  8. 8. OCR Postcorrection ● Even with highly accurate OCR, there remain a few recognition errors ● Idea: train a machine translation model to “translate” OCR errors to correct words ● Challenges: ○ retain historical spelling variants ○ avoid introducing new errors ● Two-step model (seq2seq LSTM): ○ First, detect the parts of text with errors (this helps artificially increase the error density in the input for step two) ○ Translate (i.e. correct) errors in the OCR text ● Relative OCR accuracy improvement: 18% https://github.com/qurator-spk/sbb_ocr_postcorrection
  9. 9. Named Entity Recognition ● Named Entity Recognition (NER) is used to identify proper names of persons, locations, organizations in unstructured text (here: OCR results) ● Unsupervised Pre-Training of BERT model on the digitized historical documents ● Supervised Training of BERT model for NER with labeled data for German NER ● Results are state of the art with f1 score of 85.6% https://github.com/qurator-spk/sbb_ner
  10. 10. Named Entity Disambiguation and Linking ● Entities recognized by NER can be ambiguous ● Example: “Paris is in France” - Paris the city or Paris (Hilton) the person? ● Necessary to determine the correct entity by context ● Establishing a knowledge base for comparison based on Wikidata/Wikipedia (harvesting of all articles for the corresponding categories) ● Training of a “context-comparison” BERT embeddings model that decides for a given entity in the OCR text whether it is similar to a Wikipedia lemma ● Enrichment of the OCR text with links to Wikidata IDs and geo-coordinates for toponyms https://github.com/qurator-spk/sbb_ned
  11. 11. Data Annotation ● neat (named entity annotation tool) for data annotation (and OCR correction) ● Simple, browser based Javascript tool (no installation or rights required) ● TSV (tab-separated-values) as internal working format ● Embeds image snippets via IIIF Image API to aid with annotation ● Due to (popular demand - i.e. Covid-19), neat can now also be used for OCR correction or transcription (e.g. to create GT) https://github.com/qurator-spk/neat
  12. 12. Future Work ● Processing all the digitized documents in SBB with the Qurator pipeline would give us some greatly improved data to extend this work, and for training better models ● But AI/ML is quite demanding on computation - with our current server (36 CPU cores, 2x V100, 192 GiB RAM) this would take years...what can we do to increase throughput without sacrificing performance? ● Methods that combine computer vision (document image analysis) and natural language processing (OCR text content) features promise further improvements ● Extending current developments to other languages and scripts (esp. Asian) and layouts (e.g. right-to-left, vertical) ● Provision of interactive demos in our SBB LAB https://lab.sbb.berlin/
  13. 13. Thank you for your attention! Questions?

×