Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

AI for digitized cultural heritage

Download to read offline

AI-based Digital Curation Technologies for Cultural Heritage;
Qurator2020 Conference, 20 January 2020, Berlin, Germany.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

AI for digitized cultural heritage

  1. 1. AI for digitized cultural heritage Clemens Neudecker (@cneudecker) Staatsbibliothek zu Berlin – Preußischer Kulturbesitz #QURATOR2020 – Conference on Digital Curation Technologies 20 January 2020, Fraunhofer FOKUS, Berlin qurator@sbb.spk-berlin.de
  2. 2. Table of contents ● Introduction ● Challenges & Goals ● Document Layout Analysis ● Optical Character Recognition ● Named Entity Recognition
  3. 3. Background • Staatsbibliothek zu Berlin – Preußischer Kulturbesitz (Berlin State Library, SBB) • Established 1661 • Largest research library in Germany • Over 12m volumes, 23m objects total • Legal deposit since 1699 • https://staatsbibliothek-berlin.de/en/
  4. 4. Digitization @ SBB • Since 2007: in-house Digitization Center • Approx. 1.7M images annual production • Up to 80 concurrent digitization projects • >20 diverse bookscanners, scanrobots, etc. • Operation in two shifts with 24 operators • Digitisation-on-demand service • KITODO open source workflow management software
  5. 5. Data • Digitized Collections • https://digital.staatsbibliothek-berlin.de/ • ca. 165,000 documents • ca. 5M pages with OCR fulltext • Digitized Newspapers (ZEFYS) • http://zefys.staatsbibliothek-berlin.de/ • ca. 7M pages digitized • ca. 3M pages with OCR fulltext • Special subject databases, catalogues, datasets etc. • Public Domain license up to 1920 (exceptions apply) • ca. 2,5 PetaBytes
  6. 6. Qurator @ SBB • Topic: „Automated curation technologies for digitized cultural heritage“ • Team: • 3x data scientist = 108 PM • 2x manager = 12 PM • ML server: • 2x Nvidia Tesla V100 32GB • 2x 18-core Intel XEON 2.7 Ghz • 192GB DDR4 RAM • Open Source development • https://github.com/qurator-spk • Open datasets • https://zenodo.org/communities/stabi • Trained models • https://qurator-data.de/ https://xkcd.com/1838/
  7. 7. Challenges & Goals
  8. 8. Historical documents
  9. 9. Historical language • Spelling variation • Special characters • Long s ſ • Umlauts • Ligatures æ, st, fi, … • Hyphens ⸗ • Special chars ↄ, st, st, st, st, … • Symbols ☞, ❧, ∴, …
  10. 10. Users want this (and more) • Keyword search in digitized collections • Filters to in-/exclude document regions (e.g. running titles, footnotes) • Query expansion for historical spelling variants („Teil“  „Theyl“) • Search for named entities in digitized collections • Linking of named entities to authority files, geocoordinates • Digital Humanities • Text- and data mining (Oceanic Exchanges) • Extraction of historical social netzworks from documents (SoNAR-IDH) • Query by image, image similarity search
  11. 11. Document Layout Analysis
  12. 12. Layout analysis • Image pre-processing • Deskewing, Dewarping, Binarization • Pixelwise segmentation • Text vs. non-text regions (Images, Tables, Separators etc.) • Textline detection
  13. 13. • Implemented as pixel-labeling • Ground Truth: use results from previous ICDAR binarization competitions (DIBCO) • Combination of 4 models • Optimized for printed text • No denoising/despeckling (yet) Source image Binarized image Binarization
  14. 14. Source image Segmentation result Segmentation • Training a CNN (ResNet50/U-Net) for pixel labeling • Distinguish up to 16 different classes • columns, paragraphs, separators • headlines, footnotes, marginalia • tables, graphics, formula • etc.
  15. 15. Textline detection • Detect all textlines in the image and extract their bounding boxes
  16. 16. • Current approach: purely based on segmentation (optical features) • Future plans: hybrid approach combining optical features with language features (apply a transformer - e.g. BERT - to determine the correct sequence of regions by semantics) Reading order
  17. 17. Optical Character Recognition
  18. 18. A modern OCR-Workflow Binarization Textline segmentation OCR Postcorrection 20 – rath mit einer Pœna fiſcali angeſehen worden, und ſolche durch des Hon. Graffen von Königsfeld Vor– ſpruch, nur aus Gnaden nachgelaſſen erhalten. Sondern man hat auich dieſen 4. Wochen lang alle Abend bey der Jnquißtin gantz allein gelaſſen Binnen welcher gantzer Zeit der Schreiber Bredekam beſtändig bey Jhme geweſen, und ſich in der am 13ten Octobt. a.c. in Judicio gegen ſeinen geweſenen Hrn. introducirter Appellation deſſen Bey- raths bedienet hat; 33) Dabenehenſt iſt der Schreiber binnen dieſer gantzen Zeit auf freyem Fuß geblieben, und hat nicht nur durch ſeinen Conlulenten, ſondern auch, weilen del lnquilti ſelbſten in Jhtem Gefängnüß ſo viele Freyheit gelaſſen worden, daß ſie frembden Beſuch von Jhren Anberwandten ohngehindert en– pfangen können, durch andere Perſonen ſich mit ihr über alles, Was Er oder ſie dereinſten zu ſagen hat– ten· vereinigen können, immaſſen der Hofrath [...] 20 rath mit einer Pœna fiſcali angeſehen worden, und ſolche durch des Hrn. Graffen von Königsfeld Vor– ſpruch, nur aus Gnaden nachgelaſſen erhalten. Sondern man hat auch dieſen 4. Wochen lang alle Abend bey der Jnquisitin gantz allein gelaſſen. Binnen welcher gantzer Zeit der Schreiber Bredekaw beſtändig bey Jhme geweſen, und ſich in der am 13 ten Octobr. a.c. in Judicio gegen ſeinen geweſenen Hrn. introducirter Appellation deſſen Bey- raths bedienet hat; 33) Dabenebenſt iſt der Schreiber binnen dieſer gantzen Zeit auf freyem Fuß geblieben, und hat nicht nur durch ſeinen Conſulenten, ſondern auch, weilen der Inquiſitin ſelbſten in Jhrem Gefängnüß ſo viele Freyheit gelaſſen worden, daß ſie frembden Beſuch von Jhren Anverwandten ohngehindert em– pfangen können, durch andere Perſonen ſich mit ihr über alles, Was Er oder ſie dereinſten zu ſagen hat– ten, vereinigen können, immaſſen der Hofrath [...] Acten-mäßiger Verlauff, Des Fameusen Processus sich verhaltende ... (1749)
  19. 19. learns features: curves, edges, shapes etc. Recurrent Layer Feature Maps → Probability Matrix Convolutional Layer Pixel → Feature Maps Connectionist Temporal Classification Layer Probability Matrix → Labels learns characters in sliding windows + context learns most probable text output
  20. 20. Optical Character Recognition Models • Standard-Models in Tesseract OCR • Not reproducable • Encoding issues • ch- and ck-Ligatures as <, > • no long s (ſ) for Antiqua • no superscript e (aᵉ, uᵉ, etc.) ¹GT4HistOCR: Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin – Springmann et al. • Our Model for Calamari OCR • Reproducable • Based on GT4HistOCR-Dataset¹ • Incunabula, Fraktur, early Antiqua • 300.000 textlines • 1 week training on Nvidia RTX 2080
  21. 21. Voting of multiple OCR models • Instead of a single model k equally strong models are trained • k-fold Cross Validation • Models vote – agree on a common recognition result • Sum of model confidences i: 0.8 l: 0.2 j: 0.0 Beyſp i: 0.4 l: 0.5 j: 0.1 el. i: 0.3 l: 0.4 j: 0.3 Σ: 1.5
  22. 22. Named Entity Recognition
  23. 23. Named Entity Recognition ● Information extraction from a given text ● Identification and classification of named entities such as e.g.: ● Persons ● Locations ● Organisations ● Products ● Events ● etc. Vorwort von Alexander v. Humboldt zu den "Erinnerungen der Reise nach Indien von S. K. H. dem Prinzen Waldemar von Preussen" : [Berlin, den 18 December 1854]
  24. 24. BERT - Pretraining Google: ● BERT-base: 110M parameters ● 100 languages ● 100 largest Wikipedias ● 16x Google Tensor Processing Units with 64GB VRAM each ● Processing time ca. 4 days SBB: ● Starting from Google Model ● 2.333.647 German language pages (OCR) from the SBB digitized collections ● 1x NVIDIA V100 GPU with 32GB VRAM ● 10 epochs ● Processing time ca. 2 weeks
  25. 25. NER Training - Ground Truth ● CoNLL 2003 corpus (ca. 200.000 tokens) ● GermEval Konvens 2014 corpus (ca. 450.000 tokens) ● Historical newspapers (Europeana Newspapers): ○ Newspapers from 1926 (Landesbibliothek Dr. Friedrich Teßmann, ca. 70.000 tokens, LFT) ○ Newspapers from 1710 - 1873 (Austrian National Library, ca. 30.000 tokens, ONB) ○ Newspapers from 1872 - 1930 (Staatsbibliothek zu Berlin, ca. 50.000 tokens, SBB)  f1 score: 84.3% ± 1.1% (5-fold cross validation) Kai Labusch, Clemens Neudecker and David Zellhöfer: BERT for Named Entity Recognition in Contemporary and Historic German, KONVENS 2019.
  26. 26. • Disambiguation and Linking of named entities to an authority file/knowledge base (Wikidata, GND, Geonames) • Initial approach using embeddings (Fasttext & Flair & BERT) with nearest neighbour search Named Entity Disambiguation and Linking CC BY-SA 4.0 Aparravi
  27. 27. Thank you for your attention! Questions please? Clemens Neudecker (@cneudecker) Staatsbibliothek zu Berlin – Preußischer Kulturbesitz #QURATOR2020 – Conference on Digital Curation Technologies 20 January 2020, Fraunhofer FOKUS, Berlin qurator@sbb.spk-berlin.de

AI-based Digital Curation Technologies for Cultural Heritage; Qurator2020 Conference, 20 January 2020, Berlin, Germany.

Views

Total views

140

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

0

Shares

0

Comments

0

Likes

0

×