Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multimodal Perspectives for Digitised Historical Newspapers

What's Past is Prologue: The NewsEye International Conference, 16-17 March 2021

  • Be the first to comment

  • Be the first to like this

Multimodal Perspectives for Digitised Historical Newspapers

  1. 1. Multimodal Perspectives for Digitised Historical Newspapers Clemens Neudecker (@cneudecker), Berlin State Library What‘s Past is Prologue: The NewsEye International Conference 16-17 March 2021
  2. 2. Introduction • about:me • Research Advisor („Forschungsreferent“) • Managing a team of 5 devs (project based) • ~ 20 years of applied R&D in document analysis • motivation: all cultural heritage available digitally, online and free to use for everyone • content of this talk • key research & development projects • main challenges with digitised newspapers • outlook / vision
  3. 3. SBB-PK • Staatsbibliothek zu Berlin – Stiftung Preußischer Kulturbesitz (SBB-PK) https://staatsbibliothek-berlin.de • Largest research library in Germany • A Library in two sites: (former) East & West Berlin • Part of the larger cultural heritage organization SPK • In-house digitization center, Kitodo workflows • Digital collections https://digital.staatsbibliothek-berlin.de • ZEFYS https://zefys.staatsbibliothek-berlin.de • SBB LAB https://lab.sbb.berlin
  4. 4. DDB Newspaper Portal • Users want uniform access and UI for digitised newspaper collections • Key features of a digital newspaper portal • Title list • Calender • Keyword search • „Advanced features“ • Citation & Persistance • Named Entities • Corpus Building • https://pro.deutsche-digitale-bibliothek.de/ deutsches-zeitungsportal
  5. 5. Europeana Newspapers • Originally established in the EU project Europeana Newspapers (2012-2015) as a service of The European Library (TEL) • Approx. 12m newspaper pages with OCR from 12 libraries in > 20 languages • TEL discontinued from 2017  • Migration of content to Europeana, work continues to (re-)implement TEL feature set • https://newspapers.europeana.eu • http://www.europeana-newspapers.eu
  6. 6. OCR-D • Provide the technical and organisation framework for the OCR processing of the German VD digitization initiatives (all documents printed in Germany from 1600 – 1900) • Open and transparent development process: • Specifications & GT Guidelines https://ocr-d.de/en/dev • Open source tools https://github.com/OCR-D • Community https://gitter.im/OCR-D/Lobby • 3 phases: • Phase I (2015-2018): Requirements analysis • Phase II (2018 – 2020): Development of prototypes • Phase III (2021 – 2024): Implementation in production • https://ocr-d.de
  7. 7. Qurator – Curation Technologies • Leverage state-of-the-art AI/ML for data and content curation across various domains • Our use case: digitized cultural heritage • Development of a complete pipeline: • Binarization • Layout analysis • OCR • Postcorrection • Named Entity Recognition and Linking • Image Similarity and Search • https://qurator.ai • https://github.com/qurator-spk
  8. 8. SoNAR (IDH) • Examine and evaluate approaches for an advanced research technology environment supporting Historical Network Analysis based on metadata & digitised newspapers • Extracting person names and relations from special databases & digitised newspaper full text • Transforming entities with relations into a historical social network graph • Creating intuitive and innovative visualizations and interfaces for querying and analysing the social network graph • https://sonar.fh-potsdam.de
  9. 9. Stolp, Pomm. [56000] Jn unſerem Genoſſenſchaftsregiſter iſt heute unter Nr. 113 die ,,Ländliche Spar⸗ und Darlehnskaſſe Schmaatz, eingetragene Genofſenſchaft mit be⸗ ſchränkter Haftpflicht in Schmaatz“, eingetragen worden. Gegenſtand des Unternehmens iſt die Gewährung von Darlehen an die Mitglieder für ihren Geſchäfts⸗ und Wirtſchaftsbetrieb, Er⸗ leichterung der Geldanlage und Förderung des Sparſinns, nebenbei gemeinſchaftliche Beſchaffung landwirtſchaftlicher Betriebs⸗ mittel. Die Haftſumme beträgt 20 M, die Höchſtzahl der Geſchäftsanteile 100. Vorſtandsmitglieder ſind: der Hofbeſitzer Albert Timreck als Vorſitzender, der Lehrer Auguſt Völz und der Hofbeſitzer Paul Selk, ſämtlich in Schmaatz. Das Statut iſt vom 25. Juli 1920. Das Geſchäftsjahr läuft vom 1. April bis 31. März. Die Bekanntmachungen er⸗ folgen unter der Firma der Genoſſenſchaft im Pommerſchen Genoſſenſchaftsblatt, beim Eingehen dieſes Blattes bis auf weiteres im Deutſchen Reichsanzeiger. Die Willenserklärungen des Vorſtands erfolgen durch zwei Vorſtandsmitglieder. Die Zeichnung geſchieht derart, daß die Zeich- nenden zu der Firma ihre Namensunter⸗ ſchrift beifügen. Die Einficht in die Liſte der Genoſſen iſt während der Geſchäfts⸗ ſtunden des Gerichts jedermann geſtattet. Stolp, den 11. Auguſt 1920. Das Amtsgericht. OCR • Nearly error-free OCR results are possible using ocrd_calamari with a model trained on the GT4HistOCR dataset! • Deep Learning enables recognition of both, Antiqua and Fraktur fonts with a single, language-independent model • Alas, state-of-the-art OCR engines do require already pre-segmented regions and textlines…
  10. 10. Layout Analysis • Training AI/ML model (CNN) for pixel-wise segmentation using GT data (with augmentation) • 1st iteration („pure ML“): good textline segmentation but problems with headlines, reading order • 2nd iteration („hybrid“): heuristics provide improvements for textline segmentation and reading order • But even with more GT data for training, not all cases can be covered 1st iteration 2nd iteration
  11. 11. Challenges
  12. 12. Reading Order
  13. 13. Reading Order
  14. 14. Outlook • From Multimodality to Interdisciplinarity: what do we need to succeed? • Datasets of historic newspaper GT that are a) sufficiently granular, large and with representative coverage b) openly available and free-to-use/reuse • Methods and models for document layout analysis that combine a) computer vision with natural language processing and b) machine learning with heuristics and domain knowledge • Community standards and best practices for a) metadata (content and structure) and b) use case driven evaluation → Shared Task HIPE „Identifying Historical People, Places and other Entities” → Dagstuhl Seminar 22361 „Computational Approaches for Digitized Historical Newspapers“ → Interdisciplinary projects like NewsEye, impresso, Oceanic Exchanges and conferences like HIP‘21, DATeCH → Atlas of Digitised Newspapers https://www.digitisednewspapers.net
  15. 15. Thank you for your attention! Clemens Neudecker (@cneudecker), Berlin State Library What‘s Past is Prologue: The NewsEye International Conference 16-17 March 2021

×