Multimodal Perspectives for Digitised Historical Newspapers
1. Multimodal Perspectives for
Digitised Historical Newspapers
Clemens Neudecker (@cneudecker), Berlin State Library
What‘s Past is Prologue: The NewsEye International Conference
16-17 March 2021
2. Introduction
• about:me
• Research Advisor („Forschungsreferent“)
• Managing a team of 5 devs (project based)
• ~ 20 years of applied R&D in document analysis
• motivation: all cultural heritage available digitally,
online and free to use for everyone
• content of this talk
• key research & development projects
• main challenges with digitised newspapers
• outlook / vision
3. SBB-PK
• Staatsbibliothek zu Berlin –
Stiftung Preußischer Kulturbesitz (SBB-PK)
https://staatsbibliothek-berlin.de
• Largest research library in Germany
• A Library in two sites: (former) East & West Berlin
• Part of the larger cultural heritage organization SPK
• In-house digitization center, Kitodo workflows
• Digital collections https://digital.staatsbibliothek-berlin.de
• ZEFYS https://zefys.staatsbibliothek-berlin.de
• SBB LAB https://lab.sbb.berlin
4. DDB Newspaper Portal
• Users want uniform access and UI for
digitised newspaper collections
• Key features of a digital newspaper portal
• Title list
• Calender
• Keyword search
• „Advanced features“
• Citation & Persistance
• Named Entities
• Corpus Building
• https://pro.deutsche-digitale-bibliothek.de/
deutsches-zeitungsportal
5. Europeana Newspapers
• Originally established in the EU project
Europeana Newspapers (2012-2015) as a
service of The European Library (TEL)
• Approx. 12m newspaper pages with OCR
from 12 libraries in > 20 languages
• TEL discontinued from 2017
• Migration of content to Europeana, work
continues to (re-)implement TEL feature set
• https://newspapers.europeana.eu
• http://www.europeana-newspapers.eu
6. OCR-D
• Provide the technical and organisation framework for the
OCR processing of the German VD digitization initiatives
(all documents printed in Germany from 1600 – 1900)
• Open and transparent development process:
• Specifications & GT Guidelines https://ocr-d.de/en/dev
• Open source tools https://github.com/OCR-D
• Community https://gitter.im/OCR-D/Lobby
• 3 phases:
• Phase I (2015-2018): Requirements analysis
• Phase II (2018 – 2020): Development of prototypes
• Phase III (2021 – 2024): Implementation in production
• https://ocr-d.de
7. Qurator – Curation Technologies
• Leverage state-of-the-art AI/ML for data and
content curation across various domains
• Our use case: digitized cultural heritage
• Development of a complete pipeline:
• Binarization
• Layout analysis
• OCR
• Postcorrection
• Named Entity Recognition and Linking
• Image Similarity and Search
• https://qurator.ai
• https://github.com/qurator-spk
8. SoNAR (IDH)
• Examine and evaluate approaches for an
advanced research technology environment
supporting Historical Network Analysis based on
metadata & digitised newspapers
• Extracting person names and relations from special
databases & digitised newspaper full text
• Transforming entities with relations into a historical
social network graph
• Creating intuitive and innovative visualizations and
interfaces for querying and analysing the social
network graph
• https://sonar.fh-potsdam.de
9. Stolp, Pomm. [56000]
Jn unſerem Genoſſenſchaftsregiſter iſt
heute unter Nr. 113 die ,,Ländliche
Spar⸗ und Darlehnskaſſe Schmaatz,
eingetragene Genofſenſchaft mit be⸗
ſchränkter Haftpflicht in Schmaatz“,
eingetragen worden. Gegenſtand des
Unternehmens iſt die Gewährung von
Darlehen an die Mitglieder für ihren
Geſchäfts⸗ und Wirtſchaftsbetrieb, Er⸗
leichterung der Geldanlage und Förderung
des Sparſinns, nebenbei gemeinſchaftliche
Beſchaffung landwirtſchaftlicher Betriebs⸗
mittel. Die Haftſumme beträgt 20 M,
die Höchſtzahl der Geſchäftsanteile 100.
Vorſtandsmitglieder ſind: der Hofbeſitzer
Albert Timreck als Vorſitzender, der
Lehrer Auguſt Völz und der Hofbeſitzer
Paul Selk, ſämtlich in Schmaatz. Das
Statut iſt vom 25. Juli 1920. Das
Geſchäftsjahr läuft vom 1. April bis
31. März. Die Bekanntmachungen er⸗
folgen unter der Firma der Genoſſenſchaft
im Pommerſchen Genoſſenſchaftsblatt, beim
Eingehen dieſes Blattes bis auf weiteres
im Deutſchen Reichsanzeiger. Die
Willenserklärungen des Vorſtands erfolgen
durch zwei Vorſtandsmitglieder. Die
Zeichnung geſchieht derart, daß die Zeich-
nenden zu der Firma ihre Namensunter⸗
ſchrift beifügen. Die Einficht in die Liſte
der Genoſſen iſt während der Geſchäfts⸗
ſtunden des Gerichts jedermann geſtattet.
Stolp, den 11. Auguſt 1920. Das
Amtsgericht.
OCR
• Nearly error-free OCR results are possible
using ocrd_calamari with a model trained
on the GT4HistOCR dataset!
• Deep Learning enables recognition of
both, Antiqua and Fraktur fonts with a
single, language-independent model
• Alas, state-of-the-art OCR engines do
require already pre-segmented regions
and textlines…
10. Layout Analysis
• Training AI/ML model (CNN) for pixel-wise
segmentation using GT data (with augmentation)
• 1st iteration („pure ML“): good textline segmentation
but problems with headlines, reading order
• 2nd iteration („hybrid“): heuristics provide improvements
for textline segmentation and reading order
• But even with more GT data for training, not all
cases can be covered
1st iteration 2nd iteration
14. Outlook
• From Multimodality to Interdisciplinarity: what do we need to succeed?
• Datasets of historic newspaper GT that are
a) sufficiently granular, large and with representative coverage
b) openly available and free-to-use/reuse
• Methods and models for document layout analysis that combine
a) computer vision with natural language processing and
b) machine learning with heuristics and domain knowledge
• Community standards and best practices for
a) metadata (content and structure) and
b) use case driven evaluation
→ Shared Task HIPE „Identifying Historical People, Places and other Entities”
→ Dagstuhl Seminar 22361 „Computational Approaches for Digitized Historical Newspapers“
→ Interdisciplinary projects like NewsEye, impresso, Oceanic Exchanges and conferences like HIP‘21, DATeCH
→ Atlas of Digitised Newspapers https://www.digitisednewspapers.net
15. Thank you for your attention!
Clemens Neudecker (@cneudecker), Berlin State Library
What‘s Past is Prologue: The NewsEye International Conference
16-17 March 2021