Curation Technologies for
Cultural Heritage Archives
Analysing and transforming a heterogeneous
data set into an interactive curation workbench
Georg Rehm Martin Lee Julian Moreno Schneider Peter Bourgonje
DFKI GmbH Freie Universität Berlin DFKI GmbH DFKI GmbH
Berlin, Germany Berlin, Germany Berlin, Germany Berlin, Germany
Corresponding author: georg.rehm@dfki.de
DATeCH 2019 – Brussels, Belgium – 09/10 May 2019
Outline
• Curation Technologies:
Background – Domains – Examples
• Original Project and Data Set
(= the Cultural Heritage Archive)
• Processing Pipeline and Components:
OCR – NER and Linking – Clustering
• Interactive Curation Workbench:
Graphical User Interface
Curation Technologies for Cultural Heritage Archives 2
Georg Rehm, Julian Moreno Schneider, Peter Bourgonje, Ankit Srivastava, Jan Nehring, Armin Berger, Luca König, Sören Räuchle, and Jens Gerth. “Event Detection and Semantic
Storytelling: Generating a Travelogue from a large Collection of Personal Letters”. In Tommaso Caselli, Ben Miller, Marieke van Erp, Piek Vossen, Martha Palmer, Eduard Hovy, and Teruko
Mitamura, editors, Proceedings of the Events and Stories in the News Workshop, Vancouver, Canada, August 2017. Association for Computational Linguistics. Co-located with ACL 2017.
Sector: Public Archives
Georg Rehm, Julián Moreno Schneider, Peter Bourgonje, Ankit Srivastava, Rolf Fricke, Jan Thomsen, Jing He, Joachim Quantz, Armin Berger, Luca König, Sören Räuchle, Jens Gerth, and
David Wabnitz. “Different Types of Automated and Semi-Automated Semantic Storytelling: Curation Technologies for Different Sectors”. In Georg Rehm and Thierry Declerck, editors,
Language Technologies for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, Berlin, Germany, September 13-14, 2017, Proceedings, number 10713 in Lecture
Notes in Artificial Intelligence (LNAI), pages 232-247, Cham, Switzerland, January 2018. Gesellschaft für Sprachtechnologie und Computerlinguistik e.V., Springer. 13/14 September 2017.
3
Sector: Museums,
exhibitions
Curation Technologies for Cultural Heritage Archives
Georg Rehm, Jing He, Julian Moreno Schneider, Jan Nehring, and Joachim Quantz. Designing User Interfaces
for Curation Technologies. In Sakae Yamamoto, editor, Human Interface and the Management of Information:
Information, Knowledge and Interaction Design, 19th International Conference, HCI International 2017, number
10273 in Lecture Notes in Computer Science (LNCS), pages 388-406, Vancouver, Canada, July 2017. Springer.
4
1 Nov. 2018 – 31 Oct. 2021
http://qurator.ai
Objectives
• Creation of an ecosystem for Curation
Technologies – we coined this term in
2014 and have been using and promoting
it since 2015.
• Make Berlin-Brandenburg region a hub
of excellence for the development and
industrial use of curation technologies.
Sector-specific solutions
• Medical Content Curator
• Smart Exhibits
• Media Curator
• Intelligent Navigation
• Risk Monitoring
• Next Reality Storytelling
• 12 Teilprojekte – ein Verbundprojekt
• Sechs Themenschwerpunkte in drei Ebenen – diese
korrespondieren mit den Phasen der Kuratierung
• Technologieplattform folgt dem Baukastenprinzip
• In Teilprojekten entwickelte Verfahren sind individuell
nutzbar – sie bilden die flexibel kombinierbaren
Bestandteile und Services der QURATOR-Plattform
WK-Initiative QURATOR – Assessment Center – 11. Dezember 2017 7
im ÜberblickQURATORQURATORCURATION TECHNOLOGIES
Technology showcases
• Content Curation Engine
• Corporate Smart Insights
• Speech-to-Text API
Video Action API
• Semantic Enrichment
QURATOR – Scientific Coordinator:
Georg Rehm (DFKI, georg.rehm@dfki.de)
Curation Tech – Domains – Projects
• Media (television, radio, web tv)
• Journalism
• Public archives
• Museums and exhibitions
• Digital libraries
• Health and life science
• Legal and compliance
• Cinema and film
• CRM
• Digital Humanities
• Cultural Heritage
Curation Technologies for Cultural Heritage Archives 6
QURATORCURATION TECHNOLOGIES
And a few additional
smaller projects.
http://lynx-project.eu
http://qurator.ai
http://digitale-kuratierung.de
Original Project
and Data Set
Curation Technologies for Cultural Heritage Archives 7
Original Research Project
• “Sharing German government’s
documents on unification and Inte-
gration, and Building a data-base on German unification”
– Institute of Korean Studies, FU Berlin, 2010–2016
– Funded by the Ministry of Unification, Republic of Korea
• Collected official political documents regarding the German
reunification process to make them available for research and
planning processes in the context of a potential future
reunification process of Korea
• Documents were collected, intellectually curated, analysed,
interpreted, partially translated and published.
• https://www.geschkult.fu-berlin.de/e/tongilbu/index.html
Curation Technologies for Cultural Heritage Archives 8
Data Set
• 51 volumes, mostly PDF files. Largest, most comprehensive
manually curated data set on the German unification.
• Collection: transcripts of debates in the German Parliament,
minutes of committee meetings, reports, proceedings etc.
• All primary documents written in German. The researchers
added summaries and analyses in both German and Korean.
• >138k pages in both German and Korean
– DE: 96k pages, 38m words
– KO: 41k pages, 15m words
• File types:
– Vast majority in PDF
– Auxiliary documents in tables,
Excel sheets, Word documents etc.
• 10 Gigabytes, approx. 200 files
Curation Technologies for Cultural Heritage Archives 9
Our own Objective
• State of play: Options of navigating the data set (PDF, some
HTML) and accessing specific information are very limited.
• Origin: Can we do better? Can we offer – with limited effort –
a richer, interactive, more accessible interface to scholars who
want to work on and with this specific data set?
• Objective: develop prototype of an interactive platform for the
semantic analysis, enrichment, visualisation of the data set in
a way that enables human experts to interact with the
collection in ways that go beyond PDF documents.
• Focus: temporal, geographical, entity, thematic contexts.
• Challenge: make use of an existing portfolio of services and
tools with minimal additional development.
Curation Technologies for Cultural Heritage Archives 10
Processing Pipeline
and Components
Curation Technologies for Cultural Heritage Archives 11
Processing Pipeline
Curation Technologies for Cultural Heritage Archives 12
Dataset (PDF)
Enriched
dataset
OCR NER Linking
Clustering
Interactive
dashboard
External
ontologies
Contextualise the contents
of the dataset using various
Linked Data sources
DE, KO DE (KO planned) DE
DE
Plain text contents and NIF-
based JSON annotations
Link back to original
PDF files (in progress)
Curation Technologies for Cultural Heritage Archives 13
OCR: Dataset
• Input PDFs are high quality (mostly)
• they include typewriter fonts,
• with the occasional table,
• typically multi-column layout,
• and mixed German/Korean
OCR: Evaluation
• Apache Tesseract (currently without any additional
training) to convert PDF to plain text
– 4.0 LSTM (“best” model) for DE
– 3.5 for KO (Version 4.0 had segmentation issues)
• Evaluation of OCR quality using the four most frequent,
representative content types.
• Ground Truth transcribed using Transkribus:
– DE: 2,870 words
– KO: 2,483 words
• Evaluated using ocrevalUAtion
Curation Technologies for Cultural Heritage Archives 14
OCR: Evaluation
• Results vary greatly depending on content type.
• For one column texts and scans, CER is comparably low (only DE).
• Main problem: formatting (see two columns and table formats).
• Improving OCR results for these formats would be beneficial but for
our downstream applications, the impact remains relatively limited:
– Majority of our data set’s content is single column.
– Pipeline relies on entities, which is why it will not be that much
affected by the incorrect interpretation of individual rows in a table.
• One possible solution: instead of using text as output format of
Tesseract, use XML or hORC to preserve structure information.
Curation Technologies for Cultural Heritage Archives 15
NER and Entity Linking
• Recognising persons, locations and organisations in the
plain text output generated by the OCR step
• OpenNLP NER engine data for entity spotting
– Trained with WikiNER (1.1m articles, 400m tokens)
– No training data available for our specific domain
• DBPedia SPARQL endpoint for entity linking
– Easy access to additional information on entities, provided
by DBPedia or other LOD knowledge bases
– To be replaced with GND (Gemeinsame Normdatei),
designed for German, much higher coverage expected.
Curation Technologies for Cultural Heritage Archives 16
Document Clustering
• We cluster documents based on the URIs of spotted and
linked entities
• WEKA (Expectation Maximization algorithm)
• Allows theme-based exploration of the data set (starting
off with a certain region, cluster of people, etc.)
• Systematic evaluation pending (as part of a user study)
• Document clustering to be complemented with a new
component for topic detection
Curation Technologies for Cultural Heritage Archives 17
Korean Processing
• Language Technology support for Korean is limited
• We’ve been unable to find a readily available NER
package or data for Korean
• Idea: use annotation projection as a means of enriching
Korean text with entities.
• Using Machine Translation (OpenNMT) to translate
Korean texts into German, we aim to train an NER
system on Korean, by projecting found entities in the
English text to the Korean sentences (work in progress).
Curation Technologies for Cultural Heritage Archives 18
Curation Workbench
Curation Technologies for Cultural Heritage Archives 19
Curation Workbench
Curation Technologies for Cultural Heritage Archives 20
Summary and
Future Work
Curation Technologies for Cultural Heritage Archives 21
Summary
• Dashboard for the curation of cultural heritage data.
• Goal: intuitive analysis, exploration, visualisation of data.
• Add value by combining existing semantic components
• To what extent can off-the-shelf tools be used for this?
• Current machine learning techniques rely on very large
amounts of representative, high-quality training data.
• Use of tools trained on general domain data is ambitious.
• But: generic processing results are good enough to
develop useful curation dashboard that helps users
interested in further exploring a large data set.
Curation Technologies for Cultural Heritage Archives 22
Future Work
• Pipeline and dashboard are
work in progress
• User study to verify usefulness of tool
• Show metadata created by pipeline
• Link GUI with the original PDF files/pages
• Half of the 51 editions have been translated into Korean
– we’ll explore if we can create a parallel corpus for MT
• Furthermore, related: development of additional
services, workflow manager, containerisation of services
(especially in QURATOR and European Language Grid)
Curation Technologies for Cultural Heritage Archives 23
Thank you very much!
Questions?
Curation Technologies for Cultural Heritage Archives 24

Session5 03.george rehm

  • 1.
    Curation Technologies for CulturalHeritage Archives Analysing and transforming a heterogeneous data set into an interactive curation workbench Georg Rehm Martin Lee Julian Moreno Schneider Peter Bourgonje DFKI GmbH Freie Universität Berlin DFKI GmbH DFKI GmbH Berlin, Germany Berlin, Germany Berlin, Germany Berlin, Germany Corresponding author: georg.rehm@dfki.de DATeCH 2019 – Brussels, Belgium – 09/10 May 2019
  • 2.
    Outline • Curation Technologies: Background– Domains – Examples • Original Project and Data Set (= the Cultural Heritage Archive) • Processing Pipeline and Components: OCR – NER and Linking – Clustering • Interactive Curation Workbench: Graphical User Interface Curation Technologies for Cultural Heritage Archives 2
  • 3.
    Georg Rehm, JulianMoreno Schneider, Peter Bourgonje, Ankit Srivastava, Jan Nehring, Armin Berger, Luca König, Sören Räuchle, and Jens Gerth. “Event Detection and Semantic Storytelling: Generating a Travelogue from a large Collection of Personal Letters”. In Tommaso Caselli, Ben Miller, Marieke van Erp, Piek Vossen, Martha Palmer, Eduard Hovy, and Teruko Mitamura, editors, Proceedings of the Events and Stories in the News Workshop, Vancouver, Canada, August 2017. Association for Computational Linguistics. Co-located with ACL 2017. Sector: Public Archives Georg Rehm, Julián Moreno Schneider, Peter Bourgonje, Ankit Srivastava, Rolf Fricke, Jan Thomsen, Jing He, Joachim Quantz, Armin Berger, Luca König, Sören Räuchle, Jens Gerth, and David Wabnitz. “Different Types of Automated and Semi-Automated Semantic Storytelling: Curation Technologies for Different Sectors”. In Georg Rehm and Thierry Declerck, editors, Language Technologies for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, Berlin, Germany, September 13-14, 2017, Proceedings, number 10713 in Lecture Notes in Artificial Intelligence (LNAI), pages 232-247, Cham, Switzerland, January 2018. Gesellschaft für Sprachtechnologie und Computerlinguistik e.V., Springer. 13/14 September 2017. 3
  • 4.
    Sector: Museums, exhibitions Curation Technologiesfor Cultural Heritage Archives Georg Rehm, Jing He, Julian Moreno Schneider, Jan Nehring, and Joachim Quantz. Designing User Interfaces for Curation Technologies. In Sakae Yamamoto, editor, Human Interface and the Management of Information: Information, Knowledge and Interaction Design, 19th International Conference, HCI International 2017, number 10273 in Lecture Notes in Computer Science (LNCS), pages 388-406, Vancouver, Canada, July 2017. Springer. 4
  • 5.
    1 Nov. 2018– 31 Oct. 2021 http://qurator.ai Objectives • Creation of an ecosystem for Curation Technologies – we coined this term in 2014 and have been using and promoting it since 2015. • Make Berlin-Brandenburg region a hub of excellence for the development and industrial use of curation technologies. Sector-specific solutions • Medical Content Curator • Smart Exhibits • Media Curator • Intelligent Navigation • Risk Monitoring • Next Reality Storytelling • 12 Teilprojekte – ein Verbundprojekt • Sechs Themenschwerpunkte in drei Ebenen – diese korrespondieren mit den Phasen der Kuratierung • Technologieplattform folgt dem Baukastenprinzip • In Teilprojekten entwickelte Verfahren sind individuell nutzbar – sie bilden die flexibel kombinierbaren Bestandteile und Services der QURATOR-Plattform WK-Initiative QURATOR – Assessment Center – 11. Dezember 2017 7 im ÜberblickQURATORQURATORCURATION TECHNOLOGIES Technology showcases • Content Curation Engine • Corporate Smart Insights • Speech-to-Text API Video Action API • Semantic Enrichment QURATOR – Scientific Coordinator: Georg Rehm (DFKI, georg.rehm@dfki.de)
  • 6.
    Curation Tech –Domains – Projects • Media (television, radio, web tv) • Journalism • Public archives • Museums and exhibitions • Digital libraries • Health and life science • Legal and compliance • Cinema and film • CRM • Digital Humanities • Cultural Heritage Curation Technologies for Cultural Heritage Archives 6 QURATORCURATION TECHNOLOGIES And a few additional smaller projects. http://lynx-project.eu http://qurator.ai http://digitale-kuratierung.de
  • 7.
    Original Project and DataSet Curation Technologies for Cultural Heritage Archives 7
  • 8.
    Original Research Project •“Sharing German government’s documents on unification and Inte- gration, and Building a data-base on German unification” – Institute of Korean Studies, FU Berlin, 2010–2016 – Funded by the Ministry of Unification, Republic of Korea • Collected official political documents regarding the German reunification process to make them available for research and planning processes in the context of a potential future reunification process of Korea • Documents were collected, intellectually curated, analysed, interpreted, partially translated and published. • https://www.geschkult.fu-berlin.de/e/tongilbu/index.html Curation Technologies for Cultural Heritage Archives 8
  • 9.
    Data Set • 51volumes, mostly PDF files. Largest, most comprehensive manually curated data set on the German unification. • Collection: transcripts of debates in the German Parliament, minutes of committee meetings, reports, proceedings etc. • All primary documents written in German. The researchers added summaries and analyses in both German and Korean. • >138k pages in both German and Korean – DE: 96k pages, 38m words – KO: 41k pages, 15m words • File types: – Vast majority in PDF – Auxiliary documents in tables, Excel sheets, Word documents etc. • 10 Gigabytes, approx. 200 files Curation Technologies for Cultural Heritage Archives 9
  • 10.
    Our own Objective •State of play: Options of navigating the data set (PDF, some HTML) and accessing specific information are very limited. • Origin: Can we do better? Can we offer – with limited effort – a richer, interactive, more accessible interface to scholars who want to work on and with this specific data set? • Objective: develop prototype of an interactive platform for the semantic analysis, enrichment, visualisation of the data set in a way that enables human experts to interact with the collection in ways that go beyond PDF documents. • Focus: temporal, geographical, entity, thematic contexts. • Challenge: make use of an existing portfolio of services and tools with minimal additional development. Curation Technologies for Cultural Heritage Archives 10
  • 11.
    Processing Pipeline and Components CurationTechnologies for Cultural Heritage Archives 11
  • 12.
    Processing Pipeline Curation Technologiesfor Cultural Heritage Archives 12 Dataset (PDF) Enriched dataset OCR NER Linking Clustering Interactive dashboard External ontologies Contextualise the contents of the dataset using various Linked Data sources DE, KO DE (KO planned) DE DE Plain text contents and NIF- based JSON annotations Link back to original PDF files (in progress)
  • 13.
    Curation Technologies forCultural Heritage Archives 13 OCR: Dataset • Input PDFs are high quality (mostly) • they include typewriter fonts, • with the occasional table, • typically multi-column layout, • and mixed German/Korean
  • 14.
    OCR: Evaluation • ApacheTesseract (currently without any additional training) to convert PDF to plain text – 4.0 LSTM (“best” model) for DE – 3.5 for KO (Version 4.0 had segmentation issues) • Evaluation of OCR quality using the four most frequent, representative content types. • Ground Truth transcribed using Transkribus: – DE: 2,870 words – KO: 2,483 words • Evaluated using ocrevalUAtion Curation Technologies for Cultural Heritage Archives 14
  • 15.
    OCR: Evaluation • Resultsvary greatly depending on content type. • For one column texts and scans, CER is comparably low (only DE). • Main problem: formatting (see two columns and table formats). • Improving OCR results for these formats would be beneficial but for our downstream applications, the impact remains relatively limited: – Majority of our data set’s content is single column. – Pipeline relies on entities, which is why it will not be that much affected by the incorrect interpretation of individual rows in a table. • One possible solution: instead of using text as output format of Tesseract, use XML or hORC to preserve structure information. Curation Technologies for Cultural Heritage Archives 15
  • 16.
    NER and EntityLinking • Recognising persons, locations and organisations in the plain text output generated by the OCR step • OpenNLP NER engine data for entity spotting – Trained with WikiNER (1.1m articles, 400m tokens) – No training data available for our specific domain • DBPedia SPARQL endpoint for entity linking – Easy access to additional information on entities, provided by DBPedia or other LOD knowledge bases – To be replaced with GND (Gemeinsame Normdatei), designed for German, much higher coverage expected. Curation Technologies for Cultural Heritage Archives 16
  • 17.
    Document Clustering • Wecluster documents based on the URIs of spotted and linked entities • WEKA (Expectation Maximization algorithm) • Allows theme-based exploration of the data set (starting off with a certain region, cluster of people, etc.) • Systematic evaluation pending (as part of a user study) • Document clustering to be complemented with a new component for topic detection Curation Technologies for Cultural Heritage Archives 17
  • 18.
    Korean Processing • LanguageTechnology support for Korean is limited • We’ve been unable to find a readily available NER package or data for Korean • Idea: use annotation projection as a means of enriching Korean text with entities. • Using Machine Translation (OpenNMT) to translate Korean texts into German, we aim to train an NER system on Korean, by projecting found entities in the English text to the Korean sentences (work in progress). Curation Technologies for Cultural Heritage Archives 18
  • 19.
    Curation Workbench Curation Technologiesfor Cultural Heritage Archives 19
  • 20.
    Curation Workbench Curation Technologiesfor Cultural Heritage Archives 20
  • 21.
    Summary and Future Work CurationTechnologies for Cultural Heritage Archives 21
  • 22.
    Summary • Dashboard forthe curation of cultural heritage data. • Goal: intuitive analysis, exploration, visualisation of data. • Add value by combining existing semantic components • To what extent can off-the-shelf tools be used for this? • Current machine learning techniques rely on very large amounts of representative, high-quality training data. • Use of tools trained on general domain data is ambitious. • But: generic processing results are good enough to develop useful curation dashboard that helps users interested in further exploring a large data set. Curation Technologies for Cultural Heritage Archives 22
  • 23.
    Future Work • Pipelineand dashboard are work in progress • User study to verify usefulness of tool • Show metadata created by pipeline • Link GUI with the original PDF files/pages • Half of the 51 editions have been translated into Korean – we’ll explore if we can create a parallel corpus for MT • Furthermore, related: development of additional services, workflow manager, containerisation of services (especially in QURATOR and European Language Grid) Curation Technologies for Cultural Heritage Archives 23
  • 24.
    Thank you verymuch! Questions? Curation Technologies for Cultural Heritage Archives 24