SlideShare a Scribd company logo
Curation Technologies for
Cultural Heritage Archives
Analysing and transforming a heterogeneous
data set into an interactive curation workbench
Georg Rehm Martin Lee Julian Moreno Schneider Peter Bourgonje
DFKI GmbH Freie Universität Berlin DFKI GmbH DFKI GmbH
Berlin, Germany Berlin, Germany Berlin, Germany Berlin, Germany
Corresponding author: georg.rehm@dfki.de
DATeCH 2019 – Brussels, Belgium – 09/10 May 2019
Outline
• Curation Technologies:
Background – Domains – Examples
• Original Project and Data Set
(= the Cultural Heritage Archive)
• Processing Pipeline and Components:
OCR – NER and Linking – Clustering
• Interactive Curation Workbench:
Graphical User Interface
Curation Technologies for Cultural Heritage Archives 2
Georg Rehm, Julian Moreno Schneider, Peter Bourgonje, Ankit Srivastava, Jan Nehring, Armin Berger, Luca König, Sören Räuchle, and Jens Gerth. “Event Detection and Semantic
Storytelling: Generating a Travelogue from a large Collection of Personal Letters”. In Tommaso Caselli, Ben Miller, Marieke van Erp, Piek Vossen, Martha Palmer, Eduard Hovy, and Teruko
Mitamura, editors, Proceedings of the Events and Stories in the News Workshop, Vancouver, Canada, August 2017. Association for Computational Linguistics. Co-located with ACL 2017.
Sector: Public Archives
Georg Rehm, Julián Moreno Schneider, Peter Bourgonje, Ankit Srivastava, Rolf Fricke, Jan Thomsen, Jing He, Joachim Quantz, Armin Berger, Luca König, Sören Räuchle, Jens Gerth, and
David Wabnitz. “Different Types of Automated and Semi-Automated Semantic Storytelling: Curation Technologies for Different Sectors”. In Georg Rehm and Thierry Declerck, editors,
Language Technologies for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, Berlin, Germany, September 13-14, 2017, Proceedings, number 10713 in Lecture
Notes in Artificial Intelligence (LNAI), pages 232-247, Cham, Switzerland, January 2018. Gesellschaft für Sprachtechnologie und Computerlinguistik e.V., Springer. 13/14 September 2017.
3
Sector: Museums,
exhibitions
Curation Technologies for Cultural Heritage Archives
Georg Rehm, Jing He, Julian Moreno Schneider, Jan Nehring, and Joachim Quantz. Designing User Interfaces
for Curation Technologies. In Sakae Yamamoto, editor, Human Interface and the Management of Information:
Information, Knowledge and Interaction Design, 19th International Conference, HCI International 2017, number
10273 in Lecture Notes in Computer Science (LNCS), pages 388-406, Vancouver, Canada, July 2017. Springer.
4
1 Nov. 2018 – 31 Oct. 2021
http://qurator.ai
Objectives
• Creation of an ecosystem for Curation
Technologies – we coined this term in
2014 and have been using and promoting
it since 2015.
• Make Berlin-Brandenburg region a hub
of excellence for the development and
industrial use of curation technologies.
Sector-specific solutions
• Medical Content Curator
• Smart Exhibits
• Media Curator
• Intelligent Navigation
• Risk Monitoring
• Next Reality Storytelling
• 12 Teilprojekte – ein Verbundprojekt
• Sechs Themenschwerpunkte in drei Ebenen – diese
korrespondieren mit den Phasen der Kuratierung
• Technologieplattform folgt dem Baukastenprinzip
• In Teilprojekten entwickelte Verfahren sind individuell
nutzbar – sie bilden die flexibel kombinierbaren
Bestandteile und Services der QURATOR-Plattform
WK-Initiative QURATOR – Assessment Center – 11. Dezember 2017 7
im ÜberblickQURATORQURATORCURATION TECHNOLOGIES
Technology showcases
• Content Curation Engine
• Corporate Smart Insights
• Speech-to-Text API
Video Action API
• Semantic Enrichment
QURATOR – Scientific Coordinator:
Georg Rehm (DFKI, georg.rehm@dfki.de)
Curation Tech – Domains – Projects
• Media (television, radio, web tv)
• Journalism
• Public archives
• Museums and exhibitions
• Digital libraries
• Health and life science
• Legal and compliance
• Cinema and film
• CRM
• Digital Humanities
• Cultural Heritage
Curation Technologies for Cultural Heritage Archives 6
QURATORCURATION TECHNOLOGIES
And a few additional
smaller projects.
http://lynx-project.eu
http://qurator.ai
http://digitale-kuratierung.de
Original Project
and Data Set
Curation Technologies for Cultural Heritage Archives 7
Original Research Project
• “Sharing German government’s
documents on unification and Inte-
gration, and Building a data-base on German unification”
– Institute of Korean Studies, FU Berlin, 2010–2016
– Funded by the Ministry of Unification, Republic of Korea
• Collected official political documents regarding the German
reunification process to make them available for research and
planning processes in the context of a potential future
reunification process of Korea
• Documents were collected, intellectually curated, analysed,
interpreted, partially translated and published.
• https://www.geschkult.fu-berlin.de/e/tongilbu/index.html
Curation Technologies for Cultural Heritage Archives 8
Data Set
• 51 volumes, mostly PDF files. Largest, most comprehensive
manually curated data set on the German unification.
• Collection: transcripts of debates in the German Parliament,
minutes of committee meetings, reports, proceedings etc.
• All primary documents written in German. The researchers
added summaries and analyses in both German and Korean.
• >138k pages in both German and Korean
– DE: 96k pages, 38m words
– KO: 41k pages, 15m words
• File types:
– Vast majority in PDF
– Auxiliary documents in tables,
Excel sheets, Word documents etc.
• 10 Gigabytes, approx. 200 files
Curation Technologies for Cultural Heritage Archives 9
Our own Objective
• State of play: Options of navigating the data set (PDF, some
HTML) and accessing specific information are very limited.
• Origin: Can we do better? Can we offer – with limited effort –
a richer, interactive, more accessible interface to scholars who
want to work on and with this specific data set?
• Objective: develop prototype of an interactive platform for the
semantic analysis, enrichment, visualisation of the data set in
a way that enables human experts to interact with the
collection in ways that go beyond PDF documents.
• Focus: temporal, geographical, entity, thematic contexts.
• Challenge: make use of an existing portfolio of services and
tools with minimal additional development.
Curation Technologies for Cultural Heritage Archives 10
Processing Pipeline
and Components
Curation Technologies for Cultural Heritage Archives 11
Processing Pipeline
Curation Technologies for Cultural Heritage Archives 12
Dataset (PDF)
Enriched
dataset
OCR NER Linking
Clustering
Interactive
dashboard
External
ontologies
Contextualise the contents
of the dataset using various
Linked Data sources
DE, KO DE (KO planned) DE
DE
Plain text contents and NIF-
based JSON annotations
Link back to original
PDF files (in progress)
Curation Technologies for Cultural Heritage Archives 13
OCR: Dataset
• Input PDFs are high quality (mostly)
• they include typewriter fonts,
• with the occasional table,
• typically multi-column layout,
• and mixed German/Korean
OCR: Evaluation
• Apache Tesseract (currently without any additional
training) to convert PDF to plain text
– 4.0 LSTM (“best” model) for DE
– 3.5 for KO (Version 4.0 had segmentation issues)
• Evaluation of OCR quality using the four most frequent,
representative content types.
• Ground Truth transcribed using Transkribus:
– DE: 2,870 words
– KO: 2,483 words
• Evaluated using ocrevalUAtion
Curation Technologies for Cultural Heritage Archives 14
OCR: Evaluation
• Results vary greatly depending on content type.
• For one column texts and scans, CER is comparably low (only DE).
• Main problem: formatting (see two columns and table formats).
• Improving OCR results for these formats would be beneficial but for
our downstream applications, the impact remains relatively limited:
– Majority of our data set’s content is single column.
– Pipeline relies on entities, which is why it will not be that much
affected by the incorrect interpretation of individual rows in a table.
• One possible solution: instead of using text as output format of
Tesseract, use XML or hORC to preserve structure information.
Curation Technologies for Cultural Heritage Archives 15
NER and Entity Linking
• Recognising persons, locations and organisations in the
plain text output generated by the OCR step
• OpenNLP NER engine data for entity spotting
– Trained with WikiNER (1.1m articles, 400m tokens)
– No training data available for our specific domain
• DBPedia SPARQL endpoint for entity linking
– Easy access to additional information on entities, provided
by DBPedia or other LOD knowledge bases
– To be replaced with GND (Gemeinsame Normdatei),
designed for German, much higher coverage expected.
Curation Technologies for Cultural Heritage Archives 16
Document Clustering
• We cluster documents based on the URIs of spotted and
linked entities
• WEKA (Expectation Maximization algorithm)
• Allows theme-based exploration of the data set (starting
off with a certain region, cluster of people, etc.)
• Systematic evaluation pending (as part of a user study)
• Document clustering to be complemented with a new
component for topic detection
Curation Technologies for Cultural Heritage Archives 17
Korean Processing
• Language Technology support for Korean is limited
• We’ve been unable to find a readily available NER
package or data for Korean
• Idea: use annotation projection as a means of enriching
Korean text with entities.
• Using Machine Translation (OpenNMT) to translate
Korean texts into German, we aim to train an NER
system on Korean, by projecting found entities in the
English text to the Korean sentences (work in progress).
Curation Technologies for Cultural Heritage Archives 18
Curation Workbench
Curation Technologies for Cultural Heritage Archives 19
Curation Workbench
Curation Technologies for Cultural Heritage Archives 20
Summary and
Future Work
Curation Technologies for Cultural Heritage Archives 21
Summary
• Dashboard for the curation of cultural heritage data.
• Goal: intuitive analysis, exploration, visualisation of data.
• Add value by combining existing semantic components
• To what extent can off-the-shelf tools be used for this?
• Current machine learning techniques rely on very large
amounts of representative, high-quality training data.
• Use of tools trained on general domain data is ambitious.
• But: generic processing results are good enough to
develop useful curation dashboard that helps users
interested in further exploring a large data set.
Curation Technologies for Cultural Heritage Archives 22
Future Work
• Pipeline and dashboard are
work in progress
• User study to verify usefulness of tool
• Show metadata created by pipeline
• Link GUI with the original PDF files/pages
• Half of the 51 editions have been translated into Korean
– we’ll explore if we can create a parallel corpus for MT
• Furthermore, related: development of additional
services, workflow manager, containerisation of services
(especially in QURATOR and European Language Grid)
Curation Technologies for Cultural Heritage Archives 23
Thank you very much!
Questions?
Curation Technologies for Cultural Heritage Archives 24

More Related Content

Similar to Session5 03.george rehm

E-ARK: Open Data Mining for Government Archives
E-ARK: Open Data Mining for Government ArchivesE-ARK: Open Data Mining for Government Archives
E-ARK: Open Data Mining for Government Archives
Danube University Krems, Centre for E-Governance
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Project
 
LoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloudLoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloud
Europeana
 
ResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRM
Vladimir Alexiev, PhD, PMP
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
cneudecker
 
Ji cv6n1
Ji cv6n1Ji cv6n1
Ji cv6n1
Gerry McKiernan
 
NECTAR_VRE1
NECTAR_VRE1NECTAR_VRE1
NECTAR_VRE1
Craig Bellamy
 
greenstone-bbla seminar july 2010-cheyrl
greenstone-bbla seminar july 2010-cheyrlgreenstone-bbla seminar july 2010-cheyrl
greenstone-bbla seminar july 2010-cheyrl
Cheryl Tanicala-Roldan
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
Vladimir Alexiev, PhD, PMP
 
From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...
From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...
From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...
4Science
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
cneudecker
 
Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?
AubreyMcFato
 
Olaf Janssen on the principles of large-scale digital libraries and their app...
Olaf Janssen on the principles of large-scale digital libraries and their app...Olaf Janssen on the principles of large-scale digital libraries and their app...
Olaf Janssen on the principles of large-scale digital libraries and their app...
Olaf Janssen
 
Working digitally with Historical Documents
Working digitally with Historical DocumentsWorking digitally with Historical Documents
Working digitally with Historical Documents
Georg Vogeler
 
Rebecca Grant - DH research data: identification and challenges (DH2016)
Rebecca Grant - DH research data: identification and challenges (DH2016)Rebecca Grant - DH research data: identification and challenges (DH2016)
Rebecca Grant - DH research data: identification and challenges (DH2016)
dri_ireland
 
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
University of South Australlia
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
Enno Meijers
 
Rob Davies : How we got here
Rob Davies : How we got hereRob Davies : How we got here
Rob Davies : How we got here
AccessITplus
 
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DHLorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
lorna_hughes
 
Introducing parthenos powerpoint presentation december 2015 updated
Introducing parthenos powerpoint presentation december 2015 updatedIntroducing parthenos powerpoint presentation december 2015 updated
Introducing parthenos powerpoint presentation december 2015 updated
Parthenos
 

Similar to Session5 03.george rehm (20)

E-ARK: Open Data Mining for Government Archives
E-ARK: Open Data Mining for Government ArchivesE-ARK: Open Data Mining for Government Archives
E-ARK: Open Data Mining for Government Archives
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
 
LoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloudLoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloud
 
ResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRM
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
Ji cv6n1
Ji cv6n1Ji cv6n1
Ji cv6n1
 
NECTAR_VRE1
NECTAR_VRE1NECTAR_VRE1
NECTAR_VRE1
 
greenstone-bbla seminar july 2010-cheyrl
greenstone-bbla seminar july 2010-cheyrlgreenstone-bbla seminar july 2010-cheyrl
greenstone-bbla seminar july 2010-cheyrl
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
 
From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...
From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...
From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?
 
Olaf Janssen on the principles of large-scale digital libraries and their app...
Olaf Janssen on the principles of large-scale digital libraries and their app...Olaf Janssen on the principles of large-scale digital libraries and their app...
Olaf Janssen on the principles of large-scale digital libraries and their app...
 
Working digitally with Historical Documents
Working digitally with Historical DocumentsWorking digitally with Historical Documents
Working digitally with Historical Documents
 
Rebecca Grant - DH research data: identification and challenges (DH2016)
Rebecca Grant - DH research data: identification and challenges (DH2016)Rebecca Grant - DH research data: identification and challenges (DH2016)
Rebecca Grant - DH research data: identification and challenges (DH2016)
 
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
 
Rob Davies : How we got here
Rob Davies : How we got hereRob Davies : How we got here
Rob Davies : How we got here
 
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DHLorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
 
Introducing parthenos powerpoint presentation december 2015 updated
Introducing parthenos powerpoint presentation december 2015 updatedIntroducing parthenos powerpoint presentation december 2015 updated
Introducing parthenos powerpoint presentation december 2015 updated
 

More from IMPACT Centre of Competence

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
IMPACT Centre of Competence
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
IMPACT Centre of Competence
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
IMPACT Centre of Competence
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
IMPACT Centre of Competence
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
IMPACT Centre of Competence
 

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Recently uploaded

AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
ScyllaDB
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
Tobias Schneck
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
Sunil Jagani
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
HarpalGohil4
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
manji sharman06
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 

Recently uploaded (20)

AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 

Session5 03.george rehm

  • 1. Curation Technologies for Cultural Heritage Archives Analysing and transforming a heterogeneous data set into an interactive curation workbench Georg Rehm Martin Lee Julian Moreno Schneider Peter Bourgonje DFKI GmbH Freie Universität Berlin DFKI GmbH DFKI GmbH Berlin, Germany Berlin, Germany Berlin, Germany Berlin, Germany Corresponding author: georg.rehm@dfki.de DATeCH 2019 – Brussels, Belgium – 09/10 May 2019
  • 2. Outline • Curation Technologies: Background – Domains – Examples • Original Project and Data Set (= the Cultural Heritage Archive) • Processing Pipeline and Components: OCR – NER and Linking – Clustering • Interactive Curation Workbench: Graphical User Interface Curation Technologies for Cultural Heritage Archives 2
  • 3. Georg Rehm, Julian Moreno Schneider, Peter Bourgonje, Ankit Srivastava, Jan Nehring, Armin Berger, Luca König, Sören Räuchle, and Jens Gerth. “Event Detection and Semantic Storytelling: Generating a Travelogue from a large Collection of Personal Letters”. In Tommaso Caselli, Ben Miller, Marieke van Erp, Piek Vossen, Martha Palmer, Eduard Hovy, and Teruko Mitamura, editors, Proceedings of the Events and Stories in the News Workshop, Vancouver, Canada, August 2017. Association for Computational Linguistics. Co-located with ACL 2017. Sector: Public Archives Georg Rehm, Julián Moreno Schneider, Peter Bourgonje, Ankit Srivastava, Rolf Fricke, Jan Thomsen, Jing He, Joachim Quantz, Armin Berger, Luca König, Sören Räuchle, Jens Gerth, and David Wabnitz. “Different Types of Automated and Semi-Automated Semantic Storytelling: Curation Technologies for Different Sectors”. In Georg Rehm and Thierry Declerck, editors, Language Technologies for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, Berlin, Germany, September 13-14, 2017, Proceedings, number 10713 in Lecture Notes in Artificial Intelligence (LNAI), pages 232-247, Cham, Switzerland, January 2018. Gesellschaft für Sprachtechnologie und Computerlinguistik e.V., Springer. 13/14 September 2017. 3
  • 4. Sector: Museums, exhibitions Curation Technologies for Cultural Heritage Archives Georg Rehm, Jing He, Julian Moreno Schneider, Jan Nehring, and Joachim Quantz. Designing User Interfaces for Curation Technologies. In Sakae Yamamoto, editor, Human Interface and the Management of Information: Information, Knowledge and Interaction Design, 19th International Conference, HCI International 2017, number 10273 in Lecture Notes in Computer Science (LNCS), pages 388-406, Vancouver, Canada, July 2017. Springer. 4
  • 5. 1 Nov. 2018 – 31 Oct. 2021 http://qurator.ai Objectives • Creation of an ecosystem for Curation Technologies – we coined this term in 2014 and have been using and promoting it since 2015. • Make Berlin-Brandenburg region a hub of excellence for the development and industrial use of curation technologies. Sector-specific solutions • Medical Content Curator • Smart Exhibits • Media Curator • Intelligent Navigation • Risk Monitoring • Next Reality Storytelling • 12 Teilprojekte – ein Verbundprojekt • Sechs Themenschwerpunkte in drei Ebenen – diese korrespondieren mit den Phasen der Kuratierung • Technologieplattform folgt dem Baukastenprinzip • In Teilprojekten entwickelte Verfahren sind individuell nutzbar – sie bilden die flexibel kombinierbaren Bestandteile und Services der QURATOR-Plattform WK-Initiative QURATOR – Assessment Center – 11. Dezember 2017 7 im ÜberblickQURATORQURATORCURATION TECHNOLOGIES Technology showcases • Content Curation Engine • Corporate Smart Insights • Speech-to-Text API Video Action API • Semantic Enrichment QURATOR – Scientific Coordinator: Georg Rehm (DFKI, georg.rehm@dfki.de)
  • 6. Curation Tech – Domains – Projects • Media (television, radio, web tv) • Journalism • Public archives • Museums and exhibitions • Digital libraries • Health and life science • Legal and compliance • Cinema and film • CRM • Digital Humanities • Cultural Heritage Curation Technologies for Cultural Heritage Archives 6 QURATORCURATION TECHNOLOGIES And a few additional smaller projects. http://lynx-project.eu http://qurator.ai http://digitale-kuratierung.de
  • 7. Original Project and Data Set Curation Technologies for Cultural Heritage Archives 7
  • 8. Original Research Project • “Sharing German government’s documents on unification and Inte- gration, and Building a data-base on German unification” – Institute of Korean Studies, FU Berlin, 2010–2016 – Funded by the Ministry of Unification, Republic of Korea • Collected official political documents regarding the German reunification process to make them available for research and planning processes in the context of a potential future reunification process of Korea • Documents were collected, intellectually curated, analysed, interpreted, partially translated and published. • https://www.geschkult.fu-berlin.de/e/tongilbu/index.html Curation Technologies for Cultural Heritage Archives 8
  • 9. Data Set • 51 volumes, mostly PDF files. Largest, most comprehensive manually curated data set on the German unification. • Collection: transcripts of debates in the German Parliament, minutes of committee meetings, reports, proceedings etc. • All primary documents written in German. The researchers added summaries and analyses in both German and Korean. • >138k pages in both German and Korean – DE: 96k pages, 38m words – KO: 41k pages, 15m words • File types: – Vast majority in PDF – Auxiliary documents in tables, Excel sheets, Word documents etc. • 10 Gigabytes, approx. 200 files Curation Technologies for Cultural Heritage Archives 9
  • 10. Our own Objective • State of play: Options of navigating the data set (PDF, some HTML) and accessing specific information are very limited. • Origin: Can we do better? Can we offer – with limited effort – a richer, interactive, more accessible interface to scholars who want to work on and with this specific data set? • Objective: develop prototype of an interactive platform for the semantic analysis, enrichment, visualisation of the data set in a way that enables human experts to interact with the collection in ways that go beyond PDF documents. • Focus: temporal, geographical, entity, thematic contexts. • Challenge: make use of an existing portfolio of services and tools with minimal additional development. Curation Technologies for Cultural Heritage Archives 10
  • 11. Processing Pipeline and Components Curation Technologies for Cultural Heritage Archives 11
  • 12. Processing Pipeline Curation Technologies for Cultural Heritage Archives 12 Dataset (PDF) Enriched dataset OCR NER Linking Clustering Interactive dashboard External ontologies Contextualise the contents of the dataset using various Linked Data sources DE, KO DE (KO planned) DE DE Plain text contents and NIF- based JSON annotations Link back to original PDF files (in progress)
  • 13. Curation Technologies for Cultural Heritage Archives 13 OCR: Dataset • Input PDFs are high quality (mostly) • they include typewriter fonts, • with the occasional table, • typically multi-column layout, • and mixed German/Korean
  • 14. OCR: Evaluation • Apache Tesseract (currently without any additional training) to convert PDF to plain text – 4.0 LSTM (“best” model) for DE – 3.5 for KO (Version 4.0 had segmentation issues) • Evaluation of OCR quality using the four most frequent, representative content types. • Ground Truth transcribed using Transkribus: – DE: 2,870 words – KO: 2,483 words • Evaluated using ocrevalUAtion Curation Technologies for Cultural Heritage Archives 14
  • 15. OCR: Evaluation • Results vary greatly depending on content type. • For one column texts and scans, CER is comparably low (only DE). • Main problem: formatting (see two columns and table formats). • Improving OCR results for these formats would be beneficial but for our downstream applications, the impact remains relatively limited: – Majority of our data set’s content is single column. – Pipeline relies on entities, which is why it will not be that much affected by the incorrect interpretation of individual rows in a table. • One possible solution: instead of using text as output format of Tesseract, use XML or hORC to preserve structure information. Curation Technologies for Cultural Heritage Archives 15
  • 16. NER and Entity Linking • Recognising persons, locations and organisations in the plain text output generated by the OCR step • OpenNLP NER engine data for entity spotting – Trained with WikiNER (1.1m articles, 400m tokens) – No training data available for our specific domain • DBPedia SPARQL endpoint for entity linking – Easy access to additional information on entities, provided by DBPedia or other LOD knowledge bases – To be replaced with GND (Gemeinsame Normdatei), designed for German, much higher coverage expected. Curation Technologies for Cultural Heritage Archives 16
  • 17. Document Clustering • We cluster documents based on the URIs of spotted and linked entities • WEKA (Expectation Maximization algorithm) • Allows theme-based exploration of the data set (starting off with a certain region, cluster of people, etc.) • Systematic evaluation pending (as part of a user study) • Document clustering to be complemented with a new component for topic detection Curation Technologies for Cultural Heritage Archives 17
  • 18. Korean Processing • Language Technology support for Korean is limited • We’ve been unable to find a readily available NER package or data for Korean • Idea: use annotation projection as a means of enriching Korean text with entities. • Using Machine Translation (OpenNMT) to translate Korean texts into German, we aim to train an NER system on Korean, by projecting found entities in the English text to the Korean sentences (work in progress). Curation Technologies for Cultural Heritage Archives 18
  • 19. Curation Workbench Curation Technologies for Cultural Heritage Archives 19
  • 20. Curation Workbench Curation Technologies for Cultural Heritage Archives 20
  • 21. Summary and Future Work Curation Technologies for Cultural Heritage Archives 21
  • 22. Summary • Dashboard for the curation of cultural heritage data. • Goal: intuitive analysis, exploration, visualisation of data. • Add value by combining existing semantic components • To what extent can off-the-shelf tools be used for this? • Current machine learning techniques rely on very large amounts of representative, high-quality training data. • Use of tools trained on general domain data is ambitious. • But: generic processing results are good enough to develop useful curation dashboard that helps users interested in further exploring a large data set. Curation Technologies for Cultural Heritage Archives 22
  • 23. Future Work • Pipeline and dashboard are work in progress • User study to verify usefulness of tool • Show metadata created by pipeline • Link GUI with the original PDF files/pages • Half of the 51 editions have been translated into Korean – we’ll explore if we can create a parallel corpus for MT • Furthermore, related: development of additional services, workflow manager, containerisation of services (especially in QURATOR and European Language Grid) Curation Technologies for Cultural Heritage Archives 23
  • 24. Thank you very much! Questions? Curation Technologies for Cultural Heritage Archives 24