SlideShare a Scribd company logo
1 of 36
Climbing the Tower of Babel
Challenges and Opportunities in Multilingual
Data for the Digital Humanities
Clemens Neudecker
Staatsbibliothek zu Berlin
@cneudecker
7th LIDER Roadmapping Workshop
Linked Data for Digital Humanities and Linguistics
20 October 2015, Madrid
La búsqueda de la lengua perfecta
• Umberto Eco, 1994
• „I certainly will never
advise to follow the
bizarre thought
presented here and
dream of a universal
language“
How many languages are there?
• The Holy Bible, 1. Mose 10: 72 (70)
• Max Planck Institute for Evolutionary
Anthropology: 6500 – 7000
• ISO 639-3: 7704 (ISO 639-2: 450)
• Google Translate supported: 90
• Europeana content: currently 50
Metadata
• To enjoy a painting or music on Europeana,
no special language skills are required?
• Wrong!
– Cultural objects are described using metadata
– Metadata comes in different languages
(country of origin of the data provider)
– Most often metadata does not have language
information
– How to still find what you are looking for?
Problem: Metadata
• Example: Subject „Philosophy“
– Philosophie
– Filosofía
– Filosofie
– Filosofija
– Heimspeki
– Филозофија
– Etc.
Metadata: Option 1
• Indicate the language of the metadata
• This supports the use of translation or
mapping tools to find the correct term
in other languages/controlled vocabularies
• Example:
<subject language=„English“>Philosophy</subject>
Europeana Query Translation
Europeana Query Translation
• How it works:
http://www.europeana.eu/portal/s
earch.html?query=Philosophy
Europeana Query Translation
http://[language].wikipedia.org/w/api
.php?action=query&prop=langlinks&form
at=json&titles=[query term]
 {"lang":"de", :"Philosophie"},
{"lang":"es", :"Filosofía"}
Europeana Query Translation
http://www.europeana.eu/portal/s
earch.html?query=Philosophy&Phil
osophie&Filosofía
(simplified for illustration purposes –
above query does not really work,
as the query expansion is done internally)
Europeana Query Translation
• Read more:
– Query Translation in Europeana:
http://journal.code4lib.org/articles/10285
– Improving Europeana Multilingual Search:
http://blog.europeana.eu/2014/08/improving-
search-across-languages/
El idioma analítico de John Wilkins
• Jorge Luis Borges,
Otras Inquisiciones
• „Theoretically, it is
not impossible to
think of a language
where the name of
each thing says all
the details of its destiny, past and future“
Metadata: Option 2
• Even better: Use a language-independent
identifier for subject classification
(e.g. Library of Congress, WikiData, DDC)
• Example:
<subject id=„loc“>sh85100849</subject>
<subject id=„wikidata“>Q5891</subject>
Two examples
• Europeana 1914 – 1918
http://www.europeana1914-1918.eu/
• Europeana Newspapers
http://www.europeana-newspapers.eu/
Europeana 1914 - 1918
• In fact, three projects:
– Europeana Collections 1914-1918
400.000 digitised items from World War I
– Europeana 1914-1918
User generated content from World War I
– European Film Gateway 1914
740 hours of film related to World War I
• How to present these as a uniform collection?
Europeana 1914 - 1918
• Analysis of subject classifications available at
content holding institutions, e.g. catalogues
Europeana 1914 - 1918
• Ranking of most frequent subjects
Subject Heading Count
World War, 1914-1918--Campaigns 4307
World War, 1914-1918--Trench warfare 2990
World War, 1914-1918--Transportation 2171
World War, 1914-1918--Caricatures and cartoons 2013
World War, 1914-1918--Serbia 1755
… …
Europeana 1914 - 1918
• Mapping subjects to LoC identifiers
Subject Heading LoC identifier
World War, 1914-1918--Campaigns sh85148240
World War, 1914-1918--Trench warfare sh2008113804
World War, 1914-1918--Transportation sh2008113817
World War, 1914-1918--Caricatures and cartoons sh2010119466
World War, 1914-1918--Serbia Sh2008113856
… …
Europeana 1914 - 1918
• Enrichment of metadata with LCSH identifiers
Europeana 1914 - 1918
• Translation of all subjects
Europeana Newspapers
• Full text collection of 12 million digitised
newspaper pages from 23 European libraries
• Around 40 different languages overall
• Newspapers from 1618 - 1990  historical
spelling variants!
• www.theeuropeanlibrary.org/tel4/newspapers
Europeana Newspapers
• Content in Europeana Newspapers
Europeana Newspapers
• 12 million newspaper pages =
approximately 102,000,000,000 words!
• Impossible to translate everything to
multiple languages
• But there are alternatives…
Europeana Newspapers
• What if it were possible to search for persons,
locations, events, across languages?
Siege of
Przemyśl
Europeana Newspapers
• Named Entity Recognition
• University of Stanford NER toolkit
Europeana Newspapers
•Named Entity Disambiguation
„Jordan“
• Comparison of context
Europeana Newspapers
•Named Entity Linking
wikidata.org/wiki/Q41421
freebase.com/m/054c1
lccn.loc.gov/n92121379
What if…
• All metadata in Europeana 1914-1918
had language-independent identifiers
• All entities in Europeana Newspapers
had language-independent identifiers
• It should be possible to link the two
distinct collections!
Research Questions
• This would allow for some very intersting
digital humanities research questions, e.g.
– How were World War I events covered in
newspapers of different nations across Europe?
– What were the relations between persons,
places and events during World War I?
The Republic of Letters
• http://stanford.edu/group/toolingup/rplviz/rp
lviz.swf
Global Database of Events,
Language and Tone
• http://www.gdeltproject.org/
Conclusion
• We need know-how and technologies for
multilinugual linking of objects across cultural
heritage organisations and digital collections
• We need guidelines and standards that
support the creation and provision of
metadata in cultural heritage objects
as multilingual linked data
To follow up
• Europeana White Paper on Best Practices for
Multilingual Access to Digital Libraries
• W3C Community Group Best Practices for
Multilingual Linked Open Data
• Europeana Connect - Multilinguality
Tractatus Logico-Philosophicus
• Ludwig Wittgenstein,
1922
• Proposition 7:
„Whereof one cannot
speak, thereof one
must be silent“
Thank you for you attention!
Clemens Neudecker
Staatsbibliothek zu Berlin
@cneudecker
7th LIDER Roadmapping Workshop
Linked Data for Digital Humanities and Linguistics
20 October 2015, Madrid

More Related Content

What's hot

Judaica Europeana Dov Winer
Judaica Europeana Dov WinerJudaica Europeana Dov Winer
Judaica Europeana Dov WinerDov Winer
 
TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...
TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...
TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...Olaf Janssen
 
Hopkin digitising the first world war (dh seminar, june 2014)
Hopkin   digitising the first world war (dh seminar, june 2014)Hopkin   digitising the first world war (dh seminar, june 2014)
Hopkin digitising the first world war (dh seminar, june 2014)Digital History
 
Exploiting the version history of SKOS files: skos-history (SWIB13 Lightning ...
Exploiting the version history of SKOS files: skos-history (SWIB13 Lightning ...Exploiting the version history of SKOS files: skos-history (SWIB13 Lightning ...
Exploiting the version history of SKOS files: skos-history (SWIB13 Lightning ...Joachim Neubert
 
HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...
HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...
HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...Arts and Humanities Research Council (AHRC)
 
Multilingual challenges in Europeana
Multilingual challenges in EuropeanaMultilingual challenges in Europeana
Multilingual challenges in EuropeanaAntoine Isaac
 
SEA CHANGE @ DM2Efinal conference, Pisa, Dec 11
SEA CHANGE @ DM2Efinal conference, Pisa, Dec 11SEA CHANGE @ DM2Efinal conference, Pisa, Dec 11
SEA CHANGE @ DM2Efinal conference, Pisa, Dec 11aboutgeo
 
Tales of Things and Electronic Memory and other stories
Tales of Things and Electronic Memory  and other storiesTales of Things and Electronic Memory  and other stories
Tales of Things and Electronic Memory and other storiesREKasbohm
 

What's hot (11)

Judaica Europeana Dov Winer
Judaica Europeana Dov WinerJudaica Europeana Dov Winer
Judaica Europeana Dov Winer
 
08b final event_experimente
08b final event_experimente08b final event_experimente
08b final event_experimente
 
Keynote csws2013
Keynote csws2013Keynote csws2013
Keynote csws2013
 
Library of Congress Gemma Mitchell
Library of Congress Gemma MitchellLibrary of Congress Gemma Mitchell
Library of Congress Gemma Mitchell
 
TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...
TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...
TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...
 
Hopkin digitising the first world war (dh seminar, june 2014)
Hopkin   digitising the first world war (dh seminar, june 2014)Hopkin   digitising the first world war (dh seminar, june 2014)
Hopkin digitising the first world war (dh seminar, june 2014)
 
Exploiting the version history of SKOS files: skos-history (SWIB13 Lightning ...
Exploiting the version history of SKOS files: skos-history (SWIB13 Lightning ...Exploiting the version history of SKOS files: skos-history (SWIB13 Lightning ...
Exploiting the version history of SKOS files: skos-history (SWIB13 Lightning ...
 
HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...
HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...
HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...
 
Multilingual challenges in Europeana
Multilingual challenges in EuropeanaMultilingual challenges in Europeana
Multilingual challenges in Europeana
 
SEA CHANGE @ DM2Efinal conference, Pisa, Dec 11
SEA CHANGE @ DM2Efinal conference, Pisa, Dec 11SEA CHANGE @ DM2Efinal conference, Pisa, Dec 11
SEA CHANGE @ DM2Efinal conference, Pisa, Dec 11
 
Tales of Things and Electronic Memory and other stories
Tales of Things and Electronic Memory  and other storiesTales of Things and Electronic Memory  and other stories
Tales of Things and Electronic Memory and other stories
 

Similar to Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Data for the Digital Humanities

Europeana 1914-1918, User-Generated Content and Linked Open Data
Europeana 1914-1918, User-Generated Content and Linked Open DataEuropeana 1914-1918, User-Generated Content and Linked Open Data
Europeana 1914-1918, User-Generated Content and Linked Open DataValentine Charles
 
Judaica europeana dovwinerjudaicalibrarians
Judaica europeana dovwinerjudaicalibrariansJudaica europeana dovwinerjudaicalibrarians
Judaica europeana dovwinerjudaicalibrariansDov Winer
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...Europeana
 
What's the (European) story - Alexander Badenoch
What's the (European) story - Alexander BadenochWhat's the (European) story - Alexander Badenoch
What's the (European) story - Alexander BadenochEUscreen
 
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...wnradmin
 
02 europeana collections 1914 1918
02 europeana collections 1914 191802 europeana collections 1914 1918
02 europeana collections 1914 1918Europeana
 
53 million objects! Now what?
53 million objects! Now what?53 million objects! Now what?
53 million objects! Now what?David Haskiya
 
Moving from Niche to Mainstream: the Evolution of the UCD Digital Library
Moving from Niche to Mainstream: the Evolution of the UCD Digital LibraryMoving from Niche to Mainstream: the Evolution of the UCD Digital Library
Moving from Niche to Mainstream: the Evolution of the UCD Digital LibraryUCD Library
 
Jcdl2016_keynote-zemankova
Jcdl2016_keynote-zemankovaJcdl2016_keynote-zemankova
Jcdl2016_keynote-zemankovaAlexander Nwala
 
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...Artium Vitoria
 
Oxford’s Digital Projects: Rethinking the First World War (or 'can technolog...
Oxford’s Digital Projects: Rethinking the First World War (or 'can technolog...Oxford’s Digital Projects: Rethinking the First World War (or 'can technolog...
Oxford’s Digital Projects: Rethinking the First World War (or 'can technolog...Kate Lindsay
 
Converging on the Universal Library: From Memex to Googolplex
Converging on the Universal Library: From Memex to GoogolplexConverging on the Universal Library: From Memex to Googolplex
Converging on the Universal Library: From Memex to GoogolplexMartin Kalfatovic
 
Digital contemporary history: sources, tools, methods, issues
Digital contemporary history: sources, tools, methods, issuesDigital contemporary history: sources, tools, methods, issues
Digital contemporary history: sources, tools, methods, issuesPeter Webster
 
Digital contemporary history: sources, tools, methods, issues
Digital contemporary history: sources, tools, methods, issuesDigital contemporary history: sources, tools, methods, issues
Digital contemporary history: sources, tools, methods, issuesPeter Webster
 
DHI2018 - a comparative study of Chinese and English publications
DHI2018 - a comparative study of Chinese and English publicationsDHI2018 - a comparative study of Chinese and English publications
DHI2018 - a comparative study of Chinese and English publicationsJin Gao
 
Cultural heritage: Tradition, Museums and Wikis
Cultural heritage: Tradition, Museums and WikisCultural heritage: Tradition, Museums and Wikis
Cultural heritage: Tradition, Museums and WikisThomas Tunsch
 
Future Library Unconference 2013 - Ad polle
Future Library Unconference 2013 - Ad polleFuture Library Unconference 2013 - Ad polle
Future Library Unconference 2013 - Ad polleDimitris Protopsaltou
 
Building The European Digital Library - An Insider’s Point of View
Building The European Digital Library - An Insider’s Point of View Building The European Digital Library - An Insider’s Point of View
Building The European Digital Library - An Insider’s Point of View Olaf Janssen
 
CARARE workshop: Europeana research
CARARE workshop: Europeana researchCARARE workshop: Europeana research
CARARE workshop: Europeana researchEuropeana
 

Similar to Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Data for the Digital Humanities (20)

Europeana 1914-1918, User-Generated Content and Linked Open Data
Europeana 1914-1918, User-Generated Content and Linked Open DataEuropeana 1914-1918, User-Generated Content and Linked Open Data
Europeana 1914-1918, User-Generated Content and Linked Open Data
 
Judaica europeana dovwinerjudaicalibrarians
Judaica europeana dovwinerjudaicalibrariansJudaica europeana dovwinerjudaicalibrarians
Judaica europeana dovwinerjudaicalibrarians
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
 
What's the (European) story - Alexander Badenoch
What's the (European) story - Alexander BadenochWhat's the (European) story - Alexander Badenoch
What's the (European) story - Alexander Badenoch
 
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
 
02 europeana collections 1914 1918
02 europeana collections 1914 191802 europeana collections 1914 1918
02 europeana collections 1914 1918
 
53 million objects! Now what?
53 million objects! Now what?53 million objects! Now what?
53 million objects! Now what?
 
Moving from Niche to Mainstream: the Evolution of the UCD Digital Library
Moving from Niche to Mainstream: the Evolution of the UCD Digital LibraryMoving from Niche to Mainstream: the Evolution of the UCD Digital Library
Moving from Niche to Mainstream: the Evolution of the UCD Digital Library
 
Jcdl2016_keynote-zemankova
Jcdl2016_keynote-zemankovaJcdl2016_keynote-zemankova
Jcdl2016_keynote-zemankova
 
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
 
Oxford’s Digital Projects: Rethinking the First World War (or 'can technolog...
Oxford’s Digital Projects: Rethinking the First World War (or 'can technolog...Oxford’s Digital Projects: Rethinking the First World War (or 'can technolog...
Oxford’s Digital Projects: Rethinking the First World War (or 'can technolog...
 
Converging on the Universal Library: From Memex to Googolplex
Converging on the Universal Library: From Memex to GoogolplexConverging on the Universal Library: From Memex to Googolplex
Converging on the Universal Library: From Memex to Googolplex
 
Digital contemporary history: sources, tools, methods, issues
Digital contemporary history: sources, tools, methods, issuesDigital contemporary history: sources, tools, methods, issues
Digital contemporary history: sources, tools, methods, issues
 
Digital contemporary history: sources, tools, methods, issues
Digital contemporary history: sources, tools, methods, issuesDigital contemporary history: sources, tools, methods, issues
Digital contemporary history: sources, tools, methods, issues
 
DHI2018 - a comparative study of Chinese and English publications
DHI2018 - a comparative study of Chinese and English publicationsDHI2018 - a comparative study of Chinese and English publications
DHI2018 - a comparative study of Chinese and English publications
 
Cultural heritage: Tradition, Museums and Wikis
Cultural heritage: Tradition, Museums and WikisCultural heritage: Tradition, Museums and Wikis
Cultural heritage: Tradition, Museums and Wikis
 
Future Library Unconference 2013 - Ad polle
Future Library Unconference 2013 - Ad polleFuture Library Unconference 2013 - Ad polle
Future Library Unconference 2013 - Ad polle
 
case EHRI Veerle Van den Daelen en Tim Veken
case EHRI Veerle Van den Daelen en Tim Vekencase EHRI Veerle Van den Daelen en Tim Veken
case EHRI Veerle Van den Daelen en Tim Veken
 
Building The European Digital Library - An Insider’s Point of View
Building The European Digital Library - An Insider’s Point of View Building The European Digital Library - An Insider’s Point of View
Building The European Digital Library - An Insider’s Point of View
 
CARARE workshop: Europeana research
CARARE workshop: Europeana researchCARARE workshop: Europeana research
CARARE workshop: Europeana research
 

More from cneudecker

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Librarycneudecker
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltextecneudecker
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungencneudecker
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?cneudecker
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...cneudecker
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritagecneudecker
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenzcneudecker
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-Dcneudecker
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspaperscneudecker
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...cneudecker
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...cneudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentscneudecker
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Miningcneudecker
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltextecneudecker
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europecneudecker
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minutencneudecker
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshellcneudecker
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlincneudecker
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspaperscneudecker
 

More from cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 

Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Data for the Digital Humanities

  • 1. Climbing the Tower of Babel Challenges and Opportunities in Multilingual Data for the Digital Humanities Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker 7th LIDER Roadmapping Workshop Linked Data for Digital Humanities and Linguistics 20 October 2015, Madrid
  • 2. La búsqueda de la lengua perfecta • Umberto Eco, 1994 • „I certainly will never advise to follow the bizarre thought presented here and dream of a universal language“
  • 3. How many languages are there? • The Holy Bible, 1. Mose 10: 72 (70) • Max Planck Institute for Evolutionary Anthropology: 6500 – 7000 • ISO 639-3: 7704 (ISO 639-2: 450) • Google Translate supported: 90 • Europeana content: currently 50
  • 4. Metadata • To enjoy a painting or music on Europeana, no special language skills are required? • Wrong! – Cultural objects are described using metadata – Metadata comes in different languages (country of origin of the data provider) – Most often metadata does not have language information – How to still find what you are looking for?
  • 5. Problem: Metadata • Example: Subject „Philosophy“ – Philosophie – Filosofía – Filosofie – Filosofija – Heimspeki – Филозофија – Etc.
  • 6. Metadata: Option 1 • Indicate the language of the metadata • This supports the use of translation or mapping tools to find the correct term in other languages/controlled vocabularies • Example: <subject language=„English“>Philosophy</subject>
  • 8. Europeana Query Translation • How it works: http://www.europeana.eu/portal/s earch.html?query=Philosophy
  • 10. Europeana Query Translation http://www.europeana.eu/portal/s earch.html?query=Philosophy&Phil osophie&Filosofía (simplified for illustration purposes – above query does not really work, as the query expansion is done internally)
  • 11. Europeana Query Translation • Read more: – Query Translation in Europeana: http://journal.code4lib.org/articles/10285 – Improving Europeana Multilingual Search: http://blog.europeana.eu/2014/08/improving- search-across-languages/
  • 12. El idioma analítico de John Wilkins • Jorge Luis Borges, Otras Inquisiciones • „Theoretically, it is not impossible to think of a language where the name of each thing says all the details of its destiny, past and future“
  • 13. Metadata: Option 2 • Even better: Use a language-independent identifier for subject classification (e.g. Library of Congress, WikiData, DDC) • Example: <subject id=„loc“>sh85100849</subject> <subject id=„wikidata“>Q5891</subject>
  • 14. Two examples • Europeana 1914 – 1918 http://www.europeana1914-1918.eu/ • Europeana Newspapers http://www.europeana-newspapers.eu/
  • 15. Europeana 1914 - 1918 • In fact, three projects: – Europeana Collections 1914-1918 400.000 digitised items from World War I – Europeana 1914-1918 User generated content from World War I – European Film Gateway 1914 740 hours of film related to World War I • How to present these as a uniform collection?
  • 16. Europeana 1914 - 1918 • Analysis of subject classifications available at content holding institutions, e.g. catalogues
  • 17. Europeana 1914 - 1918 • Ranking of most frequent subjects Subject Heading Count World War, 1914-1918--Campaigns 4307 World War, 1914-1918--Trench warfare 2990 World War, 1914-1918--Transportation 2171 World War, 1914-1918--Caricatures and cartoons 2013 World War, 1914-1918--Serbia 1755 … …
  • 18. Europeana 1914 - 1918 • Mapping subjects to LoC identifiers Subject Heading LoC identifier World War, 1914-1918--Campaigns sh85148240 World War, 1914-1918--Trench warfare sh2008113804 World War, 1914-1918--Transportation sh2008113817 World War, 1914-1918--Caricatures and cartoons sh2010119466 World War, 1914-1918--Serbia Sh2008113856 … …
  • 19. Europeana 1914 - 1918 • Enrichment of metadata with LCSH identifiers
  • 20. Europeana 1914 - 1918 • Translation of all subjects
  • 21.
  • 22. Europeana Newspapers • Full text collection of 12 million digitised newspaper pages from 23 European libraries • Around 40 different languages overall • Newspapers from 1618 - 1990  historical spelling variants! • www.theeuropeanlibrary.org/tel4/newspapers
  • 23. Europeana Newspapers • Content in Europeana Newspapers
  • 24. Europeana Newspapers • 12 million newspaper pages = approximately 102,000,000,000 words! • Impossible to translate everything to multiple languages • But there are alternatives…
  • 25. Europeana Newspapers • What if it were possible to search for persons, locations, events, across languages? Siege of Przemyśl
  • 26. Europeana Newspapers • Named Entity Recognition • University of Stanford NER toolkit
  • 27. Europeana Newspapers •Named Entity Disambiguation „Jordan“ • Comparison of context
  • 28. Europeana Newspapers •Named Entity Linking wikidata.org/wiki/Q41421 freebase.com/m/054c1 lccn.loc.gov/n92121379
  • 29. What if… • All metadata in Europeana 1914-1918 had language-independent identifiers • All entities in Europeana Newspapers had language-independent identifiers • It should be possible to link the two distinct collections!
  • 30. Research Questions • This would allow for some very intersting digital humanities research questions, e.g. – How were World War I events covered in newspapers of different nations across Europe? – What were the relations between persons, places and events during World War I?
  • 31. The Republic of Letters • http://stanford.edu/group/toolingup/rplviz/rp lviz.swf
  • 32. Global Database of Events, Language and Tone • http://www.gdeltproject.org/
  • 33. Conclusion • We need know-how and technologies for multilinugual linking of objects across cultural heritage organisations and digital collections • We need guidelines and standards that support the creation and provision of metadata in cultural heritage objects as multilingual linked data
  • 34. To follow up • Europeana White Paper on Best Practices for Multilingual Access to Digital Libraries • W3C Community Group Best Practices for Multilingual Linked Open Data • Europeana Connect - Multilinguality
  • 35. Tractatus Logico-Philosophicus • Ludwig Wittgenstein, 1922 • Proposition 7: „Whereof one cannot speak, thereof one must be silent“
  • 36. Thank you for you attention! Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker 7th LIDER Roadmapping Workshop Linked Data for Digital Humanities and Linguistics 20 October 2015, Madrid