Information retrieval challenges we face at Europeana. Presented at the Search Engines Amsterdam Meet-up on the 22nd February 2019 (https://www.meetup.com/SEA-Search-Engines-Amsterdam/events/qvfxgpyzdbdc/)
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Challenges in Searching Europe's Cultural Heritage
1. Challenges in the
Search of European
Cultural Heritage
Mónica Marrero
Search Engines Amsterdam, 22 February 2019
2. What is Europeana?
● Europeana is the European Commission's digital platform for cultural heritage.
● Europeana aggregates digital collections from libraries, museums and archives
around Europe, offering that information through its digital platform
● Through Europeana, citizens and the Cultural and Creative Industries can access
European culture for the widest possible variety of purposes.
2 / 28
3. Europeana Timeline
● 2005 idea: “Virtual European library, to make Europe's cultural heritage
accessible for all”, Jacques Chirac
● 2008 prototype: European Digital Library Network (EDLnet)
○ 4.5M objects
● 2009 Europeana v1.0
● 2019 Europeana
○ ~ 58M objects
○ ~ 3700 institutions
○ ~ 40 languages (although 24 official languages in EU)
3 / 28
4. Europeana Contents
● Collections: Books, newspapers, journals, letters, diaries, archival papers,
paintings, maps, drawings, photographs, music, spoken word, radio broadcasts,
films, newsreels, television, fashion, sculpture, 3d objects, etc.
● Aggregation:
○ As a rule, content is served from institutions
○ No storing of digital object: only metadata and thumbnails (exceptions:
user-generated content in I World War Collection and digitalization
newspaper collection)
○ Access to digital object directly from data provider’s platform
4 / 28
8. Diversity
● Different types of objects: image, video, audio, text
● Different topics: fashion, art, maps, manuscripts, etc.
● Different way institutions describe those objects
How to make these data work together?
The Europeana Data Model (EDM)
8 / 28
9. Europeana Data Model
● Follow Linked Open Data principles:
○ Open RDF-based model:
https://pro.europeana.eu/page/edm-documentation
○ Reuse of existing vocabularies
○ Linked to external data
● Supports the representation of metadata about:
○ The object
○ Its digital representations
○ Its provider
9 / 28
12. Agence de presse Mondial Photo-Presse.
France, Public Domain
1932, National Library of France
Tournoi royal de motos à Londres :
changement d'une roue de side-car en marche
Challenges I:
Enrichment
13. From the metadata provided...
Provided Object
Rijksmuseum
Schutters van wijk II onder leiding van kapitein
Frans Banninck Cocq, bekend als de ‘Nachtwacht
Rembrandt van Rijn
1642
Schilderij
title
provider
author
date
type
13 / 28
14. ...to new metadata
Provided Object
Rijksmuseum
Schutters van wijk II onder leiding van kapitein
Frans Banninck Cocq, bekend als de ‘Nachtwacht
Rembrandt
1642
Schilderij
title
provider
author
date
type
Amsterdam
second quarter 17th century
Rembrandt van Rijn
painting
Schutters van wijk II led by Captain Frans
Banninck Cocq, known as the 'Night Watch'
[coord]
[date
birth]
14 / 28
18. Benefits
● Enhance the retrieval experience of
the user
○ More data to retrieve from
○ Multilinguality
○ Less ambiguity: help user to contextualize objects
○ Entity Pages increase browsing options: users can jump from one object to
others sharing common entities
18 / 28
19. Issues: Source and Rules
● Metadata missing, wrong, without clear format or including misleading properties
for automatic processing
○ E.g. dc:creator: Rembrandt, painter, born in July 15, 1606
○ E.g. Not standardized formats for date
○ E.g. dc:coverage: Wien, 20th century (could be provided as more precise
values with dcterms:spatial and dcterms:temporal)
● Ambiguity: two entities with same mention
○ E.g. Córdoba: city in Spain or Argentina?, Madrid: city or province?
● Cross-lingual ambiguity: wrong enrichments if no language tag
○ E.g. Inde (French) India
Inde (Latvian) poison
19 / 28
20. Issues: Target Resources
● Coverage and quality of target resources
○ E.g. much more resources in English than in Albanian...
○ E.g. Germania [18th century] is not in Geonames
● Domain and granularity selection
○ E.g. paper in cultural heritage is not the same as in environmental science
○ E.g. enriching with the concept culture may not help...
● Synchronism target resources
20 / 28
22. keywords
Current approach
Doc ranked 1: French
Doc ranked 2: Spanish
Doc ranked 3: Polish
Doc ranked 4: Polish
Doc ranked 5: Dutch
Doc ranked 5: English
search
results
Search THE SAME KEYWORDS in all languages
22 / 28
23. Issues
I se h ‘In i ’ in F c ,
w do I d u n n
Lat ?
Doc ranked 1: French
Doc ranked 2: Spanish
Doc ranked 3: Polish
Doc ranked 4: Polish
Doc ranked 5: Dutch
Doc ranked 5: English
search
results
keywords
23 / 28
24. Issues II
Wha y d o l I e t o f
pa n ? I do ’t ow h an e t
de r … an I on’t a !
Doc ranked 1: French
Doc ranked 2: Spanish
Doc ranked 3: Polish
Doc ranked 4: Polish
Doc ranked 5: Dutch
Doc ranked 5: English
search
results
keywords
24 / 28
25. Towards Cross-Lingual IR
Doc ranked 1: French
Doc ranked 2: Spanish
search
results
keywords
Metadata Search
Enough
multilingual data
Language tags
Analysis by
language
Input
Query translation
Output
Translation of
results
25 /28
27. Main Battles
● Quality of (meta)data
● Quality of enrichment
● Cross-lingual approach
● Content retrieval
○ Challenge from a search perspective
○ Challenge from a Human Interaction perspective
● Evaluation!
27 / 28