Transforming Access to Culture
& History with Connected Data
The case of Europeana
Netherlands, Public Domain
1660 - 1625, Rijksmuseum
Anonymous
Arrival of a Portuguese ship
Netherlands, Public Domain
1615, Rijksmuseum
Anonymous
Elegant Party on a Terrace of a Venetian-
inspired Setting
Who we are
Europeana: Transforming the World with Culture
Europeana: Cultural Heritage Metadata
and Content from across Europe
• We aggregate data from Cultural Heritage organisations across
Europe
• predominantly, but not only, EU member states
• In most cases we only harvest metadata
• increasingly, however, we are also hosting content as well as
metadata
• Make it available through our portal site:
https://www.europeana.eu/portal/en
• ultimately linking back to the originating institution
Transforming Access to Culture & History with Connected DataCC BY-SA
Europeana in numbers
• 53 million+ items
• 30+ languages
• 4500+ GLAM (Galleries, Libraries, Archives, and Museums)
institutions
CC BY-SA
Transforming Access to Culture & History with Connected Data
Europeana as ‘Big Data’
• Volume: relatively low, by Big Data standards (< 2TB metadata)
• Velocity: continuous updating, flushed to datastore every 15 minutes
• Veracity: significant issues of data quality
• Variety: immense
• multiple languages
• multiple formats
• different institutions
• etc … extremely heterogeneous
CC BY-SA
Transforming Access to Culture & History with Connected Data
Analysed as the four ‘V’s …
Norway, CC BY-SA
1921, Oslo Museum
Ernest Rude
Ernest Marini - dancer in a costume
Who they are
Users: what they want and what they do
Who are they?
• "Culture Vultures"
• Academic researchers
• Teachers and students
• Visual artists
• Graphic designers
• Amateurs (in the original sense of the word)
• "Culture snackers”
• casual browsers looking for entertainment
CC BY-SA
Transforming Access to Culture & History with Connected Data
What are they looking for?
• Query pattern is extremely flat
• analysis of logs shows no search term shared by > 6 users
• further analysis needed here
• “serendipity search” is important: users are trying to surprise
themselves
CC BY-SA
Transforming Access to Culture & History with Connected Data
It seems literally impossible to say ….
What are they like?
• Culture vultures
• engagement is extremely high
• mean rank of clicked items: 82 (!)
• session length once an item is clicked in the SERP can stretch into
hours
• Culture snackers
• bounce rate difficult to estimate, but high (> 85%)
CC BY-SA
Transforming Access to Culture & History with Connected Data
User engagement
What are they doing?
• school reports
• university essays
• presentations
• exhibitions
• research papers
• new artworks
CC BY-SA
Transforming Access to Culture & History with Connected Data
Making new stuff!
United Kingdom, CC BY
The Wellcome Library
Luigi Garzi
The birth of Adonis and
the transformation of Myrrha
Where we’ve been, where
we’re going
Visions for cultural heritage and connected data, past
and present
Original Vision: as Linked Open Data
provider
CC BY-SA
Transforming Access to Culture & History with Connected Data
Linking Open Data cloud diagram 2011, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
The original vision, today
• Ontological modelling
• Europeana Data Model (EDM)
• expressed in RDF for data-model mediation
• internationally shared (DPLA, BBC, etc.)
• Served on our SPARQL endpoint
• … but more frequently as JSON-LD over our APIs
• plug plug: received API World award this year for best Data API
CC BY-SA
Transforming Access to Culture & History with Connected Data
Continued contributions
LOD: New Directions
• “Entity-fication”
• 70%-80% of our searches are for named entities
• People
• Places
• Concepts (subject headings)
• Information on these can be harvested from:
• DBPedia
• Wikidata
• Geonames
• …
CC BY-SA
Transforming Access to Culture & History with Connected Data
Structuring content through LOD harvesting (i)
LOD: New Directions
• “Workification” (FRBR data model)
• creating abstract artistic or intellectual entities from numerous
instantiations
• for example, the novel “Oliver Twist” from its many printed editions
and translations
• Harvested (or at least seeded) from OCLC and VIAF
CC BY-SA
Transforming Access to Culture & History with Connected Data
Structuring content through LOD harvesting (ii)
LOD: New Directions
• Knowledge Graphs linking …
• authors to works
• artists to their paintings, and other artists
• concepts to concepts
• …
• Obvious applications
• educational
• research
• “serendipity”
• improved “snacker” engagement
CC BY-SA
Transforming Access to Culture & History with Connected Data
Structuring content through LOD harvesting (iii)
Case Study
CC BY-SA
Transforming Access to Culture & History with Connected Data
Linking Rembrandt to Jahangir
• https://www.thetimes.co.uk/article/from-rhinos-to-rembrandt-how-india-
inspired-the-world-hdsr8kls5
“Self-portrait” (Rembrandt van Rijn), “The Great-Mughal Jahangir” (Rembrandt van Rijn),
and “Prince Salim, the future Jahangir, Enthroned” (Anonymous), all in the public domain.
How we do it
Technical stack
France, Public Domain
1914, National Library of France
Agence de presse Meurisse
Concours de cycles nautiques sur le lac
d’Enghien : Berregent piloté par Austerling
The webapp stack
• Data ingestion: Java + XSLT behemoth
• Data enrichment: Java
• Source-of-truth datastore: MongoDB
• Information retrieval: Solr + Neo4J
• API: Swagger with Java
• UI: JS, variety of libraries
• SPARQL endpoint: Virtuoso
CC BY-SA
Transforming Access to Culture & History with Connected Data
France, Public Domain
1588, Bibliothèque municipale de Lyon
Hendrik Goltzius
Le dragon dévorant les compagnons
de Cadmus
Reality check
Where we are and how fast we can go
Dirty Data (i)
• getting from things to strings is a non-trivial process
• Named Entity Recognition technology relatively unhelpful in this
domain
• exact-string matching only: precision good, but recall poor
• multilinguality strong
• Limited number of tools to help with cleaning, enhancing, validating this
data
• OpenRefine potentially helpful
• ShEx, SHACL not yet fully mature
CC BY-SA
Transforming Access to Culture & History with Connected Data
Source data
Dirty Data (ii)
• Irregular data models
• Large number of Wikidata, DBpedia properties applied irregularly
• “defensive querying”
• Incorrect data
• more often questions of structure than inaccurate field values
• e.g. Geonames hierarchies
• Uncurated or aggregated data
• e.g., many variants provided by VIAF
CC BY-SA
Transforming Access to Culture & History with Connected Data
Linked Data resources
Directions forward
• Manual or at least heavily-supervised curation a requirement for the
foreseeable future
• Tools to aid NER and entity-matching are the focus of two US efforts:
• Institute of Museum and Library Services (IMLS) Local Authority Files
project
• Linked Data for Libraries Reconciliation Service Group
• Work division
• devolution to partners
• crowdsourcing
CC BY-SA
Transforming Access to Culture & History with Connected Data
Dealing with dirty data
16 November 2017
PANEL: LINKED OPEN DATA - IS IT FAILING
OR JUST GETTING OUT OF THE BLOCKS?
Tweet your questions via Direct Message to @Connected_Data or #ConnectedData
MODERATOR
James Phare
Connected Data London
PANELIST
Chris Taggart
CEO
OpenCorporates
@CountCulture
PANELIST
Chris Gutteridge
Linked Open Data
Architect
University of Southampton
PANELIST
Leigh Dodds
Data Infrastructure
Programme Lead
Open Data Institute
@ldodds
PANELIST
Sebastian Hellmann
Executive Director and
Board Member
DBpedia

Tim Hill

  • 1.
    Transforming Access toCulture & History with Connected Data The case of Europeana Netherlands, Public Domain 1660 - 1625, Rijksmuseum Anonymous Arrival of a Portuguese ship
  • 2.
    Netherlands, Public Domain 1615,Rijksmuseum Anonymous Elegant Party on a Terrace of a Venetian- inspired Setting Who we are Europeana: Transforming the World with Culture
  • 3.
    Europeana: Cultural HeritageMetadata and Content from across Europe • We aggregate data from Cultural Heritage organisations across Europe • predominantly, but not only, EU member states • In most cases we only harvest metadata • increasingly, however, we are also hosting content as well as metadata • Make it available through our portal site: https://www.europeana.eu/portal/en • ultimately linking back to the originating institution Transforming Access to Culture & History with Connected DataCC BY-SA
  • 4.
    Europeana in numbers •53 million+ items • 30+ languages • 4500+ GLAM (Galleries, Libraries, Archives, and Museums) institutions CC BY-SA Transforming Access to Culture & History with Connected Data
  • 5.
    Europeana as ‘BigData’ • Volume: relatively low, by Big Data standards (< 2TB metadata) • Velocity: continuous updating, flushed to datastore every 15 minutes • Veracity: significant issues of data quality • Variety: immense • multiple languages • multiple formats • different institutions • etc … extremely heterogeneous CC BY-SA Transforming Access to Culture & History with Connected Data Analysed as the four ‘V’s …
  • 6.
    Norway, CC BY-SA 1921,Oslo Museum Ernest Rude Ernest Marini - dancer in a costume Who they are Users: what they want and what they do
  • 7.
    Who are they? •"Culture Vultures" • Academic researchers • Teachers and students • Visual artists • Graphic designers • Amateurs (in the original sense of the word) • "Culture snackers” • casual browsers looking for entertainment CC BY-SA Transforming Access to Culture & History with Connected Data
  • 8.
    What are theylooking for? • Query pattern is extremely flat • analysis of logs shows no search term shared by > 6 users • further analysis needed here • “serendipity search” is important: users are trying to surprise themselves CC BY-SA Transforming Access to Culture & History with Connected Data It seems literally impossible to say ….
  • 9.
    What are theylike? • Culture vultures • engagement is extremely high • mean rank of clicked items: 82 (!) • session length once an item is clicked in the SERP can stretch into hours • Culture snackers • bounce rate difficult to estimate, but high (> 85%) CC BY-SA Transforming Access to Culture & History with Connected Data User engagement
  • 10.
    What are theydoing? • school reports • university essays • presentations • exhibitions • research papers • new artworks CC BY-SA Transforming Access to Culture & History with Connected Data Making new stuff!
  • 11.
    United Kingdom, CCBY The Wellcome Library Luigi Garzi The birth of Adonis and the transformation of Myrrha Where we’ve been, where we’re going Visions for cultural heritage and connected data, past and present
  • 12.
    Original Vision: asLinked Open Data provider CC BY-SA Transforming Access to Culture & History with Connected Data Linking Open Data cloud diagram 2011, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
  • 13.
    The original vision,today • Ontological modelling • Europeana Data Model (EDM) • expressed in RDF for data-model mediation • internationally shared (DPLA, BBC, etc.) • Served on our SPARQL endpoint • … but more frequently as JSON-LD over our APIs • plug plug: received API World award this year for best Data API CC BY-SA Transforming Access to Culture & History with Connected Data Continued contributions
  • 14.
    LOD: New Directions •“Entity-fication” • 70%-80% of our searches are for named entities • People • Places • Concepts (subject headings) • Information on these can be harvested from: • DBPedia • Wikidata • Geonames • … CC BY-SA Transforming Access to Culture & History with Connected Data Structuring content through LOD harvesting (i)
  • 15.
    LOD: New Directions •“Workification” (FRBR data model) • creating abstract artistic or intellectual entities from numerous instantiations • for example, the novel “Oliver Twist” from its many printed editions and translations • Harvested (or at least seeded) from OCLC and VIAF CC BY-SA Transforming Access to Culture & History with Connected Data Structuring content through LOD harvesting (ii)
  • 16.
    LOD: New Directions •Knowledge Graphs linking … • authors to works • artists to their paintings, and other artists • concepts to concepts • … • Obvious applications • educational • research • “serendipity” • improved “snacker” engagement CC BY-SA Transforming Access to Culture & History with Connected Data Structuring content through LOD harvesting (iii)
  • 17.
    Case Study CC BY-SA TransformingAccess to Culture & History with Connected Data Linking Rembrandt to Jahangir • https://www.thetimes.co.uk/article/from-rhinos-to-rembrandt-how-india- inspired-the-world-hdsr8kls5 “Self-portrait” (Rembrandt van Rijn), “The Great-Mughal Jahangir” (Rembrandt van Rijn), and “Prince Salim, the future Jahangir, Enthroned” (Anonymous), all in the public domain.
  • 18.
    How we doit Technical stack France, Public Domain 1914, National Library of France Agence de presse Meurisse Concours de cycles nautiques sur le lac d’Enghien : Berregent piloté par Austerling
  • 19.
    The webapp stack •Data ingestion: Java + XSLT behemoth • Data enrichment: Java • Source-of-truth datastore: MongoDB • Information retrieval: Solr + Neo4J • API: Swagger with Java • UI: JS, variety of libraries • SPARQL endpoint: Virtuoso CC BY-SA Transforming Access to Culture & History with Connected Data
  • 20.
    France, Public Domain 1588,Bibliothèque municipale de Lyon Hendrik Goltzius Le dragon dévorant les compagnons de Cadmus Reality check Where we are and how fast we can go
  • 21.
    Dirty Data (i) •getting from things to strings is a non-trivial process • Named Entity Recognition technology relatively unhelpful in this domain • exact-string matching only: precision good, but recall poor • multilinguality strong • Limited number of tools to help with cleaning, enhancing, validating this data • OpenRefine potentially helpful • ShEx, SHACL not yet fully mature CC BY-SA Transforming Access to Culture & History with Connected Data Source data
  • 22.
    Dirty Data (ii) •Irregular data models • Large number of Wikidata, DBpedia properties applied irregularly • “defensive querying” • Incorrect data • more often questions of structure than inaccurate field values • e.g. Geonames hierarchies • Uncurated or aggregated data • e.g., many variants provided by VIAF CC BY-SA Transforming Access to Culture & History with Connected Data Linked Data resources
  • 23.
    Directions forward • Manualor at least heavily-supervised curation a requirement for the foreseeable future • Tools to aid NER and entity-matching are the focus of two US efforts: • Institute of Museum and Library Services (IMLS) Local Authority Files project • Linked Data for Libraries Reconciliation Service Group • Work division • devolution to partners • crowdsourcing CC BY-SA Transforming Access to Culture & History with Connected Data Dealing with dirty data
  • 24.
  • 25.
    PANEL: LINKED OPENDATA - IS IT FAILING OR JUST GETTING OUT OF THE BLOCKS? Tweet your questions via Direct Message to @Connected_Data or #ConnectedData MODERATOR James Phare Connected Data London PANELIST Chris Taggart CEO OpenCorporates @CountCulture PANELIST Chris Gutteridge Linked Open Data Architect University of Southampton PANELIST Leigh Dodds Data Infrastructure Programme Lead Open Data Institute @ldodds PANELIST Sebastian Hellmann Executive Director and Board Member DBpedia