Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

New approaches for data acquisition at europeana iiif, sitemaps and schema.org, dans seminar, 2017

1,144 views

Published on

Presentation on experiments at Europeana regarding new methods of aggregating metadata.
Presented at the Seminar Linked Data in Research and Cultural Heritage, on 1st of May 2017.

Published in: Technology

New approaches for data acquisition at europeana iiif, sitemaps and schema.org, dans seminar, 2017

  1. 1. New approaches for data acquisition at Europeana: IIIF, Sitemaps and Schema.org Valentine Charles and Nuno Freire Seminar Linked Data in Research and Cultural Heritage 1 May 2017
  2. 2. Title here CC BY-SA Europeana The Platform for Europe’s Digital Cultural Heritage ● We aggregate (and make available) metadata: • From all EU countries • From ~3,500 galleries, libraries, archives and museums • Under a CC0 licence • More than 54M objects and • In about 50 languages “We transform the world with culture! We want to build on Europe’s rich heritage and make it easier for people to use, whether for work, for learning or just for fun.” New approaches for data acquisition at Europeana CC BY-SA
  3. 3. Czech Republic, PD 1887, Uměleckoprůmyslové museum v Praze Preissig, Vojtech Coloured etchings Re-thinking data aggregation in Europeana
  4. 4. Title here CC BY-SA ● Organisational rationale • Data providers, Aggregators, Europeana have defined roles ● A technical rationale • Federated search had shown its limits in previous projects • Choice of OAI-PMH as the core technological solution ● A data rationale • Data aggregation focused • on metadata •on cultural objects as the main entity A centralised approach to data aggregation New approaches for data acquisition at Europeana CC BY-SA
  5. 5. Title here CC BY-SA How to go from... New approaches for data acquisition at Europeana CC BY-SA
  6. 6. Title here CC BY-SA Europeana aggregation infrastructure Europeana| CC BY-SA ...to New approaches for data acquisition at Europeana CC BY-SA
  7. 7. Title here CC BY-SA What kind of technology(ies) are we considering? ● What are the successors of OAI-PMH? ● Technologies widely used by CH organizations for other purposes • Search engine optimization • Linked data • Social web technologies New approaches for data acquisition at Europeana CC BY-SA
  8. 8. Cristallisation ou Mouvement du temps, René Bord 1987, Bibliothèque Municipale De Lyon, public domain Investigated technologies: IIIF and Sitemaps
  9. 9. International Image Interoperability Framework (IIIF) CC BY-SA New approaches for data acquisition at Europeana ● Why IIIF? • It provides immediate access to full and high-res imagery and multi- page documents is something all users want (whether casual or professional) • Some users have specific needs and pain points • It supports Europeana in shifting its focus on content too. • Storing and serving digital media, on behalf of partners, is a major step towards an updated value proposition to partners and users both.
  10. 10. International Image Interoperability Framework (IIIF) CC BY-SA New approaches for data acquisition at Europeana ● How do we support IIIF? • We have joined the IIIF community as a founding member! • Within the IIIF community we are engaged in the Newspapers special interest group and in prototyping using IIIF in web discovery and metadata harvesting • We work with the Europeana Network to encourage the use of IIIF • We have updated our Europeana Data Model and documentation to include instructions on how to provide IIIF images and manifests • And we support the idea to try to extend IIIF to other types of media, esp. audio-visual.
  11. 11. Cristallisation ou Mouvement du temps, René Bord 1987, Bibliothèque Municipale De Lyon, public domain Investigated technologies: IIIF and Sitemaps
  12. 12. Sitemaps CC BY-SA New approaches for data acquisition at Europeana ● Sitemaps allows webmasters to inform search engines about pages on their sites that are available for crawling ● They are supported by • all major search engines • many content management systems • many Europeana data providers ● They provide a simple technological solution with a very low implementation barrier ● They can support a large range of resources type • There are sitemaps extensions for images and videos (by Google)
  13. 13. Sitemaps and Schema.org CC BY-SA New approaches for data acquisition at Europeana ● Sitemaps can be associated with microdata like Schema.org ● Europeana has already developed EDM mappings to Schema.org ● We have also worked on a series of recommendations • URI for an object (http://data.europeana.eu/...) should differ from the URL of the page(s) that display information about that object (http://www.europeana.eu/portal/...). • A sitemap should also include reference to the publisher of the data (http://europeana.eu) and provider pages that Europeana could publish in the future. See more in Code4Lib article Recommendations for the application of Schema.org to aggregated Cultural Heritage metadata to increase relevance and visibility to search engines: the case of Europeana
  14. 14. Case studies Netherlands, Public Domain 1910-1925, Rijksmuseum Anonymous Tak met vier mangolia’s
  15. 15. Partners Europeana & IIIF CC BY-SA ● To study the feasibility of performing metadata aggregation via IIIF/Sitemaps, we have undertaken several case studies, in cooperation with data providers of the Europeana Network • National Library of Wales • Very active in the IIIF community • Very advanced in IIIF implementation • Expertise in full-text content (over IIIF) • University College Dublin • Very advanced in IIIF implementation • Expertise in search engine optimization (Sitemaps and its media specific extensions)
  16. 16. Brief introduction to the IIIF APIs Europeana & IIIF CC BY-SA How can IIIF be used for metadata aggregation?
  17. 17. Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io @bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io “get pixels” via a simple, RESTful, web service Just enough metadata to drive a remote viewing experience Image API Presentation API IIIF: Two Core APIs
  18. 18. Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io @bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io Image Delivery API http://iiif.io/api/image/2.0/
  19. 19. Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io @bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io Object = Image + Presentation
  20. 20. Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io @bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io Presentation API •Descriptive: label, description •Rights: license, attribution (to be c’ed) Image API Image Data Object = Image + Presentation
  21. 21. Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io @bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io Presentation API (c’ed) • Structure • Collections of objects • Manifests organizing Items, Sequences, Parts together with their metadata • Linking • service: additional service endpoint • related: resource to display to the user • seeAlso: semantic metadata resource
  22. 22. Case study 1: Crawling services across the IIIF universe • Questions addressed: • Can Europeana find the available IIIF services through IIIF Service Registries? • Is the output of IIIF crawlable? Can robots follow links in IIIF output and reach all resources? • How mature and uniform are existing IIIF implementations ? • Is metadata available? • Are machine readable licenses available? New approaches for data acquisition at Europeana CC BY-SA
  23. 23. Case study 1: Crawling services across the IIIF universe • Questions addressed: • Can Europeana find the available IIIF services through IIIF Service Registries? • Is the output of IIIF crawlable? Can robots follow links in IIIF output and reach all resources? • How mature and uniform are existing IIIF implementations? • Is metadata available? • Are machine readable licenses available? New approaches for data acquisition at Europeana CC BY-SA Registries are available and are machine readable, but coverage was only partial IIIF provides all that is necessary, but some features are optional (e.g. IIIF Collections) Minor compliance problems only due to immaturity of the implementations IIIF provides a way to link to metadata, but it is optional (and often not used) IIIF provides licensing information, but it is optional (and often not used)
  24. 24. Case study 2: Crawling IIIF services via IIIF Collections IIIF offers a Collection construct to represent groups of objects • By making a IIIF collection available to Europeana, all the resources it references can be crawled and their metadata harvested • Often available or simple to implement The two data providers had IIIF services in operation already, but... • No collection • No metadata ==>Implementation of a IIIF collection was easily achieved in both cases. We identified an additional issue for metadata aggregation : IIIF collections do not provide the modification timestamp of resources. In order to overcome it, other technologies may be used in conjunction with IIIF. New approaches for data acquisition at Europeana CC BY-SA
  25. 25. Case study 3: Crawling IIIF services via Sitemaps • Aggregation using Sitemaps can be more efficient • Resource timestamps can be included in a Sitemap • Three possible ways of using Sitemaps where experimented: • Standard Sitemaps • Sitemaps extended with elements used in IIIF specifications • Sitemaps extended with elements from the ResourceSync namespace New approaches for data acquisition at Europeana CC BY-SA
  26. 26. Case study 3: Crawling IIIF services via Sitemaps New approaches for data acquisition at Europeana CC BY-SA <url> <loc>https://data.ucd.ie/api/img/collection/ivrla:3573</loc> <lastmod>2014-08-24T04:09:09.716Z</lastmod> </url> Example of URL data in a Sitemap from University College Dublin. The loc element references a IIIF Manifest.
  27. 27. Case study 3: Crawling IIIF services via Sitemaps New approaches for data acquisition at Europeana CC BY-SA <url> <loc>http://newspapers.library.wales/view/3679651</loc> <iiif:Manifest xmlns:iiif="http://iiif.io/api/presentation/2/">http://dams.llgc.org.uk/iiif/newspaper/issue/3679651/m anifest.json</iiif:Manifest> <dcterms:isPartOf>http://dams.llgc.org.uk/iiif/newspapers/3679650.json<dcterms:isPartOf> <lastmod>2014-11-08</lastmod> </url> Example of URL data in a Sitemap from the National Library of Wales, with references to the webpage of the resource, the IIIF Manifest and its IIIF Collection.
  28. 28. Case study 3: Crawling IIIF services via Sitemaps New approaches for data acquisition at Europeana CC BY-SA <url> <loc>https://digital.ucd.ie/view/ucdlib:38491</loc> <rs:ln rel="alternate" href="https://data.ucd.ie/api/img/manifests/ucdlib:38491" type="application/json" dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/> <rs:ln rel="collection href="https://digital.ucd.ie/view/ucdlib:38488” type="application/json" dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/> <lastmod>2014-08-24T04:09:09.716Z</lastmod> </url> Example of URL data in a Sitemap from University College Dublin, with references to the webpage of the resource, the IIIF Manifest and its IIIF Collection, and the indication of the IIIF API version in use
  29. 29. Case study 4: Crawling IIIF services via IIIF Collections and HTTP cache headers • Addressed the lack of efficiency of IIIF Collections, by using HTTP cache control • The IIIF service is required to have the implementation of some HTTP cache headers for the URLs that provide access to the IIIF resources. • When resources have not changed, the IIIF service saves time and processing • Reduced crawling time by ~50% New approaches for data acquisition at Europeana CC BY-SA ➢ The IIIF crawler includes in all the requests for IIIF manifests, the HTTP header If-Modified- Since, with the timestamp of the last crawl. ➢ The IIIF service only needs to send the IIIF manifest if an update has happened ➢ In case of deletion, the IIIF service returns a response with the HTTP Status code 404 Not Found.
  30. 30. Case study 5: Crawling resources and metadata referenced by Sitemaps Video and image Extensions • Google has defined Sitemaps extensions for retrieval of image and video • Just like search engines, Europeana may reuse the media specific metadata, however: • From Europeana’s metadata aggregation perspective, the main issue is that the metadata does not fulfil its data quality requirements • The solution adopted with University College Dublin was to further extend the Video Sitemaps with elements from ResourceSync that allow for the association of the EDM metadata New approaches for data acquisition at Europeana CC BY-SA
  31. 31. Example of URL data using the Sitemaps Video extension from University College Dublin. The Sitemap was extended for association of EDM metadata. New approaches for data acquisition at Europeana CC BY-SA <url>. <loc>https://digital.ucd.ie/view/ucdlib:38509</loc> <rs:ln rel="describedby" href="https://data.ucd.ie/api/edm/v1/ucdlib:38509" dcterms:conformsTo="http://www.europeana.eu/schemas/edm/"/> <rs:ln rel="collection" href="https://data.ucd.ie/api/img/collection/ucdlib:38488"/> <video:video> <video:thumbnail_loc>https://digital.ucd.ie/get/ucdlib:38509/thumbnail </video:thumbnail_loc> <video:description>Irish poet Catherine Ann Cullen reads her poem 'Meeting at the Chester Beatty' in UCD Library's Special Collections.</video:description> <video:player_loc allow_embed="yes"> https://player.vimeo.com/video/111413587 </video:player_loc> <video:duration>00:02:51.04</video:duration> <video:family_friendly>yes</video:family_friendly> <video:live>no</video:live> </video:video> <lastmod>2015-09-10T17:14:26.523Z</lastmod> </url>
  32. 32. New approaches for data acquisition at Europeana CC BY-SA Main conclusions from the case studies • Applying these technologies by providers was straightforward • In-house knowledge is a great advantage • None of the case studies presented serious technological obstacles • Very simple technological solutions are available • Only very large collections may require additional complexity • ...the main challenge is to choose among the several possibilities and establishing a standard (or best practice) within the community(ies): • Europeana is working with the IIIF community in the context of the IIIF Discovery Technical Specification group • Europeana will prepare recommendations targeted at its own partner network.
  33. 33. Future work France, Public Domain Agence Rol. Agence photographique, Bibliothëque national de France Chat "regardant" à travers une longue-vue et autre chat perché dessus
  34. 34. New approaches for data acquisition at Europeana CC BY-SA Future work • More case studies in preparation: • Crawling websites/LOD in search for resources represented in Schema.org • ResourceSync: One case study in preparation with a collection containing over 600 thousand resources • Continue monitoring and investigating technology trends in our domain: • Continue work on IIIF and Sitemaps • The Linked Data Platform • Notification based solutions: • Linked Data Notifications • Webmention
  35. 35. Title here CC BY-SA Name of image | Creator Providing organization| Country, licence Name of image | Creator Providing organization| Country, licence Updated February 2016

×