Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, and Synchronization of Resources on the Web

1,087 views

Published on

DPALfest 2017 Presentation by:
Martin Klein
Gretchen Gueguen
Mark Matienzo
Petr Knoth

Published in: Internet
  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, and Synchronization of Resources on the Web

  1. 1. An overview of capabilities and real-world use cases for discovery, harvesting, and synchronization of resources on the web http://www.openarchives.org/rs #resourcesync ResourceSync ANSI/NISO Z39.99-2017 Martin Klein Gretchen Gueguen Mark Matienzo Petr Knoth
  2. 2. ResourceSync was funded by the Sloan Foundation & JISC Martin Klein Los Alamos National Laboratory @mart1nkle1n http://www.openarchives.org/rs #resourcesync ResourceSync ANSI/NISO Z39.99-2017
  3. 3. ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017 Background - OAI-PMH •  Recurrent metadata exchange from a Data Provider to Service Providers •  XML metadata only •  Repository centric •  Devised 1999-2002, prior to REST, prior to dominance of web search engines
  4. 4. ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017 Revisit the Problem Domain - ResourceSync •  Synchronization of resources from a Source to Destinations •  Web resources, anything with an HTTP URI & representation •  Resource centric •  Devised 2012-2013, leverages key ingredients of web interoperability, existing specifications •  Updated in 2017 to v1.1
  5. 5. One to One Synchronization
  6. 6. One to Many – Master Copy
  7. 7. Many to One - Aggregator
  8. 8. Selective Synchronization
  9. 9. Metadata Harvesting
  10. 10. ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017 ResourceSync Capabilities •  Resource List •  Inventory, baseline synchronization •  Change List •  Resource change events that occurred in a temporal interval, incremental synchronization •  Resource Dump •  Change Dump •  Notifications (separate specification) •  Archives (beta draft)
  11. 11. ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017 Sitemap <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”> <url> <loc>http://example.com/res1</loc> <lastmod>2017-01-02T13:00:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2017-01-02T14:00:00Z</lastmod> <changefreq>daily</changefreq> </url> … </urlset>
  12. 12. ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017 Resource List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" at="2017-01-03T09:00:00Z” /> <url> <loc>http://example.com/res1</loc> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" type="application/pdf" /> <rs:ln rel="describedby" href="http://example.com/res1_dublin_core_md.xml" type="application/xml" /> </url> <url> … </url> </urlset>
  13. 13. ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017 Change List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="changelist" from="2017-01-02T09:00:00Z" until="2017-01-03T09:00:00Z" /> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change="created" datetime="2017-01-02T13:00Z" /> </url> <url> <loc>http://example.com/res3</loc> <rs:md change="updated" datetime="2017-01-02T15:00Z" /> </url> </urlset>
  14. 14. ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017 ResourceSync Change Notifications •  Notifications about change events to resources •  Source notifies subscribed Destinations (cf. recurrent pull) •  Push-based approach via WebSub •  Similar, sitemap-based payload •  Decrease synchronization latency between Source and Destination •  Change Notification Specification v1.0
  15. 15. ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017 EHRI Use Case •  Aggregation of information about Holocaust collections •  held by 1,800+ organizations worldwide •  into a central service •  EAD as exchange format •  Diversity of data sources and locations •  databases, spreadsheets (“home collections”) https://ehri-project.eu/ http://portal.ehri-project.eu https://twitter.com/EHRIproject
  16. 16. ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017 EHRI Use Case •  Special ResourceSync implementation •  Bridges gap between local systems and ResourceSync capability documents on a web server •  Filters local resources by subject, time period, etc •  Set up by EHRI technical staff, run by contributing party •  Baseline synchronization: Resource Lists •  Incremental synchronization: Change Lists •  Together with EAD files moved from local system to web server •  Dropbox, FTP, USB stick •  Service: partners expose EADs, server collects and offers value- added services e.g., graph database https://ehri-project.eu/ http://portal.ehri-project.eu https://twitter.com/EHRIproject
  17. 17. ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017 CLARIAH Use Case •  Various institutions host evolving collections •  Make collection items uniformly available via RDF graph •  Central registry holds description of all collections •  Researchers use Virtual Research Environment to •  Discover collections (via registry) •  Collect graphs from respective institution •  Keep graphs up to date https://www.clariah.nl/ https://twitter.com/CLARIAH_NL
  18. 18. ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017 CLARIAH Use Case •  Baseline synchronization •  Download graph from DB •  Serialized as one or more files, one RDF triple per line (+ s p o graph_name) •  + stands for “add” •  URIs of files listed in Resource List •  Incremental synchronization •  Changes logged in one or more files, one change per line (+/- s p o graph_name) •  + stands for “add”, “-” for delete •  URIs of files listed in Change List https://www.clariah.nl/ https://twitter.com/CLARIAH_NL
  19. 19. ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017 ResourceSync Tools •  Source implementation •  Python •  DANS & LANL & CORE •  Connectors to file system, Solr index •  OAI-PMH converter (planned) •  https://github.com/resourcesync/py-resourcesync •  Client implementation •  Python •  https://github.com/resync/resync •  Notification implementation •  PubSubHubbub •  https://github.com/resync/resourcesync_push
  20. 20. Hyku & DPLA ResourceSync Implementations Gretchen Gueguen, Data Services Coordinator Digital Public Library of America, gretchen@dp.la
  21. 21. Project Background ● IMLS National Leadership Grant (30 months) ● Foster a national digital platform through community-based repository infrastructure ● Leverage & contribute to Hydra, both in code and community
  22. 22. Primary Project Goals 1. Develop turnkey (“easy to install, easy to maintain”) Hydra-based application that leverages and improves on core code components 2. Develop metadata aggregation & enrichment tools 3. Work toward a hosted service in the cloud
  23. 23. Metadata Aggregation @DPLA
  24. 24. Metadata Aggregation @DPLA Methods for Data Aggregation: ● OAI PMH (21 providers) ● Custom APIs/other (8 providers) ● Direct file transfer (3 providers) Biggest Drawbacks: ● Re-synchronizing entire data sets ● Relying on http requests
  25. 25. ResourceSync and Hyku ● ResourceSync publishing support built into MVP ● Test application with 50,000 records to start ○ Limit for a single list. To add more, we would need to make a list of lists. ● Resource lists and change lists are supported ● Resource or change dumps not currently supported ● Content negotiation for JSON-LD, N-Triples, and Turtle
  26. 26. ResourceSync and DPLA Harvester developed for Hyku endpoint ● Development for this specific endpoint means that it’s not a full test of all ResourceSync capabilities ● We suspect that we will prefer the Dump to the List ○ Using the List means making HTTP calls for each item in order to do the content negotiation ○ Dump allows us to just download specifically what we need ○ We will still be downloading records that weren’t updated but given the typical size of the diff for each provider this single download may still be preferable to 100,000 HTTP requests ● Future implementations may require us to build on this initial harvester if the specifics are different
  27. 27. Next Steps Hyku: ● Possibly support Dump ● Increase test set over 50K DPLA: ● Harvest from 3 DPLA providers implementing ResourceSync by end of year
  28. 28. IIIF & ResourceSync: Supporting discovery Mark A. Matienzo, Stanford University Libraries @anarchivist / https://orcid.org/0000-0003-3270-1306 DPLAFest — Chicago, Illinois — April 20, 2017
  29. 29. International Image Interoperability Framework A community that develops Shared APIs implements them in Software and exposes interoperable Content http://iiif.io/
  30. 30. IIIF Community http://iiif.io/community ● IIIF Consortium ○ Currently 38 state/national libraries, universities, museums, tech firms ○ Provides sustainability and steering for the initiative ● Wider community ○ 80+ CH institutions, companies, and projects using IIIF standards ○ iiif-discuss list = 670+ members ○ IIIF Slack = 300+ members ● Community & Technical Specification Groups
  31. 31. Shared APIs http://iiif.io/api/ ● Image API ○ Transfer image pixels, regions, etc. ○ Image manipulation ● Presentation API ○ Presentation of an object (pixels + navigation and metadata) ○ Easily share and re-use, mix and match content ○ Annotate content ● Search API ○ Search annotations ● Authentication API ○ Provide interoperability for access-restricted content
  32. 32. Software Implementations https://github.com/IIIF/awesome-iiif
  33. 33. IIIF Content All kinds of image resources: artworks, photographs, manuscripts, newspapers Investigating AV and 3D
  34. 34. “Discovery” in IIIF Finding interoperable resources Two main concerns: ● How can users find IIIF resources? ● How can users then get those resources into an environment where they can use them?
  35. 35. Scoping the problem What resources can be discovered? Types of resources in IIIF: ● Content (Image API) ● Description (Presentation API) The Image API does not provide description of image content, just technical and rights metadata. Discovery requires Description resources to provide information about Content resources.
  36. 36. Presentation API A Manifest provides just enough metadata (descriptive, structural, etc.) to drive a viewer. A Collection groups Manifests or other Collections. http://iiif.io/api/presentation/2.1/
  37. 37. Community work IIIF Discovery Technical Specification Group iiif.io/community/groups/discovery/ IIIF Discovery TSG scope: ● Crawling and harvesting ● Content indexing ● Change notification ● Import to viewers
  38. 38. Presentation API constraints Informing decisions The Presentation API does not include semantic descriptions, but can reference them using seeAlso. IIIF (including the Presentation API) has a resource-centric view of the web, not a service-centric view (cf Sitemaps/ResourceSync vs OAI-PMH).
  39. 39. Examples
  40. 40. Basic Sitemaps at NC State ● Example demonstrates use of Simple sitemaps without any extensions, including ResourceSync ● Intended to expand upon existing practice of publishing sitemaps from digital collections
  41. 41. Sitemap entry for manifests <url> <loc>https://d.lib.ncsu.edu/collections/catalog/bh1141pnc004/manifest</loc> <lastmod>2016-12-13T15:38:19Z</lastmod> </url> Sitemap entry for landing page <url> <loc>https://d.lib.ncsu.edu/collections/catalog/bh1141pnc004</loc> <lastmod>2017-03-27T19:33:52Z</lastmod> </url> Sample of NCSU Sitemaps Courtesy Jason Ronallo, North Carolina State University
  42. 42. Prototyping at Europeana Exploring Sitemaps and extensions for discovery of IIIF resources for harvesting ● Partnership with University College Dublin and National Library of Wales ● ResourceSync satisfied key needs identified within requirements ● ResourceSync accommodated additional metadata prototyped in an IIIF Sitemap Extension ● Follows several synchronization paradigms
  43. 43. Uses Sitemaps and IIIF Extension <url> <loc>http://newspapers.library.wales/view/3320640</loc> <iiif:Manifest xmlns:iiif="http://iiif.io/api/presentation/2/"> http://dams.llgc.org.uk/iiif/newspaper/issue/3320640/manifest.json </iiif:Manifest> <dct:isPartOf>http://dams.llgc.org.uk/iiif/newspapers/3320639.json</dct:isPartOf> <lastmod>2014-11-08</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> Example of NLW Sitemap Entry Courtesy Nuno Freire, Europeana
  44. 44. Uses Sitemaps and ResourceSync and DCMES as Extensions <url> <loc>https://digital.ucd.ie/view/ucdlib:38491</loc> <rs:ln rel="alternate" href="https://digital.ucd.ie/view/ucdlib:38491" type="application/json" dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/> <rs:ln rel="collection” href="https://digital.ucd.ie/view/ucdlib:38488” type="application/json" dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/> <lastmod>2014-08-24T04:09:09.716Z</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> Example of UCD Resource List Entry Courtesy Nuno Freire, Europeana
  45. 45. Uses Sitemaps, ResourceSync, and Sitemap Image Extension Sample of UCD Resource List Courtesy John Howard, University College Dublin
  46. 46. Conclusions Strengths ● ResourceSync addresses core requirements for exposing IIIF resources for harvesting ● Can build on publication of existing sitemaps easily ● Leverages Many-to-One, Selective Synchronization, and Metadata Harvesting paradigms ● Can adopt additional extensions to implement needed features ● Plenty of opportunity to contribute; need more prototypes Challenges ● IIIF community’s needs for discovery are not necessarily what other sitemap consumers want (e.g. Google) ● Identifying the primary resource influences structure ● Unclear whether search engines support custom extensions, and what ranking impact would be
  47. 47. Thank You! Mark A. Matienzo, Stanford University Libraries @anarchivist / https://orcid.org/0000-0003-3270-1306 DPLAFest — Chicago, Illinois — April 20, 2017
  48. 48. Seamless access to the world’s open access research papers via ResourceSync Petr Knoth
  49. 49. Use Case 1: ResourceSync as a seamless layer over heterogenous APIs
  50. 50. Use Case 1: What is CORE? OA Repositories OA Journals Mostly OAI-PMH CORE aggregates and provides free access to millions of research articles aggregated from thousands of OA repositories and journals.
  51. 51. Use Case 1: What is CORE? OA Repositories OA Journals Mostly OAI-PMH CORE aggregates and provides free access to millions of research articles aggregated from thousands of OA repositories and journals. » Enrichment and harmonisation of aggregated data » Products/services: › Portal › API › Data dumps › Recommendation system for libraries › Repository dashboard › B2B and analytical services
  52. 52. Use Case 1: What is CORE? OA Repositories OA Journals Mostly OAI-PMH CORE aggregates and provides free access to millions of research articles aggregated from thousands of OA repositories and journals. » 70 million+ metadata records » Over 6 million full texts hosted on CORE » ~1.5 million monthly active users » Aggregating from 2,500 repositories and 10k OA journals
  53. 53. Use Case 1: Key issue Key players do not provide interoperability for machine access to metadata and content of research papers. 35% 23% 18% 12% 12% Accessing full-text by harves5ng the website Major search engines Recongnised services upon approval 75% 12% 13% Restric5ng access to full-text Don't restrict access in any way Specify a crawl delay Allow access to specific robots 39% 11% 39% 11% Reference of an ar5cle’s full-text on metadata Direct link to full- text Interface supporBng full-text transfer 50% 42% 8% Accessing content standards OAI Own API Z39.50 36% 24% 4% 32% 4% Files format PDF HTML Plain text HTML JSON 54%31% 15% Automated downloads of OA full-text Website API FTP
  54. 54. Use Case 1: Approach OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector Mostly OAI-PMH A range of bespoke APIs + many others Provide seamless access over non-standardised APIs. What protocol?
  55. 55. Use Case 1: Approach OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector Mostly OAI-PMH A range of bespoke APIs + many others Provide seamless access over non-standardised APIs. What protocol? » Why not OAI-PMH? › slow and very inefficient for big repositories. › Standardised for metadata transfer but not for content transfer. ›  Very difficult to represent the richness of metadata from a broad range of data providers.
  56. 56. Use Case 1: ResourceSync as a seamless access layer » Very scalable implementation on both the server and client side » Interpretation of metadata happens using existing pipeline at the aggregator. » 1.5 million OA publications from Elsevier, Springer and others already exposed. » Available at: https://publisher-connector.core.ac.uk/resourcesync OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector Mostly OAI-PMH A range of bespoke APIs + many others ResourceSync
  57. 57. Use Case 2: Exposing enriched data for Text and Data Mining (TDM) via ResourceSync
  58. 58. Use Case 2: Subscribing to ResourceSync OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector Mostly OAI-PMH A range of bespoke APIs ResourceSync + many others » Other aggregators can subscribe to the Publisher connector to make use of their ingestion pipelines and enrichment technologies
  59. 59. Use Case 2: Content ingestion in OpenMinTeD OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector ResourceSync Mostly OAI-PMH OMTD-SHARE (over REST) A range of bespoke APIs + many others » CORE and OpenAIRE are content sources in the OpenMinTeD TDM platform (EU infrastructure project) being developed to enable the mining of scholarly literature.
  60. 60. Use Case 2: Exposing enriched data for TDM OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector ResourceSync Mostly OAI-PMH A range of bespoke APIs + many others ResourceSync » But others want similar solutions … typically, they want to be able to sync and host the data.
  61. 61. Use Case 3: Make repositories and journals adopt ResourceSync
  62. 62. Use Case 3: Replace OAI-PMH with ResourceSync OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector ResourceSync Mostly OAI-PMH OMTD-SHARE (over REST) A range of bespoke APIs + many others ResourceSync ResourceSync » Will be a game changer … » Advocated by COAR Next Generation Repositories WG
  63. 63. Key contributions and considerations
  64. 64. What’s new about our implementation of ResourceSync? » Scales to many millions of resources as required by aggregators (as opposed to existing implementations for repositories that are scalable for tens of thousands of resources) » Real-time updating of ResourceLists and ChangeLists (avoiding unnecessary batch processes). » Combination of real-time updates and scalability
  65. 65. Architectural choices » Based on the principle of changes being communicated to a controller as they happen (rather than having to be detected prior to ResourceList/ChangeList updates) » Uses Elasticsearch as a database » Hashing mechanism to distribute size of each ResourceList link and a clever mechanism for iterative updating of ResourceLists
  66. 66. Conclusions » ResourceSync: › broad range of uses in scholarly communication. › solves problems with aggregating content over OAI-PMH, faster & more efficient aggregation => fresher data in aggregators compared to OAI-PMH » We used ResourceSync to ”liberate” over 1.5 million OA papers (and growing) from key publishers » CORE soon to provide access to over 8 million OA full texts via ResourceSync. » CORE actively contributes to the adoption of ResourceSync in the repositories community (as part of OpenMinTeD and COAR NGR)
  67. 67. An overview of capabilities and real-world use cases for discovery, harvesting, and synchronization of resources on the web http://www.openarchives.org/rs #resourcesync ResourceSync ANSI/NISO Z39.99-2017 @mart1nkle1n @G_AmSpinnrade @anarchivist @petrknoth

×