ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, and Synchronization of Resources on the Web

An overview of capabilities and real-world use cases for discovery,
harvesting, and synchronization of resources on the web
http://www.openarchives.org/rs #resourcesync
ResourceSync
ANSI/NISO Z39.99-2017
Martin
Klein
Gretchen
Gueguen
Mark
Matienzo
Petr
Knoth

ResourceSync was funded by the Sloan Foundation & JISC
Martin Klein
Los Alamos National Laboratory
@mart1nkle1n
ResourceSync

ResourceSync - @mart1nkle1n
DPLAfest, Chicago, April 20 2017
Background - OAI-PMH
•  Recurrent metadata exchange
from a Data Provider to Service
Providers
•  XML metadata only
•  Repository centric
•  Devised 1999-2002, prior to
REST, prior to dominance of
web search engines

Revisit the Problem Domain - ResourceSync
•  Synchronization of resources
from a Source to Destinations
•  Web resources, anything with
an HTTP URI & representation
•  Resource centric
•  Devised 2012-2013, leverages
key ingredients of web
interoperability, existing
specifications
•  Updated in 2017 to v1.1

ResourceSync Capabilities
•  Resource List
•  Inventory, baseline synchronization
•  Change List
•  Resource change events that occurred in a temporal interval,
incremental synchronization
•  Resource Dump
•  Change Dump
•  Notifications (separate specification)
•  Archives (beta draft)

Sitemap
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2017-01-02T13:00:00Z</lastmod>
</url>
<url>
<changefreq>daily</changefreq>
</url>
…
</urlset>

Resource List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
at="2017-01-03T09:00:00Z” />
<url>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
type="application/pdf" />
<rs:ln rel="describedby"
href="http://example.com/res1_dublin_core_md.xml"
type="application/xml" />
</url>
<url> …
</url>
</urlset>

Change List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="changelist"
from="2017-01-02T09:00:00Z"
until="2017-01-03T09:00:00Z" />
<url>
<rs:md change="created" datetime="2017-01-02T13:00Z" />
</url>
<url>
<rs:md change="updated" datetime="2017-01-02T15:00Z" />
</url>
</urlset>

ResourceSync Change Notifications
•  Notifications about change events to resources
•  Source notifies subscribed Destinations (cf. recurrent pull)
•  Push-based approach via WebSub
•  Similar, sitemap-based payload
•  Decrease synchronization latency between Source and Destination
•  Change Notification Specification v1.0

EHRI Use Case
•  Aggregation of information about Holocaust collections
•  held by 1,800+ organizations worldwide
•  into a central service
•  EAD as exchange format
•  Diversity of data sources and locations
•  databases, spreadsheets (“home collections”)
https://ehri-project.eu/
http://portal.ehri-project.eu
https://twitter.com/EHRIproject

EHRI Use Case
•  Special ResourceSync implementation
•  Bridges gap between local systems and ResourceSync
capability documents on a web server
•  Filters local resources by subject, time period, etc
•  Set up by EHRI technical staff, run by contributing party
•  Baseline synchronization: Resource Lists
•  Incremental synchronization: Change Lists
•  Together with EAD files moved from local system to web server
•  Dropbox, FTP, USB stick
•  Service: partners expose EADs, server collects and offers value-
added services e.g., graph database
https://ehri-project.eu/
http://portal.ehri-project.eu
https://twitter.com/EHRIproject

CLARIAH Use Case
•  Various institutions host evolving collections
•  Make collection items uniformly available via RDF graph
•  Central registry holds description of all collections
•  Researchers use Virtual Research Environment to
•  Discover collections (via registry)
•  Collect graphs from respective institution
•  Keep graphs up to date
https://www.clariah.nl/
https://twitter.com/CLARIAH_NL

CLARIAH Use Case
•  Baseline synchronization
•  Download graph from DB
•  Serialized as one or more files, one RDF triple per line
(+ s p o graph_name)
•  + stands for “add”
•  URIs of files listed in Resource List
•  Incremental synchronization
•  Changes logged in one or more files, one change per line
(+/- s p o graph_name)
•  + stands for “add”, “-” for delete
•  URIs of files listed in Change List
https://www.clariah.nl/
https://twitter.com/CLARIAH_NL

ResourceSync Tools
•  Source implementation
•  Python
•  DANS & LANL & CORE
•  Connectors to file system, Solr index
•  OAI-PMH converter (planned)
•  https://github.com/resourcesync/py-resourcesync
•  Client implementation
•  Python
•  https://github.com/resync/resync
•  Notification implementation
•  PubSubHubbub
•  https://github.com/resync/resourcesync_push

Hyku & DPLA
ResourceSync Implementations
Gretchen Gueguen, Data Services Coordinator
Digital Public Library of America, gretchen@dp.la

Project Background
● IMLS National Leadership Grant
(30 months)
● Foster a national digital
platform through
community-based repository
infrastructure
● Leverage & contribute to
Hydra, both in code and
community

Primary Project Goals
1. Develop turnkey (“easy to install, easy to maintain”)
Hydra-based application that leverages and improves on
core code components
2. Develop metadata aggregation & enrichment tools
3. Work toward a hosted service in the cloud

Metadata Aggregation @DPLA
Methods for Data Aggregation:
● OAI PMH (21 providers)
● Custom APIs/other (8 providers)
● Direct file transfer (3 providers)
Biggest Drawbacks:
● Re-synchronizing entire data sets
● Relying on http requests

ResourceSync and Hyku
● ResourceSync publishing support built into MVP
● Test application with 50,000 records to start
○ Limit for a single list. To add more, we would need to make a list of
lists.
● Resource lists and change lists are supported
● Resource or change dumps not currently supported
● Content negotiation for JSON-LD, N-Triples, and Turtle

ResourceSync and DPLA
Harvester developed for Hyku endpoint
● Development for this specific endpoint means that it’s
not a full test of all ResourceSync capabilities
● We suspect that we will prefer the Dump to the List
○ Using the List means making HTTP calls for each item in order to do
the content negotiation
○ Dump allows us to just download specifically what we need
○ We will still be downloading records that weren’t updated but given
the typical size of the diff for each provider this single download
may still be preferable to 100,000 HTTP requests
● Future implementations may require us to build on this
initial harvester if the specifics are different

Next Steps
Hyku:
● Possibly support Dump
● Increase test set over
50K
DPLA:
● Harvest from 3 DPLA
providers implementing
ResourceSync by end of
year

IIIF & ResourceSync:
Supporting discovery
Mark A. Matienzo, Stanford University Libraries
@anarchivist / https://orcid.org/0000-0003-3270-1306
DPLAFest — Chicago, Illinois — April 20, 2017

International Image Interoperability Framework
A community
that develops Shared APIs
implements them in Software
and exposes interoperable Content
http://iiif.io/

IIIF Community
http://iiif.io/community
● IIIF Consortium
○ Currently 38 state/national
libraries, universities, museums,
tech firms
○ Provides sustainability and steering
for the initiative
● Wider community
○ 80+ CH institutions, companies,
and projects using IIIF standards
○ iiif-discuss list = 670+ members
○ IIIF Slack = 300+ members
● Community & Technical
Specification Groups

Shared APIs
http://iiif.io/api/
● Image API
○ Transfer image pixels, regions, etc.
○ Image manipulation
● Presentation API
○ Presentation of an object (pixels +
navigation and metadata)
○ Easily share and re-use, mix and
match content
○ Annotate content
● Search API
○ Search annotations
● Authentication API
○ Provide interoperability for
access-restricted content

Software Implementations
https://github.com/IIIF/awesome-iiif

IIIF Content
All kinds of image resources:
artworks, photographs,
manuscripts, newspapers
Investigating AV and 3D

“Discovery”
in IIIF
Finding interoperable resources
Two main concerns:
● How can users find IIIF
resources?
● How can users then get those
resources into an environment
where they can use them?

Scoping the
problem
What resources
can be discovered?
Types of resources in IIIF:
● Content (Image API)
● Description (Presentation API)
The Image API does not provide
description of image content, just
technical and rights metadata.
Discovery requires Description
resources to provide information
about Content resources.

Presentation API
A Manifest provides
just enough metadata
(descriptive, structural,
etc.) to drive a viewer.
A Collection groups
Manifests or other
Collections.
http://iiif.io/api/presentation/2.1/

Community
work
IIIF Discovery Technical
Specification Group
iiif.io/community/groups/discovery/
IIIF Discovery TSG scope:
● Crawling and harvesting
● Content indexing
● Change notification
● Import to viewers

Presentation
API constraints
Informing decisions
The Presentation API does not
include semantic descriptions, but
can reference them using seeAlso.
IIIF (including the Presentation
API) has a resource-centric view of
the web, not a service-centric view
(cf Sitemaps/ResourceSync vs
OAI-PMH).

Basic Sitemaps
at NC State
● Example demonstrates use of
Simple sitemaps without any
extensions, including
ResourceSync
● Intended to expand upon
existing practice of publishing
sitemaps from digital collections

Sitemap entry for manifests
<url>
<loc>https://d.lib.ncsu.edu/collections/catalog/bh1141pnc004/manifest</loc>
</url>
Sitemap entry for landing page
<url>
<loc>https://d.lib.ncsu.edu/collections/catalog/bh1141pnc004</loc>
</url>
Sample of NCSU Sitemaps
Courtesy Jason Ronallo, North Carolina State University

Prototyping at
Europeana
Exploring Sitemaps and
extensions for discovery of
IIIF resources for harvesting
● Partnership with University
College Dublin and National
Library of Wales
● ResourceSync satisfied key
needs identified within
requirements
● ResourceSync accommodated
additional metadata prototyped
in an IIIF Sitemap Extension
● Follows several synchronization
paradigms

Uses Sitemaps and IIIF Extension
<url>
<loc>http://newspapers.library.wales/view/3320640</loc>
<iiif:Manifest xmlns:iiif="http://iiif.io/api/presentation/2/">
http://dams.llgc.org.uk/iiif/newspaper/issue/3320640/manifest.json
</iiif:Manifest>
<dct:isPartOf>http://dams.llgc.org.uk/iiif/newspapers/3320639.json</dct:isPartOf>
<lastmod>2014-11-08</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
Example of NLW Sitemap Entry
Courtesy Nuno Freire, Europeana

Uses Sitemaps and ResourceSync and DCMES as Extensions
<url>
<loc>https://digital.ucd.ie/view/ucdlib:38491</loc>
<rs:ln rel="alternate" href="https://digital.ucd.ie/view/ucdlib:38491"
type="application/json" dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/>
<rs:ln rel="collection” href="https://digital.ucd.ie/view/ucdlib:38488”
type="application/json" dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/>
<lastmod>2014-08-24T04:09:09.716Z</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
Example of UCD Resource List Entry
Courtesy Nuno Freire, Europeana

Uses Sitemaps, ResourceSync, and Sitemap Image Extension
Sample of UCD Resource List
Courtesy John Howard, University College Dublin

Conclusions
Strengths
● ResourceSync addresses core requirements
for exposing IIIF resources for harvesting
● Can build on publication of existing
sitemaps easily
● Leverages Many-to-One, Selective
Synchronization, and Metadata Harvesting
paradigms
● Can adopt additional extensions to
implement needed features
● Plenty of opportunity to contribute; need
more prototypes
Challenges
● IIIF community’s needs for discovery are
not necessarily what other sitemap
consumers want (e.g. Google)
● Identifying the primary resource influences
structure
● Unclear whether search engines support
custom extensions, and what ranking
impact would be

Thank You!
Mark A. Matienzo, Stanford University Libraries
@anarchivist / https://orcid.org/0000-0003-3270-1306
DPLAFest — Chicago, Illinois — April 20, 2017

Seamless access to the world’s open
access research papers via
ResourceSync

Petr Knoth

Use Case 1: ResourceSync as a seamless layer over
heterogenous APIs

Use Case 1: What is CORE?
OA Repositories OA Journals
Mostly OAI-PMH
CORE aggregates and
provides free access to
millions of research
articles aggregated
from thousands of OA
repositories and
journals.

Mostly OAI-PMH
CORE aggregates and
articles aggregated
repositories and
journals.

» Enrichment and
harmonisation of
aggregated data
» Products/services:
› Portal
› API
› Data dumps
› Recommendation
system for libraries
› Repository dashboard
› B2B and analytical
services

Mostly OAI-PMH
CORE aggregates and
articles aggregated
repositories and
journals.

» 70 million+
metadata records
» Over 6 million full
texts hosted on
CORE
» ~1.5 million
monthly active
users
» Aggregating from
2,500 repositories
and 10k OA
journals

Use Case 1: Key issue
Key players do not provide interoperability for machine
access to metadata and content of research papers.
35%
23%
18%
12%
12%
Accessing full-text by
harves5ng
the website
Major search
engines
Recongnised
services upon
approval
75%
12%
13%
Restric5ng access to
full-text
Don't restrict
access in any way
Specify a crawl
delay
Allow access to
speciﬁc robots
39%
11%
39%
11%
Reference of an ar5cle’s
full-text on metadata
Direct link to full-
text
Interface
supporBng full-text
transfer
50%
42%
8%
Accessing content
standards
OAI
Own API
Z39.50
36%
24%
4%
32%
4%
Files format
PDF
HTML
Plain text
HTML
JSON
54%31%
15%
Automated downloads
of OA full-text
Website
API
FTP

Use Case 1: Approach
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
+ many others
Provide seamless access over non-standardised APIs.
What protocol?

Use Case 1: Approach
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
+ many others
Provide seamless access over non-standardised APIs.
What protocol? » Why not OAI-PMH?
› slow and very ineﬃcient
for big repositories.
› Standardised for
metadata transfer but
not for content transfer.
›  Very diﬃcult to
represent the richness of
metadata from a broad
range of data providers.

Use Case 1: ResourceSync as a seamless access layer
» Very scalable
implementation on
both the server and
client side
» Interpretation of
metadata happens
using existing pipeline
at the aggregator.
» 1.5 million OA
publications from
Elsevier, Springer and
others already
exposed.
» Available at: https://publisher-connector.core.ac.uk/resourcesync
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
+ many others
ResourceSync

Use Case 2: Exposing enriched data for Text and Data
Mining (TDM) via ResourceSync

Use Case 2: Subscribing to ResourceSync
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
ResourceSync
+ many others
» Other aggregators can
subscribe to the Publisher
connector to make use of their
ingestion pipelines and
enrichment technologies

Use Case 2: Content ingestion in OpenMinTeD
Key publishers
(OA + hybrid OA)
Publisher connector
ResourceSync
Mostly OAI-PMH
OMTD-SHARE
(over REST)
+ many others
» CORE and OpenAIRE are content sources in the OpenMinTeD
TDM platform (EU infrastructure project) being developed to
enable the mining of scholarly literature.

Use Case 2: Exposing enriched data for TDM
Key publishers
(OA + hybrid OA)
Publisher connector
ResourceSync
Mostly OAI-PMH
+ many others
ResourceSync
» But others want similar solutions … typically, they want to be
able to sync and host the data.

Use Case 3: Make repositories and journals adopt
ResourceSync

Use Case 3: Replace OAI-PMH with ResourceSync
Key publishers
(OA + hybrid OA)
Publisher connector
ResourceSync
Mostly OAI-PMH
OMTD-SHARE
(over REST)
+ many others
ResourceSync
ResourceSync
» Will be a game changer …
» Advocated by COAR Next
Generation Repositories WG

Key contributions and considerations

What’s new about our implementation of ResourceSync?
» Scales to many millions of resources as required by
aggregators (as opposed to existing implementations for
repositories that are scalable for tens of thousands of
resources)
» Real-time updating of ResourceLists and ChangeLists
(avoiding unnecessary batch processes).
» Combination of real-time updates and scalability

Architectural choices
» Based on the principle of changes being communicated
to a controller as they happen (rather than having to be
detected prior to ResourceList/ChangeList updates)
» Uses Elasticsearch as a database
» Hashing mechanism to distribute size of each
ResourceList link and a clever mechanism for iterative
updating of ResourceLists

Conclusions
» ResourceSync:
› broad range of uses in scholarly communication.
› solves problems with aggregating content over OAI-PMH, faster &
more eﬃcient aggregation => fresher data in aggregators compared
to OAI-PMH
» We used ResourceSync to ”liberate” over 1.5 million OA papers (and
growing) from key publishers
» CORE soon to provide access to over 8 million OA full texts via
ResourceSync.
» CORE actively contributes to the adoption of ResourceSync in the
repositories community (as part of OpenMinTeD and COAR NGR)

An overview of capabilities and real-world use cases for discovery,
harvesting, and synchronization of resources on the web
ResourceSync
@mart1nkle1n @G_AmSpinnrade @anarchivist @petrknoth

ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, and Synchronization of Resources on the Web

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, and Synchronization of Resources on the Web

Similar to ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, and Synchronization of Resources on the Web (20)

More from Martin Klein

More from Martin Klein (20)

Recently uploaded

Recently uploaded (16)

ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, and Synchronization of Resources on the Web