These slides were presented at the LITA Forum,
Louisville, Kentucky, November 10 2013
The most recent version of the slides is available at
http://www.slideshare.net/OpenArchivesInitiative/resourcesync-tutorial
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
2
ResourceSync Tutorial History
•
•
•
•
•
•
First outing: OAI8, June 2013
Second run: Open Repositories, July 2013
Third run: JCDL, July 2013
Fourth run: TPDL 2013, September 2013
Fifth run: LITA Forum, November 2013
Sixth run: SWIB 2013, November 2013
Presenter
Herbert Van de Sompel
Los Alamos National Laboratory
<hvdsomp@gmail.com>
@hvdsomp
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
3
ResourceSync Tutorial Contributors
Martin Klein
Herbert Van de Sompel
Robert Sanderson
Los Alamos National Laboratory Los Alamos National Laboratory Los Alamos National Laboratory
<martinklein0815@gmail.com>
<hvdsomp@gmail.com>
<azaroth24@gmail.com>
@mart1nkle1n
@hvdsomp
@azaroth24
Simeon Warner
Cornell University
<simeon.warner@cornell.edu>
@zimeon
Michael L. Nelson
Old Dominion University
<mln@cs.odu.edu>
@phonedude_mln
Richard Jones
Cottage Labs
<richard@cottagelabs.com>
@cottagelabs
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
4
OAI
Herbert Van de Sompel
Martin Klein
Robert Sanderson
(Los Alamos National Laboratory)
Simeon Warner
(Cornell University)
NISO
Todd Carpenter
Nettie Lagace
University of Oxford
Graham Klyne
Berhard Haslhofer
(University of Vienna)
Michael L. Nelson
(Old Dominion University)
Lyrasis
Peter Murray
Carl Lagoze
(University of Michigan)
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
5
ResourceSync Technical Group
LOCKSS
Ex Libris Inc.
Shlomo Sanders
David Rosenthal
JISC
Paul Walk
Richard Jones
Graham Klyne
Stuart Lewis
RedHat
OCLC
Christian Sadilek
Library of Congress
Jeff Young
Kevin Ford
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
6
Timeline, Status of Specification(s)
• August 2013
o
o
Release of ResourceSync framework Core specification
- Version 0.9.1
Public draft of ResourceSync Archives specification released
• September 2013
o
Core specification on its way to become an ANSI standard
• November 2013
o
Internal draft of ResourceSync Notification specification
• January 2014
o
Public draft of ResourceSync Notification specification
• Mid 2014
o
Core specification becomes ANSI/NISO standard
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
7
Papers
• Klein, M., and Van de Sompel, H. (2013) Extending Sitemaps for
Resourcesync. http://arxiv.org/abs/1305.4890 ACM/IEEE JCDL 2013
• Haslhofer, B., Warner, S, Lagoze, C., Klein, M., Sanderson, R., Nels
on, M.L. and Van de Sompel, H. (2013) ResourceSync: Leveraging
Sitemaps for Resource Synchronization.
http://arxiv.org/abs/1305.1476 WWW 2013 Developer Track
• Klein, M., Sanderson, R., Van de
Sompel, H., Warner, S, Haslhofer, B., Lagoze, C., and Nelson, M.L.
(2013) A Technical Framework for Resource Synchronization.
http://dx.doi.org/10.1045/january2013-klein D-Lib Magazine.
• Van de
Sompel, H., Sanderson, R., Klein, M., Nelson, M.L., Haslhofer, B., W
arner, S, and Lagoze, C. (2012) A Perspective on Resource
Synchronization. http://dx.doi.org/10.1045/september2012vandesompel D-Lib Magazine.
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
9
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual
Approach
2. Motivation & Use Cases
3. Framework Walkthrough
4. Framework (Technical) Details
5. Implementation
6. Q&A
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
10
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual
Approach
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
11
Synchronize What?
• Web resources
o things with a URI that can be dereferenced
• Focus on needs of research communication and cultural heritage
organizations
o but aim for generality
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
12
Synchronize What?
• Small websites/repositories (a few resources) to large
repositories/datasets/linked data collections (many millions of
resources)
sync
sync
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
13
Synchronize What?
• Low change frequency (weeks/months) to high change
frequency (seconds)
sync
sync
sync
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
14
Synchronize What?
• Synchronization latency and accuracy needs may vary
sync
Sync ???
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
15
Why?
… because lots of projects and services are doing synchronization
but have to resort to ad-hoc, case by case, approaches!
• Project team involved with projects that need this
• Experience with OAI-PMH: widely used in repos but
o XML metadata only
o Web technology has moved on since 1999
• Devise a shared solution for data, metadata, linked data?
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
16
ResourceSync Problem
• Consideration:
• Source (server) A has resources that change over time: they
get created, modified, deleted
• Destination (servers) X, Y, and Z leverage (some)
resources of Source A.
• Problem:
• Destinations want to keep in step with the resource changes
at Source A: resource synchronization.
• Goal:
• Design an approach for resource synchronization aligned
with the Web Architecture that has a fair chance of adoption
by different communities.
• The approach must scale better than recurrent HTTP
HEAD/GET on resources.
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
17
Source: Core Synchronization Capabilities
P
U
L
L
1. Describing content – publish a list of resources available for
synchronization to enable Destinations to perform an initial load
or catch-up with a Source
2. Packaging content – bundle resources to enable bulk download
by destinations
3. Describing changes – publish a list of resource changes to
enable destinations to stay synchronized and decrease latency
4. Packaging changes – bundle resource changes for bulk
download by destinations
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
18
Source: Notifications Capabilities
To reduce synchronization latency and to optimize the synchronization
process the Source can support:
P
•
U
S
•
H
1. Change Notification
• Notifies about changes to particular resources
• e.g., resource A has been updated | created | deleted
2. Framework Notification
• Notifies about changes to capabilities i.e., their documents
• e.g., a Change List has been updated | created | deleted
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
19
A
R
C
H
I
V
E
S
Source: Archival Capabilities
The Source may hold on to historical data, for example, to allow
Destinations to catch up with events they missed or revisit prior
resource states. To this end, the Source can publish archives, i.e.
documents that enumerate historical capability documents
1.
2.
3.
4.
Resource List Archive
Resource Dump Archive
Change List Archive
Change Dump Archive
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
20
Source: Synchronization Features
1. Discovery of capabilities – support Destinations in discovering
all offered capabilities
o
Applies to PULL, PUSH, ARCHIVES capabilities
1. Linking to related resources – provide links from resources
subject to synchronization to related resources
o
Applies to PULL, PUSH capabilities
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
21
Destination: Synchronization Needs
1. Baseline synchronization – A destination must be able to
perform an initial load or catch-up with a source
- avoid out-of-band setup
2. Incremental synchronization – A destination must have some
way to keep up-to-date with changes at a source
- subject to some latency; minimal: create/update/delete
- allow to catch-up after destination has been offline
3. Audit – A destination should be able to determine whether it is
synchronized with a source
- regarding coverage and accuracy
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
22
ResourceSync - Agenda
2. Motivation & Use Cases
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
23
Use Cases – The Basics
a)
b)
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
24
Use Cases – The Basics
c)
d)
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
25
Use Cases – The not-so-Basics
e)
f)
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
26
Use Case 1: arXiv Mirroring and Data Sharing
• Repository of scholarly articles in
physics, mathematics, computer
science, etc.
• > 850k articles
• approx. 1.5 revisions per article on
average
• approx. 75k new articles per year
• Each article has full-text and separate
metadata record
• approx. 3.8M resources
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
28
Use Case 1: arXiv Mirroring and Data Sharing
• 2,700 updates daily
o at 8pm EST
o Currently using homebrew mirroring
solution (running with minor
modifications since 1994!)
o occasional rsync (file systemspecific, auth issues)
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
29
Use Case 1: arXiv
Mirroring
• GOAL: Keep mirror sites synchronized with daily
changes
• WANT:
o
o
o
o
high consistency
moderate latency
robustness to global network outages (low admin effort)
ability to verify sync status in case of questions
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
31
Use Case 1: arXiv
Data Sharing
• GOAL: Make resources and update information
publicly available so that any other service may
synchronize at the frequency it needs, e.g.
o
o
o
Math Front at UC Davis
EprintWeb from IOP in UK
Data for bibliometric and scientometric analysis
• WANT:
o
o
low admin effort (i.e. standard approach, standard tools)
reasonable consistency, latency, efficiency
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
32
Use Case 2: DBpedia Live Duplication
• Average of 2 updates per second
• Low latency desirable => need for a push technology
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
33
Use Case 2: DBpedia Live Duplication
• Daily traffic:
o 99% updates
o 0.6% deletions
o 0.03% creations
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
35
Use Case 2: DBpedia Live Duplication
• # of content transfer
events in two 8 hour
intervals
• Max, queue size of
remote duplication
process
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
36
ResourceSync - Agenda
3. Framework Walkthrough
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
37
Source Capability 1: Describing Content
In order to advertise the resources that a source wants destinations
to know about, it may describe them:
o
o
Publish a Resource List, a list of resource URIs and possibly
associated metadata
- Destination GETs the Resource List
- Destination GETs listed resources by their URI
A Resource List describes the state of a set of resources at
one point in time (snapshot)
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
38
Source Capability 2: Packaging Content
By default, content is transferred in response to a GET issued by a
destination against a URI of a source’s resource. But a source may
support additional mechanisms:
o
o
Publish a Resource Dump, a document that points to
packages of resource representations and necessary
metadata
- Destination GETs the package
- Destination unpacks the package
- ZIP format supported
A Resource Dump and the packages it points to reflect the
state of a set of resources at one point in time (snapshot)
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
41
Source Capability 3: Describing Changes
In order to achieve lower latency and/or greater efficiency, a source
may communicate about changes to its resources:
o
o
Publish a Change List, a list of recent change events
(created, updated, deleted resource)
- Destination acts upon change events, e.g. GETs
created/updated resources, removes deleted resources.
A Change List pertains to resources that changed in a
temporal interval with a start- and an end-date
- If a resource changed more than once, it will be listed
more than once
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
45
Source Capability 4: Packaging Changes
In order to reduce the number of requests to obtain resource
changes, a source may provide packaged bitstreams for changed
resources:
o
o
Publish a Change Dump, a document that points to
packages containing bitstreams of recently changed
resource and necessary metadata
- Destination GETs the package
- Destination unpacks the package
- ZIP format supported
A Change Dump and its packages pertain to resources that
changed in a temporal interval with a start- and an end-date
- If a resource changed more than once, it will be included
more than once
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
50
A Framework Based on Sitemaps
• Modular framework allowing selective deployment
• Sitemap is the core format throughout the framework
o
o
o
Introduce extension elements and attributes:
- In ResourceSync namespace (rs:) to
accommodate synchronization needs
Reuse Sitemap format for all capability documents:
Resource List, Resource Dump, Change
List, Change Dump, as well as for manifest in
Dumps
Utilize Sitemap index format where
needed/allowed
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
61
Resource Metadata Summary
Element/Attribute
<loc>
<lastmod>
Description
Resource URI (identity)
Timestamp of last change
Defined by
sitemaps
sitemaps
<changefreq>
Expected update frequency
sitemaps
<rs:md>
change
encoding
hash
length
path
type
ResourceSync
Change type (Change List & Change
Dump Manifest only)
ResourceSync
HTTP Content-Encoding header value
RFC2616
One or more content digests (md5, sha-1, Atom Link Ext.
sha-256)
HTTP Content-Length header value
RFC4287
Path in ZIP package (Dump Manifests
only)
HTTP Content-Type header value
ResourceSync
RFC4287
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
Related Resource Metadata Summary
• Attributes of the <rs:ln> element; c.f. resource metadata + pri
Element/Attribute Description
Defined by
<rs:ln>
ResourceSync
encoding
HTTP Content-Encoding header value
RFC2616
hash
One or more content digests (md5, sha-1, sha-256)
Atom Link Ext.
href
Related resource URI (identity)
RFC4287
length
HTTP Content-Length header value
RFC4287
modified
Timestamp of last change (c.f. <lastmod>)
Atom Link Ext.
path
Path in ZIP package (Dump Manifests only)
ResourceSync
pri
Priority of link
RFC6249
rel
Relation - IANA registered or URI
RFC4287
type
HTTP Content-Type header value
RFC4287
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
Link Relation Summary
Relation
Use in ResourceSync
Defined in
rel="alternate"
Link from generic to specific URI
HTML 5
rel="canonical"
Link from specific to generic URI
RFC6596
rel="collection"
Resource is member of collection
RFC6573
rel="contents"
Link from dump to manifest
rel="describedby"
Has metadata
HTML4
Protocol for Web Description Resources
(POWDER): Description Resources
rel="describes"
Is metadata for
The 'describes' Link Relation Type
rel="duplicate"
RFC6249
rel=".../rs/terms/patch"
Mirror or alternative copy
A patch -- efficient change
information
rel="memento"
Link to time-specific URI
Memento Internet Draft
rel="timegate"
Link to timegate
Memento Internet Draft
rel="via"
Provenance chain, came from
RFC4287
This specification
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
ResourceSync Sitemap Validation
• All ResourceSync capability documents are valid according to
the Sitemap XML Schema
o
http://www.sitemaps.org/schemas/sitemap/0.9
• For a more thorough validation use the ResourceSync XML
Schema
o
http://www.openarchives.org/rs/0.9.1/resourcesync.xsd
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
4. Framework (Technical) Details
1. Sitemaps
2. Core synchronization capabilities (PULL)
3. Discovery
4. Linking to related resources
5. Notification Capabilities (PUSH)
6. Archival capabilities (ARCHIVES)
http://www.openarchives.org/rs/resourcesync
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
70
Describing Content: Resource List
http://www.openarchives.org/rs/resourcesync#DescResources
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
71
Resource List
• Describe Source’s resources that are subject to synchronization
• At one point in time (snapshot)
• Creation can take some time – duration can be conveyed
• Typical Destination use: Baseline Synchronization, Audit
• Each URI typically listed only once
• Might be expensive to generate
• Destinations use @at to determine freshness
• [@at, @completed] – interval of uncertainty
• Destination issues GETs against URIs to obtain resources
• Very similar to current Sitemaps
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
73
What if I have a million resources?
• Current sitemap limit is 50k resources (or maximum document
size of 50MB)
• Break complete list of resources into 50k-resource chunks, each
on a Resource List document
• Create a Resource List Index document to group them:
o
o
o
Based on <sitemapindex>
May have up to 50k component Resource Lists
Extends capacity to 2,500,000,000 resources within current
community practices
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
Resource List Index <resourcelist_index.xml>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”resourcelist"
at="2013-01-02T09:00:02Z”/>
<sitemap>
<loc>http://example.com/resourcelist1.xml</loc>
<rs:md type="application/xml"/>
</sitemap>
<sitemap>
<loc>http://example.com/resourcelist2.xml</loc>
<rs:md type="application/xml"/>
</sitemap>
</sitemapindex>
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
75
Resource Dump
• A Resource Dump points to packages (ZIP files) that contain
representations of the Source’s resources
• At one point in time (snapshot)
• Resource Dump is mandatory, even if there is only one ZIP file
• ZIP package contains manifest, listing contained bitstreams
• Typical Destination use: Baseline Synchronization, bulk
download
• Each URI typically listed only once
• Might be expensive to generate
• Destinations use @at to determine freshness
• [@at, @completed] – interval of uncertainty
• GETs against individual URIs from Resource List achieves the
same result (ignoring varying freshness)
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
81
Describing Changes: Change List
http://www.openarchives.org/rs/resourcesync#DesChanges
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
82
Open Change List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs=http://www.openarchives.org/rs/terms/>
<rs:md capability="changelist"
from="2013-01-02T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
</urlset>
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
84
Change List
• A Change List pertains to a Source’s resources that changed
• Changes that occurred during a temporal interval with startand end-date
• Typical Destination use: Incremental Synchronization, Audit
• Changes are listed in chronological order
• Multiple changes to one resource results in the resource being
listed multiple times, once per change
• Source determines duration of temporal interval
• Destinations use @from and @until to determine freshness
• Destinations issue GETs against URIs to obtain changed
resources
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
85
Change Dump
• A Change Dump points at packages (ZIP files) that contain
bitstreams of the Source’s resources that changed
• Changes that occurred during a temporal interval with startand end-date
• Change Dump is mandatory, even if there is only one ZIP file
• ZIP package contains manifest, listing contained bitstreams
• Typical Destination use: Incremental Synchronization, bulk
download of changes
•
•
•
•
Changes in Change Dump Manifest listed in chronological order
Same URI can be listed multiple times
Might be expensive to generate
Destinations use @from and @until to determine freshness
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
94
ResourceSync - Agenda
4. Framework (Technical) Details
1. Sitemaps
2. Core synchronization capabilities (PULL)
3. Discovery
4. Linking to related resources
5. Notification Capabilities (PUSH)
6. Archival capabilities (ARCHIVES)
http://www.openarchives.org/rs/resourcesync#Discovery
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
95
Discovery of Capabilities
Requirements:
• Need to discover capabilities, i.e. Resource List, Resource
Dump, Change List, Change Dump, Archives, Notification
channels
• Need to know the type of capability each document
represents.
Approach:
• The Source publishes a Capability List that enumerates the
capabilities it supports.
• By pointing at Resource List, Change List, Resource Dump,
etc. using appropriate relation types, e.g. “resourcelist”,
“changelist”, “resourcedump” etc.
http://www.openarchives.org/rs/resourcesync#CapabilityList
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
96
Discovery of Capability Lists
Requirements:
• Need to discover a Capability List
Approaches:
• Introduce a link in the HTTP Link header of a resources that is
subject to synchronization, pointing at the Capability List with the
relation type “resourcesync”
• Introduce a link from an HTML document that is subject to
synchronization (<head> section), pointing at the Capability List
with the relation type “resourcesync”
• Link from a Resource List, etc. to the Capability List with the
relation type “up”
Link header on example.com/res1.pdf
Link: <example.com/dataset1/capabilitylist.xml>;rel=“resourcesync”
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
99
Discovery: Source Description
Requirements:
• Support for multiple Capability Lists, one per “set of
resources”
• Need to discover these Capability Lists
• Need descriptive information about each set of resources
that a Capability List pertains to
• Useful to have descriptive information about the Source itself
Approach:
• The Source Description document meets these requirements.
• It should be at a particular location to avoid having registries:
http://(hostname)/.well-known/resourcesync
• It can be linked to from the Capability Lists as well.
http://www.openarchives.org/rs/resourcesync#SourceDesc
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
101
Discovery via robots.txt
• Resource Lists are (enhanced) Sitemaps
• Sitemaps can be discovered via robots.txt
• Ergo, Resource Lists should be discoverable via robots.txt
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Sitemap: http://example.com/dataset1/resourcelist.xml
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
105
ResourceSync - Agenda
4. Framework (Technical) Details
4. Linking to related resources
http://www.openarchives.org/rs/resourcesync#LinkRelRes
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
111
Supported Linking Use Cases
Provide links to related resources to address specific resource
synchronization needs.
1.
2.
3.
4.
5.
6.
7.
Mirrored content with multiple download locations
Alternate representations of the same content
Patching content rather than replacing it
Resources and metadata about resources
Prior versions of resources
Collection membership of resources
Republishing synchronized resources
All cases are handled with a <rs:ln> element referring to the linked
resource
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
112
Notes about Linked Resources
Some important things to keep in mind about linked resources:
• They may also be subject to synchronization
• They may be updated in a very different schedule than the
resources that link to them
• Therefore, it is recommended to convey metadata about the
linked resource too
• Links can be bi-directional – the linked resource can link back to
the linking resource
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
113
Linking #1 - Mirror
1. Content with multiple download locations
This may be of interest for:
• Content distribution networks
• Mirror sites
• Backup locations
• Load balancing
http://www.openarchives.org/rs/0.9.1/resourcesync#MirCon
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
114
Linking #2 – Alternate Representations
2.
Alternate representations of the same content
This may be of interest for:
• Resources subject to HTTP content negotiation
• Format migration for preservation reasons
• Different clients wanting different formats
• Multiple languages of the content
http://www.openarchives.org/rs/resourcesync#AltRep
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
116
Linking #3 – Patching Content
3.
Patching content rather than replacing it
This may be of interest when:
• Resources are very large and server wishes to conserve
bandwidth where possible
• Changes are frequent and small
• Changes are managed in a CMS that tracks differences
Need:
• Machine processable format to describe a change in a
manner that allows patching a representation
• Existing or newly defined by communities
http://www.openarchives.org/rs/resourcesync#PatchCon
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
119
Linking #4 – Metadata about Resources
4.
Resources and metadata about resources
This may be of interest when:
• Resources have associated descriptive metadata records,
which are useful for understanding the resource
• Such as cultural heritage images, audio, video
• Resources that have associated technical, administrative,
rights metadata
http://www.openarchives.org/rs/resourcesync#ResMDLinking
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
121
Linking #5 – Prior Versions of Resources
This may be of interest when:
• A Destinations needs to have a copy of all versions of a
resource
http://www.openarchives.org/rs/resourcesync#ResVers
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
124
URI for Original, URI for Version
Web Archive
URI-M - http://web.archive.org/web/20010911203610/http://www.cnn.com/
URI-R - http://www.cnn.com/
URI for Original, URI for Version
CMS
URI-M - http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333
URI-R - http://en.wikipedia.org/wiki/September_11_attacks
Memento Time Travel extension for Chrome
Download extension at http://bit.ly/memento-for-chrome
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
Linking #6 – Collection Membership
6.
Collection membership of resources
This may be of interest when:
• Resources are part of OAI-ORE aggregations
• Resources are part of OAI-PMH sets
• To indicate any other type of collections of resources
Collections are named with URIs and can then be linked to with
rel=“collection”
• Nice if the collection URI resolves to a useful description
http://www.openarchives.org/rs/resourcesync#ColMem
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
136
Linking #7 – Republishing Resources
7.
Republishing synchronized resources
This may be of interest when:
• Aggregator systems harvest resources from Sources and
then republish them at new URIs
Examples include Blog republishing, content distribution networks,
mirrored or combined collections
Hypothetical scenario: Lots of little museums with small collections,
and a large European/American aggregating digital library system
that wants to provide fast, combined access to the content (with
permission)
http://www.openarchives.org/rs/resourcesync#RePub
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
138
Linking #7 – Republishing Resources #1
• Original Source publishes information about a changed resource
via a Change List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-03T00:00:00Z”/>
<url>
<loc>http://original.example.com/res1</loc>
<lastmod>2013-01-03T07:00:00Z</lastmod>
<rs:md change=”updated”/>
</url>
</urlset>
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
139
Linking #7 – Republishing Resources #2
• Aggregator 1 republishes information about the changed
resource with reference to the original Source
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-03T11:00:00Z”/>
<url>
<loc>http://aggregator1.example.com/res1</loc>
<lastmod>2013-01-03T20:00:00Z</lastmod>
<rs:md change=”updated”/>
<rs:ln rel=”via”
modified=“2013-01-03T07:00:00Z”
href=”http://original.example.org/res1"/>
</url>
</urlset>
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
140
Linking #7 – Republishing Resources #3
• Aggregator 2 ditto
• Caution when republishing links, need to make sure they are still
appropriate from an aggregator’s perspective
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-03T12:00:00Z”/>
<url>
<loc>http://aggregator2.example.com/res1</loc>
<lastmod>2013-01-04T09:00:00Z</lastmod>
<rs:md change=”updated”/>
<rs:ln rel=”via”
modified=“2013-01-03T07:00:00Z”
href=”http://original.example.org/res1"/>
</url>
</urlset>
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
141
ResourceSync - Agenda
4. Framework (Technical) Details
1. Sitemaps
2. Core synchronization capabilities (PULL)
3. Discovery
4. Linking to related resources
5. Notification Capabilities (PUSH)
6. Archival capabilities (ARCHIVES)
http://www.openarchives.org/rs/notification
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
142
Motivation for Notifications
•
Reduce synchronization latency by having the Source push out
resource change information
• To avoid continuous pull of Change Lists by Destinations
•
Share information about changes to the Source’s
ResourceSync implementation, e.g. announcement of new
Resource List, new Capability List, etc.
• To avoid continuous polling of e.g. Resource Lists,
ResourceSync Description
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
143
Source: Notifications Capabilities
•
P
U
•
S
H
1. Change Notification
• Notifies about changes to particular resources
• e.g., resource A has been updated | created | deleted
2. Framework Notification
• Notifies about changes to capabilities i.e., their documents
• e.g., a Change List has been updated | created | deleted
• Also for Capability Lists and Source Description
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
144
Notifications Channels
•
Notification sent via channels
• Resource Notification: one channel per set of resources
• Framework Notification: one channel per set of resources
• Sent on level of capability document, not on index-level
• Notifications about changes to Source Description sent on all
Framework Notification channels
•
Payload for notifications: <urlset> documents
•
Transport protocol for notifications:
• PubSubHubbub https://pubsubhubbub.googlecode.com/git/pubsubhubbub-core0.4.html - current choice
• WebSockets -http://tools.ietf.org/html/rfc6455 – may be added
later
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
145
ResourceSync - Agenda
4. Framework (Technical) Details
1. Sitemaps
2. Core synchronization capabilities (PULL)
3. Discovery
4. Linking to related resources
5. Notification Capabilities (PUSH)
6. Archival capabilities (ARCHIVES)
http://www.openarchives.org/rs/archives
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
153
A
R
C
H
I
V
E
S
Source: Archival Capabilities
The Source may hold on to historical data, for example, to allow
Destinations to catch up with events they missed or revisit prior
resource states. To this end, the Source can publish archives, i.e.
documents that enumerate historical capability documents
1.
2.
3.
4.
Resource List Archive
Resource Dump Archive
Change List Archive
Change Dump Archive
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
154
The Metadata Harvesting Use Case
1. Identification of metadata records within a service
1. Use of standards in metadata formats
1. Incremental updates
1. Create, Update, Delete
1. Sets
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
167
The Metadata Harvesting Use Case
1. Identification of metadata records within a service
ResourceSync does not specifically care about metadata records, only
resources. It is up to the server to identify which of those resources
are metadata.
2. Use of standards in metadata formats
We are free to annotate a resource's entry with appropriate metadata
to indicate the format.
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
168
The Metadata Harvesting Use Case
3. Incremental updates
ResourceSync publishes changes as static documents. The client is
then free to walk up and down the change lists provided by the server.
4. Create, Update, Delete
All resources that can be obtained from a change list will be annotated
with the kind of change that happened to them.
5. Sets
ResourceSync allows the server to publish lists of resources and
changes and indexes of those lists all annotated with metadata.
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
169
Generating Documents
1. Initialise
Creates initial Capability List and Resource List documents
[dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -i
2. Update
Creates a new Change List which covers the period since the last Change List
was created
[dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -u
3. Rebase
A combination of both Initialise and Update.
[dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -r
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
174
Usage of Resources by clients
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
175
URLs
•
•
•
•
Stable identifiers for archived items
Stable identifiers for unarchived items
Stable identifiers for metadata resources (in their various formats)
Stable identifiers for previous versions ?
Provenance
• History of changes to an item/bitstream
• Item/bitstream deletions (vs withdraw)
• Bitstream create/update dates
• Item create/update dates
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
177
Versioning
• Access of previous versions of both metadata and bitstreams ?
• Stable identifiers for previous versions of both metadata and ?
bitstreams
Metadata Resources
• Metadata in a variety of formats
• Metadata as file/bitstream
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
178
Admin Files
•
•
•
ResourceSync documents (Resource Lists, Change Lists, etc)
ResourceSync exports - Resource Dumps, Change Dumps
Metadata exports in a number of formats
Scheduled Tasks
•
Regular generation of RS documents
Complex Objects
•
•
Item/bitstream relationships
Collections of content
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
179
Get the software!
Dspace Module:
https://github.com/CottageLabs/DSpaceResourceSync
depends on the common java library:
https://github.com/CottageLabs/ResourceSyncJava
PHP client:
https://github.com/stuartlewis/resync-php
depends on the SWORDv2 clienbt library:
https://github.com/swordapp/swordappv2-php-library/
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
180
ResourceSync @ arXiv
• Use ResourceSync for both mirroring and public data access
o efficient updates
o ability to do periodic audits
o public synchronization capability
o reduce admin burden
• Likely start with metadata + source for mirroring use case (doing
experiments now)
• Open access use cases requires processed PDF also
• Some concerns about likely use/load…
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
182
Alternate download location
•
Likely want to separate machine accesses from human accesses to
preserve response time on main server
=> Use Mirrored Content part of spec
o
o
<loc> specifies canonical URI
- e.g. http://arxiv.org/pdf/1306.1073v1.pdf
<rs:ln rel=“duplicate”> specifies preferred download location
- e.g. http://export.arxiv.org/pdf/1306.1073v1.pdf
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
184
Getting a copy of arXiv
It might be as easy as:
(of course, you probably have to wait a while but it is nice to know ResourceSync is
stateless so one can efficiently restart)
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
186
Python Library and Client
• Aim to provide library code implementing all ResourceSync
facilities for use in both source and destination implementations
o
o
Designed for python 2.6 (RHEL6) and 2.7
Will not work with python <= 2.5
• Client (resync) supports many destination operations, inspired
by the common Unix rsync program
• Client also supports some operations that might be useful in a
source, such as generation of static Resource Lists, or periodic
Change Lists (used in arXiv experiments)
• Explorer (resync-explorer) intended to allow easy inspection
of a source’s resource sets and capabilities
• Developed since ResourceSync v0.5, updated for v0.9
http://github.org/resync/resync
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
ResourceSync Source Simulator
• Python code using Tornado server
• Provides random set of resources of different sizes updated at a
particular rate
• Very useful for testing Destination code
http://github.com/resync/simulator
ResourceSync Tutorial
DANS, January 21 2014, Den Haag, Netherlands
LANL Memento Aggregator of IIPC; Europeana does metadata via OAI-PMH but anticipate content also; arXiv – mirroring and data sharing; Linked data @ BBC; DBpedia, journal data at LANLREST not about in 1999
XML <-> OAI-PMHlarge data begs diff question
protected mostly about existing HTTP auth methods, stats -> just inventory
Switching to a standardized resource-centric framework could
Semantic web version of wikipedia; want mirror to provide reliable basis for local services
Semantic web version of wikipedia; want mirror to provide reliable basis for local services
Semantic web version of wikipedia; want mirror to provide reliable basis for local services
Semantic web version of wikipedia; want mirror to provide reliable basis for local services
Top line – just metadata about resources, destination uses GET to get them (duh)Bottom line – packaged content => fewer round trips
Rsyncetc just reference; push vs pull -> both; many other parts
Rsyncetc just reference; push vs pull -> both; many other parts
Add: rel=“contents”rel=“archives”
They have in common: versions exist at different URIs. Because only the representation of a single state of a resource is available from a URI.
They have in common: versions exist at different URIs. Because only the representation of a single state of a resource is available from a URI.
Pattern exists in e.g.: WikiPedia, W3C specs, DryadNot sure whether DOI in general follows this paradigm.
Now the question is “How we do access those versions” - Can interlink them. There’s RFCs that describe how to do that.-But that URI-R is special. It is what typically is being bookmarked, put in email. Want to leverage the fact that this URI-R is always there. Use it as the entry point.
Memento addresses the problem in a resource-centric way:Resource, URI, state, representation, link, content negotiation
Test site, has subsets of arXiv and even complete source plus metadata (at present not up to date with 0.9)
No way around the difficulty of transferring 1TB initially but then a daily or weekly sync is efficient, and it still works even after some arbitrary time.