NISO ResourceSync Training Session

NISO Training
http://www.niso.org/workrooms/onixpl-encoding/

NISO Training
ResourceSync: A Web-Based
Resource Synchronization
Framework
December 3, 2013
Speakers:
Bernhard Haslhofer - Postdoc Research Associate
Department of Computer Science, University of Vienna
Simeon Warner - Information Science, Cornell University

ResourceSync:
A Web-Based
Framework

#resourcesync

ResourceSync is funded by
The Sloan Foundation & JISC
ResourceSync Webinar
December 3 2013

2

This is a short version of the complete ResourceSync tutorial,
which is available at
http://www.slideshare.net/OpenArchivesInitiative/resourcesync-tutorial

December 3 2013

3

ResourceSync Tutorial History
• OAI8, June 2013 – Open Repositories, July 2013 –
JCDL, July 2013 – TPDL 2013, September 2013 –LITA
Forum, November 2013, SWIB November 2013, …

Presenters

Simeon Warner
Cornell University

Bernhard Haslhofer
University of Vienna
December 3 2013
4

ResourceSync Tutorial Contributors

Martin Klein
Herbert Van de Sompel
Robert Sanderson
Los Alamos National Laboratory Los Alamos National Laboratory Los Alamos National Laboratory
<martinklein0815@gmail.com>
<hvdsomp@gmail.com>
<azaroth24@gmail.com>
@mart1nkle1n
@hvdsomp
@azaroth24

Simeon Warner
Cornell University
<simeon.warner@cornell.edu>
@zimeon

Michael L. Nelson
Old Dominion University
<mln@cs.odu.edu>
@phonedude_mln

Richard Jones
Cottage Labs
<richard@cottagelabs.com>
@cottagelabs
December 3 2013
5

OAI
Herbert Van de Sompel
Martin Klein
Robert Sanderson
(Los Alamos National Laboratory)
Simeon Warner
(Cornell University)

NISO
Todd Carpenter
Nettie Lagace
University of Oxford
Graham Klyne

Bernhard Haslhofer
(University of Vienna)
Michael L. Nelson
(Old Dominion University)

Lyrasis
Peter Murray

Carl Lagoze
(University of Michigan)

December 3 2013

6

ResourceSync Technical Group
LOCKSS
Ex Libris Inc.
Shlomo Sanders

David Rosenthal

JISC
Richard Jones

Paul Walk

Stuart Lewis

RedHat
OCLC
Christian Sadilek

Library of Congress

Jeff Young

Kevin Ford

December 3 2013

7

Timeline, Status of Specification(s)
• August 2013
o

o

Release of ResourceSync framework Core specification
- Version 0.9.1
Public draft of ResourceSync Archives specification released

• September 2013
o

Core specification on its way to become an ANSI standard

• November 2013
o

Internal draft of ResourceSync Notification specification

• January 2014
o

Public draft of ResourceSync Notification specification

• Mid 2014
o

Core specification becomes ANSI/NISO standard

December 3 2013

8

Pointers
• Specification
http://www.openarchives.org/rs/
http://www.openarchives.org/rs/resourcesync
http://www.openarchives.org/rs/archives

• List for public comment
https://groups.google.com/d/forum/resourcesync
• Client and simulator code
http://github.org/resync/resync
http://github.org/resync/simulator

December 3 2013

9

ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual
Approach

2. Motivation & Use Cases
3. Framework Walkthrough

4. Framework (Technical) Details
5. Implementation
6. Q&A
December 3 2013

10

1. ResourceSync: Problem Perspective & Conceptual
Approach

December 3 2013

11

Synchronize What?

• Web resources
o things with a URI that can be dereferenced
• Focus on needs of research communication and cultural heritage
organizations
o but aim for generality

December 3 2013

12

Synchronize What?
• Small websites/repositories (a few resources) to large
repositories/datasets/linked data collections (many millions of
resources)

sync

sync

December 3 2013

13

Synchronize What?
• Low change frequency (weeks/months) to high change
frequency (seconds)
• Synchronization latency and accuracy needs may vary
sync

sync

sync

December 3 2013

14

Why?
… because lots of projects and services are doing synchronization
but have to resort to ad-hoc, case by case, approaches!
• Project team involved with projects that need this

• Experience with OAI-PMH: widely used in repos but
o XML metadata only
o Attempts at synchronizing actual content via OAI-PMH
(complex object formats, dc:identifier) not successful.
o Web technology has moved on since 1999
• Devise a shared solution for data, metadata, linked data?

December 3 2013

15

ResourceSync Problem
• Consideration:
• Source (server) A has resources that change over time: they
get created, modified, deleted
• Destination (servers) X, Y, and Z leverage (some)
resources of Source A.
• Problem:
• Destinations want to keep in step with the resource changes
at Source A: resource synchronization.
• Goal:
• Design an approach for resource synchronization aligned
with the Web Architecture that has a fair chance of adoption
by different communities.
• The approach must scale better than recurrent HTTP
HEAD/GET on resources.

December 3 2013

16

Source: Core Synchronization Capabilities

P
U
L
L

1. Describing content – publish a list of resources available for
synchronization to enable Destinations to perform an initial load
or catch-up with a Source
2. Packaging content – bundle resources to enable bulk download
by destinations
3. Describing changes – publish a list of resource changes to
enable destinations to stay synchronized and decrease latency
4. Packaging changes – bundle resource changes for bulk
download by destinations

December 3 2013

17

Source: Notifications Capabilities
To reduce synchronization latency and to optimize the synchronization
process the Source can support:

P
•
U
S
•
H

1. Change Notification
• Notifies about changes to particular resources
• e.g., resource A has been updated | created | deleted
2. Framework Notification
• Notifies about changes to capabilities i.e., their documents
• e.g., a Change List has been updated | created | deleted

December 3 2013

18

Source: Synchronization Features
1. Discovery of capabilities – support Destinations in discovering
all offered capabilities
o

Applies to PULL, PUSH, capabilities

1. Linking to related resources – provide links from resources
subject to synchronization to related resources
o

Applies to PULL, PUSH capabilities

December 3 2013

19

Destination: Synchronization Needs
1. Baseline synchronization – A destination must be able to
perform an initial load or catch-up with a source
- avoid out-of-band setup
2. Incremental synchronization – A destination must have some
way to keep up-to-date with changes at a source
- subject to some latency; minimal: create/update/delete
- allow to catch-up after destination has been offline
3. Audit – A destination should be able to determine whether it is
synchronized with a source
- regarding coverage and accuracy

December 3 2013

20


2. Motivation & Use Cases

December 3 2013

21

Use Case 1: arXiv Mirroring and Data Sharing
• Repository of scholarly articles in
physics, mathematics, computer
science, etc.
• > 850k articles
• approx. 1.5 revisions per article on
average
• approx. 75k new articles per year
• Each article has full-text and separate
metadata record
• approx. 3.8M resources

December 3 2013

22

Use Case 1: arXiv Mirroring and Data Sharing
• 2,700 updates daily
o at 8pm EST
o Currently using homebrew mirroring
solution (running with minor
modifications since 1994!)
o occasional rsync (file systemspecific, auth issues)

December 3 2013

23

Use Case 1: arXiv
Mirroring / Data Sharing
• GOAL: Keep mirror sites synchronized with daily
changes
• WANT:
o
o
o
o
o

o

high consistency
moderate latency
robustness to global network outages (low admin effort)
ability to verify sync status in case of questions
low admin effort (i.e. standard approach, standard tools)
reasonable consistency, latency, efficiency

December 3 2013

24

Use Case 2: DBpedia Live Duplication
• Average of 2 updates per second
• Low latency desirable => need for a push technology

December 3 2013

25


3. Framework Walkthrough

December 3 2013

26

Source Capability 1: Describing Content
In order to advertise the resources that a source wants destinations
to know about, it may describe them:
o

o

Publish a Resource List, a list of resource URIs and possibly
associated metadata
- Destination GETs the Resource List
- Destination GETs listed resources by their URI
A Resource List describes the state of a set of resources at
one point in time (snapshot)

December 3 2013

27

Source Capability 2: Packaging Content
By default, content is transferred in response to a GET issued by a
destination against a URI of a source’s resource. But a source may
support additional mechanisms:
o

o

Publish a Resource Dump, a document that points to
packages of resource representations and necessary
metadata
- Destination GETs the package
- Destination unpacks the package
- ZIP format supported
A Resource Dump and the packages it points to reflect the
state of a set of resources at one point in time (snapshot)

December 3 2013

30

Source Capability 3: Describing Changes
In order to achieve lower latency and/or greater efficiency, a source
may communicate about changes to its resources:
o

o

Publish a Change List, a list of recent change events
(created, updated, deleted resource)
- Destination acts upon change events, e.g. GETs
created/updated resources, removes deleted resources.
A Change List pertains to resources that changed in a
temporal interval with a start- and an end-date
- If a resource changed more than once, it will be listed
more than once

December 3 2013

33

Source Capability 4: Packaging Changes
In order to reduce the number of requests to obtain resource
changes, a source may provide packaged bitstreams for changed
resources:
o

o

Publish a Change Dump, a document that points to
packages containing bitstreams of recently changed
resource and necessary metadata
- Destination GETs the package
- Destination unpacks the package
- ZIP format supported
A Change Dump and its packages pertain to resources that
changed in a temporal interval with a start- and an end-date
- If a resource changed more than once, it will be included
more than once
December 3 2013

37

Destination: Key Processes

December 3 2013

39


4. Framework (Technical) Details

December 3 2013

40

So Many Choices
Push

DSNotify
OAI-PMH
rsync

Crawl

Pull
OAI-ORE

RDFsync

WebDAV Col. Syn.

XMPP
Atom

SWORD
Sitemap

SPARQLpush

SDShare

AtomPub

RSS
PubSubHubbub

XMPP
December 3 2013

41

So Many Choices
Push

DSNotify
OAI-PMH
rsync

Crawl

Pull
OAI-ORE

RDFsync

WebDAV Col. Syn.

XMPP
Atom

SWORD
Sitemap

SPARQLpush

SDShare

AtomPub

RSS
PubSubHubbub

XMPP
December 3 2013

42

December 3 2013

43

Sitemap Format

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
</url>

<url>
</url>
…
</urlset>

December 3 2013

44

ResourceSync Sitemap Extensions
<urlset xmlns=http://www.sitemaps.org/schemas/sitemap/0.9
xmlns:rs="http://www.openarchives.org/rs/terms/”>
<rs:ln …/>
<rs:md …/>
<url>
<rs:ln …/>
<rs:md …/>
</url>
<url>
…
</url>
</urlset>

December 3 2013

45

Related Resource Metadata Summary
• Attributes of the <rs:ln> element; c.f. resource metadata + pri
Element/Attribute Description

Defined by

<rs:ln>

ResourceSync

encoding

HTTP Content-Encoding header value

RFC2616

hash

One or more content digests (md5, sha-1, sha-256)

Atom Link Ext.

href

Related resource URI (identity)

RFC4287

length

HTTP Content-Length header value

RFC4287

modified

Timestamp of last change (c.f. <lastmod>)

Atom Link Ext.

path

Path in ZIP package (Dump Manifests only)

ResourceSync

pri

Priority of link

RFC6249

rel

Relation - IANA registered or URI

RFC4287

type

HTTP Content-Type header value

RFC4287
December 3 2013

Resource Metadata Summary
Element/Attribute
<loc>
<lastmod>

Description
Resource URI (identity)
Timestamp of last change

Defined by
sitemaps
sitemaps

<changefreq>

Expected update frequency

sitemaps

<rs:md>
change
encoding

hash
length
path
type

ResourceSync
Change type (Change List & Change
Dump Manifest only)

ResourceSync

HTTP Content-Encoding header value

RFC2616

One or more content digests (md5, sha-1, Atom Link Ext.
sha-256)

HTTP Content-Length header value

RFC4287

Path in ZIP package (Dump Manifests
only)
HTTP Content-Type header value

ResourceSync

RFC4287

December 3 2013

Link Relation Summary
Relation

Use in ResourceSync

Defined in

rel="alternate"

Link from generic to specific URI

HTML 5

rel="canonical"

Link from specific to generic URI

RFC6596

rel="collection"

Resource is member of collection

RFC6573

rel="contents"

Link from dump to manifest

rel="describedby"

Has metadata

HTML4
Protocol for Web Description Resources
(POWDER): Description Resources

rel="describes"

Is metadata for

The 'describes' Link Relation Type

rel="duplicate"

RFC6249

rel=".../rs/terms/patch"

Mirror or alternative copy
A patch -- efficient change
information

rel="memento"

Link to time-specific URI

Memento Internet Draft

rel="timegate"

Link to timegate

Memento Internet Draft

rel="via"

Provenance chain, came from

RFC4287

This specification

December 3 2013

Resource List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
at="2013-01-03T09:00:00Z”
completed="2013-01-03T09:01:00Z” />
<url>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
<url>
…
</url>
</urlset>

December 3 2013

49

Resource List
• Describe Source’s resources that are subject to synchronization
• At one point in time (snapshot)
• Creation can take some time – duration can be conveyed
• Typical Destination use: Baseline Synchronization, Audit

• Each URI typically listed only once
• Might be expensive to generate
• Destinations use @at to determine freshness
• [@at, @completed] – interval of uncertainty
• Destination issues GETs against URIs to obtain resources
• Very similar to current Sitemaps

December 3 2013

50

Resource Dump
<rs:md capability=”resourcedump"
at="2013-01-02T09:00:00Z”/>
<url>
<loc>http://example.com/resourcedump_part1.zip</loc>
<rs:md length=”97553"
type=”application/zip"/>
<rs:ln rel=”contents”
href="http://example.com/resourcedump_manifest-part1.xml"
type=”application/xml"/>
</url>
<url>
<loc>http://example.com/resourcedump_part2.zip</loc>
</url>
</urlset>
December 3 2013

51

Resource Dump Manifest
<rs:md capability=”resourcedump-manifest"
at="2013-01-02T09:00:00Z”/>
<url>
<rs:md type="text/html"
path=”/resources/res1"/>
</url>
<url>
<rs:md type=”application/pdf”
path=”/resources/res2"/>
</url>
</urlset>

December 3 2013

52

Resource Dump
• A Resource Dump points to packages (ZIP files) that contain
representations of the Source’s resources
• At one point in time (snapshot)
• Resource Dump is mandatory, even if there is only one ZIP file
• ZIP package contains manifest, listing contained bitstreams
• Typical Destination use: Baseline Synchronization, bulk
download

• Each URI typically listed only once
• Might be expensive to generate
• Destinations use @at to determine freshness
• [@at, @completed] – interval of uncertainty
• GETs against individual URIs from Resource List achieves the
same result (ignoring varying freshness)
December 3 2013

53

Change List
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<rs:md change=”updated"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
<url>
…
</url>
</urlset>

December 3 2013

54

Change List
• A Change List pertains to a Source’s resources that changed
• Changes that occurred during a temporal interval with startand end-date
• Typical Destination use: Incremental Synchronization, Audit
• Changes are listed in chronological order
• Multiple changes to one resource results in the resource being
listed multiple times, once per change
• Source determines duration of temporal interval
• Destinations use @from and @until to determine freshness
• Destinations issue GETs against URIs to obtain changed
resources

December 3 2013

55

Discovery of Capabilities
Requirements:
• Need to discover capabilities, i.e. Resource List, Resource
Dump, Change List, Change Dump, Archives, Notification
channels
• Need to know the type of capability each document
represents.
Approach:
• The Source publishes a Capability List that enumerates the
capabilities it supports.
• By pointing at Resource List, Change List, Resource Dump,
etc. using appropriate relation types, e.g. “resourcelist”,
“changelist”, “resourcedump” etc.
http://www.openarchives.org/rs/resourcesync#CapabilityList
December 3 2013

56

Discovery of Capabilities

December 3 2013

57

Capability List
<rs:md capability=”capabilitylist”/>
<url>
<loc>http://example.com/dataset1/resourcelist.xml</loc>
<rs:md capability=”resourcelist”/>
</url>
<url>
<loc>http://example.com/dataset1/changelist.xml</loc>
<rs:md capability=”changelist”/>
</url>
<url>
<loc>http://example.com/dataset1/resourcedump.xml</loc>
<rs:md capability=”resourcedump”/>
</url>
</urlset>

December 3 2013

58

Discovery of Capability Lists
Requirements:
• Need to discover a Capability List
Approaches:
• Introduce a link in the HTTP Link header of a resources that is
subject to synchronization, pointing at the Capability List with the
relation type “resourcesync”
• Introduce a link from an HTML document that is subject to
synchronization (<head> section), pointing at the Capability List
with the relation type “resourcesync”
• Link from a Resource List, etc. to the Capability List with the
relation type “up”
Link header on example.com/res1.pdf
Link: <example.com/dataset1/capabilitylist.xml>;rel=“resourcesync”
December 3 2013

59

Discovery via robots.txt
• Resource Lists are (enhanced) Sitemaps
• Sitemaps can be discovered via robots.txt
• Ergo, Resource Lists should be discoverable via robots.txt
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Sitemap: http://example.com/dataset1/resourcelist.xml

December 3 2013

60

Framework Structure

December 3 2013

61

Motivation for Notifications
•

Reduce synchronization latency by having the Source push out
resource change information
• To avoid continuous pull of Change Lists by Destinations

•

Share information about changes to the Source’s
ResourceSync implementation, e.g. announcement of new
Resource List, new Capability List, etc.
• To avoid continuous polling of e.g. Resource Lists,
ResourceSync Description

December 3 2013

62

Source: Notification Capabilities
•

P
U
•
S
H

1. Change Notification
• Notifies about changes to particular resources
• e.g., resource A has been updated | created | deleted
2. Framework Notification
• Notifies about changes to capabilities i.e., their documents
• e.g., a Change List has been updated | created | deleted
• Also for Capability Lists and Source Description

December 3 2013

63

Notification Channels
•

Notification sent via channels
• Resource Notification: one channel per set of resources
• Framework Notification: one channel per set of resources
• Sent on level of capability document, not on index-level
• Notifications about changes to Source Description sent on all
Framework Notification channels

•

Payload for notifications: <urlset> documents

•

Transport protocol for notifications under discussion:
• PubSubHubbub https://pubsubhubbub.googlecode.com/git/pubsubhubbub-core0.4.html - current choice
• WebSockets -http://tools.ietf.org/html/rfc6455 – may be added
later
December 3 2013

64

Change Notification Payload
<rs:ln rel="up"
href="http://example.com/dataset1/capabilitylist.xml"/>
<url>
<rs:md change=”updated"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
<url>
…
</url>
</urlset>

December 3 2013

65


5. Implementation

December 3 2013

66

DSpace support for
metadata harvesting use case

DSpace Module:
https://github.com/CottageLabs/DSpaceResourceSync
PHP client:
https://github.com/stuartlewis/resync-php
http://mydspace.edu/dspace-rs/resource/123456789/7/qdc
ResourceSync webapp

Item handle

Metadata Format
December 3 2013

67

ResourceSync @ arXiv
• Use ResourceSync for both
mirroring and public data access
o efficient updates
o ability to do periodic audits
o public synchronization capability
o reduce admin burden
• Start with metadata + source for
mirroring use case (doing
experiments now)
• Open Access use cases require
processed PDF also
December 3 2013

68

Getting a copy of arXiv
It might be as easy as:

(of course, you probably have to wait a while but it is nice to know ResourceSync is
stateless so one can efficiently restart)

December 3 2013

69

Python Library and Client
• Aim to provide library code implementing all ResourceSync
facilities for use in both source and destination implementations
o

Designed for python 2.6 (RHEL6) and 2.7

• Client (resync) supports many destination operations, inspired
by the common Unix rsync program
• Client also supports some operations that might be useful in a
source, such as generation of static Resource Lists, or periodic
Change Lists (used in arXiv experiments)
• Explorer (resync-explorer) intended to allow easy inspection
of a source’s resource sets and capabilities
• Developed since ResourceSync v0.5, updated for v0.9.1
http://github.org/resync/resync
On pypi: “easy_install resync”

December 3 2013

ResourceSync Source Simulator
• Python code using Tornado server
• Provides random set of resources of different sizes updated at a
particular rate
• Very useful for testing Destination code

http://github.com/resync/simulator

December 3 2013


6. Q&A
December 3 2013

72

ResourceSync:
A Web-Based
Framework

#resourcesync

ResourceSync is funded by
The Sloan Foundation & JISC
December 3 2013

73

THANK YOU
We look forward to seeing you at a
future NISO training event.

NISO ResourceSync Training Session

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NISO ResourceSync Training Session

Similar to NISO ResourceSync Training Session (20)

More from National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Recently uploaded

Recently uploaded (20)

NISO ResourceSync Training Session

Editor's Notes