ResourceSync: Web-Based Resource Synchronization

ResourceSync:
Web-Based
Resource
Synchronization
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp

ResourceSync is funded by
The Sloan Foundation & JISC

ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands

ResourceSync Core Team – NISO & OAI

Cornell University & OAI:
Berhard Haslhofer, Carl Lagoze, Simeon Warner

Old Dominion University & OAI:
Michael L. Nelson

Los Alamos National Laboratory & OAI:
Martin Klein, Robert Sanderson, Herbert Van de Sompel

NISO:
Todd Carpenter, Nettie Lagace, Peter Murray


ResourceSync Technical Group

•  Manuel Bernhardt, Delving B.V.
•  Kevin Ford, Library of Congress
•  Richard Jones, JISC
•  Graham Klyne, JISC
•  Stuart Lewis, JISC
•  David Rosenthal, LOCKSS
•  Christian Sadilek, Red Hat
•  Shlomo Sanders, Ex Libris, Inc.
•  Sjoerd Siebinga, Delving B.V.
•  Ed Summers, Library of Congress
•  Jeff Young, OCLC Online Computer Library Center


ResourceSync

ResourceSync: What & Why?

Problem Perspective & Conceptual Approach

Technical Details

Q&A


Synchronize What?

•  Web resources – things with a URI that can be dereferenced and
are cache-able (no dependency on underlying OS, technologies
etc.)

•  Small websites/repositories (a few resources) to large
repositories/datasets/linked data collections (many millions of
resources)

•  That change slowly (weeks/months) or quickly (seconds), and
where latency needs may vary

•  Focus on needs of research communication and cultural heritage
organizations, but aim for generality


Why?

… because lots of projects and services are doing synchronization
but have to resort to ad-hoc, case by case, approaches!

•  Project team involved with projects that need this

•  Experience with OAI-PMH: widely used in repos but
o  XML metadata only

o  Attempts at synchronizing actual content via OAI-PMH

(complex object formats, dc:identifier) not successful.
o  Web technology has moved on since 1999

•  Devise a shared solution for data, metadata, linked data?


Use Cases – The Basics


Use Cases - More


Out Of Scope (For Now)

•  Bidirectional synchronization

•  Destination-defined selective synchronization (query)

•  Bulk URI migration


Use Case: arXiv Mirroring

•  1M article versions, ~800/day created or
updated at 8 PM US Eastern Time

•  Metadata and full-text for each article

•  Accuracy important

•  Want low barrier for others to use

•  Look for more general solution than current
homebrew mirroring (running with minor
modifications since 1994!) and occasional rsync
(filesystem layout specific, auth issues)


Use Case: DBpedia Live Duplication

•  Average of 2 updates per second
•  Want low latency => need a push technology


ResourceSync Problem

•  Consideration:
•  Source (server) A has resources that change over time: they
get created, modified, deleted
•  Destination (servers) X, Y, and Z leverage (some) resources
of Source A.
•  Problem:
•  Destinations want to keep in step with the resource changes
at Source A: resource synchronization.
•  Goal:
•  Design an approach for resource synchronization aligned
with the Web Architecture that has a fair chance of adoption
by different communities.
•  The approach must scale better than recurrent HTTP
HEAD/GET on resources.


Destination: 3 Basic Synchronization Needs

1.  Baseline synchronization – A destination must be able to
perform an initial load or catch-up with a source
-  avoid out-of-band setup

2.  Incremental synchronization – A destination must have some
way to keep up-to-date with changes at a source
-  subject to some latency; minimal: create/update/delete
-  allow to catch-up after destination has been offline

3.  Audit – A destination should be able to determine whether it is
synchronized with a source
-  subject to some latency


Source Capability 1: Describing Content

In order to advertise the resources that a source wants destinations
to know about, it may describe them:

o  Publish an inventory of resource URIs and possibly
associated metadata
-  Destination GETs the Content Description
-  Destination GETs listed resources by their URI


Source Capability 2: Communicating Change Events

In order to achieve lower latency, a source may communicate about
changes to its resources:

o  2.1. Change Set: Publish a list of recent change events
(create, update, delete resource)
-  Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.


Source Capability 2: Communicating Change Events

In order to achieve lower latency, a source may communicate about
changes to its resources:

o  2.1. Change Set: Publish a list of recent change events
(create, update, delete resource)

o  2.2. Push Change Set: Push a list of recent change events
(create, update, delete resource) towards (a) destination(s)


Source Capability 3: Providing Access to Versions

In order to allow a destination to catch up with missed changes, a
source may support:

o  3.1. Historical Change Sets: Provide access to change events that
occurred prior to the ones listed in the current Change Set


Source Capability 3: Providing Access to Versions

In order to allow a destination to catch up with missed changes, a
source may support:

o  3.1. Historical Change Sets: Provide access to change events that
occurred prior to the ones listed in the current Change Set

o  3.2. Historical Content: Provide access to prior resource versions


Source Capability 4: Transferring Content

By default, content is transferred in response to a GET issued by a
destination against a URI of a source’s resource. But a source may
support additional mechanisms:

o  4.1. Dump: Publish a package of resource representations
and necessary metadata
-  Destination GETs the Dump
-  Destination unpacks the Dump


Source Capability 4: Transferring Content

By default, content is transferred in response to a GET issued by a
destination against a URI of a source’s resource. But a source may
support additional mechanisms:

o  4.1. Dump: Publish a package of resource representations
and necessary metadata
-  Destination GETs the Dump
-  Destination unpacks the Dump

o  4.2. Alternate Content Transfer: Support alternative
mechanisms to optimize getting content, e.g. content via a
mirror site, only changes not the entire changed resource.


Source: Advertise Capabilities

A source needs to advertise the capabilities it supports to allow a
destination to discover them

•  Some capabilities may be provided by a third party, not the
source itself
o  e.g. Historical Change Sets, Historical Content
o  But the source should still make those third party capabilities
discoverable - trust


So Many Choices

Push
DSNotify
OAI-PMH Pull
rsync
Crawl
OAI-ORE
RDFsync
WebDAV Col. Syn.
XMPP
Atom SWORD AtomPub
Sitemap RSS

SPARQLpush PubSubHubbub
SDShare XMPP


A Framework Based on Sitemaps

•  Modular framework allowing selective deployment

•  Sitemap is the core component throughout the
framework

o  Introduce extension elements and attributes:
-  In ResourceSync namespace (rs:) to
accommodate synchronization needs
-  In XHTML namespace (xhtml:) mainly to
accommodate discovery needs
o  Reuse Sitemap format for Change Sets (both
current and historical) and for manifest in Dump


Source Capabilities – Destination Needs


Sitemap with Added Datetime


Change Types: Extend lastmod, Use expires
!


Sitemap with lastmod and expires
!


Sitemap Discovery via robots.txt
!


Change Set: An rs Typed Sitemap


More rs Extension Elements


Change Set with rs and xhtml Extensions


Change Set Discovery via Sitemap


Pushing Change Sets via XMPP PubSub

XMPP Publish-Subscribe: Client to Subscription Service,
Subscription Service to Client(s) communication

•  One of the XMPP (Extensible Messaging and Presence Protocol)
extensions http://xmpp.org/extensions/xep-0060.html
•  Apple Notifications based on XMPP PubSub
•  Available tools, see http://xmpp.org/about-xmpp/
technology-overview/pubsub/#impl-client
o  XMPP Servers with PubSub support:

-  ejabberd , OpenFire , Tigase , SleekXMPP
o  XMPP libraries with PubSub support:

-  Strophe (C, JavaScript), XMPP4R (Ruby), SleekXMPP
(Python), PubSub Client (Python)


Pushing Change Sets via XMPP PubSub


Change Set via XMPP


Push Change Set Discovery via Sitemap


Discovering a Historical Change Set via a Current Change Set


Discovering Historical Content – Link to Version Resource


Memento Intermezzo

http://www.mementoweb.org/


Original Resources and Mementos


Bridge from Present to Past


Bridge from Past to Present


Memento Framework


Discovering Historical Content – Link to Memento TimeGate


Dump

•  Two formats currently under discussion:

o  Format based on ZIP:
-  Package content
-  Add manifest (manifest.xml) expressed in
Sitemap format
-  ZIP it up

o  WARC files as used by the web archiving
community


Mapping URI to File Path with rs:path
!


Manifest (manifest.xml) Expressed in Sitemap Format


Dump Discovery via Sitemap


Alternate Location


Alternate Protocol, e.g. Obtain Changes Only


Timeline
•  August 2012
o  First draft spec shared for feedback with ResourceSync team

•  September 2012
o  In-person meeting of ResourceSync Team

o  Revise spec, conduct experiments

o  Solicit broad feedback

o  Paper in D-Lib Magazine

•  December 2012 – Finalize specification (?)


Pointers
•  First draft spec:
http://www.openarchives.org/rs/0.1/resourcesync!

•  Simulator code on github
http://github.org/resync/simulator!

•  NISO workspace
http://www.niso.org/workrooms/resourcesync/!
!
•  List for public comment coming soon


ResourceSync: Web-Based Resource Synchronization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ResourceSync: Web-Based Resource Synchronization

Similar to ResourceSync: Web-Based Resource Synchronization (20)

More from Herbert Van de Sompel

More from Herbert Van de Sompel (20)

Recently uploaded

Recently uploaded (20)

ResourceSync: Web-Based Resource Synchronization