ResourceSync: Web-Based Resource Synchronization

2,493 views
2,435 views

Published on

Presentation about the NISO/OAI ResourceSync effort used at TICER 2012 Summer School.

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,493
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
21
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

ResourceSync: Web-Based Resource Synchronization

  1. 1. ResourceSync: Web-Based Resource Synchronization Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp ResourceSync is funded by The Sloan Foundation & JISC ResourceSync – Herbert Van de SompelTICER Summer School, August 22 2012, Tilburg, The Netherlands
  2. 2. ResourceSync Core Team – NISO & OAICornell University & OAI: Berhard Haslhofer, Carl Lagoze, Simeon WarnerOld Dominion University & OAI: Michael L. NelsonLos Alamos National Laboratory & OAI: Martin Klein, Robert Sanderson, Herbert Van de SompelNISO: Todd Carpenter, Nettie Lagace, Peter Murray ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  3. 3. ResourceSync Technical Group•  Manuel Bernhardt, Delving B.V.•  Kevin Ford, Library of Congress•  Richard Jones, JISC•  Graham Klyne, JISC•  Stuart Lewis, JISC•  David Rosenthal, LOCKSS•  Christian Sadilek, Red Hat•  Shlomo Sanders, Ex Libris, Inc.•  Sjoerd Siebinga, Delving B.V.•  Ed Summers, Library of Congress•  Jeff Young, OCLC Online Computer Library Center ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  4. 4. ResourceSyncResourceSync: What & Why?Problem Perspective & Conceptual ApproachTechnical DetailsQ&A ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  5. 5. ResourceSyncResourceSync: What & Why?Problem Perspective & Conceptual ApproachTechnical DetailsQ&A ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  6. 6. Synchronize What?•  Web resources – things with a URI that can be dereferenced and are cache-able (no dependency on underlying OS, technologies etc.)•  Small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources)•  That change slowly (weeks/months) or quickly (seconds), and where latency needs may vary•  Focus on needs of research communication and cultural heritage organizations, but aim for generality ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  7. 7. Why?… because lots of projects and services are doing synchronizationbut have to resort to ad-hoc, case by case, approaches!•  Project team involved with projects that need this•  Experience with OAI-PMH: widely used in repos but o  XML metadata only o  Attempts at synchronizing actual content via OAI-PMH (complex object formats, dc:identifier) not successful. o  Web technology has moved on since 1999•  Devise a shared solution for data, metadata, linked data? ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  8. 8. Use Cases – The Basics ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  9. 9. Use Cases - More ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  10. 10. Out Of Scope (For Now)•  Bidirectional synchronization•  Destination-defined selective synchronization (query)•  Bulk URI migration ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  11. 11. Use Case: arXiv Mirroring•  1M article versions, ~800/day created or updated at 8 PM US Eastern Time•  Metadata and full-text for each article•  Accuracy important•  Want low barrier for others to use•  Look for more general solution than current homebrew mirroring (running with minor modifications since 1994!) and occasional rsync (filesystem layout specific, auth issues) ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  12. 12. Use Case: DBpedia Live Duplication•  Average of 2 updates per second•  Want low latency => need a push technology ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  13. 13. ResourceSyncResourceSync: What & Why?Problem Perspective & Conceptual ApproachTechnical DetailsQ&A ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  14. 14. ResourceSync Problem•  Consideration: •  Source (server) A has resources that change over time: they get created, modified, deleted •  Destination (servers) X, Y, and Z leverage (some) resources of Source A.•  Problem: •  Destinations want to keep in step with the resource changes at Source A: resource synchronization.•  Goal: •  Design an approach for resource synchronization aligned with the Web Architecture that has a fair chance of adoption by different communities. •  The approach must scale better than recurrent HTTP HEAD/GET on resources. ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  15. 15. Destination: 3 Basic Synchronization Needs1.  Baseline synchronization – A destination must be able to perform an initial load or catch-up with a source -  avoid out-of-band setup2.  Incremental synchronization – A destination must have some way to keep up-to-date with changes at a source -  subject to some latency; minimal: create/update/delete -  allow to catch-up after destination has been offline3.  Audit – A destination should be able to determine whether it is synchronized with a source -  subject to some latency ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  16. 16. Source Capability 1: Describing ContentIn order to advertise the resources that a source wants destinationsto know about, it may describe them: o  Publish an inventory of resource URIs and possibly associated metadata -  Destination GETs the Content Description -  Destination GETs listed resources by their URI ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  17. 17. Source Capability 2: Communicating Change EventsIn order to achieve lower latency, a source may communicate aboutchanges to its resources: o  2.1. Change Set: Publish a list of recent change events (create, update, delete resource) -  Destination acts upon change events, e.g. GETs created/ updated resources, removes deleted resources. ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  18. 18. Source Capability 2: Communicating Change EventsIn order to achieve lower latency, a source may communicate aboutchanges to its resources: o  2.1. Change Set: Publish a list of recent change events (create, update, delete resource) -  Destination acts upon change events, e.g. GETs created/ updated resources, removes deleted resources. o  2.2. Push Change Set: Push a list of recent change events (create, update, delete resource) towards (a) destination(s) -  Destination acts upon change events, e.g. GETs created/ updated resources, removes deleted resources. ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  19. 19. Source Capability 3: Providing Access to VersionsIn order to allow a destination to catch up with missed changes, asource may support: o  3.1. Historical Change Sets: Provide access to change events that occurred prior to the ones listed in the current Change Set ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  20. 20. Source Capability 3: Providing Access to VersionsIn order to allow a destination to catch up with missed changes, asource may support: o  3.1. Historical Change Sets: Provide access to change events that occurred prior to the ones listed in the current Change Set o  3.2. Historical Content: Provide access to prior resource versions ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  21. 21. Source Capability 4: Transferring ContentBy default, content is transferred in response to a GET issued by adestination against a URI of a source’s resource. But a source maysupport additional mechanisms: o  4.1. Dump: Publish a package of resource representations and necessary metadata -  Destination GETs the Dump -  Destination unpacks the Dump ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  22. 22. Source Capability 4: Transferring ContentBy default, content is transferred in response to a GET issued by adestination against a URI of a source’s resource. But a source maysupport additional mechanisms: o  4.1. Dump: Publish a package of resource representations and necessary metadata -  Destination GETs the Dump -  Destination unpacks the Dump o  4.2. Alternate Content Transfer: Support alternative mechanisms to optimize getting content, e.g. content via a mirror site, only changes not the entire changed resource. ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  23. 23. Source: Advertise CapabilitiesA source needs to advertise the capabilities it supports to allow adestination to discover them•  Some capabilities may be provided by a third party, not the source itself o  e.g. Historical Change Sets, Historical Content o  But the source should still make those third party capabilities discoverable - trust ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  24. 24. ResourceSyncResourceSync: What & Why?Problem Perspective & Conceptual ApproachTechnical DetailsQ&A ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  25. 25. So Many Choices Push DSNotify OAI-PMH Pull rsync Crawl OAI-ORE RDFsync WebDAV Col. Syn. XMPP Atom SWORD AtomPub Sitemap RSSSPARQLpush PubSubHubbub SDShare XMPP ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  26. 26. ResourceSync – Herbert Van de SompelTICER Summer School, August 22 2012, Tilburg, The Netherlands
  27. 27. ResourceSync – Herbert Van de SompelTICER Summer School, August 22 2012, Tilburg, The Netherlands
  28. 28. A Framework Based on Sitemaps•  Modular framework allowing selective deployment•  Sitemap is the core component throughout the framework o  Introduce extension elements and attributes: -  In ResourceSync namespace (rs:) to accommodate synchronization needs -  In XHTML namespace (xhtml:) mainly to accommodate discovery needs o  Reuse Sitemap format for Change Sets (both current and historical) and for manifest in Dump ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  29. 29. Source Capabilities – Destination Needs ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  30. 30. Source Capabilities – Destination Needs ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  31. 31. Sitemap with Added Datetime ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  32. 32. Change Types: Extend lastmod, Use expires ! ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  33. 33. Sitemap with lastmod and expires ! ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  34. 34. Sitemap Discovery via robots.txt ! ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  35. 35. Source Capabilities – Destination Needs ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  36. 36. Change Set: An rs Typed Sitemap ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  37. 37. More rs Extension Elements ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  38. 38. Change Set with rs and xhtml Extensions ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  39. 39. Change Set Discovery via Sitemap ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  40. 40. Pushing Change Sets via XMPP PubSubXMPP Publish-Subscribe: Client to Subscription Service, Subscription Service to Client(s) communication•  One of the XMPP (Extensible Messaging and Presence Protocol) extensions http://xmpp.org/extensions/xep-0060.html•  Apple Notifications based on XMPP PubSub•  Available tools, see http://xmpp.org/about-xmpp/ technology-overview/pubsub/#impl-client o  XMPP Servers with PubSub support: -  ejabberd , OpenFire , Tigase , SleekXMPP o  XMPP libraries with PubSub support: -  Strophe (C, JavaScript), XMPP4R (Ruby), SleekXMPP (Python), PubSub Client (Python) ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  41. 41. Pushing Change Sets via XMPP PubSub ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  42. 42. Change Set via XMPP ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  43. 43. Push Change Set Discovery via Sitemap ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  44. 44. Source Capabilities – Destination Needs ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  45. 45. Discovering a Historical Change Set via a Current Change Set ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  46. 46. Source Capabilities – Destination Needs ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  47. 47. Discovering Historical Content – Link to Version Resource ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  48. 48. Memento Intermezzo http://www.mementoweb.org/ ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  49. 49. Original Resources and Mementos ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  50. 50. Bridge from Present to Past ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  51. 51. Bridge from Past to Present ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  52. 52. Memento Framework ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  53. 53. Discovering Historical Content – Link to Memento TimeGate ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  54. 54. Source Capabilities – Destination Needs ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  55. 55. Dump•  Two formats currently under discussion: o  Format based on ZIP: -  Package content -  Add manifest (manifest.xml) expressed in Sitemap format -  ZIP it up o  WARC files as used by the web archiving community ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  56. 56. Mapping URI to File Path with rs:path ! ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  57. 57. Manifest (manifest.xml) Expressed in Sitemap Format ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  58. 58. Dump Discovery via Sitemap ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  59. 59. Source Capabilities – Destination Needs ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  60. 60. Alternate Location ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  61. 61. Alternate Protocol, e.g. Obtain Changes Only ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  62. 62. Timeline•  August 2012 o  First draft spec shared for feedback with ResourceSync team•  September 2012 o  In-person meeting of ResourceSync Team o  Revise spec, conduct experiments o  Solicit broad feedback o  Paper in D-Lib Magazine•  December 2012 – Finalize specification (?) ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  63. 63. Pointers•  First draft spec: http://www.openarchives.org/rs/0.1/resourcesync!•  Simulator code on github http://github.org/resync/simulator!•  NISO workspace http://www.niso.org/workrooms/resourcesync/! !•  List for public comment coming soon ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
  64. 64. ResourceSync: Web-Based Resource Synchronization Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp ResourceSync is funded by The Sloan Foundation & JISC ResourceSync – Herbert Van de SompelTICER Summer School, August 22 2012, Tilburg, The Netherlands

×