Successfully reported this slideshow.

NISO Forum, Denver, September 24, 2012: ResourceSync: Web-Based Resource Synchronization

2,257 views

Published on

ResourceSync: Web-Based Resource Synchronization. Also for Data.
Herbert Van de Sompel, Digital Library Researcher, Los Alamos National Laboratory, and Co-chair of NISO’s ResourceSync Working Group

Web applications frequently leverage resources made available by remote Web servers. As resources are created, updated, or deleted these applications face challenges to remain in lockstep with the server’s change dynamics. Several approaches exist to help meet this challenge for use cases where “good enough” synchronization is acceptable. But when strict resource coverage or low synchronization latency is required, commonly accepted Web-based solutions remain elusive. Motivated by the need to synchronize resources for applications in the realm of cultural heritage and research communication, the National Information Standards Organization (NISO) and the Open Archives Initiative (OAI) have launched the ResourceSync project that aims at designing an approach for resource synchronization that is aligned with the web architecture and that has a fair chance of adoption by different communities. The presentation will discuss some motivating use cases and will provide a perspective on the resource synchronization problem that results from ResourceSync project discussions. It will provide an overview of the ongoing thinking regarding an approach to address the challenges and will pay special attention to aspects that are relevant for the synchronization of data.

Published in: Education
  • Be the first to comment

  • Be the first to like this

NISO Forum, Denver, September 24, 2012: ResourceSync: Web-Based Resource Synchronization

  1. 1. ResourceSync: Web-Based Resource Synchronization Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp ResourceSync is funded by#resourcesync The Sloan Foundation & JISC ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  2. 2. ResourceSync Core Team – NISO & OAILos Alamos National Laboratory & OAI: Martin Klein, RobertSanderson, Herbert Van de SompelCornell University & OAI: Berhard Haslhofer, SimeonWarnerOld Dominion University & OAI: Michael L. NelsonUniversity of Michigan & OAI: Carl LagozeNISO: Todd Carpenter, Nettie Lagace, Peter Murray ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  3. 3. ResourceSync Technical Group•  Manuel Bernhardt, Delving B.V.•  Kevin Ford, Library of Congress•  Richard Jones, JISC•  Graham Klyne, JISC•  Stuart Lewis, JISC•  David Rosenthal, LOCKSS•  Christian Sadilek, Red Hat•  Shlomo Sanders, Ex Libris, Inc.•  Sjoerd Siebinga, Delving B.V.•  Ed Summers, Library of Congress•  Jeff Young, OCLC Online Computer Library Center ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  4. 4. ResourceSyncResourceSync: What & Why?Problem Perspective & Conceptual ApproachPossible Technical ChoicesQ&A ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  5. 5. ResourceSyncResourceSync: What & Why?Problem Perspective & Conceptual ApproachPossible Technical ChoicesQ&A ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  6. 6. Synchronize What?•  Web resources – things with a URI that can be dereferenced and are cache-able (no dependency on underlying OS, technologies etc.)•  Small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources)•  That change slowly (weeks/months) or quickly (seconds), and where latency needs may vary•  Focus on needs of research communication and cultural heritage organizations, but aim for generality ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  7. 7. Why?… because lots of projects and services are doing synchronizationbut have to resort to ad-hoc, case by case, approaches!•  Project team involved with projects that need this•  Experience with OAI-PMH: widely used in repos but o  XML metadata only o  Attempts at synchronizing actual content via OAI-PMH (complex object formats, dc:identifier) not successful. o  Web technology has moved on since 1999•  Devise a shared solution for data, metadata, linked data? ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  8. 8. Use Cases – The Basics ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  9. 9. Use Cases - More ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  10. 10. Out Of Scope (For Now)•  Bidirectional synchronization•  Destination-defined selective synchronization (query)•  Bulk URI migration ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  11. 11. Use Case: arXiv Mirroring•  1M article versions, ~800/day created or updated at 8 PM US Eastern Time•  Metadata and full-text for each article•  Accuracy important•  Want low barrier for others to use•  Look for more general solution than current homebrew mirroring (running with minor modifications since 1994!) and occasional rsync (filesystem layout specific, auth issues) ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  12. 12. Use Case: DBpedia Live Duplication•  Average of 2 updates per second•  Want low latency => need a push technology ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  13. 13. ResourceSyncResourceSync: What & Why?Problem Perspective & Conceptual ApproachPossible Technical ChoicesQ&A ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  14. 14. ResourceSync Problem•  Consideration: •  Source (server) A has resources that change over time: they get created, modified, deleted •  Destination (servers) X, Y, and Z leverage (some) resources of Source A.•  Problem: •  Destinations want to keep in step with the resource changes at source A: resource synchronization.•  Goal: •  Design an approach for resource synchronization aligned with the Web Architecture that has a fair chance of adoption by different communities. •  The approach must scale better than recurrent HTTP HEAD/GET on resources. ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  15. 15. Destination: 3 Basic Synchronization Needs1.  Baseline synchronization – A destination must be able to perform an initial load or catch-up with a source -  avoid out-of-band setup2.  Incremental synchronization – A destination must have some way to keep up-to-date with changes at a source -  subject to some latency; minimal: create/update/delete -  allow to catch-up after destination has been offline3.  Audit – A destination should be able to determine whether it is synchronized with a source -  subject to some latency ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  16. 16. Source Capability 1: Describing ContentIn order to advertise the resources that a source wants destinationsto know about, it may describe them: o  Publish an inventory of resource URIs and possibly associated metadata -  Destination GETs the Content Description -  Destination GETs listed resources by their URI ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  17. 17. Source Capability 2: Communicating Change EventsIn order to achieve lower latency, a source may communicate aboutchanges to its resources: o  2.1. Change Set: Publish a list of recent change events (create, update, delete resource) -  Destination acts upon change events, e.g. GETs created/ updated resources, removes deleted resources. ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  18. 18. Source Capability 2: Communicating Change EventsIn order to achieve lower latency, a source may communicate aboutchanges to its resources: o  2.1. Change Set: Publish a list of recent change events (create, update, delete resource) -  Destination acts upon change events, e.g. GETs created/ updated resources, removes deleted resources. o  2.2. Push Change Set: Push a list of recent change events (create, update, delete resource) towards (a) destination(s) -  Destination acts upon change events, e.g. GETs created/ updated resources, removes deleted resources. ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  19. 19. Source Capability 3: Providing Access to VersionsIn order to allow a destination to catch up with missed changes, asource may support: o  3.1. Historical Change Sets: Provide access to change events that occurred prior to the ones listed in the current Change Set ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  20. 20. Source Capability 3: Providing Access to VersionsIn order to allow a destination to catch up with missed changes, asource may support: o  3.1. Historical Change Sets: Provide access to change events that occurred prior to the ones listed in the current Change Set o  3.2. Historical Content: Provide access to prior resource versions ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  21. 21. Source Capability 4: Transferring ContentBy default, content is transferred in response to a GET issued by adestination against a URI of a source’s resource. But a source maysupport additional mechanisms: o  4.1. Dump: Publish a package of resource representations and necessary metadata -  Destination GETs the Dump -  Destination unpacks the Dump ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  22. 22. Source Capability 4: Transferring ContentBy default, content is transferred in response to a GET issued by adestination against a URI of a source’s resource. But a source maysupport additional mechanisms: o  4.1. Dump: Publish a package of resource representations and necessary metadata -  Destination GETs the Dump -  Destination unpacks the Dump o  4.2. Alternate Content Transfer: Support alternative mechanisms to optimize getting content (see later) ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  23. 23. Source: Advertise CapabilitiesA source needs to advertise the capabilities it supports to allow adestination to discover them•  Some capabilities may be provided by a third party, not the source itself o  e.g. Historical Change Sets, Historical Content o  But the source should still make those third party capabilities discoverable - trust ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  24. 24. ResourceSyncResourceSync: What & Why?Problem Perspective & Conceptual ApproachPossible Technical ChoicesQ&A ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  25. 25. ResourceSync: A Framework of Capabilities•  Modular framework allowing selective deployment of capabilities•  A Source selects which capabilities to support in order to meet local and community needs•  A Source’s Capabilities can be discovered via capability descriptions ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  26. 26. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  27. 27. BY REFERENCE! BY VALUE! ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  28. 28. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  29. 29. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  30. 30. Sitemap<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://example.com/res1</loc> <lastmod>2012-08-08T08:15:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2012-08-08T13:22:00Z</lastmod> </url></urlset> ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  31. 31. Baseline Matching - Sitemap•  Periodic publication of up-to-date Sitemap, which is a “by reference” inventory of a Source’s resources•  Use ”as is” with resource location and last modification date as core elements•  Introduce extension elements aimed at supporting audit: e.g. MD5 hash of content ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  32. 32. robots.txt! discovery ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  33. 33. Baseline Matching – Dump•  A Dump is a “by-value” inventory of a Source’s resources•  Periodic publication of an up-to-date Dump•  Possible technology: ZIP file consisting of: •  Special-purpose Sitemap that acts as a manifest for resources contained in the ZIP file •  Introduce an element to express correspondence between resource URI and filename in the ZIP file •  Resource bitsteams•  Possible technology: WARC file ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  34. 34. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  35. 35. Change Communication – Pull Change Sets•  Periodic publication of a Change Set that describes recent changes•  A Change Set is a Sitemap-style document, enhanced to express change events rather than inventory. Per change event, convey: •  About the event: •  datetime •  event type: create/update/delete (maybe move/copy) •  About the changed resource: •  URI •  Information relevant for audit, e.g. fixity, size, mime type •  Further information to aide accessing the resource (see later) ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  36. 36. Change Set, Based on Sitemap<?xml version="1.0" encoding="UTF-8"?><urlset rs:type="changeset” xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <url> <loc>http://example.com/res1</loc> <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod rs:type="created">2012-08-08T10:22:00Z</lastmod> </url></urlset> ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  37. 37. Change Set, from Scratch<?xml version="1.0" encoding="UTF-8"?><changeset xmlns="http://www.openarchives.org/rs/changeset"> <change> <link rel="created" length="1234" type="text/html” href="http://example.com/res1.html"/> <date>2012-09-25T09:00:00Z</date> <fixity>ni:///sha-256;f4OxZX_x_FO5LcGBSKHWXfwtSx</fixity> </change></changeset> ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  38. 38. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  39. 39. Change Communication – Push Change Sets•  Use a push technology to convey changes•  Express changes using same Sitemap-style document •  A Change Set in this case might convey only one change event•  Possible technology: XMPP PubSub ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  40. 40. <XMPP PubSub Intermezzo>XMPP Publish-Subscribe: Client to Subscription Service, Subscription Service to Client(s) communication•  One of the XMPP (Extensible Messaging and Presence Protocol) extensions http://xmpp.org/extensions/xep-0060.html•  Apple Notifications based on XMPP PubSub•  Both client and server tools widely available ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  41. 41. </XMPP PubSub Intermezzo>Source PubSub Server Destination ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  42. 42. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  43. 43. Change Communication Memory•  Publication of one or more Change Sets that convey historical (rather than recent) changes•  All historical Change Sets use same Sitemap-style document•  Same approach irrespective of whether pull or push is used for Change Communication ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  44. 44. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  45. 45. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  46. 46. Resource Transfer•  Resources are obtained in bulk by obtaining a Dump•  An individual resource is, by default, obtained by dereferencing a resource’s URI listed in: •  Sitemap •  Change Set•  Alternative access mechanisms are introduced to obtain an individual resource: •  From a mirror site •  Access to diff with previous version instead of access to the entire changed resource •  Resource version ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  47. 47. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  48. 48. Resource Memory•  Requires a (short or long term) archive of resource versions•  Access to specific version can be expressed as an alternative access mechanism in e.g. Change Set. •  Via a link to a version resource that is the result of the change expressed in the Change Set •  Via a link to a Memento TimeGate that supports access to all available prior versions ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  49. 49. <Memento Intermezzo> http://www.mementoweb.org/ ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  50. 50. Original Resources and Mementos ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  51. 51. Bridge from Present to Past ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  52. 52. Bridge from Past to Present ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  53. 53. Memento Framework ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  54. 54. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  55. 55. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  56. 56. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  57. 57. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  58. 58. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  59. 59. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  60. 60. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  61. 61. Memento FrameworkOriginal Resource: http://lanlsource.lanl.gov/pics/picoftheday.png ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  62. 62. Time Travel across Versions of a Picture of the DayMovie at: http://www.mementoweb.org/demo/picoftheday.mov ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  63. 63. Memento FrameworkOriginal Resource: http://dbpedia.org/resource/France ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  64. 64. Time-Series Analysis across DBpedia Versions Data collected through HTTP Navigation Paper at http://arxiv.org/abs/1003.3661 ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  65. 65. </Memento Intermezzo> http://www.mementoweb.org/ ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  66. 66. ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO
  67. 67. ResourceSync Timeline•  August 2012 o  First draft spec shared for feedback with ResourceSync team•  September 2012 o  Problem Statement paper in D-Lib Magazine o  In-person meeting of ResourceSync Team•  October 2012 o  Revise spec, conduct experiments o  Solicit broad feedback•  December 2012 – Finalize specification (?) ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  68. 68. Pointers•  First ResourceSync draft spec (do not implement!): http://www.openarchives.org/rs/0.1/resourcesync!•  ResourceSync Simulator code on github http://github.org/resync/simulator!•  NISO ResourceSync workspace http://www.niso.org/workrooms/resourcesync/!•  Memento http://mementoweb.org! ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
  69. 69. ResourceSync: Get the Sticker! Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp ResourceSync is funded by The Sloan Foundation & JISC ResourceSync – Herbert Van de SompelNISO Forum, September 24 2012, Denver, CO

×