Successfully reported this slideshow.
ResourceSync:                   Web-based                    Resource               Synchronization                 Simeon...
Core team -- Todd Carpenter (NISO), Berhard Haslhofer, (CornellUniversity), Martin Klein (Los Alamos National Laboratory),...
Synchronize what?•  Web resources – things with a URI that can be   dereferenced and are cache-able (no dependency on   un...
Why?… because lots of projects and services aredoing synchronization but have to roll theirown on a case by case basis!•  ...
Use cases – the basics  JISC
More use casesJISC
Out-of-scope (for now)•  Bidirectional synchronization•  Destination-defined selective synchronization (query)•  Special u...
Use case: DBpedia Live duplication•  20M entries updated @ 1/s though sporadic•  Want low latency => need a push technology
Use case: arXiv mirroring•  1M article versions, ~800/day created   or updated at 8pm US eastern time•  Metadata and full-...
Terminology•  Resource: an object to be synchronized, a web resource•  Source: system with the original or master resource...
Three basic needs1.  Baseline synchronization – A destination must be    able to perform an initial load or catch-up with ...
Baseline synchronizationEither•  Get inventory of resources and then copy them one-   by-one using HTTP GET     o    simpl...
AuditCould do new Baseline synchronization and compare …but likely very inefficient! Optimize by adding:•  Get inventory a...
Incremental synchronizationSimplest method is Audit and then copy of all new/updated resources, plus removal of deleted re...
Template to map approaches                         15
Approaches and technologies                                    Push  DSNotify                OAI-PMH                      ...
A framework based on Sitemaps•  Modular framework allowing selective deployment•  Sitemap is the most basic component of t...
Baseline Sync with Inventory                           19
Level zero è Publish a Sitemap•  Periodic publication of an up-to-date Sitemap is   base level implementation•  Use Sitem...
Two resources, with lastmod times
Two resources, with lastmod times, sizes and         digests. The second with a tag also
Sitemap details & issues•  Sitemap XML format designed to allow extension•  ResourceSync additions:   o    Additional core...
Incremental Sync with ChangeSet                            24
ChangeSet•  Reuse Sitemap format but include information only for change   events over a certain period:    •  One <url> e...
Expt: arXiv – Inventory and ChangeSet •  Baseline synchronization and Audit (Inventory):    o    2.3M resources (300GB con...
Incremental Sync with Push via XMPP                              27
Change Communication: Push via XMPP  •  Rapid notification of change events via XMPP     PubSub node; one notification per...
Expt: LiveDBpedia with XMPP Push•  LANL Research Library ran a significant scale   experiment in synchronization of the Li...
DumpsOptimization over making repeated HTTP GET requestsfor multiple resources. Use for baseline and changeset.Options:1. ...
Sitemaps + XMPP + Dumps                      31
Timeline and input•  July 2012 – First draft of sitemap-based spec (SOON)•  August 2012 – Publicize and solicit feedback (...
ResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource Synchronization
Upcoming SlideShare
Loading in …5
×

ResourceSync: Web-based Resource Synchronization

2,489 views

Published on

Slides from Open Repositories 2012 in Edinburgh, 11 July 2012 (http://or2012.ed.ac.uk/)

Published in: Technology, Education
  • Be the first to comment

ResourceSync: Web-based Resource Synchronization

  1. 1. ResourceSync: Web-based Resource Synchronization Simeon Warner (Cornell)Open Repositories 2012, Edinburgh, 11 July 2012
  2. 2. Core team -- Todd Carpenter (NISO), Berhard Haslhofer, (CornellUniversity), Martin Klein (Los Alamos National Laboratory), NettieLagace (NISO), Carl Lagoze (Cornell University), Peter Murray(NISO), Michael L. Nelson (Old Dominion University), RobertSanderson (Los Alamos National Laboratory), Herbert Van deSompel (Los Alamos National Laboratory), Simeon Warner (CornellUniversity)Team members – Richard Jones (JISC/Cottage Labs), StuartLewis (JISC/Cottage Labs), Graham Klyne (JISC), Shlomo Sanders(Ex Libris), Kevin Ford (LoC), Ed Summers (LoC), Jeff Young(OCLC), David Rosenthal (Stanford)Funding – The Sloan Foundation (core team) and the JISC (UKparticipation)Thanks for slides from – Stuart Lewis, Herbert Van de Sompel
  3. 3. Synchronize what?•  Web resources – things with a URI that can be dereferenced and are cache-able (no dependency on underlying OS, technologies etc.)•  Small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources)•  That change slowly (weeks/months) or quickly (seconds), and where latency needs may vary•  Focus on needs of research communication and cultural heritage organizations, but aim for generality 3
  4. 4. Why?… because lots of projects and services aredoing synchronization but have to roll theirown on a case by case basis!•  Project team involved with projects that need this•  Experience with OAI-PMH: widely used in repos but o  XML metadata only o  Web technology has moved on since 1999•  Data / Metadata / Linked Data – Shared solution?
  5. 5. Use cases – the basics JISC
  6. 6. More use casesJISC
  7. 7. Out-of-scope (for now)•  Bidirectional synchronization•  Destination-defined selective synchronization (query)•  Special understanding of complex objects•  Bulk URI migration•  Diffs (hooks?)•  Intra-application event notification•  Content tracking
  8. 8. Use case: DBpedia Live duplication•  20M entries updated @ 1/s though sporadic•  Want low latency => need a push technology
  9. 9. Use case: arXiv mirroring•  1M article versions, ~800/day created or updated at 8pm US eastern time•  Metadata and full-text for each article•  Accuracy important•  Want low barrier for others to use•  Look for more general solution than current homebrew mirroring (running with minor modifications since 1994!) and occasional rsync (filesystem layout specific, auth issues)
  10. 10. Terminology•  Resource: an object to be synchronized, a web resource•  Source: system with the original or master resources•  Destination: system to which resources from the source will be copied and kept in synchronization•  Pull: process to get information from source to destination initiated by the destination.•  Push: process to get information from source to destination initiated by the source (and some subscription mechanism)•  Metadata: information about resources such as URI, modification time, checksum, etc. (Not to be confused with resources that may themselves be metadata records)
  11. 11. Three basic needs1.  Baseline synchronization – A destination must be able to perform an initial load or catch-up with a source -  avoid out-of-band setup; provide discovery2.  Incremental synchronization – A destination must have some way to keep up-to-date with changes at a source -  subject to some latency; minimal: create/update/delete3.  Audit – It should be possible to determine whether a destination is synchronized with a source -  subject to some latency; want efficiency > HTTP HEAD
  12. 12. Baseline synchronizationEither•  Get inventory of resources and then copy them one- by-one using HTTP GET o  simplest, inventory is list of resources plus perhaps metadata o  inventory format?or•  Get dump of resources and all necessary metadata o  more efficient: reduce number of round trips o  dump format?
  13. 13. AuditCould do new Baseline synchronization and compare …but likely very inefficient! Optimize by adding:•  Get inventory and compare with copy at destination o  use timestamp, digest or other metadata in inventory to check content (effort çè accuracy tradeoff) o  latency depends on freshness of inventory and time to copy and check (easier to cope with if modification times included in metadata)
  14. 14. Incremental synchronizationSimplest method is Audit and then copy of all new/updated resources, plus removal of deleted resources.Optimize by adding:•  Change Communication – Exchange ChangeSet listing only updates -  How to understand sequence, schedule?•  Resource Transfer – Exchange dumps for ChangeSets or even diffs appropriate to resource typeChange Memory necessary to record sequence orintermediate states.
  15. 15. Template to map approaches 15
  16. 16. Approaches and technologies Push DSNotify OAI-PMH Pull rsync Crawl OAI-ORE RDFsync WebDAV Col. Syn. XMPP Atom SWORD AtomPub Sitemap RSSSPARQLpush PubSubHubbub SDShare XMPP JISC
  17. 17. A framework based on Sitemaps•  Modular framework allowing selective deployment•  Sitemap is the most basic component of the framework•  Reuse Sitemap form for changesets and notifications (same <url> element describing resource)•  Selective synchronization via tagging•  Discovery of capabilities via <atom:link>!•  Further extension possible 18
  18. 18. Baseline Sync with Inventory 19
  19. 19. Level zero è Publish a Sitemap•  Periodic publication of an up-to-date Sitemap is base level implementation•  Use Sitemap <url> as is with <loc> and <lastmod> as core elements for each Resource o  Introduce optional extra elements to convey fixity information, size, tags for selective synchronization, etc.•  Extend to: o  Convey Source capabilities, discovery informatio, locations of dumps, locations of changesets, change memory, etc. o  Provide timestamp and/or additional metadata for the Sitemap
  20. 20. Two resources, with lastmod times
  21. 21. Two resources, with lastmod times, sizes and digests. The second with a tag also
  22. 22. Sitemap details & issues•  Sitemap XML format designed to allow extension•  ResourceSync additions: o  Additional core elements in ResourceSync namespace (digest, size, update information) o  Discovery information using <atom:link> elements•  Use existing Sitemap Index scheme for large sets of resources (handles up to 2.5 billion resources before further extension required)•  Provide mapping to RDF semantics but keep XML simple 23
  23. 23. Incremental Sync with ChangeSet 24
  24. 24. ChangeSet•  Reuse Sitemap format but include information only for change events over a certain period: •  One <url> element per change event •  The <url> element uses <loc> and <lastmod> as is and is extended with: •  an event type to express create/update/delete •  an optional event id to provide a unique identifier for the event. •  can further extend to include fixity, tag info, Memento TimeGate link, special-purpose access-point, etc. •  Introduce minimal <urlset>-level extensions to support: •  Navigation between ChangeSets via <atom:link> •  Timestamping the ChangeSet 25
  25. 25. Expt: arXiv – Inventory and ChangeSet •  Baseline synchronization and Audit (Inventory): o  2.3M resources (300GB content) o  46 sitemaps and 1 sitemapindex (50k resources/sitemap) o  sitemaps ~9.3MB each -> 430MB total uncompressed;1.7MB each -> 78MB total if gzipped (<0.03% content size) •  Incremental synchronization (ChangeSet): o  arXiv has updates daily @ 8pm so create daily ChangeSet o  ~1k additions and 700 updates per day o  1 sitemap ~300kB or 20kB gzipped, can be generated and served statically o  keep chain of ChangeSets, link with <atom:link>
  26. 26. Incremental Sync with Push via XMPP 27
  27. 27. Change Communication: Push via XMPP •  Rapid notification of change events via XMPP PubSub node; one notification per event •  Each change event is conveyed using a Sitemap <url> element contained in a dedicated XMPP <item> wrapper •  Use same resource metadata (e.g. <loc>, <lastmod>) and same extensions as with changesets •  Multiple change events can be grouped into a single XMPP message (using <items>)
  28. 28. Expt: LiveDBpedia with XMPP Push•  LANL Research Library ran a significant scale experiment in synchronization of the LiveDBpedia database from Los Alamos to two remote sites using XMPP to push change notifications o  Push for change communication only, content then obtained with HTTP GET•  Destination sites were able to keep in close synchronization with sources o  Maximum queued updates <400 over 6 runs with 100k updates; and bursty updates averaging ~1/s o  Small number of errors suggests use for audit in many real- life situations
  29. 29. DumpsOptimization over making repeated HTTP GET requestsfor multiple resources. Use for baseline and changeset.Options:1.  ZIP+Sitemap o  simple and ZIP very widely used o  consistent inventory/change/set format o  con: “custom”2.  WARC o  designed for exactly this purpose o  con: little used outside web archiving community
  30. 30. Sitemaps + XMPP + Dumps 31
  31. 31. Timeline and input•  July 2012 – First draft of sitemap-based spec (SOON)•  August 2012 – Publicize and solicit feedback (will be NISO email list)•  September 2012 – Revise, more experiments, more feedback•  December 2012 – Finalize specification (?)•  NISO webspace•  Code on github: http://github.org/resync/simulator

×