ResourceSync: Leveraging Sitemaps
for Resource Synchronization
WWW 2013, Rio de Janeiro, May 17th
Bernhard Haslhofer | University ofVienna
Simeon Warner | Cornell University
Carl Lagoze | University of Michigan
Martin Klein, Robert Sanderson | Los Alamos National Labs
Michael L. Nelson | Old Dominion University
Herbert van de Sompel | Los Alamos National Labs
http://www.openarchives.org/rs/
WWW 2013, May 17th
ResourceSync
• What and Why?
• Synchronization Scenarios
• ResourceSync Basics
• Demos
• Status and Next Steps
2
WWW 2013, May 17th
What?
• A framework for synchronizing Web
resources from a Source to a Destination
3
Web
sync
$ resync http://example.com
WWW 2013, May 17th
Why?
• rsync: filesystem sync, but not Web
• OAI-PMH: metadata, but not resources
• Web-DAV: extends HTTP, requires server
installation at source
• ...
4
… because lots of projects and services are doing
synchronization but rely on ad-hoc solutions!
WWW 2013, May 17th
ResourceSync
• What and Why?
• Synchronization Scenarios
• ResourceSync Basics
• Demos
• Status and Next Steps
5
WWW 2013, May 17th
arxiv.org mirroring
• 2.4M resources (PDF,
metadata, Latex src)
• ~800/day created or
updated
• uses homebrew
mirroring since 1994 (!)
• look for more general
solution to support
independent destinations
6
WWW 2013, May 17th
Wikipedia
• 1.4 updates / sec
• many dependent
services reusing
Wikipedia content (e.g.,
DBPedia, Freebase, etc.)
• harvest articles via OAI-
PMH, retrieve changes
via IRC, download
dumps
7
WWW 2013, May 17th
data.europeana.eu
• aggregates metadata
from >200 data
providers in Europe
• 10 largest providers
contribute 80%
• >190 providers
contribute 20%
8
WWW 2013, May 17th
Design Guidelines
• Sync small websites / repositories (few
resources) but also large data collections
(millions of resources)
• Support low change frequency (weeks /
months) to high change frequency
(seconds) sources
• Low adoption barrier!
9
WWW 2013, May 17th
ResourceSync
• What and Why?
• Synchronization Scenarios
• ResourceSync Basics
• Demos
• Status and Next Steps
10
WWW 2013, May 17th
Resource List
11
Destination
Source
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
modified="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/res1</loc>
</url>
<url>
<loc>http://example.com/res2</loc>
</url>
</urlset>
$ resync -b http://example.com
XML Sitemap
WWW 2013, May 17th
Resource List
12
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
modified="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"/>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-02T14:00:00Z</lastmod>
<rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"/>
</url>
</urlset>
Source
WWW 2013, May 17th
Change List
13
Destination
Source
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="changelist"
modified="2013-01-03T11:00:00Z"/>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change="updated"/>
</url>
<url>
<loc>http://example.com/res3</loc>
<lastmod>2013-01-02T18:00:00Z</lastmod>
<rs:md change="deleted"/>
</url>
</urlset>
$ resync -b http://example.com
$ resync -i http://example.com
XML Sitemap
WWW 2013, May 17th
Resource Dump
14
Source
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcedump"
modified="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/resourcedump.zip</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
</url>
</urlset>
XML Sitemap
WWW 2013, May 17th
Resource Dump
15
http://example.com/resourcedump.zip
|- manifest.xml
|- resources
|- res1
|- res2
WWW 2013, May 17th
Resource Dump Manifest
16
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcedump-manifest"
modified="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-03T03:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
path="/resources/res1"/>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-03T04:00:00Z</lastmod>
<rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"
path="/resources/res2"/>
</url>
</urlset>
manifest.xml (XML Sitemap)
WWW 2013, May 17th
Capability List
17
Destination
Source
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln href="http://example.com/info-about-source.xml"
rel="describedby"
type="application/xml"/>
<rs:md capability="capabilitylist"
modified="2013-01-02T14:00:00Z"/>
<url>
<loc>http://example.com/dataset1/resourcelist.xml</loc>
<rs:md capability="resourcelist"/>
</url>
<url>
<loc>http://example.com/dataset1/resourcedump.xml</loc>
<rs:md capability="resourcedump"/>
</url>
<url>
<loc>http://example.com/dataset1/changelist.xml</loc>
<rs:md capability="changelist"/>
</url>
</urlset>
$ resync -x http://example.com
XML Sitemap
WWW 2013, May 17th
Large Resource Lists
18
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
modified="2013-01-03T09:00:00Z"/>
<sitemap>
<loc>http://example.com/resourcelist-part2.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>http://example.com/resourcelist-part1.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
</sitemap>
</sitemapindex>
Source
WWW 2013, May 17th
Other Capabilities
WWW 2013, May 17th
ResourceSync
• What and Why?
• Synchronization Scenarios
• ResourceSync Basics Walkthrough
• Demos
• Status and Next Steps
20
WWW 2013, May 17th
Available code
• ResourceSync client
and library (Python)
• ResourceSync source
simulator
21
http://github.com/resync
WWW 2013, May 17th
Install resync client/library
22
$ git clone git://github.com/resync/resync.git
$ cd resync/
$ python setup.py build
$ sudo python setup.py install
$ sudo easy_install resync
$ sudo pip install resync
or
or
WWW 2013, May 17th
Install resync simulator
23
$ git clone git://github.com/resync/simulator.git
$ cd simulator/
$ chmod u+x simulate-source
$ ./simulate-source
$ sudo easy_install tornado
WWW 2013, May 17th
Run client against simulator
24
$ resync -b http://localhost:8888
$ resync -i http://localhost:8888
WWW 2013, May 17th
resync @ arxiv.org
25
resync -v --noauth http://resync.library.cornell.edu/
arxiv-q-bio=/tmp/qbio http://
resync.library.cornell.edu/arxiv=/tmp/arxiv
WWW 2013, May 17th
resync @ en.wikipedia.org
26
WWW 2013, May 17th
ResourceSync
• What and Why?
• Synchronization Scenarios
• ResourceSync Basics Walkthrough
• Demos
• Status and Next Steps
27
WWW 2013, May 17th
Status
• Beta spec (v.0.6) for public comment
http://www.openarchives.org/rs/0.6/
resourcesync
• Tool development started
• Separate documents for archiving and push
deployments
28
WWW 2013, May 17th
Next Steps
• Continue tool development & deployment
• Collect
• public comments on
resourcesync@googlegroups.com
• implementation issues on
https://github.com/resync/resync/issues
• Version 0.9 to be released in Summer 2013
• Version 1.0 in fall 2013 (NISO standard)
29
WWW 2013, May 17th
Thanks!
@bhaslhofer
http://slideshare.net/bhaslhofer
http://openarchives.org/rs
resourcesync@googlegroups.com

ResourceSync: Leveraging Sitemaps for Resource Synchronization

  • 1.
    ResourceSync: Leveraging Sitemaps forResource Synchronization WWW 2013, Rio de Janeiro, May 17th Bernhard Haslhofer | University ofVienna Simeon Warner | Cornell University Carl Lagoze | University of Michigan Martin Klein, Robert Sanderson | Los Alamos National Labs Michael L. Nelson | Old Dominion University Herbert van de Sompel | Los Alamos National Labs http://www.openarchives.org/rs/
  • 2.
    WWW 2013, May17th ResourceSync • What and Why? • Synchronization Scenarios • ResourceSync Basics • Demos • Status and Next Steps 2
  • 3.
    WWW 2013, May17th What? • A framework for synchronizing Web resources from a Source to a Destination 3 Web sync $ resync http://example.com
  • 4.
    WWW 2013, May17th Why? • rsync: filesystem sync, but not Web • OAI-PMH: metadata, but not resources • Web-DAV: extends HTTP, requires server installation at source • ... 4 … because lots of projects and services are doing synchronization but rely on ad-hoc solutions!
  • 5.
    WWW 2013, May17th ResourceSync • What and Why? • Synchronization Scenarios • ResourceSync Basics • Demos • Status and Next Steps 5
  • 6.
    WWW 2013, May17th arxiv.org mirroring • 2.4M resources (PDF, metadata, Latex src) • ~800/day created or updated • uses homebrew mirroring since 1994 (!) • look for more general solution to support independent destinations 6
  • 7.
    WWW 2013, May17th Wikipedia • 1.4 updates / sec • many dependent services reusing Wikipedia content (e.g., DBPedia, Freebase, etc.) • harvest articles via OAI- PMH, retrieve changes via IRC, download dumps 7
  • 8.
    WWW 2013, May17th data.europeana.eu • aggregates metadata from >200 data providers in Europe • 10 largest providers contribute 80% • >190 providers contribute 20% 8
  • 9.
    WWW 2013, May17th Design Guidelines • Sync small websites / repositories (few resources) but also large data collections (millions of resources) • Support low change frequency (weeks / months) to high change frequency (seconds) sources • Low adoption barrier! 9
  • 10.
    WWW 2013, May17th ResourceSync • What and Why? • Synchronization Scenarios • ResourceSync Basics • Demos • Status and Next Steps 10
  • 11.
    WWW 2013, May17th Resource List 11 Destination Source <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" modified="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/res1</loc> </url> <url> <loc>http://example.com/res2</loc> </url> </urlset> $ resync -b http://example.com XML Sitemap
  • 12.
    WWW 2013, May17th Resource List 12 <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" modified="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"/> </url> </urlset> Source
  • 13.
    WWW 2013, May17th Change List 13 Destination Source <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="changelist" modified="2013-01-03T11:00:00Z"/> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change="updated"/> </url> <url> <loc>http://example.com/res3</loc> <lastmod>2013-01-02T18:00:00Z</lastmod> <rs:md change="deleted"/> </url> </urlset> $ resync -b http://example.com $ resync -i http://example.com XML Sitemap
  • 14.
    WWW 2013, May17th Resource Dump 14 Source <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcedump" modified="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/resourcedump.zip</loc> <lastmod>2013-01-03T09:00:00Z</lastmod> </url> </urlset> XML Sitemap
  • 15.
    WWW 2013, May17th Resource Dump 15 http://example.com/resourcedump.zip |- manifest.xml |- resources |- res1 |- res2
  • 16.
    WWW 2013, May17th Resource Dump Manifest 16 <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcedump-manifest" modified="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-03T03:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" path="/resources/res1"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-03T04:00:00Z</lastmod> <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e" path="/resources/res2"/> </url> </urlset> manifest.xml (XML Sitemap)
  • 17.
    WWW 2013, May17th Capability List 17 Destination Source <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:ln href="http://example.com/info-about-source.xml" rel="describedby" type="application/xml"/> <rs:md capability="capabilitylist" modified="2013-01-02T14:00:00Z"/> <url> <loc>http://example.com/dataset1/resourcelist.xml</loc> <rs:md capability="resourcelist"/> </url> <url> <loc>http://example.com/dataset1/resourcedump.xml</loc> <rs:md capability="resourcedump"/> </url> <url> <loc>http://example.com/dataset1/changelist.xml</loc> <rs:md capability="changelist"/> </url> </urlset> $ resync -x http://example.com XML Sitemap
  • 18.
    WWW 2013, May17th Large Resource Lists 18 <?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" modified="2013-01-03T09:00:00Z"/> <sitemap> <loc>http://example.com/resourcelist-part2.xml</loc> <lastmod>2013-01-03T09:00:00Z</lastmod> </sitemap> <sitemap> <loc>http://example.com/resourcelist-part1.xml</loc> <lastmod>2013-01-03T09:00:00Z</lastmod> </sitemap> </sitemapindex> Source
  • 19.
    WWW 2013, May17th Other Capabilities
  • 20.
    WWW 2013, May17th ResourceSync • What and Why? • Synchronization Scenarios • ResourceSync Basics Walkthrough • Demos • Status and Next Steps 20
  • 21.
    WWW 2013, May17th Available code • ResourceSync client and library (Python) • ResourceSync source simulator 21 http://github.com/resync
  • 22.
    WWW 2013, May17th Install resync client/library 22 $ git clone git://github.com/resync/resync.git $ cd resync/ $ python setup.py build $ sudo python setup.py install $ sudo easy_install resync $ sudo pip install resync or or
  • 23.
    WWW 2013, May17th Install resync simulator 23 $ git clone git://github.com/resync/simulator.git $ cd simulator/ $ chmod u+x simulate-source $ ./simulate-source $ sudo easy_install tornado
  • 24.
    WWW 2013, May17th Run client against simulator 24 $ resync -b http://localhost:8888 $ resync -i http://localhost:8888
  • 25.
    WWW 2013, May17th resync @ arxiv.org 25 resync -v --noauth http://resync.library.cornell.edu/ arxiv-q-bio=/tmp/qbio http:// resync.library.cornell.edu/arxiv=/tmp/arxiv
  • 26.
    WWW 2013, May17th resync @ en.wikipedia.org 26
  • 27.
    WWW 2013, May17th ResourceSync • What and Why? • Synchronization Scenarios • ResourceSync Basics Walkthrough • Demos • Status and Next Steps 27
  • 28.
    WWW 2013, May17th Status • Beta spec (v.0.6) for public comment http://www.openarchives.org/rs/0.6/ resourcesync • Tool development started • Separate documents for archiving and push deployments 28
  • 29.
    WWW 2013, May17th Next Steps • Continue tool development & deployment • Collect • public comments on resourcesync@googlegroups.com • implementation issues on https://github.com/resync/resync/issues • Version 0.9 to be released in Summer 2013 • Version 1.0 in fall 2013 (NISO standard) 29
  • 30.
    WWW 2013, May17th Thanks! @bhaslhofer http://slideshare.net/bhaslhofer http://openarchives.org/rs resourcesync@googlegroups.com