ResourceSync: Leveraging Sitemapsfor Resource SynchronizationWWW 2013, Rio de Janeiro, May 17thBernhard Haslhofer | Univer...
WWW 2013, May 17thResourceSync• What and Why?• Synchronization Scenarios• ResourceSync Basics• Demos• Status and Next Steps2
WWW 2013, May 17thWhat?• A framework for synchronizing Webresources from a Source to a Destination3Websync$ resync http://...
WWW 2013, May 17thWhy?• rsync: filesystem sync, but not Web• OAI-PMH: metadata, but not resources• Web-DAV: extends HTTP, r...
WWW 2013, May 17thResourceSync• What and Why?• Synchronization Scenarios• ResourceSync Basics• Demos• Status and Next Steps5
WWW 2013, May 17tharxiv.org mirroring• 2.4M resources (PDF,metadata, Latex src)• ~800/day created orupdated• uses homebrew...
WWW 2013, May 17thWikipedia• 1.4 updates / sec• many dependentservices reusingWikipedia content (e.g.,DBPedia, Freebase, e...
WWW 2013, May 17thdata.europeana.eu• aggregates metadatafrom >200 dataproviders in Europe• 10 largest providerscontribute ...
WWW 2013, May 17thDesign Guidelines• Sync small websites / repositories (fewresources) but also large data collections(mil...
WWW 2013, May 17thResourceSync• What and Why?• Synchronization Scenarios• ResourceSync Basics• Demos• Status and Next Step...
WWW 2013, May 17thResource List11DestinationSource<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps...
WWW 2013, May 17thResource List12<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/site...
WWW 2013, May 17thChange List13DestinationSource<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.o...
WWW 2013, May 17thResource Dump14Source<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schema...
WWW 2013, May 17thResource Dump15http://example.com/resourcedump.zip|- manifest.xml|- resources|- res1|- res2
WWW 2013, May 17thResource Dump Manifest16<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/sch...
WWW 2013, May 17thCapability List17DestinationSource<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitema...
WWW 2013, May 17thLarge Resource Lists18<?xml version="1.0" encoding="UTF-8"?><sitemapindex xmlns="http://www.sitemaps.org...
WWW 2013, May 17thOther Capabilities
WWW 2013, May 17thResourceSync• What and Why?• Synchronization Scenarios• ResourceSync Basics Walkthrough• Demos• Status a...
WWW 2013, May 17thAvailable code• ResourceSync clientand library (Python)• ResourceSync sourcesimulator21http://github.com...
WWW 2013, May 17thInstall resync client/library22$ git clone git://github.com/resync/resync.git$ cd resync/$ python setup....
WWW 2013, May 17thInstall resync simulator23$ git clone git://github.com/resync/simulator.git$ cd simulator/$ chmod u+x si...
WWW 2013, May 17thRun client against simulator24$ resync -b http://localhost:8888$ resync -i http://localhost:8888
WWW 2013, May 17thresync @ arxiv.org25resync -v --noauth http://resync.library.cornell.edu/arxiv-q-bio=/tmp/qbio http://re...
WWW 2013, May 17thresync @ en.wikipedia.org26
WWW 2013, May 17thResourceSync• What and Why?• Synchronization Scenarios• ResourceSync Basics Walkthrough• Demos• Status a...
WWW 2013, May 17thStatus• Beta spec (v.0.6) for public commenthttp://www.openarchives.org/rs/0.6/resourcesync• Tool develo...
WWW 2013, May 17thNext Steps• Continue tool development & deployment• Collect• public comments onresourcesync@googlegroups...
WWW 2013, May 17thThanks!@bhaslhoferhttp://slideshare.net/bhaslhoferhttp://openarchives.org/rsresourcesync@googlegroups.com
Upcoming SlideShare
Loading in …5
×

ResourceSync: Leveraging Sitemaps for Resource Synchronization

1,495
-1

Published on

Published in: Technology, News & Politics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,495
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

ResourceSync: Leveraging Sitemaps for Resource Synchronization

  1. 1. ResourceSync: Leveraging Sitemapsfor Resource SynchronizationWWW 2013, Rio de Janeiro, May 17thBernhard Haslhofer | University ofViennaSimeon Warner | Cornell UniversityCarl Lagoze | University of MichiganMartin Klein, Robert Sanderson | Los Alamos National LabsMichael L. Nelson | Old Dominion UniversityHerbert van de Sompel | Los Alamos National Labshttp://www.openarchives.org/rs/
  2. 2. WWW 2013, May 17thResourceSync• What and Why?• Synchronization Scenarios• ResourceSync Basics• Demos• Status and Next Steps2
  3. 3. WWW 2013, May 17thWhat?• A framework for synchronizing Webresources from a Source to a Destination3Websync$ resync http://example.com
  4. 4. WWW 2013, May 17thWhy?• rsync: filesystem sync, but not Web• OAI-PMH: metadata, but not resources• Web-DAV: extends HTTP, requires serverinstallation at source• ...4… because lots of projects and services are doingsynchronization but rely on ad-hoc solutions!
  5. 5. WWW 2013, May 17thResourceSync• What and Why?• Synchronization Scenarios• ResourceSync Basics• Demos• Status and Next Steps5
  6. 6. WWW 2013, May 17tharxiv.org mirroring• 2.4M resources (PDF,metadata, Latex src)• ~800/day created orupdated• uses homebrewmirroring since 1994 (!)• look for more generalsolution to supportindependent destinations6
  7. 7. WWW 2013, May 17thWikipedia• 1.4 updates / sec• many dependentservices reusingWikipedia content (e.g.,DBPedia, Freebase, etc.)• harvest articles via OAI-PMH, retrieve changesvia IRC, downloaddumps7
  8. 8. WWW 2013, May 17thdata.europeana.eu• aggregates metadatafrom >200 dataproviders in Europe• 10 largest providerscontribute 80%• >190 providerscontribute 20%8
  9. 9. WWW 2013, May 17thDesign Guidelines• Sync small websites / repositories (fewresources) but also large data collections(millions of resources)• Support low change frequency (weeks /months) to high change frequency(seconds) sources• Low adoption barrier!9
  10. 10. WWW 2013, May 17thResourceSync• What and Why?• Synchronization Scenarios• ResourceSync Basics• Demos• Status and Next Steps10
  11. 11. WWW 2013, May 17thResource List11DestinationSource<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:rs="http://www.openarchives.org/rs/terms/"><rs:md capability="resourcelist"modified="2013-01-03T09:00:00Z"/><url><loc>http://example.com/res1</loc></url><url><loc>http://example.com/res2</loc></url></urlset>$ resync -b http://example.comXML Sitemap
  12. 12. WWW 2013, May 17thResource List12<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:rs="http://www.openarchives.org/rs/terms/"><rs:md capability="resourcelist"modified="2013-01-03T09:00:00Z"/><url><loc>http://example.com/res1</loc><lastmod>2013-01-02T13:00:00Z</lastmod><rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"/></url><url><loc>http://example.com/res2</loc><lastmod>2013-01-02T14:00:00Z</lastmod><rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"/></url></urlset>Source
  13. 13. WWW 2013, May 17thChange List13DestinationSource<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:rs="http://www.openarchives.org/rs/terms/"><rs:md capability="changelist"modified="2013-01-03T11:00:00Z"/><url><loc>http://example.com/res2</loc><lastmod>2013-01-02T13:00:00Z</lastmod><rs:md change="updated"/></url><url><loc>http://example.com/res3</loc><lastmod>2013-01-02T18:00:00Z</lastmod><rs:md change="deleted"/></url></urlset>$ resync -b http://example.com$ resync -i http://example.comXML Sitemap
  14. 14. WWW 2013, May 17thResource Dump14Source<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:rs="http://www.openarchives.org/rs/terms/"><rs:md capability="resourcedump"modified="2013-01-03T09:00:00Z"/><url><loc>http://example.com/resourcedump.zip</loc><lastmod>2013-01-03T09:00:00Z</lastmod></url></urlset>XML Sitemap
  15. 15. WWW 2013, May 17thResource Dump15http://example.com/resourcedump.zip|- manifest.xml|- resources|- res1|- res2
  16. 16. WWW 2013, May 17thResource Dump Manifest16<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:rs="http://www.openarchives.org/rs/terms/"><rs:md capability="resourcedump-manifest"modified="2013-01-03T09:00:00Z"/><url><loc>http://example.com/res1</loc><lastmod>2013-01-03T03:00:00Z</lastmod><rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"path="/resources/res1"/></url><url><loc>http://example.com/res2</loc><lastmod>2013-01-03T04:00:00Z</lastmod><rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"path="/resources/res2"/></url></urlset>manifest.xml (XML Sitemap)
  17. 17. WWW 2013, May 17thCapability List17DestinationSource<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:rs="http://www.openarchives.org/rs/terms/"><rs:ln href="http://example.com/info-about-source.xml"rel="describedby"type="application/xml"/><rs:md capability="capabilitylist"modified="2013-01-02T14:00:00Z"/><url><loc>http://example.com/dataset1/resourcelist.xml</loc><rs:md capability="resourcelist"/></url><url><loc>http://example.com/dataset1/resourcedump.xml</loc><rs:md capability="resourcedump"/></url><url><loc>http://example.com/dataset1/changelist.xml</loc><rs:md capability="changelist"/></url></urlset>$ resync -x http://example.comXML Sitemap
  18. 18. WWW 2013, May 17thLarge Resource Lists18<?xml version="1.0" encoding="UTF-8"?><sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:rs="http://www.openarchives.org/rs/terms/"><rs:md capability="resourcelist"modified="2013-01-03T09:00:00Z"/><sitemap><loc>http://example.com/resourcelist-part2.xml</loc><lastmod>2013-01-03T09:00:00Z</lastmod></sitemap><sitemap><loc>http://example.com/resourcelist-part1.xml</loc><lastmod>2013-01-03T09:00:00Z</lastmod></sitemap></sitemapindex>Source
  19. 19. WWW 2013, May 17thOther Capabilities
  20. 20. WWW 2013, May 17thResourceSync• What and Why?• Synchronization Scenarios• ResourceSync Basics Walkthrough• Demos• Status and Next Steps20
  21. 21. WWW 2013, May 17thAvailable code• ResourceSync clientand library (Python)• ResourceSync sourcesimulator21http://github.com/resync
  22. 22. WWW 2013, May 17thInstall resync client/library22$ git clone git://github.com/resync/resync.git$ cd resync/$ python setup.py build$ sudo python setup.py install$ sudo easy_install resync$ sudo pip install resyncoror
  23. 23. WWW 2013, May 17thInstall resync simulator23$ git clone git://github.com/resync/simulator.git$ cd simulator/$ chmod u+x simulate-source$ ./simulate-source$ sudo easy_install tornado
  24. 24. WWW 2013, May 17thRun client against simulator24$ resync -b http://localhost:8888$ resync -i http://localhost:8888
  25. 25. WWW 2013, May 17thresync @ arxiv.org25resync -v --noauth http://resync.library.cornell.edu/arxiv-q-bio=/tmp/qbio http://resync.library.cornell.edu/arxiv=/tmp/arxiv
  26. 26. WWW 2013, May 17thresync @ en.wikipedia.org26
  27. 27. WWW 2013, May 17thResourceSync• What and Why?• Synchronization Scenarios• ResourceSync Basics Walkthrough• Demos• Status and Next Steps27
  28. 28. WWW 2013, May 17thStatus• Beta spec (v.0.6) for public commenthttp://www.openarchives.org/rs/0.6/resourcesync• Tool development started• Separate documents for archiving and pushdeployments28
  29. 29. WWW 2013, May 17thNext Steps• Continue tool development & deployment• Collect• public comments onresourcesync@googlegroups.com• implementation issues onhttps://github.com/resync/resync/issues• Version 0.9 to be released in Summer 2013• Version 1.0 in fall 2013 (NISO standard)29
  30. 30. WWW 2013, May 17thThanks!@bhaslhoferhttp://slideshare.net/bhaslhoferhttp://openarchives.org/rsresourcesync@googlegroups.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×