Advertisement
Advertisement

More Related Content

Similar to ResourceSync Tutorial(20)

Advertisement

ResourceSync Tutorial

  1. ResourceSync: A Web-Based Resource Synchronization Framework #resourcesync ResourceSync is funded by The Sloan Foundation & JISC ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 1
  2. These slides were presented at the LITA Forum, Louisville, Kentucky, November 10 2013 The most recent version of the slides is available at http://www.slideshare.net/OpenArchivesInitiative/resourcesync-tutorial ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 2
  3. ResourceSync Tutorial History • • • • • • First outing: OAI8, June 2013 Second run: Open Repositories, July 2013 Third run: JCDL, July 2013 Fourth run: TPDL 2013, September 2013 Fifth run: LITA Forum, November 2013 Sixth run: SWIB 2013, November 2013 Presenter Herbert Van de Sompel Los Alamos National Laboratory <hvdsomp@gmail.com> @hvdsomp ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 3
  4. ResourceSync Tutorial Contributors Martin Klein Herbert Van de Sompel Robert Sanderson Los Alamos National Laboratory Los Alamos National Laboratory Los Alamos National Laboratory <martinklein0815@gmail.com> <hvdsomp@gmail.com> <azaroth24@gmail.com> @mart1nkle1n @hvdsomp @azaroth24 Simeon Warner Cornell University <simeon.warner@cornell.edu> @zimeon Michael L. Nelson Old Dominion University <mln@cs.odu.edu> @phonedude_mln Richard Jones Cottage Labs <richard@cottagelabs.com> @cottagelabs ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 4
  5. OAI Herbert Van de Sompel Martin Klein Robert Sanderson (Los Alamos National Laboratory) Simeon Warner (Cornell University) NISO Todd Carpenter Nettie Lagace University of Oxford Graham Klyne Berhard Haslhofer (University of Vienna) Michael L. Nelson (Old Dominion University) Lyrasis Peter Murray Carl Lagoze (University of Michigan) ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 5
  6. ResourceSync Technical Group LOCKSS Ex Libris Inc. Shlomo Sanders David Rosenthal JISC Paul Walk Richard Jones Graham Klyne Stuart Lewis RedHat OCLC Christian Sadilek Library of Congress Jeff Young Kevin Ford ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 6
  7. Timeline, Status of Specification(s) • August 2013 o o Release of ResourceSync framework Core specification - Version 0.9.1 Public draft of ResourceSync Archives specification released • September 2013 o Core specification on its way to become an ANSI standard • November 2013 o Internal draft of ResourceSync Notification specification • January 2014 o Public draft of ResourceSync Notification specification • Mid 2014 o Core specification becomes ANSI/NISO standard ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 7
  8. Pointers • Specification http://www.openarchives.org/rs/ http://www.openarchives.org/rs/resourcesync http://www.openarchives.org/rs/notification http://www.openarchives.org/rs/archives • List for public comment https://groups.google.com/d/forum/resourcesync • Client and simulator code http://github.org/resync/resync http://github.org/resync/simulator ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 8
  9. Papers • Klein, M., and Van de Sompel, H. (2013) Extending Sitemaps for Resourcesync. http://arxiv.org/abs/1305.4890 ACM/IEEE JCDL 2013 • Haslhofer, B., Warner, S, Lagoze, C., Klein, M., Sanderson, R., Nels on, M.L. and Van de Sompel, H. (2013) ResourceSync: Leveraging Sitemaps for Resource Synchronization. http://arxiv.org/abs/1305.1476 WWW 2013 Developer Track • Klein, M., Sanderson, R., Van de Sompel, H., Warner, S, Haslhofer, B., Lagoze, C., and Nelson, M.L. (2013) A Technical Framework for Resource Synchronization. http://dx.doi.org/10.1045/january2013-klein D-Lib Magazine. • Van de Sompel, H., Sanderson, R., Klein, M., Nelson, M.L., Haslhofer, B., W arner, S, and Lagoze, C. (2012) A Perspective on Resource Synchronization. http://dx.doi.org/10.1045/september2012vandesompel D-Lib Magazine. ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 9
  10. ResourceSync - Agenda 1. ResourceSync: Problem Perspective & Conceptual Approach 2. Motivation & Use Cases 3. Framework Walkthrough 4. Framework (Technical) Details 5. Implementation 6. Q&A ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 10
  11. ResourceSync - Agenda 1. ResourceSync: Problem Perspective & Conceptual Approach ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 11
  12. Synchronize What? • Web resources o things with a URI that can be dereferenced • Focus on needs of research communication and cultural heritage organizations o but aim for generality ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 12
  13. Synchronize What? • Small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources) sync sync ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 13
  14. Synchronize What? • Low change frequency (weeks/months) to high change frequency (seconds) sync sync sync ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 14
  15. Synchronize What? • Synchronization latency and accuracy needs may vary sync Sync ??? ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 15
  16. Why? … because lots of projects and services are doing synchronization but have to resort to ad-hoc, case by case, approaches! • Project team involved with projects that need this • Experience with OAI-PMH: widely used in repos but o XML metadata only o Web technology has moved on since 1999 • Devise a shared solution for data, metadata, linked data? ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 16
  17. ResourceSync Problem • Consideration: • Source (server) A has resources that change over time: they get created, modified, deleted • Destination (servers) X, Y, and Z leverage (some) resources of Source A. • Problem: • Destinations want to keep in step with the resource changes at Source A: resource synchronization. • Goal: • Design an approach for resource synchronization aligned with the Web Architecture that has a fair chance of adoption by different communities. • The approach must scale better than recurrent HTTP HEAD/GET on resources. ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 17
  18. Source: Core Synchronization Capabilities P U L L 1. Describing content – publish a list of resources available for synchronization to enable Destinations to perform an initial load or catch-up with a Source 2. Packaging content – bundle resources to enable bulk download by destinations 3. Describing changes – publish a list of resource changes to enable destinations to stay synchronized and decrease latency 4. Packaging changes – bundle resource changes for bulk download by destinations ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 18
  19. Source: Notifications Capabilities To reduce synchronization latency and to optimize the synchronization process the Source can support: P • U S • H 1. Change Notification • Notifies about changes to particular resources • e.g., resource A has been updated | created | deleted 2. Framework Notification • Notifies about changes to capabilities i.e., their documents • e.g., a Change List has been updated | created | deleted ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 19
  20. A R C H I V E S Source: Archival Capabilities The Source may hold on to historical data, for example, to allow Destinations to catch up with events they missed or revisit prior resource states. To this end, the Source can publish archives, i.e. documents that enumerate historical capability documents 1. 2. 3. 4. Resource List Archive Resource Dump Archive Change List Archive Change Dump Archive ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 20
  21. Source: Synchronization Features 1. Discovery of capabilities – support Destinations in discovering all offered capabilities o Applies to PULL, PUSH, ARCHIVES capabilities 1. Linking to related resources – provide links from resources subject to synchronization to related resources o Applies to PULL, PUSH capabilities ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 21
  22. Destination: Synchronization Needs 1. Baseline synchronization – A destination must be able to perform an initial load or catch-up with a source - avoid out-of-band setup 2. Incremental synchronization – A destination must have some way to keep up-to-date with changes at a source - subject to some latency; minimal: create/update/delete - allow to catch-up after destination has been offline 3. Audit – A destination should be able to determine whether it is synchronized with a source - regarding coverage and accuracy ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 22
  23. ResourceSync - Agenda 2. Motivation & Use Cases ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 23
  24. Use Cases – The Basics a) b) ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 24
  25. Use Cases – The Basics c) d) ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 25
  26. Use Cases – The not-so-Basics e) f) ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 26
  27. Use Case 1: arXiv Mirroring and Data Sharing • Repository of scholarly articles in physics, mathematics, computer science, etc. • > 850k articles • approx. 1.5 revisions per article on average • approx. 75k new articles per year • Each article has full-text and separate metadata record • approx. 3.8M resources ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 28
  28. Use Case 1: arXiv Mirroring and Data Sharing • 2,700 updates daily o at 8pm EST o Currently using homebrew mirroring solution (running with minor modifications since 1994!) o occasional rsync (file systemspecific, auth issues) ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 29
  29. Use Case 1: arXiv Mirroring • GOAL: Keep mirror sites synchronized with daily changes • WANT: o o o o high consistency moderate latency robustness to global network outages (low admin effort) ability to verify sync status in case of questions ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 31
  30. Use Case 1: arXiv Data Sharing • GOAL: Make resources and update information publicly available so that any other service may synchronize at the frequency it needs, e.g. o o o Math Front at UC Davis EprintWeb from IOP in UK Data for bibliometric and scientometric analysis • WANT: o o low admin effort (i.e. standard approach, standard tools) reasonable consistency, latency, efficiency ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 32
  31. Use Case 2: DBpedia Live Duplication • Average of 2 updates per second • Low latency desirable => need for a push technology ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 33
  32. Use Case 2: DBpedia Live Duplication • Daily traffic: o 99% updates o 0.6% deletions o 0.03% creations ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 35
  33. Use Case 2: DBpedia Live Duplication • # of content transfer events in two 8 hour intervals • Max, queue size of remote duplication process ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 36
  34. ResourceSync - Agenda 3. Framework Walkthrough ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 37
  35. Source Capability 1: Describing Content In order to advertise the resources that a source wants destinations to know about, it may describe them: o o Publish a Resource List, a list of resource URIs and possibly associated metadata - Destination GETs the Resource List - Destination GETs listed resources by their URI A Resource List describes the state of a set of resources at one point in time (snapshot) ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 38
  36. 39
  37. 40
  38. Source Capability 2: Packaging Content By default, content is transferred in response to a GET issued by a destination against a URI of a source’s resource. But a source may support additional mechanisms: o o Publish a Resource Dump, a document that points to packages of resource representations and necessary metadata - Destination GETs the package - Destination unpacks the package - ZIP format supported A Resource Dump and the packages it points to reflect the state of a set of resources at one point in time (snapshot) ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 41
  39. 42
  40. 43
  41. Source: Modular Capabilities ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 44
  42. Source Capability 3: Describing Changes In order to achieve lower latency and/or greater efficiency, a source may communicate about changes to its resources: o o Publish a Change List, a list of recent change events (created, updated, deleted resource) - Destination acts upon change events, e.g. GETs created/updated resources, removes deleted resources. A Change List pertains to resources that changed in a temporal interval with a start- and an end-date - If a resource changed more than once, it will be listed more than once ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 45
  43. 46
  44. 47
  45. 48
  46. 49
  47. Source Capability 4: Packaging Changes In order to reduce the number of requests to obtain resource changes, a source may provide packaged bitstreams for changed resources: o o Publish a Change Dump, a document that points to packages containing bitstreams of recently changed resource and necessary metadata - Destination GETs the package - Destination unpacks the package - ZIP format supported A Change Dump and its packages pertain to resources that changed in a temporal interval with a start- and an end-date - If a resource changed more than once, it will be included more than once ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 50
  48. 51
  49. 52
  50. Source: Modular Capabilities ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 53
  51. Destination: Key Processes ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 54
  52. ResourceSync - Agenda 4. Framework (Technical) Details ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 55
  53. ResourceSync - Agenda 4. Framework (Technical) Details 1. Sitemaps 2. Core synchronization capabilities (PULL) 3. Discovery 4. Linking to related resources 5. Notification Capabilities (PUSH) 6. Archival capabilities (ARCHIVES) ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 56
  54. ResourceSync - Agenda 4. Framework (Technical) Details 1. Sitemaps 2. Core synchronization capabilities (PULL) 3. Discovery 4. Linking to related resources 5. Notification Capabilities (PUSH) 6. Archival capabilities (ARCHIVES) ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 57
  55. So Many Choices Push DSNotify OAI-PMH rsync Crawl Pull OAI-ORE RDFsync WebDAV Col. Syn. XMPP Atom SWORD Sitemap SPARQLpush SDShare AtomPub RSS PubSubHubbub XMPP ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 58
  56. So Many Choices Push DSNotify OAI-PMH rsync Crawl Pull OAI-ORE RDFsync WebDAV Col. Syn. XMPP Atom SWORD Sitemap SPARQLpush SDShare AtomPub RSS PubSubHubbub XMPP ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 59
  57. ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 60
  58. A Framework Based on Sitemaps • Modular framework allowing selective deployment • Sitemap is the core format throughout the framework o o o Introduce extension elements and attributes: - In ResourceSync namespace (rs:) to accommodate synchronization needs Reuse Sitemap format for all capability documents: Resource List, Resource Dump, Change List, Change Dump, as well as for manifest in Dumps Utilize Sitemap index format where needed/allowed ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 61
  59. Sitemap Format <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> </url> … </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 62
  60. Sitemap Index Format <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”> <sitemap> <loc>http://example.com/sitemap1.xml</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </sitemap> <sitemap> <loc>http://example.com/sitemap2.xml</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> </sitemap> … </sitemapindex> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 63
  61. ResourceSync Sitemap Extensions <urlset xmlns=http://www.sitemaps.org/schemas/sitemap/0.9 xmlns:rs="http://www.openarchives.org/rs/terms/”> <rs:ln …/> <rs:md …/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:ln …/> <rs:md …/> </url> <url> … </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 64
  62. ResourceSync Sitemap Extensions <sitemapindex xmlns=http://www.sitemaps.org/schemas/sitemap/0.9 xmlns:rs="http://www.openarchives.org/rs/terms/”> <rs:ln …/> <rs:md …/> <sitemap> <loc>http://example.com/sitemap1.xml</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:ln …/> <rs:md …/> </sitemap> … </sitemapindex> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 65
  63. Resource Metadata Summary Element/Attribute <loc> <lastmod> Description Resource URI (identity) Timestamp of last change Defined by sitemaps sitemaps <changefreq> Expected update frequency sitemaps <rs:md> change encoding hash length path type ResourceSync Change type (Change List & Change Dump Manifest only) ResourceSync HTTP Content-Encoding header value RFC2616 One or more content digests (md5, sha-1, Atom Link Ext. sha-256) HTTP Content-Length header value RFC4287 Path in ZIP package (Dump Manifests only) HTTP Content-Type header value ResourceSync RFC4287 ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands
  64. Related Resource Metadata Summary • Attributes of the <rs:ln> element; c.f. resource metadata + pri Element/Attribute Description Defined by <rs:ln> ResourceSync encoding HTTP Content-Encoding header value RFC2616 hash One or more content digests (md5, sha-1, sha-256) Atom Link Ext. href Related resource URI (identity) RFC4287 length HTTP Content-Length header value RFC4287 modified Timestamp of last change (c.f. <lastmod>) Atom Link Ext. path Path in ZIP package (Dump Manifests only) ResourceSync pri Priority of link RFC6249 rel Relation - IANA registered or URI RFC4287 type HTTP Content-Type header value RFC4287 ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands
  65. Link Relation Summary Relation Use in ResourceSync Defined in rel="alternate" Link from generic to specific URI HTML 5 rel="canonical" Link from specific to generic URI RFC6596 rel="collection" Resource is member of collection RFC6573 rel="contents" Link from dump to manifest rel="describedby" Has metadata HTML4 Protocol for Web Description Resources (POWDER): Description Resources rel="describes" Is metadata for The 'describes' Link Relation Type rel="duplicate" RFC6249 rel=".../rs/terms/patch" Mirror or alternative copy A patch -- efficient change information rel="memento" Link to time-specific URI Memento Internet Draft rel="timegate" Link to timegate Memento Internet Draft rel="via" Provenance chain, came from RFC4287 This specification ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands
  66. ResourceSync Sitemap Validation • All ResourceSync capability documents are valid according to the Sitemap XML Schema o http://www.sitemaps.org/schemas/sitemap/0.9 • For a more thorough validation use the ResourceSync XML Schema o http://www.openarchives.org/rs/0.9.1/resourcesync.xsd ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands
  67. ResourceSync - Agenda 4. Framework (Technical) Details 1. Sitemaps 2. Core synchronization capabilities (PULL) 3. Discovery 4. Linking to related resources 5. Notification Capabilities (PUSH) 6. Archival capabilities (ARCHIVES) http://www.openarchives.org/rs/resourcesync ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 70
  68. Describing Content: Resource List http://www.openarchives.org/rs/resourcesync#DescResources ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 71
  69. Resource List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" at="2013-01-03T09:00:00Z” completed="2013-01-03T09:01:00Z” /> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> … </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 72
  70. Resource List • Describe Source’s resources that are subject to synchronization • At one point in time (snapshot) • Creation can take some time – duration can be conveyed • Typical Destination use: Baseline Synchronization, Audit • Each URI typically listed only once • Might be expensive to generate • Destinations use @at to determine freshness • [@at, @completed] – interval of uncertainty • Destination issues GETs against URIs to obtain resources • Very similar to current Sitemaps ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 73
  71. What if I have a million resources? • Current sitemap limit is 50k resources (or maximum document size of 50MB) • Break complete list of resources into 50k-resource chunks, each on a Resource List document • Create a Resource List Index document to group them: o o o Based on <sitemapindex> May have up to 50k component Resource Lists Extends capacity to 2,500,000,000 resources within current community practices ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands
  72. Resource List Index <resourcelist_index.xml> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”resourcelist" at="2013-01-02T09:00:02Z”/> <sitemap> <loc>http://example.com/resourcelist1.xml</loc> <rs:md type="application/xml"/> </sitemap> <sitemap> <loc>http://example.com/resourcelist2.xml</loc> <rs:md type="application/xml"/> </sitemap> </sitemapindex> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 75
  73. Resource List <resourcelist1.xml> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs=http://www.openarchives.org/rs/terms/> <rs:ln rel=”index” href=”http://example.com/resourcelist_index.xml”/> <rs:md capability=”resourcelist" at="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T08:07:06Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> ... </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 76
  74. Resource List Index ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 77
  75. Packaging Content: Resource Dump http://www.openarchives.org/rs/resourcesync#ResourceDump ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 78
  76. Resource Dump <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”resourcedump" at="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/resourcedump_part1.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md length=”97553" type=”application/zip"/> <rs:ln rel=”contents” href="http://example.com/resourcedump_manifest-part1.xml" type=”application/xml"/> </url> <url> <loc>http://example.com/resourcedump_part2.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 79
  77. Resource Dump Manifest <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”resourcedump-manifest" at="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md type="text/html" path=”/resources/res1"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md type=”application/pdf” path=”/resources/res2"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 80
  78. Resource Dump • A Resource Dump points to packages (ZIP files) that contain representations of the Source’s resources • At one point in time (snapshot) • Resource Dump is mandatory, even if there is only one ZIP file • ZIP package contains manifest, listing contained bitstreams • Typical Destination use: Baseline Synchronization, bulk download • Each URI typically listed only once • Might be expensive to generate • Destinations use @at to determine freshness • [@at, @completed] – interval of uncertainty • GETs against individual URIs from Resource List achieves the same result (ignoring varying freshness) ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 81
  79. Describing Changes: Change List http://www.openarchives.org/rs/resourcesync#DesChanges ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 82
  80. Change List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated" hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> … </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 83
  81. Open Change List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs=http://www.openarchives.org/rs/terms/> <rs:md capability="changelist" from="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated" hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 84
  82. Change List • A Change List pertains to a Source’s resources that changed • Changes that occurred during a temporal interval with startand end-date • Typical Destination use: Incremental Synchronization, Audit • Changes are listed in chronological order • Multiple changes to one resource results in the resource being listed multiple times, once per change • Source determines duration of temporal interval • Destinations use @from and @until to determine freshness • Destinations issue GETs against URIs to obtain changed resources ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 85
  83. Change List Index <changelist_index.xml> <changelist1.xml> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 86
  84. Change List Index <changelist_index.xml> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <sitemap> <loc>http://example.com/changelist1.xml</loc> <lastmod>2013-01-02T11:00:00Z</lastmod> <rs:md type="application/xml"/> </sitemap> <sitemap> <loc>http://example.com/changelist2.xml</loc> <lastmod>2013-01-02T23:00:00Z</lastmod> <rs:md type="application/xml"/> </sitemap> </sitemapindex> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 87
  85. Change List <changelist1.xml> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs=http://www.openarchives.org/rs/terms/> <rs:ln rel=”index” href=”http://example.com/changelist_index.xml”/> <rs:md capability="changelist" from="2013-01-02T09:00:00Z” until="2013-01-02T21:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated" hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 88
  86. Open Change List Index <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z”/> <sitemap> <loc>http://example.com/changelist1.xml</loc> <lastmod>2013-01-02T11:00:00Z</lastmod> </sitemap> <sitemap> <loc>http://example.com/changelist2.xml</loc> <lastmod>2013-01-02T23:00:00Z</lastmod> </sitemap> <sitemap> <loc>http://example.com/changelist_open.xml</loc> </sitemap> </sitemapindex> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 89
  87. Change List Index ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 90
  88. Packaging Changes: Change Dump http://www.openarchives.org/rs/resourcesync#PackChanges ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 91
  89. Capability 4: Change Dump <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changedump" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/change_dump_part1.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md length="887" type=”application/zip"/> </url> <url> <loc>http://example.com/change_dump_part2.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md length=”9767" type=”application/zip"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 92
  90. Change Dump Manifest <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changedump-manifest" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated" length=”2887” type=”text/html” path=”/changes/res1”/> </url> <url> … </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 93
  91. Change Dump • A Change Dump points at packages (ZIP files) that contain bitstreams of the Source’s resources that changed • Changes that occurred during a temporal interval with startand end-date • Change Dump is mandatory, even if there is only one ZIP file • ZIP package contains manifest, listing contained bitstreams • Typical Destination use: Incremental Synchronization, bulk download of changes • • • • Changes in Change Dump Manifest listed in chronological order Same URI can be listed multiple times Might be expensive to generate Destinations use @from and @until to determine freshness ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 94
  92. ResourceSync - Agenda 4. Framework (Technical) Details 1. Sitemaps 2. Core synchronization capabilities (PULL) 3. Discovery 4. Linking to related resources 5. Notification Capabilities (PUSH) 6. Archival capabilities (ARCHIVES) http://www.openarchives.org/rs/resourcesync#Discovery ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 95
  93. Discovery of Capabilities Requirements: • Need to discover capabilities, i.e. Resource List, Resource Dump, Change List, Change Dump, Archives, Notification channels • Need to know the type of capability each document represents. Approach: • The Source publishes a Capability List that enumerates the capabilities it supports. • By pointing at Resource List, Change List, Resource Dump, etc. using appropriate relation types, e.g. “resourcelist”, “changelist”, “resourcedump” etc. http://www.openarchives.org/rs/resourcesync#CapabilityList ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 96
  94. Discovery of Capabilities ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 97
  95. Capability List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”capabilitylist”/> <url> <loc>http://example.com/dataset1/resourcelist.xml</loc> <rs:md capability=”resourcelist”/> </url> <url> <loc>http://example.com/dataset1/changelist.xml</loc> <rs:md capability=”changelist”/> </url> <url> <loc>http://example.com/dataset1/resourcedump.xml</loc> <rs:md capability=”resourcedump”/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 98
  96. Discovery of Capability Lists Requirements: • Need to discover a Capability List Approaches: • Introduce a link in the HTTP Link header of a resources that is subject to synchronization, pointing at the Capability List with the relation type “resourcesync” • Introduce a link from an HTML document that is subject to synchronization (<head> section), pointing at the Capability List with the relation type “resourcesync” • Link from a Resource List, etc. to the Capability List with the relation type “up” Link header on example.com/res1.pdf Link: <example.com/dataset1/capabilitylist.xml>;rel=“resourcesync” ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 99
  97. Discovery of Capabilities ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 100
  98. Discovery: Source Description Requirements: • Support for multiple Capability Lists, one per “set of resources” • Need to discover these Capability Lists • Need descriptive information about each set of resources that a Capability List pertains to • Useful to have descriptive information about the Source itself Approach: • The Source Description document meets these requirements. • It should be at a particular location to avoid having registries: http://(hostname)/.well-known/resourcesync • It can be linked to from the Capability Lists as well. http://www.openarchives.org/rs/resourcesync#SourceDesc ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 101
  99. Discovery of Capabilities ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 102
  100. Discovery of Capabilities ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 103
  101. Source Description <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”description”/> <rs:ln rel=“describedby” href=“http://example.com/info_about_source.xml”/> <url> <loc>http://example.com/dataset1/capabilitylist.xml</loc> <rs:md capability=”capabilitylist”/> <rs:ln rel=“describedby” href=“http://example.com/dataset1/info_about_dataset1.xml”/> </url> <url> <loc>http://example.com/dataset2/capabilitylist.xml</loc> <rs:md capability=”capabilitylist”/> <rs:ln rel=“describedby” href=“http://example.com/dataset2/info_about_dataset2.xml”/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 104
  102. Discovery via robots.txt • Resource Lists are (enhanced) Sitemaps • Sitemaps can be discovered via robots.txt • Ergo, Resource Lists should be discoverable via robots.txt User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Sitemap: http://example.com/dataset1/resourcelist.xml ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 105
  103. Discovery of Capabilities http://www.openarchives.org/rs/resourcesync#Discovery ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 106
  104. Framework Navigation http://www.openarchives.org/rs/resourcesync#Navigation ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 107
  105. e.g., Capability List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”capabilitylist”/> <rs:ln rel=“up” href=“http://example.com/.well-known/resourcesync”/> <url> <loc>http://example.com/dataset1/resourcelist.xml</loc> <rs:md capability=”resourcelist”/> </url> <url> <loc>http://example.com/dataset1/changelist.xml</loc> <rs:md capability=”changelist”/> </url> <url> <loc>http://example.com/dataset1/resourcedump.xml</loc> <rs:md capability=”resourcedump”/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 108
  106. Framework Structure http://www.openarchives.org/rs/resourcesync#Structure ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 109
  107. Framework Structure ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 110
  108. ResourceSync - Agenda 4. Framework (Technical) Details 4. Linking to related resources http://www.openarchives.org/rs/resourcesync#LinkRelRes ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 111
  109. Supported Linking Use Cases Provide links to related resources to address specific resource synchronization needs. 1. 2. 3. 4. 5. 6. 7. Mirrored content with multiple download locations Alternate representations of the same content Patching content rather than replacing it Resources and metadata about resources Prior versions of resources Collection membership of resources Republishing synchronized resources All cases are handled with a <rs:ln> element referring to the linked resource ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 112
  110. Notes about Linked Resources Some important things to keep in mind about linked resources: • They may also be subject to synchronization • They may be updated in a very different schedule than the resources that link to them • Therefore, it is recommended to convey metadata about the linked resource too • Links can be bi-directional – the linked resource can link back to the linking resource ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 113
  111. Linking #1 - Mirror 1. Content with multiple download locations This may be of interest for: • Content distribution networks • Mirror sites • Backup locations • Load balancing http://www.openarchives.org/rs/0.9.1/resourcesync#MirCon ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 114
  112. Linking #1 - Mirror <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”duplicate” pri=”1” href=”http://mirror1.example.com/res1"/> <rs:ln rel=”duplicate” pri=”2” href=”http://mirror2.example.com/res1"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 115
  113. Linking #2 – Alternate Representations 2. Alternate representations of the same content This may be of interest for: • Resources subject to HTTP content negotiation • Format migration for preservation reasons • Different clients wanting different formats • Multiple languages of the content http://www.openarchives.org/rs/resourcesync#AltRep ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 116
  114. Linking #2 – Alternate Representations <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel="alternate" type="text/html" href="http://example.com/res1.html"/> <rs:ln rel="alternate" type=“application/pdf" href=”http://example.com/res1.pdf"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 117
  115. Linking #2 – Alternate Representations <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1.html</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”canonical” href="http://example.com/res1"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 118
  116. Linking #3 – Patching Content 3. Patching content rather than replacing it This may be of interest when: • Resources are very large and server wishes to conserve bandwidth where possible • Changes are frequent and small • Changes are managed in a CMS that tracks differences Need: • Machine processable format to describe a change in a manner that allows patching a representation • Existing or newly defined by communities http://www.openarchives.org/rs/resourcesync#PatchCon ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 119
  117. Linking #3 – Patching Content <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1.json</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated” length=“398723”/> <rs:ln rel=”http://www.openarchives.org/rs/terms/patch” type=”application/json-patch” modified=“2013-01-02T17:00:00Z” length=“58” href=”http://example.com/res1-patch.json"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 120
  118. Linking #4 – Metadata about Resources 4. Resources and metadata about resources This may be of interest when: • Resources have associated descriptive metadata records, which are useful for understanding the resource • Such as cultural heritage images, audio, video • Resources that have associated technical, administrative, rights metadata http://www.openarchives.org/rs/resourcesync#ResMDLinking ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 121
  119. Linking #4 – Metadata about Resources <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”describedby” type=”application/xml” href=”http://example.com/metadata/res1.xml"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 122
  120. Linking #4 – Metadata about Resources <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/metadata/res1.xml</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”describes” type=”text/html” href=”http://example.com/res1"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 123
  121. Linking #5 – Prior Versions of Resources This may be of interest when: • A Destinations needs to have a copy of all versions of a resource http://www.openarchives.org/rs/resourcesync#ResVers ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 124
  122. Memento Intermezzo http://www.mementoweb.org/ ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands
  123. URI for Original, URI for Version Web Archive URI-M - http://web.archive.org/web/20010911203610/http://www.cnn.com/ URI-R - http://www.cnn.com/
  124. URI for Original, URI for Version CMS URI-M - http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 URI-R - http://en.wikipedia.org/wiki/September_11_attacks
  125. Memento Time Travel extension for Chrome Download extension at http://bit.ly/memento-for-chrome ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands
  126. Linking #5 – Prior Versions of Resources <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”memento” href=”http://example.com/past/20130102130000/res1"/> <rs:ln rel=”timegate” href=”http://example.com/timegate/res1"/> <rs:ln rel=”timemap” href=“http://example.com/timemap/res1” type=“application/link-format”/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 135
  127. Linking #6 – Collection Membership 6. Collection membership of resources This may be of interest when: • Resources are part of OAI-ORE aggregations • Resources are part of OAI-PMH sets • To indicate any other type of collections of resources Collections are named with URIs and can then be linked to with rel=“collection” • Nice if the collection URI resolves to a useful description http://www.openarchives.org/rs/resourcesync#ColMem ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 136
  128. Linking #6 – Collection Membership <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”collection” href=”http://example.com/aggregation/allres"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 137
  129. Linking #7 – Republishing Resources 7. Republishing synchronized resources This may be of interest when: • Aggregator systems harvest resources from Sources and then republish them at new URIs Examples include Blog republishing, content distribution networks, mirrored or combined collections Hypothetical scenario: Lots of little museums with small collections, and a large European/American aggregating digital library system that wants to provide fast, combined access to the content (with permission) http://www.openarchives.org/rs/resourcesync#RePub ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 138
  130. Linking #7 – Republishing Resources #1 • Original Source publishes information about a changed resource via a Change List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-03T00:00:00Z”/> <url> <loc>http://original.example.com/res1</loc> <lastmod>2013-01-03T07:00:00Z</lastmod> <rs:md change=”updated”/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 139
  131. Linking #7 – Republishing Resources #2 • Aggregator 1 republishes information about the changed resource with reference to the original Source <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-03T11:00:00Z”/> <url> <loc>http://aggregator1.example.com/res1</loc> <lastmod>2013-01-03T20:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”via” modified=“2013-01-03T07:00:00Z” href=”http://original.example.org/res1"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 140
  132. Linking #7 – Republishing Resources #3 • Aggregator 2 ditto • Caution when republishing links, need to make sure they are still appropriate from an aggregator’s perspective <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-03T12:00:00Z”/> <url> <loc>http://aggregator2.example.com/res1</loc> <lastmod>2013-01-04T09:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”via” modified=“2013-01-03T07:00:00Z” href=”http://original.example.org/res1"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 141
  133. ResourceSync - Agenda 4. Framework (Technical) Details 1. Sitemaps 2. Core synchronization capabilities (PULL) 3. Discovery 4. Linking to related resources 5. Notification Capabilities (PUSH) 6. Archival capabilities (ARCHIVES) http://www.openarchives.org/rs/notification ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 142
  134. Motivation for Notifications • Reduce synchronization latency by having the Source push out resource change information • To avoid continuous pull of Change Lists by Destinations • Share information about changes to the Source’s ResourceSync implementation, e.g. announcement of new Resource List, new Capability List, etc. • To avoid continuous polling of e.g. Resource Lists, ResourceSync Description ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 143
  135. Source: Notifications Capabilities • P U • S H 1. Change Notification • Notifies about changes to particular resources • e.g., resource A has been updated | created | deleted 2. Framework Notification • Notifies about changes to capabilities i.e., their documents • e.g., a Change List has been updated | created | deleted • Also for Capability Lists and Source Description ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 144
  136. Notifications Channels • Notification sent via channels • Resource Notification: one channel per set of resources • Framework Notification: one channel per set of resources • Sent on level of capability document, not on index-level • Notifications about changes to Source Description sent on all Framework Notification channels • Payload for notifications: <urlset> documents • Transport protocol for notifications: • PubSubHubbub https://pubsubhubbub.googlecode.com/git/pubsubhubbub-core0.4.html - current choice • WebSockets -http://tools.ietf.org/html/rfc6455 – may be added later ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 145
  137. 146
  138. Framework Notification Structure ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 147
  139. Framework Notification Structure ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 148
  140. Change Notification Payload <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T09:07:00Z</lastmod> <rs:md change=”updated" hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> … </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 149
  141. Framework Notification Payload <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <url> <loc>http://example.com/resourceset1/resourcelist.xml</loc> <rs:md change=”created" capability=”resourcelist”/> </url> <url> <loc>http://example.com/resourceset1/resourcedump.xml</loc> <rs:md change=”created" capability=”resourcedump”/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 150
  142. Framework Notification Payload (w/ index) <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <url> <loc>http://example.com/resourceset1/resourcelist.xml</loc> <rs:md change=”created" capability=”resourcelist”/> <rs:ln rel="index" href=”http://example.com/dataset1/resourcelist-index.xml/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 151
  143. Framework Notification Discovery ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 152
  144. ResourceSync - Agenda 4. Framework (Technical) Details 1. Sitemaps 2. Core synchronization capabilities (PULL) 3. Discovery 4. Linking to related resources 5. Notification Capabilities (PUSH) 6. Archival capabilities (ARCHIVES) http://www.openarchives.org/rs/archives ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 153
  145. A R C H I V E S Source: Archival Capabilities The Source may hold on to historical data, for example, to allow Destinations to catch up with events they missed or revisit prior resource states. To this end, the Source can publish archives, i.e. documents that enumerate historical capability documents 1. 2. 3. 4. Resource List Archive Resource Dump Archive Change List Archive Change Dump Archive ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 154
  146. Resource List Archive http://www.openarchives.org/rs/archives#ResourceListArch ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 155
  147. Resource List Archive <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist-archive" at="2013-01-09T13:00:00Z"/> <url> <loc>http://example.com/resourcelist1.xml</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </url> <url> <loc>http://example.com/resourcelist2.xml</loc> <lastmod>2013-01-09T13:00:00Z</lastmod> </url> <url> … </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 156
  148. Resource Dump Archive http://www.openarchives.org/rs/archives#ResourceDumpArch ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 157
  149. Resource Dump Archive <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcedump-archive" at="2013-02-10T03:00:00Z"/> <url> <loc>http://example.com/resourcedump1.xml</loc> <lastmod>2013-01-10T03:00:00Z</lastmod> </url> <url> <loc>http://example.com/resourcedump2.xml</loc> <lastmod>2013-02-10T03:00:00Z</lastmod> </url> <url> … </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 158
  150. Change List Archive http://www.openarchives.org/rs/archives#ChangeListArch ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 159
  151. Change List Archive <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist-archive" from="2013-02-01T23:00:00Z until="2013-02-03T23:00:00Z"/> <url> <loc>http://example.com/changelist1.xml</loc> <lastmod>2013-02-01T23:00:00Z</lastmod> </url> <url> <loc>http://example.com/changelist2.xml</loc> <lastmod>2013-02-02T23:00:00Z</lastmod> </url> <url> <loc>http://example.com/changelist3.xml</loc> <lastmod>2013-02-03T23:00:00Z</lastmod> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 160
  152. Change Dump Archive http://www.openarchives.org/rs/archives#ChangeDumpArch ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 161
  153. Change Dump Archive <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changedump-archive" from="2013-02-10T03:00:00Z until="2013-02-17T03:00:00Z"/> <url> <loc>http://example.com/changedump1.xml</loc> <lastmod>2013-02-10T03:00:00Z</lastmod> </url> <url> <loc>http://example.com/changedump2.xml</loc> <lastmod>2013-02-17T03:00:00Z</lastmod> </url> <url> … </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 162
  154. Capability List for Archives <urlset xmlns=“http://www.sitemaps.org/schemas/sitemap/0.9” xmlns:rs=“http://www.openarchives.org/rs/terms/”> <rs:md capability=”capabilitylist”/> <url> <loc>http://example.com/dataset1/resourcelist.xml</loc> <rs:md capability=”resourcelist”/> </url> … <url> <loc>http://example.com/dataset1/resourcelist-archive.xml</loc> <rs:md capability=“resourcelist-archive”/> </url> <url> <loc>http://example.com/dataset1/changelist-archive.xml</loc> <rs:md capability=“changelist-archive”/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 163
  155. ResourceSync Framework with Archives ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 164
  156. ResourceSync - Agenda 5. Implementation ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 165
  157. Implementation #1: The Metadata Harvesting Use Case ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 166
  158. The Metadata Harvesting Use Case 1. Identification of metadata records within a service 1. Use of standards in metadata formats 1. Incremental updates 1. Create, Update, Delete 1. Sets ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 167
  159. The Metadata Harvesting Use Case 1. Identification of metadata records within a service ResourceSync does not specifically care about metadata records, only resources. It is up to the server to identify which of those resources are metadata. 2. Use of standards in metadata formats We are free to annotate a resource's entry with appropriate metadata to indicate the format. ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 168
  160. The Metadata Harvesting Use Case 3. Incremental updates ResourceSync publishes changes as static documents. The client is then free to walk up and down the change lists provided by the server. 4. Create, Update, Delete All resources that can be obtained from a change list will be annotated with the kind of change that happened to them. 5. Sets ResourceSync allows the server to publish lists of resources and changes and indexes of those lists all annotated with metadata. ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 169
  161. (Required) Documents for metadata harvesting use case ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 170
  162. Describing Metadata Resources <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" from="2013-05-05T13:00:00Z"/> <url> <loc>http://mydspace.edu/dspace-rs/resource/123456789/7/qdc</loc> <lastmod>2013-05-01T19:09:35Z</lastmod> <changefreq>never</changefreq> <rs:md type=”application/xml”/> <rs:ln href="http://mydspace.edu/bitstream/123456789/7/1/bitstream.pdf" rel="describes"/> <rs:ln href="http://mydspace.edu/bitstream/123456789/7/2/image.jpg" rel="describes"/> <rs:ln href="http://mydspace.edu/123456789/3" rel=”collection"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 171
  163. Describing Bitstream Resources <urlset … <url> <loc>http://mydspace.edu/bitstream/123456789/7/1/bitstream.pdf</loc> <lastmod>2013-05-01T19:09:35Z</lastmod> <changefreq>never</changefreq> <rs:md hash="md5:75d0ea94097a05fce9aca5b079e2f209" length="419805" type="application/pdf"/> <rs:ln href="http://mydspace.edu/dspace-rs/resource/123456789/7/qdc" rel="describedby"/> <rs:ln href="http://mydspace.edu/dspace-rs/resource/123456789/7/mets" rel="describedby"/> <rs:ln href="http://mydspace.edu/dspace-rs/resource/123456789/12/qdc" rel="describedby"/> <rs:ln href="http://mydspace.edu/123456789/2" rel=”collection"/> </url> </urlset> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 172
  164. Serving Metadata Resources http://mydspace.edu/dspace-rs/resource/123456789/7/qdc ResourceSync webapp metadata.formats = qdc = http://purl.org/dc/terms/, mets = http://www.loc.gov/METS/ metadata.types = qdc = application/xml, mets = application/xml Item handle Metadata Format <loc>http://mydspace.edu/dspace-rs/resource/123456789/7/qdc<loc> <rs:md type="application/xml”/> <rs:ln href="http://purl.org/dc/terms/" rel="describedby"/> <loc>http://mydspace.edu/dspace-rs/resource/123456789/7/mets</loc> <rs:md type="application/xml”/> <rs:ln href="http://www.loc.gov/METS/" rel="describedby"/> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 173
  165. Generating Documents 1. Initialise Creates initial Capability List and Resource List documents [dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -i 2. Update Creates a new Change List which covers the period since the last Change List was created [dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -u 3. Rebase A combination of both Initialise and Update. [dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -r ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 174
  166. Usage of Resources by clients ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 175
  167. Impact on DSpace ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 176
  168. URLs • • • • Stable identifiers for archived items Stable identifiers for unarchived items Stable identifiers for metadata resources (in their various formats) Stable identifiers for previous versions ? Provenance • History of changes to an item/bitstream • Item/bitstream deletions (vs withdraw) • Bitstream create/update dates • Item create/update dates ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 177
  169. Versioning • Access of previous versions of both metadata and bitstreams ? • Stable identifiers for previous versions of both metadata and ? bitstreams Metadata Resources • Metadata in a variety of formats • Metadata as file/bitstream ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 178
  170. Admin Files • • • ResourceSync documents (Resource Lists, Change Lists, etc) ResourceSync exports - Resource Dumps, Change Dumps Metadata exports in a number of formats Scheduled Tasks • Regular generation of RS documents Complex Objects • • Item/bitstream relationships Collections of content ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 179
  171. Get the software! Dspace Module: https://github.com/CottageLabs/DSpaceResourceSync depends on the common java library: https://github.com/CottageLabs/ResourceSyncJava PHP client: https://github.com/stuartlewis/resync-php depends on the SWORDv2 clienbt library: https://github.com/swordapp/swordappv2-php-library/ ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 180
  172. Implementation #2: ResourceSync at arXiv.org ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 181
  173. ResourceSync @ arXiv • Use ResourceSync for both mirroring and public data access o efficient updates o ability to do periodic audits o public synchronization capability o reduce admin burden • Likely start with metadata + source for mirroring use case (doing experiments now) • Open access use cases requires processed PDF also • Some concerns about likely use/load… ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 182
  174. ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 183
  175. Alternate download location • Likely want to separate machine accesses from human accesses to preserve response time on main server => Use Mirrored Content part of spec o o <loc> specifies canonical URI - e.g. http://arxiv.org/pdf/1306.1073v1.pdf <rs:ln rel=“duplicate”> specifies preferred download location - e.g. http://export.arxiv.org/pdf/1306.1073v1.pdf ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 184
  176. Alternate download location <url> <loc>http://arxiv.org/pdf/1306.1073v1.pdf</loc> <lastmod>2013-06-06T00:57:12Z</lastmod> <rs:md hash="md5:e08e0c4e4d7b0895120014f0aa09e7c4" length="287714” type=”application/pdf"/> <rs:ln rel="duplicate” pri="1" href="http://export.arxiv.org/pdf/1306.1073v1.pdf" modified="2013-06-06T02:00:59Z"/> </url> ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 185
  177. Getting a copy of arXiv It might be as easy as: (of course, you probably have to wait a while but it is nice to know ResourceSync is stateless so one can efficiently restart) ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 186
  178. Python Library and Client • Aim to provide library code implementing all ResourceSync facilities for use in both source and destination implementations o o Designed for python 2.6 (RHEL6) and 2.7 Will not work with python <= 2.5 • Client (resync) supports many destination operations, inspired by the common Unix rsync program • Client also supports some operations that might be useful in a source, such as generation of static Resource Lists, or periodic Change Lists (used in arXiv experiments) • Explorer (resync-explorer) intended to allow easy inspection of a source’s resource sets and capabilities • Developed since ResourceSync v0.5, updated for v0.9 http://github.org/resync/resync ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands
  179. ResourceSync Source Simulator • Python code using Tornado server • Provides random set of resources of different sizes updated at a particular rate • Very useful for testing Destination code http://github.com/resync/simulator ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands
  180. ResourceSync - Agenda 6. Q&A ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 189
  181. ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 190
  182. ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 191
  183. ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 192
  184. ResourceSync: A Web-Based Resource Synchronization Framework #resourcesync ResourceSync is funded by The Sloan Foundation & JISC ResourceSync Tutorial DANS, January 21 2014, Den Haag, Netherlands 193

Editor's Notes

  1. LANL Memento Aggregator of IIPC; Europeana does metadata via OAI-PMH but anticipate content also; arXiv – mirroring and data sharing; Linked data @ BBC; DBpedia, journal data at LANLREST not about in 1999
  2. XML &lt;-&gt; OAI-PMHlarge data begs diff question
  3. protected mostly about existing HTTP auth methods, stats -&gt; just inventory
  4. Switching to a standardized resource-centric framework could
  5. Semantic web version of wikipedia; want mirror to provide reliable basis for local services
  6. Semantic web version of wikipedia; want mirror to provide reliable basis for local services
  7. Semantic web version of wikipedia; want mirror to provide reliable basis for local services
  8. Semantic web version of wikipedia; want mirror to provide reliable basis for local services
  9. Top line – just metadata about resources, destination uses GET to get them (duh)Bottom line – packaged content =&gt; fewer round trips
  10. Rsyncetc just reference; push vs pull -&gt; both; many other parts
  11. Rsyncetc just reference; push vs pull -&gt; both; many other parts
  12. Add: rel=“contents”rel=“archives”
  13. They have in common: versions exist at different URIs. Because only the representation of a single state of a resource is available from a URI.
  14. They have in common: versions exist at different URIs. Because only the representation of a single state of a resource is available from a URI.
  15. Pattern exists in e.g.: WikiPedia, W3C specs, DryadNot sure whether DOI in general follows this paradigm.
  16. Now the question is “How we do access those versions” - Can interlink them. There’s RFCs that describe how to do that.-But that URI-R is special. It is what typically is being bookmarked, put in email. Want to leverage the fact that this URI-R is always there. Use it as the entry point.
  17. Memento addresses the problem in a resource-centric way:Resource, URI, state, representation, link, content negotiation
  18. Test site, has subsets of arXiv and even complete source plus metadata (at present not up to date with 0.9)
  19. No way around the difficulty of transferring 1TB initially but then a daily or weekly sync is efficient, and it still works even after some arbitrary time.
Advertisement