SlideShare a Scribd company logo
ResourceSync:
                   Web-based
                    Resource
               Synchronization

                 Simeon Warner (Cornell)


Open Repositories 2012, Edinburgh, 11 July 2012
Core team -- Todd Carpenter (NISO), Berhard Haslhofer, (Cornell
University), Martin Klein (Los Alamos National Laboratory), Nettie
Lagace (NISO), Carl Lagoze (Cornell University), Peter Murray
(NISO), Michael L. Nelson (Old Dominion University), Robert
Sanderson (Los Alamos National Laboratory), Herbert Van de
Sompel (Los Alamos National Laboratory), Simeon Warner (Cornell
University)

Team members – Richard Jones (JISC/Cottage Labs), Stuart
Lewis (JISC/Cottage Labs), Graham Klyne (JISC), Shlomo Sanders
(Ex Libris), Kevin Ford (LoC), Ed Summers (LoC), Jeff Young
(OCLC), David Rosenthal (Stanford)

Funding – The Sloan Foundation (core team) and the JISC (UK
participation)

Thanks for slides from – Stuart Lewis, Herbert Van de Sompel
Synchronize what?
•  Web resources – things with a URI that can be
   dereferenced and are cache-able (no dependency on
   underlying OS, technologies etc.)

•  Small websites/repositories (a few resources) to
   large repositories/datasets/linked data collections
   (many millions of resources)

•  That change slowly (weeks/months) or quickly
   (seconds), and where latency needs may vary

•  Focus on needs of research communication and
   cultural heritage organizations, but aim for generality

                                                    3
Why?
… because lots of projects and services are
doing synchronization but have to roll their
own on a case by case basis!


•  Project team involved with projects that need this

•  Experience with OAI-PMH: widely used in repos but
   o    XML metadata only
   o    Web technology has moved on since 1999

•  Data / Metadata / Linked Data – Shared solution?
Use cases – the basics




  JISC
More use cases




JISC
Out-of-scope (for now)
•  Bidirectional synchronization

•  Destination-defined selective synchronization (query)

•  Special understanding of complex objects

•  Bulk URI migration

•  Diffs (hooks?)

•  Intra-application event notification

•  Content tracking
Use case: DBpedia Live duplication
•  20M entries updated @ 1/s though sporadic

•  Want low latency => need a push technology
Use case: arXiv mirroring
•  1M article versions, ~800/day created
   or updated at 8pm US eastern time

•  Metadata and full-text for each article

•  Accuracy important

•  Want low barrier for others to use
•  Look for more general solution than current
   homebrew mirroring (running with minor
   modifications since 1994!) and occasional
   rsync (filesystem layout specific, auth issues)
Terminology
•  Resource: an object to be synchronized, a web resource

•  Source: system with the original or master resources

•  Destination: system to which resources from the source will be
   copied and kept in synchronization

•  Pull: process to get information from source to destination
   initiated by the destination.

•  Push: process to get information from source to destination
   initiated by the source (and some subscription mechanism)

•  Metadata: information about resources such as URI,
   modification time, checksum, etc. (Not to be confused with
   resources that may themselves be metadata records)
Three basic needs
1.  Baseline synchronization – A destination must be
    able to perform an initial load or catch-up with a
    source
     -     avoid out-of-band setup; provide discovery

2.  Incremental synchronization – A destination must
     have some way to keep up-to-date with changes at a
     source
     -    subject to some latency; minimal: create/update/delete

3.  Audit – It should be possible to determine whether a
    destination is synchronized with a source
     -    subject to some latency; want efficiency > HTTP HEAD
Baseline synchronization
Either

•  Get inventory of resources and then copy them one-
   by-one using HTTP GET
     o    simplest, inventory is list of resources plus perhaps metadata
     o    inventory format?

or

•  Get dump of resources and all necessary metadata
     o    more efficient: reduce number of round trips
     o    dump format?
Audit
Could do new Baseline synchronization and compare …
but likely very inefficient! Optimize by adding:

•  Get inventory and compare with copy at destination
   o    use timestamp, digest or other metadata in inventory to
        check content (effort çè accuracy tradeoff)
   o    latency depends on freshness of inventory and time to copy
        and check (easier to cope with if modification times included
        in metadata)
Incremental synchronization
Simplest method is Audit and then copy of all new/
updated resources, plus removal of deleted resources.
Optimize by adding:

•  Change Communication – Exchange ChangeSet
   listing only updates
      -  How to understand sequence, schedule?

•  Resource Transfer – Exchange dumps for
   ChangeSets or even diffs appropriate to resource type

Change Memory necessary to record sequence or
intermediate states.
Template to map approaches




                         15
Approaches and technologies
                                    Push
  DSNotify
                OAI-PMH                        Pull
     rsync
                   Crawl
                              OAI-ORE
        RDFsync
                                         WebDAV Col. Syn.
                            XMPP
 Atom                                SWORD       AtomPub
             Sitemap        RSS

SPARQLpush                                 PubSubHubbub
                  SDShare         XMPP

                JISC
A framework based on Sitemaps
•  Modular framework allowing selective deployment
•  Sitemap is the most basic component of the
   framework
•  Reuse Sitemap form for changesets and notifications
   (same <url> element describing resource)
•  Selective synchronization via tagging
•  Discovery of capabilities via <atom:link>!
•  Further extension possible



                                                18
Baseline Sync with Inventory




                           19
Level zero è Publish a Sitemap
•  Periodic publication of an up-to-date Sitemap is
   base level implementation

•  Use Sitemap <url> as is with <loc> and
   <lastmod> as core elements for each Resource
   o    Introduce optional extra elements to convey fixity information,
        size, tags for selective synchronization, etc.

•  Extend to:
   o    Convey Source capabilities, discovery informatio, locations of
        dumps, locations of changesets, change memory, etc.
   o    Provide timestamp and/or additional metadata for the
        Sitemap
Two resources, with lastmod times
Two resources, with lastmod times, sizes and
         digests. The second with a tag also
Sitemap details & issues
•  Sitemap XML format designed to allow extension

•  ResourceSync additions:
   o    Additional core elements in ResourceSync namespace
        (digest, size, update information)
   o    Discovery information using <atom:link> elements

•  Use existing Sitemap Index scheme for large sets of
   resources (handles up to 2.5 billion resources before
   further extension required)

•  Provide mapping to RDF semantics but keep XML
   simple


                                                        23
Incremental Sync with ChangeSet




                            24
ChangeSet
•  Reuse Sitemap format but include information only for change
   events over a certain period:
    •  One <url> element per change event
    •  The <url> element uses <loc> and <lastmod> as is and
       is extended with:
        •  an event type to express create/update/delete
        •  an optional event id to provide a unique identifier for the
            event.
        •  can further extend to include fixity, tag info, Memento
            TimeGate link, special-purpose access-point, etc.
    •  Introduce minimal <urlset>-level extensions to support:
        •  Navigation between ChangeSets via <atom:link>
        •  Timestamping the ChangeSet



                                                             25
Expt: arXiv – Inventory and ChangeSet
 •  Baseline synchronization and Audit (Inventory):
    o    2.3M resources (300GB content)
    o    46 sitemaps and 1 sitemapindex (50k resources/sitemap)
    o    sitemaps ~9.3MB each -> 430MB total uncompressed;1.7MB
         each -> 78MB total if gzipped (<0.03% content size)

 •  Incremental synchronization (ChangeSet):
    o    arXiv has updates daily @ 8pm so create daily ChangeSet
    o    ~1k additions and 700 updates per day
    o    1 sitemap ~300kB or 20kB gzipped, can be generated and
         served statically
    o    keep chain of ChangeSets, link with <atom:link>
Incremental Sync with Push via XMPP




                              27
Change Communication: Push via XMPP
  •  Rapid notification of change events via XMPP
     PubSub node; one notification per event
  •  Each change event is conveyed using a Sitemap
     <url> element contained in a dedicated XMPP
     <item> wrapper
  •  Use same resource metadata (e.g. <loc>,
     <lastmod>) and same extensions as with
     changesets
  •  Multiple change events can be grouped into a single
     XMPP message (using <items>)
Expt: LiveDBpedia with XMPP Push
•  LANL Research Library ran a significant scale
   experiment in synchronization of the LiveDBpedia
   database from Los Alamos to two remote sites using
   XMPP to push change notifications
   o    Push for change communication only, content then obtained
        with HTTP GET

•  Destination sites were able to keep in close
   synchronization with sources
   o    Maximum queued updates <400 over 6 runs with 100k
        updates; and bursty updates averaging ~1/s
   o    Small number of errors suggests use for audit in many real-
        life situations
Dumps
Optimization over making repeated HTTP GET requests
for multiple resources. Use for baseline and changeset.
Options:

1.  ZIP+Sitemap
  o    simple and ZIP very widely used
  o    consistent inventory/change/set format
  o    con: “custom”

2.  WARC
  o    designed for exactly this purpose
  o    con: little used outside web archiving community
Sitemaps + XMPP + Dumps




                      31
Timeline and input
•  July 2012 – First draft of sitemap-based spec (SOON)

•  August 2012 – Publicize and solicit feedback (will be
   NISO email list)

•  September 2012 – Revise, more experiments, more
   feedback

•  December 2012 – Finalize specification (?)



•  NISO webspace

•  Code on github: http://github.org/resync/simulator
ResourceSync: Web-based Resource Synchronization

More Related Content

What's hot

ResourceSync: Conceptual and Technical Problem Perspective
ResourceSync: Conceptual and Technical Problem PerspectiveResourceSync: Conceptual and Technical Problem Perspective
ResourceSync: Conceptual and Technical Problem Perspective
Herbert Van de Sompel
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
Uri Laserson
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
Aaron Brooks
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Humphrey Southall
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Lessons learnt from the Murchison Widefield Array Data Archive
Lessons learnt from the Murchison Widefield Array Data ArchiveLessons learnt from the Murchison Widefield Array Data Archive
Lessons learnt from the Murchison Widefield Array Data Archive
Chen Wu
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil DubeySwapnil Dubey
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
EUCLID project
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
eswcsummerschool
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoopguest27e6764
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
 
Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slides
lancesfa
 
NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)
Christine Stohn
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-researchsaintdevil163
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
Joe McTee
 

What's hot (20)

ResourceSync
ResourceSyncResourceSync
ResourceSync
 
ResourceSync: Conceptual and Technical Problem Perspective
ResourceSync: Conceptual and Technical Problem PerspectiveResourceSync: Conceptual and Technical Problem Perspective
ResourceSync: Conceptual and Technical Problem Perspective
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Lessons learnt from the Murchison Widefield Array Data Archive
Lessons learnt from the Murchison Widefield Array Data ArchiveLessons learnt from the Murchison Widefield Array Data Archive
Lessons learnt from the Murchison Widefield Array Data Archive
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slides
 
NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
GHCNPaper3
GHCNPaper3GHCNPaper3
GHCNPaper3
 

Similar to ResourceSync: Web-based Resource Synchronization

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
datastack
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloud
Varun Thacker
 
From Backups To Time Travel: A Systems Perspective on Snapshots
From Backups To Time Travel: A Systems Perspective on SnapshotsFrom Backups To Time Travel: A Systems Perspective on Snapshots
From Backups To Time Travel: A Systems Perspective on Snapshots
NuoDB
 
Mind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvestingMind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvesting
Simeon Warner
 
Testing data and metadata backends with ClawIO
Testing data and metadata backends with ClawIOTesting data and metadata backends with ClawIO
Testing data and metadata backends with ClawIO
Hugo González Labrador
 
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Amazon Web Services
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Isuru Suriarachchi
 
Public private hybrid - cmdb challenge
Public private hybrid - cmdb challengePublic private hybrid - cmdb challenge
Public private hybrid - cmdb challenge
ryszardsshare
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
pmanvi
 
Stateful streaming and the challenge of state
Stateful streaming and the challenge of stateStateful streaming and the challenge of state
Stateful streaming and the challenge of state
Yoni Farin
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
Jordan Open Source Association
 
Hadoop
HadoopHadoop
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
Jim Dowling
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
 
DSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDuraSpace
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
Rahul Jain
 
Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open Discussion
Nguyen Tung
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
SCAPE Project
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
Eric Rodriguez (Hiring in Lex)
 

Similar to ResourceSync: Web-based Resource Synchronization (20)

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloud
 
From Backups To Time Travel: A Systems Perspective on Snapshots
From Backups To Time Travel: A Systems Perspective on SnapshotsFrom Backups To Time Travel: A Systems Perspective on Snapshots
From Backups To Time Travel: A Systems Perspective on Snapshots
 
Mind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvestingMind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvesting
 
Testing data and metadata backends with ClawIO
Testing data and metadata backends with ClawIOTesting data and metadata backends with ClawIO
Testing data and metadata backends with ClawIO
 
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
 
Public private hybrid - cmdb challenge
Public private hybrid - cmdb challengePublic private hybrid - cmdb challenge
Public private hybrid - cmdb challenge
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Stateful streaming and the challenge of state
Stateful streaming and the challenge of stateStateful streaming and the challenge of state
Stateful streaming and the challenge of state
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
ResourceSync - NISO Update Jan 2014
ResourceSync - NISO Update Jan 2014ResourceSync - NISO Update Jan 2014
ResourceSync - NISO Update Jan 2014
 
Hadoop
HadoopHadoop
Hadoop
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
DSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/Export
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open Discussion
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 

More from Simeon Warner

Questioning Authority Lookup Service: Linking the Data
Questioning Authority Lookup Service: Linking the DataQuestioning Authority Lookup Service: Linking the Data
Questioning Authority Lookup Service: Linking the Data
Simeon Warner
 
OCFL: A Shared Approach to Preservation Persistence
OCFL: A Shared Approach to Preservation PersistenceOCFL: A Shared Approach to Preservation Persistence
OCFL: A Shared Approach to Preservation Persistence
Simeon Warner
 
The Oxford Common File Layout: A common approach to digital preservation
The Oxford Common File Layout: A common approach to digital preservationThe Oxford Common File Layout: A common approach to digital preservation
The Oxford Common File Layout: A common approach to digital preservation
Simeon Warner
 
Welcome to the FOLIO Community
Welcome to the FOLIO CommunityWelcome to the FOLIO Community
Welcome to the FOLIO Community
Simeon Warner
 
Sinopia & FOLIO: Bridging the gap to linked data cataloging
Sinopia & FOLIO: Bridging the gap to linked data cataloging Sinopia & FOLIO: Bridging the gap to linked data cataloging
Sinopia & FOLIO: Bridging the gap to linked data cataloging
Simeon Warner
 
FOLIO and Linked Data
FOLIO and Linked DataFOLIO and Linked Data
FOLIO and Linked Data
Simeon Warner
 
OCFL v1.0
OCFL v1.0OCFL v1.0
OCFL v1.0
Simeon Warner
 
IIIF Technical Specification Status Update
IIIF Technical Specification Status UpdateIIIF Technical Specification Status Update
IIIF Technical Specification Status Update
Simeon Warner
 
LKG Editor Dev
LKG Editor DevLKG Editor Dev
LKG Editor Dev
Simeon Warner
 
Don't bold the field name!
Don't bold the field name!Don't bold the field name!
Don't bold the field name!
Simeon Warner
 
Samvera and IIIF 2018
Samvera and IIIF 2018Samvera and IIIF 2018
Samvera and IIIF 2018
Simeon Warner
 
Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)
Simeon Warner
 
ORCID @ Cornell
ORCID @ CornellORCID @ Cornell
ORCID @ Cornell
Simeon Warner
 
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
Simeon Warner
 
Introduction to the IIIF Presentation API (@SWIB17)
Introduction to the IIIF Presentation API (@SWIB17)Introduction to the IIIF Presentation API (@SWIB17)
Introduction to the IIIF Presentation API (@SWIB17)
Simeon Warner
 
Introduction to the International Image Interoperability Framework (IIIF)
Introduction to the International Image Interoperability Framework (IIIF)Introduction to the International Image Interoperability Framework (IIIF)
Introduction to the International Image Interoperability Framework (IIIF)
Simeon Warner
 
From Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and CollaborationsFrom Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and Collaborations
Simeon Warner
 
ORCID & other Person iDs
ORCID & other Person iDsORCID & other Person iDs
ORCID & other Person iDs
Simeon Warner
 
Who's the Author? Identifier soup - ORCID, ISNI, LC NACO and VIAF
Who's the Author? Identifier soup - ORCID, ISNI, LC NACO and VIAFWho's the Author? Identifier soup - ORCID, ISNI, LC NACO and VIAF
Who's the Author? Identifier soup - ORCID, ISNI, LC NACO and VIAF
Simeon Warner
 
IIIF without an image server? No problem!
IIIF without an image server? No problem!IIIF without an image server? No problem!
IIIF without an image server? No problem!
Simeon Warner
 

More from Simeon Warner (20)

Questioning Authority Lookup Service: Linking the Data
Questioning Authority Lookup Service: Linking the DataQuestioning Authority Lookup Service: Linking the Data
Questioning Authority Lookup Service: Linking the Data
 
OCFL: A Shared Approach to Preservation Persistence
OCFL: A Shared Approach to Preservation PersistenceOCFL: A Shared Approach to Preservation Persistence
OCFL: A Shared Approach to Preservation Persistence
 
The Oxford Common File Layout: A common approach to digital preservation
The Oxford Common File Layout: A common approach to digital preservationThe Oxford Common File Layout: A common approach to digital preservation
The Oxford Common File Layout: A common approach to digital preservation
 
Welcome to the FOLIO Community
Welcome to the FOLIO CommunityWelcome to the FOLIO Community
Welcome to the FOLIO Community
 
Sinopia & FOLIO: Bridging the gap to linked data cataloging
Sinopia & FOLIO: Bridging the gap to linked data cataloging Sinopia & FOLIO: Bridging the gap to linked data cataloging
Sinopia & FOLIO: Bridging the gap to linked data cataloging
 
FOLIO and Linked Data
FOLIO and Linked DataFOLIO and Linked Data
FOLIO and Linked Data
 
OCFL v1.0
OCFL v1.0OCFL v1.0
OCFL v1.0
 
IIIF Technical Specification Status Update
IIIF Technical Specification Status UpdateIIIF Technical Specification Status Update
IIIF Technical Specification Status Update
 
LKG Editor Dev
LKG Editor DevLKG Editor Dev
LKG Editor Dev
 
Don't bold the field name!
Don't bold the field name!Don't bold the field name!
Don't bold the field name!
 
Samvera and IIIF 2018
Samvera and IIIF 2018Samvera and IIIF 2018
Samvera and IIIF 2018
 
Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)
 
ORCID @ Cornell
ORCID @ CornellORCID @ Cornell
ORCID @ Cornell
 
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
 
Introduction to the IIIF Presentation API (@SWIB17)
Introduction to the IIIF Presentation API (@SWIB17)Introduction to the IIIF Presentation API (@SWIB17)
Introduction to the IIIF Presentation API (@SWIB17)
 
Introduction to the International Image Interoperability Framework (IIIF)
Introduction to the International Image Interoperability Framework (IIIF)Introduction to the International Image Interoperability Framework (IIIF)
Introduction to the International Image Interoperability Framework (IIIF)
 
From Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and CollaborationsFrom Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and Collaborations
 
ORCID & other Person iDs
ORCID & other Person iDsORCID & other Person iDs
ORCID & other Person iDs
 
Who's the Author? Identifier soup - ORCID, ISNI, LC NACO and VIAF
Who's the Author? Identifier soup - ORCID, ISNI, LC NACO and VIAFWho's the Author? Identifier soup - ORCID, ISNI, LC NACO and VIAF
Who's the Author? Identifier soup - ORCID, ISNI, LC NACO and VIAF
 
IIIF without an image server? No problem!
IIIF without an image server? No problem!IIIF without an image server? No problem!
IIIF without an image server? No problem!
 

Recently uploaded

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 

Recently uploaded (20)

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 

ResourceSync: Web-based Resource Synchronization

  • 1. ResourceSync: Web-based Resource Synchronization Simeon Warner (Cornell) Open Repositories 2012, Edinburgh, 11 July 2012
  • 2. Core team -- Todd Carpenter (NISO), Berhard Haslhofer, (Cornell University), Martin Klein (Los Alamos National Laboratory), Nettie Lagace (NISO), Carl Lagoze (Cornell University), Peter Murray (NISO), Michael L. Nelson (Old Dominion University), Robert Sanderson (Los Alamos National Laboratory), Herbert Van de Sompel (Los Alamos National Laboratory), Simeon Warner (Cornell University) Team members – Richard Jones (JISC/Cottage Labs), Stuart Lewis (JISC/Cottage Labs), Graham Klyne (JISC), Shlomo Sanders (Ex Libris), Kevin Ford (LoC), Ed Summers (LoC), Jeff Young (OCLC), David Rosenthal (Stanford) Funding – The Sloan Foundation (core team) and the JISC (UK participation) Thanks for slides from – Stuart Lewis, Herbert Van de Sompel
  • 3. Synchronize what? •  Web resources – things with a URI that can be dereferenced and are cache-able (no dependency on underlying OS, technologies etc.) •  Small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources) •  That change slowly (weeks/months) or quickly (seconds), and where latency needs may vary •  Focus on needs of research communication and cultural heritage organizations, but aim for generality 3
  • 4. Why? … because lots of projects and services are doing synchronization but have to roll their own on a case by case basis! •  Project team involved with projects that need this •  Experience with OAI-PMH: widely used in repos but o  XML metadata only o  Web technology has moved on since 1999 •  Data / Metadata / Linked Data – Shared solution?
  • 5. Use cases – the basics JISC
  • 7. Out-of-scope (for now) •  Bidirectional synchronization •  Destination-defined selective synchronization (query) •  Special understanding of complex objects •  Bulk URI migration •  Diffs (hooks?) •  Intra-application event notification •  Content tracking
  • 8. Use case: DBpedia Live duplication •  20M entries updated @ 1/s though sporadic •  Want low latency => need a push technology
  • 9. Use case: arXiv mirroring •  1M article versions, ~800/day created or updated at 8pm US eastern time •  Metadata and full-text for each article •  Accuracy important •  Want low barrier for others to use •  Look for more general solution than current homebrew mirroring (running with minor modifications since 1994!) and occasional rsync (filesystem layout specific, auth issues)
  • 10. Terminology •  Resource: an object to be synchronized, a web resource •  Source: system with the original or master resources •  Destination: system to which resources from the source will be copied and kept in synchronization •  Pull: process to get information from source to destination initiated by the destination. •  Push: process to get information from source to destination initiated by the source (and some subscription mechanism) •  Metadata: information about resources such as URI, modification time, checksum, etc. (Not to be confused with resources that may themselves be metadata records)
  • 11. Three basic needs 1.  Baseline synchronization – A destination must be able to perform an initial load or catch-up with a source -  avoid out-of-band setup; provide discovery 2.  Incremental synchronization – A destination must have some way to keep up-to-date with changes at a source -  subject to some latency; minimal: create/update/delete 3.  Audit – It should be possible to determine whether a destination is synchronized with a source -  subject to some latency; want efficiency > HTTP HEAD
  • 12. Baseline synchronization Either •  Get inventory of resources and then copy them one- by-one using HTTP GET o  simplest, inventory is list of resources plus perhaps metadata o  inventory format? or •  Get dump of resources and all necessary metadata o  more efficient: reduce number of round trips o  dump format?
  • 13. Audit Could do new Baseline synchronization and compare … but likely very inefficient! Optimize by adding: •  Get inventory and compare with copy at destination o  use timestamp, digest or other metadata in inventory to check content (effort çè accuracy tradeoff) o  latency depends on freshness of inventory and time to copy and check (easier to cope with if modification times included in metadata)
  • 14. Incremental synchronization Simplest method is Audit and then copy of all new/ updated resources, plus removal of deleted resources. Optimize by adding: •  Change Communication – Exchange ChangeSet listing only updates -  How to understand sequence, schedule? •  Resource Transfer – Exchange dumps for ChangeSets or even diffs appropriate to resource type Change Memory necessary to record sequence or intermediate states.
  • 15. Template to map approaches 15
  • 16. Approaches and technologies Push DSNotify OAI-PMH Pull rsync Crawl OAI-ORE RDFsync WebDAV Col. Syn. XMPP Atom SWORD AtomPub Sitemap RSS SPARQLpush PubSubHubbub SDShare XMPP JISC
  • 17.
  • 18. A framework based on Sitemaps •  Modular framework allowing selective deployment •  Sitemap is the most basic component of the framework •  Reuse Sitemap form for changesets and notifications (same <url> element describing resource) •  Selective synchronization via tagging •  Discovery of capabilities via <atom:link>! •  Further extension possible 18
  • 19. Baseline Sync with Inventory 19
  • 20. Level zero è Publish a Sitemap •  Periodic publication of an up-to-date Sitemap is base level implementation •  Use Sitemap <url> as is with <loc> and <lastmod> as core elements for each Resource o  Introduce optional extra elements to convey fixity information, size, tags for selective synchronization, etc. •  Extend to: o  Convey Source capabilities, discovery informatio, locations of dumps, locations of changesets, change memory, etc. o  Provide timestamp and/or additional metadata for the Sitemap
  • 21. Two resources, with lastmod times
  • 22. Two resources, with lastmod times, sizes and digests. The second with a tag also
  • 23. Sitemap details & issues •  Sitemap XML format designed to allow extension •  ResourceSync additions: o  Additional core elements in ResourceSync namespace (digest, size, update information) o  Discovery information using <atom:link> elements •  Use existing Sitemap Index scheme for large sets of resources (handles up to 2.5 billion resources before further extension required) •  Provide mapping to RDF semantics but keep XML simple 23
  • 24. Incremental Sync with ChangeSet 24
  • 25. ChangeSet •  Reuse Sitemap format but include information only for change events over a certain period: •  One <url> element per change event •  The <url> element uses <loc> and <lastmod> as is and is extended with: •  an event type to express create/update/delete •  an optional event id to provide a unique identifier for the event. •  can further extend to include fixity, tag info, Memento TimeGate link, special-purpose access-point, etc. •  Introduce minimal <urlset>-level extensions to support: •  Navigation between ChangeSets via <atom:link> •  Timestamping the ChangeSet 25
  • 26. Expt: arXiv – Inventory and ChangeSet •  Baseline synchronization and Audit (Inventory): o  2.3M resources (300GB content) o  46 sitemaps and 1 sitemapindex (50k resources/sitemap) o  sitemaps ~9.3MB each -> 430MB total uncompressed;1.7MB each -> 78MB total if gzipped (<0.03% content size) •  Incremental synchronization (ChangeSet): o  arXiv has updates daily @ 8pm so create daily ChangeSet o  ~1k additions and 700 updates per day o  1 sitemap ~300kB or 20kB gzipped, can be generated and served statically o  keep chain of ChangeSets, link with <atom:link>
  • 27. Incremental Sync with Push via XMPP 27
  • 28. Change Communication: Push via XMPP •  Rapid notification of change events via XMPP PubSub node; one notification per event •  Each change event is conveyed using a Sitemap <url> element contained in a dedicated XMPP <item> wrapper •  Use same resource metadata (e.g. <loc>, <lastmod>) and same extensions as with changesets •  Multiple change events can be grouped into a single XMPP message (using <items>)
  • 29. Expt: LiveDBpedia with XMPP Push •  LANL Research Library ran a significant scale experiment in synchronization of the LiveDBpedia database from Los Alamos to two remote sites using XMPP to push change notifications o  Push for change communication only, content then obtained with HTTP GET •  Destination sites were able to keep in close synchronization with sources o  Maximum queued updates <400 over 6 runs with 100k updates; and bursty updates averaging ~1/s o  Small number of errors suggests use for audit in many real- life situations
  • 30. Dumps Optimization over making repeated HTTP GET requests for multiple resources. Use for baseline and changeset. Options: 1.  ZIP+Sitemap o  simple and ZIP very widely used o  consistent inventory/change/set format o  con: “custom” 2.  WARC o  designed for exactly this purpose o  con: little used outside web archiving community
  • 31. Sitemaps + XMPP + Dumps 31
  • 32. Timeline and input •  July 2012 – First draft of sitemap-based spec (SOON) •  August 2012 – Publicize and solicit feedback (will be NISO email list) •  September 2012 – Revise, more experiments, more feedback •  December 2012 – Finalize specification (?) •  NISO webspace •  Code on github: http://github.org/resync/simulator