Multi-Archival Syndicated Storage Platform


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Multi-Archival Syndicated Storage Platform

  1. 1. Bryan Beecher University of Michigan Director, Computing & Network Services E: [email_address] W: Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate Director, Harvard-MIT Data Center Senior Research Scientist, Institute for Quantitative Social Sciences E: [email_address] W:
  2. 2. <ul><li>Roadmap </li></ul><ul><ul><li>Why replicate for preservation? </li></ul></ul><ul><ul><li>What is institutional model for replication in Data-PASS use? </li></ul></ul><ul><ul><li>How do we build on LOCKSS to support these institutional needs? </li></ul></ul><ul><li>Collaborators and Conspirators </li></ul><ul><ul><li>Leonid Andreev, IQSS; Steve Burling, ICPSR; Jonathan Crabtree, Odum; Marc Maynard, Roper; Nancy McGovern, ICPSR; </li></ul></ul>
  3. 3. <ul><li>Technical </li></ul><ul><ul><li>Media failure: storage conditions, media characteristics </li></ul></ul><ul><ul><li>Format obsolescence </li></ul></ul><ul><ul><li>Preservation infrastructure software failure </li></ul></ul><ul><ul><li>Storage infrastructure software failure </li></ul></ul><ul><ul><li>Storage infrastructure hardware failure </li></ul></ul><ul><li>External Threats to Institutions </li></ul><ul><ul><li>Third party attacks </li></ul></ul><ul><ul><li>Institutional funding </li></ul></ul><ul><ul><li>Change in legal regimes </li></ul></ul><ul><li>Quis custodiet ipsos custodes? </li></ul><ul><ul><li>Unintentional curatorial modification </li></ul></ul><ul><ul><li>Loss of institutional knowledge & skills </li></ul></ul><ul><ul><li>Intentional curatorial deaccessioning </li></ul></ul><ul><ul><li>Change in institutional mission </li></ul></ul>
  4. 4. <ul><li>There are potential single points of failure in both technology, organization and legal regimes: </li></ul><ul><ul><li>Diversify your portfolio: multiple software systems, hardware, organization </li></ul></ul><ul><ul><li>Find diverse partners – diverse business models, legal regimes </li></ul></ul>
  5. 5. <ul><li>Consider organizational credentials </li></ul><ul><li>No organization is absolutely certain to be reliable </li></ul><ul><li>Consider the trust relationships across institutions </li></ul> /
  6. 6. <ul><li>Policy Driven </li></ul><ul><ul><li>Institutional policy creates formal replication commitments </li></ul></ul><ul><ul><li>Replication commitments are described in metadata, using schema </li></ul></ul><ul><ul><li>Metadata drives </li></ul></ul><ul><ul><ul><li>Configuration of replication network </li></ul></ul></ul><ul><ul><ul><li>Auditing of replication network </li></ul></ul></ul>
  7. 7. <ul><li>Asymmetric Commitments Partners vary in … </li></ul><ul><ul><li>… storage commitments to replication </li></ul></ul><ul><ul><li>… size of holdings being replicated </li></ul></ul><ul><ul><li>… what holdings of other partners they replicate </li></ul></ul>
  8. 8. <ul><li>Completeness </li></ul><ul><ul><li>Complete public holdings of each partner </li></ul></ul><ul><ul><li>Retain previous version of holdings </li></ul></ul><ul><ul><li>Include </li></ul></ul><ul><ul><ul><li>metadata </li></ul></ul></ul><ul><ul><ul><li>data </li></ul></ul></ul><ul><ul><ul><li>documentation </li></ul></ul></ul><ul><ul><ul><li>legal agreements </li></ul></ul></ul>
  9. 9. <ul><li>Restoration guarantees </li></ul><ul><ul><li>Restore groups of versioned content </li></ul></ul><ul><ul><ul><li>to owning archive </li></ul></ul></ul><ul><ul><ul><li>to replication hosts </li></ul></ul></ul><ul><ul><li>Institutional failure restoration: – transfer entire holdings of an archive to another </li></ul></ul>
  10. 10. <ul><li>Trust & Verification </li></ul><ul><ul><li>Each partner is trusted … </li></ul></ul><ul><ul><ul><li>… to hold the public content of other (not to disseminate improperly) </li></ul></ul></ul><ul><ul><ul><li>… to add units to be harvested </li></ul></ul></ul><ul><ul><li>No partner is trusted to be “super-user” </li></ul></ul><ul><ul><ul><li>No deletion (or directly manipulation of replication storage owned by another partner </li></ul></ul></ul><ul><ul><li>Legal agreements reinforce trust model </li></ul></ul><ul><ul><li>Schema based auditing used to … </li></ul></ul><ul><ul><ul><li>… verify replication guarantees are met </li></ul></ul></ul><ul><ul><ul><li>… record replication and storage commitments </li></ul></ul></ul><ul><ul><ul><li>… document related TRAC criteria </li></ul></ul></ul>
  11. 11. <ul><li>Network level: </li></ul><ul><ul><li>Identification: name; description; contact; access point URI </li></ul></ul><ul><ul><li>Capabilities: protocol version; number of replicates maintained; replication frequency; versioning/deletion support </li></ul></ul><ul><ul><li>Human readable documentation: restrictions on content that may be placed in the network; services guaranteed by the network; Virtual Organization policies relating to network maintenance </li></ul></ul><ul><li>Host level </li></ul><ul><ul><li>Identification: name; description; contact; access point URI </li></ul></ul><ul><ul><li>Capabilities: protocol version; storage available </li></ul></ul><ul><ul><li>Human readable terms of use: Documentation of hardware, software and operating personnel in support of TRAC criteria  </li></ul></ul><ul><li>Archival unit level </li></ul><ul><ul><li>Identification: name; description; contact; access point URI </li></ul></ul><ul><ul><li>Attributes: update frequency, plugin required for harvesting, storage required </li></ul></ul><ul><ul><li>Terms of use: Required statement of content compliance with network terms. ; Dissemination terms and conditions </li></ul></ul><ul><li>TRAC Integration </li></ul><ul><ul><li>A number of elements comprise documentation showing how the replication system itself supports relevant TRAC criteria </li></ul></ul><ul><ul><li>Other elements that may be use to include text, or reference external text that documents evidence of compliance with TRAC criteria. </li></ul></ul><ul><ul><li>Specific TRAC criteria are identified implicitly, can be explicitly identified with attributes </li></ul></ul><ul><ul><li>Schema documentation describes each elements relevance to TRAC, and mapping to particular TRAC criteria </li></ul></ul>
  12. 12. <ul><ul><li>Initialization: Given schema instance distribute AU harvesting responsibility to hosts </li></ul></ul><ul><ul><li>Auditing: Does current host harvesting allocation & history match replication commitment in schema? </li></ul></ul><ul><ul><li>Recovery of hosts </li></ul></ul><ul><ul><li>Deliver AU content to source archive </li></ul></ul><ul><ul><li>Addition of AU’s, hosts </li></ul></ul><ul><ul><li>Growth of AU’s over initial commitment </li></ul></ul><ul><ul><li>Assumptions </li></ul></ul><ul><ul><li>Nothing is deleted </li></ul></ul><ul><ul><li>Resources in network grow monotonically </li></ul></ul><ul><ul><li>Off-the-path behavior is detected automatically, resolved manually </li></ul></ul>
  13. 15. <ul><li>LOCKSS </li></ul><ul><li>CLOCKSS </li></ul><ul><li>Very easy to build and deploy </li></ul><ul><ul><li>5 minutes </li></ul></ul><ul><li>Very easy to plug into public LOCKSS network </li></ul><ul><ul><li>5 minutes </li></ul></ul><ul><li>Very easy to manage thereafter – it is basically an appliance </li></ul><ul><li>Also easy to set-up </li></ul><ul><li>Grouping your CLOCKSS devices into a private network and paring the 20k-line configuration file into the right 200-line configuration file is not </li></ul><ul><li>Managing a network now, not a device </li></ul>
  14. 16. <ul><li>Standard LOCKSS used for </li></ul><ul><ul><li>Harvesting </li></ul></ul><ul><ul><li>Recovery </li></ul></ul><ul><li>New LOCKSS bulk update mechanism used for </li></ul><ul><ul><li>Initial configuration </li></ul></ul><ul><ul><li>Adding AU’s </li></ul></ul><ul><li>CLOCKSS mechanisms (certificates, cache monitor) </li></ul><ul><ul><li>Content delivery </li></ul></ul><ul><ul><li>Optimize recovery </li></ul></ul><ul><ul><li>Auditing </li></ul></ul><ul><li>Data-PASS customizations for schema processing </li></ul><ul><ul><li>Translating schema instance into bulk update requests </li></ul></ul><ul><ul><li>Reporting on compliance based on cache monitor database </li></ul></ul>
  15. 17. <ul><li>Summer 2007: Attended the MetaArchive LOCKSS tutorial </li></ul><ul><ul><li>Very good overview of LOCKSS </li></ul></ul><ul><li>Summer 2007: SSP System Requirements Developed & Approved </li></ul><ul><li>Winter 2007: First public LOCKSS network nodes built at two Data-PASS sites </li></ul><ul><li>Winter 2007: SSP Replication Commitment Schema Developed </li></ul><ul><li>Spring 2008: Completed Test harvest of MRA collection into LOCKSS </li></ul><ul><li>Sprint 2008: SSP System Use Cases Developed </li></ul><ul><li>Spring 2008: Prototype plugin developed to harvest Dataverse Networks </li></ul><ul><li>Spring 2008: Data-PASS sites are joined into single Private LOCKSS Network (PLN) </li></ul><ul><li>Spring 2008: Met with LOCKSS developers to review use cases </li></ul><ul><ul><li>SSP will leverage functionality “in the works” by LOCKSS team </li></ul></ul>
  16. 19. <ul><ul><li>Replication ameliorates institutional risks to preservation </li></ul></ul><ul><ul><li>Data PASS requires policy based, auditable, asymmetric replication commitments </li></ul></ul><ul><ul><li>Formalize policy in schema </li></ul></ul><ul><ul><li>(Re)Configure & audit LOCKSS using schema </li></ul></ul><ul><ul><li>Replication uses standard LOCKSS mechanisms </li></ul></ul>