The structure and design of digital storage systems is a cornerstone of digital preservation. To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This talk discusses findings from this survey, common gaps, and trends in this area.
(I also have a little fun highlighting the hidden assumptions underlying Amazon Glacier's reliability claims. For more on that see this earlier post: http://drmaltman.wordpress.com/2012/11/15/amazons-creeping-glacier-and-digital-preservation )
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Approaches to Preservation Storage Technologies
1. Prepared for
MIT Libraries Informatics Program Brown Bag Talk
June 2013
Approaches to Preservation Storage
Technologies
Dr. Micah Altman
<escience@mit.edu>
Director of Research, MIT Libraries
2. DISCLAIMER
These opinions are my own, they are not the opinions
of MIT, Brookings, any of the project funders, nor (with
the exception of co-authored previously published
work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the
future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill,
Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi,
Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle,
George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White,
etc.
Approaches to Preservation Storage Technologies 2
3. Collaborators & Co-Conspirators
• Jefferson Bailey, Karen Cariani, Jonathan
Crabtree, Michelle Gallinger, Jane
Mandelbaum, Nancy McGovern Trevor
Owens
• NDSA Coordination Committee & Working
Group Chairs
• Research Support
Thanks to the Library of Congress, & the
Massachusetts Institute of Technology.
Approaches to Preservation Storage Technologies 3
4. Related Work
• Altman, et. al, 2013. “NDSA Storage Report: Reflections on
National Digital Stewardship Alliance Member Approaches to
Preservation Storage Technologies”, Dlib 19 (5/6)
• National Digital Stewardship Alliance, 2013 (Forthcoming), 2014
National Agenda for Digital Stewardship.
• Micah Altman, Jonathan Crabtree (2011) Using the SafeArchive
System : TRAC-Based Auditing of LOCKSS, 165-170. In Archiving
2011.
Most reprints available from:
informatics.mit.edu
Approaches to Preservation Storage Technologies 4
5. Simple question?
• If you have 1000 files (bitstreams), and you’d
like to have 99.99% chance of accessing them
in 20 years. How do you store them?
Approaches to Preservation Storage Technologies 5
6. Simplistic Answer: Put it in AWS
• Amazon Glacier claims a design reliability of
99.999999999%
• Neat-o !!!!!!!!!!
– Longer odds than winning Powerball OR
– Getting struck by a lightning, three times OR
– (Possibly) eventually finding alien civilization
• But …
Approaches to Preservation Storage Technologies 6
7. Clarifying Requirements
• What are the units of reliability? - Collection?
Object? Bit?
• What is the natural unit of risk?
• Is value of information uniform across units?
• How many of these do you have?
Approaches to Preservation Storage Technologies 7
8. Hidden Assumptions
• Reliability estimates appear entirely theoretical
– (MTBF + Independence)* enough replicas -> as many 9’s as you like…
– No details for estimate provided
– No historical reliability statistics provided
– No service reliability auditing provided
• Empirical Issues
– Storage manufacture hardware MTBF (mean time between failures) does not
match observed error rates in real environments…
– Failures across hardware replicas are observed to correlated
• Unmodeled failure modes
– software failure
(e.g. a bug in the AWS software for its control backplane might result in
permanent loss that would go undetected for a substantial time_
– legal threats (leading to account lock-out — such as this, deletion, or content
removal);
– institutional threats (such as a change in Amazon’s business model)
– Process threats (someone hits the delete button by mistake; forgets to pay
the bill; or AWS rejects the payment)
Approaches to Preservation Storage Technologies 8
9. Business Risks?
• Amazon SLA’s do not incorporate or reflect
“design” reliability claims:
– No claim to reliability in SLA’s
– Sole recover for breach limited to refund of fees for
periods the service was unavailable
– No right to audit logs, or other evidence of reliability
Approaches to Preservation Storage Technologies 9
10. What practices are
leading stewardhip
organizations using?
Approaches to Preservation Storage Technologies 10
11. Results from the NDSA Bi-Annual
Preservation Storage Survey
• 74 institutions surveyed.
58 met selection criteria.
– Follow up on non-responders: 100% response rate.
– Low rolloff on individual questions
– Next round will be > 2x bigger
• Survey Methods
– Close ended, with open ended extensions
– Selected qualitative followup
• Survey Data
– Instrument and data available as open data
Approaches to Preservation Storage Technologies 11
13. Key Findings: What are Current
Institutional Practices?
• 90% of respondents are distributing copies of at least part of their content
geographically
• 88% of respondents are responsible for their content for an indefinite
period of time
• 80% of respondents use some form of fixity checking for their content
• 75% of respondents report a strong preference to host and control their
own technical infrastructure for preservation storage
• 69% of respondents are considering or currently participating in a
distributed storage cooperative or system (ex. LOCKSS alliance, MetaArchive,
Data-PASS)
• 64% of respondents are planning to make significant changes in the
technologies in their preservation storage architecture in the next three
years
• 51% of respondents are considering or already using a cloud storage
provider to keep one or more copies of their content
• 48% of respondents are considering, or currently contracting out storage
services to be managed by another organization or company
Approaches to Preservation Storage Technologies 13
17. What do organizations want from their
preservation systems?
Approaches to Preservation Storage Technologies 17
18. What are most memory organizations
not doing yet?
• Formal cost and valuation models
• Auditing&evaluation
• Certification
• Comprehensive content review
Approaches to Preservation Storage Technologies 18
20. Emerging State of the
Practice
Approaches to Preservation Storage Technologies 20
21. Methods for Mitigating Bit-Level Risk
Physical:
Media,
Hardware,
Environment
Number
of copies
Diversification
of copies
Formats File
Transforms:
compression,
encoding,
encryption
Fixity Repair
Local
Storage
File
Systems:
transforms,
deduplication,
redundancy
Replication
Verification
Audit
Approaches to Preservation Storage Technologies 21
22. Emerging State of Practice
• Organizational – Multi Institutional Stewardship
– Institutions hold digital assets they wish to preserve,
many unique
– Many of these assets are not replicated at all
– Even when institutions keep multiple backups offsite,
many single points of failure remain,
because replicas are managed by single institution
– Approaches: LOCKSS, Digital Preservation Network,
MetaArchive, Data-PASS, Datanet Federation
Consortium, Data-ONE
• Technical: Fixity, verification and auditing
• Legal: Secession planning, Confidentiality, …
Approaches to Preservation Storage Technologies 22
25. The Risk Problem Restated
Keeping risk of object loss fixed
-- what choices minimize $?
“Dual problem”
Keeping $ fixed, what choices minimize risk?
Extension
For specific cost functions for loss of object:
Loss(object_i), of all lost objects
What choices minimize:
Total cost= preservation cost+ sum(E(Loss))
risk
cost
Are we there
yet?
Approaches to Preservation Storage Technologies 25
26. Research Directions
• Growing the evidence base
– Descriptive inference – patterns of use
– Descriptive inference – outcomes
– Predictive inference – trend analysis
– Causal inference – effectiveness of interventions
• Modes of inquiry
– probability-based surveys
(e.g. of information management practice and outcomes)
– replicable simulation experiments tied to theoretically grounded
models of information management and risk;
– creation of testbeds and test-corpuses which can be used to
systematically compare new practices, tools, and methods;
– field experiments, in which randomized interventions are applied and
evaluated in real operational environments.
Approaches to Preservation Storage Technologies 26
27. Bibliography (Selected)
• David S.H. Rosenthal, Thomas S. Robertson, Tom Lipkis, Vicky Reich,
Seth Morabito. “Requirements for Digital Preservation Systems: A
Bottom-Up Approach”, D-Lib Magazine, vol. 11, no. 11, November
2005.
• Pinheiro, E., Weber, W.D., & Barroso, L. A. (2007). Failure trends in a large
disk drive population. In Proceedings of 5th USENIX Conference on File and
Storage Technologies.
• Rosenthal, David SH. "Bit preservation: a solved problem?." International
Journal of Digital Curation 5.1 (2010): 134-148.
Approaches to Preservation Storage
Technologies
27
This work. by Micah Altman (http://micahaltman.com) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
The structure and design of digital storage systems is a cornerstone of digital preservation. To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This talk discusses findings from this survey, common gaps, and trends in this area.(I also have a little fun highlighting the hidden assumptions underlying Amazon Glacier's reliability claims. For more on that see this earlier post: http://drmaltman.wordpress.com/2012/11/15/amazons-creeping-glacier-and-digital-preservation )