Digital Document
Preservation Simulation
Richard Landau
(late of Digital Equipment,
Compaq, Dell)
- Research Affiliate, MI...
The Problem
• How do you preserve digital documents long-term?
• Surely not on CDs, tapes, etc., with short lifetimes
• LO...
The Project
• Digital Document Preservation Simulation
• Part of Program on Information Science, MIT Libraries:
Dr. Micah ...
Questions to Be Answered
• Start small: 10,000 documents for 10 years, 1-20 copies
• (Short term questions, lots more in t...
Tools
• Windows 7 on largish PCs
• Cygwin for Windows (like Linux on Windows)
• Python 2.7
– SimPy library for discrete ev...
Approach
• Very traditional object and service oriented design
– Client, Document, Collection of documents
– Server, Copy ...
How SimPy Works
• Library for discrete event simulation, V3
• Discrete event queue sorted by time
• Asynchronous processes...
Early Results for Q0
BPG - Digital Document Preservation Simulation 20140722 RBL 8
Is Python Fast Enough?
• Yes, if one is careful
– Code it all in C++? No. Get results faster with Python.
– "Premature opt...
Burn Those CPUs!
• One test only
2 sec to 8 min
• 2000-4000 tests
• Scheduler runs
5-7 programs at
once
• All 4 cores
• (P...
Class Instances Should Have Names
• <client.CDocument object at 0x6ffff9f6690> is
imperfectly informative
– D127 is much b...
How I Assign IDs
• Every class begins like this:
• itertools.count(1).next function returns a unique
integer for this clas...
Lessons Learned
• Python is fast enough
• SimPy is really dandy
• argparse is a great lib for CLIs
• csv.DictReader makes ...
Code is on github
• On github: MIT-Informatics/PreservationSimulation
• Questions: mailto:landau@ricksoft.com
BPG - Digita...
Further Details (Not Tonight)
• Many small optimizations add up
– Fast find of victim document on a shelf
– Shorten log fi...
Forming and Running
External Commands
• Substitute multiple arguments into a string with format()
• Run command and captur...
A More Complicated Command
• Read a dict of parameters with csv.DictReader
• And then substitute lots of arguments with fo...
Subprocess Module Instead
• os.popen is being replaced by more general functions
– subprocess.check_output
– subprocess.Po...
Supersede That Old Function
• Have a complicated function that works, but you want to
replace it with a spiffier version?
...
Upcoming SlideShare
Loading in...5
×

Digital Document Preservation Simulation - Boston Python User's Group

335

Published on

Mr. Rick Landau, Research Affiliate, will present a briefing at the Boston Python Users Group meeting on the topic of simulating document loss .



Durable access to information requires insuring against multiple risks such as media failures, format obsolescence, fires, floods, earthquakes,

institutional failures, mergers, funding cuts, and malicious insiders. Real libraries need empirical guidance on storage strategies and costs -- and vendor claims of reliability are often uninformative, or suspect. The talk describes how event-based simulations, coded in python, are used to simulate document loss under a variety of risks profiles, and to provide practical guidance.



This meeting will be held at the Microsoft NERD building, 1 Memorial Drive, and is open to the public. More information is available through the meeting web page:

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
335
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Digital Document Preservation Simulation - Boston Python User's Group

  1. 1. Digital Document Preservation Simulation Richard Landau (late of Digital Equipment, Compaq, Dell) - Research Affiliate, MIT Libraries Program on Information Science - http://informatics.mit.edu/people/ richard-landau 2014-07-22
  2. 2. The Problem • How do you preserve digital documents long-term? • Surely not on CDs, tapes, etc., with short lifetimes • LOCKSS: Lots Of Copies Keep Stuff Safe • What's the threat model? – Failures: media, format obsolescence, fires, floods, earthquakes, institutional failures, mergers, funding cuts, malicious insiders,…. • Some data exists on disk media reliability, RAID, etc. • Little data on reliability of storage strategies BPG - Digital Document Preservation Simulation 20140722 RBL 2
  3. 3. The Project • Digital Document Preservation Simulation • Part of Program on Information Science, MIT Libraries: Dr. Micah Altman, Director of Research • Develop empirical data that real libraries can use to make decisions about storage strategies and costs – Simulate a range of situations, let the clients decide what level of risk they are willing to accept • I do this for fun: volunteer intern, 1-2 days/week BPG - Digital Document Preservation Simulation 20140722 RBL 3
  4. 4. Questions to Be Answered • Start small: 10,000 documents for 10 years, 1-20 copies • (Short term questions, lots more in the long term) • Question 0: If I place a number of copies out into the network, how many will I lose over time? – For various error rates and copies, what's the risk? • Question 1: If I audit the remote copies and repair them if they're broken, how many will I lose over time? – For various auditing strategies and frequencies, what's the risk? BPG - Digital Document Preservation Simulation 20140722 RBL 4
  5. 5. Tools • Windows 7 on largish PCs • Cygwin for Windows (like Linux on Windows) • Python 2.7 – SimPy library for discrete event simulations – argparse module for CLI – csv module for reading parameter files – logging module for recording events – itertools for generating serial numbers – random to generate exponentials, uniforms, Gaussians • Random seed values from random.org BPG - Digital Document Preservation Simulation 20140722 RBL 5
  6. 6. Approach • Very traditional object and service oriented design – Client, Document, Collection of documents – Server, Copy of document, Storage Shelf – Auditor • Asynchronous processes managed by SimPy – Sector failure, shelf failure, at exponential intervals – Audit a collection on a server, at regular intervals • SimPy resource, network mutex for auditors • Slightly perverse code (e.g., Hungarian naming) BPG - Digital Document Preservation Simulation 20140722 RBL 6
  7. 7. How SimPy Works • Library for discrete event simulation, V3 • Discrete event queue sorted by time • Asynchronous processes, resources, events, timers – Process is a Python generator, yield to wait for event(s) • Timeouts, discrete events, resource requests • Very simple to use, very efficient, good tutorial examples • Time is floating point, choose your units BPG - Digital Document Preservation Simulation 20140722 RBL 7
  8. 8. Early Results for Q0 BPG - Digital Document Preservation Simulation 20140722 RBL 8
  9. 9. Is Python Fast Enough? • Yes, if one is careful – Code it all in C++? No. Get results faster with Python. – "Premature optimization is the root of all evil" -- Knuth • Numerous small optimizations on the squeaky wheels – Just a few reduced CPU and I/O by 98% • If it's still not fast enough, use a bigger computer BPG - Digital Document Preservation Simulation 20140722 RBL 9
  10. 10. Burn Those CPUs! • One test only 2 sec to 8 min • 2000-4000 tests • Scheduler runs 5-7 programs at once • All 4 cores • (Poor man's parallelism) • 99% CPU use BPG - Digital Document Preservation Simulation 20140722 RBL 10
  11. 11. Class Instances Should Have Names • <client.CDocument object at 0x6ffff9f6690> is imperfectly informative – D127 is much better for humans: name, not address – With zillions of instances, identifying one during debugging is hard • Suggestion: – Assign a unique ID to every instance that denotes class and a serial number – Always always always pass IDs as arguments rather than instance pointers BPG - Digital Document Preservation Simulation 20140722 RBL 11
  12. 12. How I Assign IDs • Every class begins like this: • itertools.count(1).next function returns a unique integer for this class, starting at 1, when you invoke it • Global dictionary dID2Xxxx translates from an ID string to an instance of class Xxxx • Pass ID as argument; when a function needs the pointer, it can use a single fast dictionary lookup BPG - Digital Document Preservation Simulation 20140722 RBL 12 class CDocument(object): getID = itertools.count(1).next (in __init__) self.ID = "D" + str(self.getID()) G.dID2Document[self.ID] = self
  13. 13. Lessons Learned • Python is fast enough • SimPy is really dandy • argparse is a great lib for CLIs • csv.DictReader makes everything a string, beware • item-in-list is really slow, O(n/2), use dict or set instead • Lots of comments in the code! BPG - Digital Document Preservation Simulation 20140722 RBL 13
  14. 14. Code is on github • On github: MIT-Informatics/PreservationSimulation • Questions: mailto:landau@ricksoft.com BPG - Digital Document Preservation Simulation 20140722 RBL 14
  15. 15. Further Details (Not Tonight) • Many small optimizations add up – Fast find of victim document on a shelf – Shorten log file by removing many or all details – Minimize just-in-time checks during auditing – Change item-in-list checks to item-in-dict • Wide variety of scenarios covered – 1000X span on document size, 100X on storage shelf size, 10000X span on failure rates, 20X span on number of copies – A test run takes from 2 seconds to 8 minutes (CPU) • Running external programs • Post-processing of structured log files BPG - Digital Document Preservation Simulation 20140722 RBL 15
  16. 16. Forming and Running External Commands • Substitute multiple arguments into a string with format() • Run command and capture output into string (or list) BPG - Digital Document Preservation Simulation 20140722 RBL 16 TemplateCmd = "ps | grep {Name} | wc -l" RealCmd = TemplateCmd.format(**dictArgs) ResultString = "" for Line in os.popen(RealCmd): ResultString += Line.strip() nProcesses = int(ResultString) if nProcesses < nProcessMax: . . .
  17. 17. A More Complicated Command • Read a dict of parameters with csv.DictReader • And then substitute lots of arguments with format() • Execute and capture output with os.popen() BPG - Digital Document Preservation Simulation 20140722 RBL 17 dir,specif,len,seed,cop,ber,berm,extra1 ../q1,v3d50,0,919028296,01,0050,0050000, python main.py {family} {specif} {len} {seed} --ncopies={cop} --ber={berm} {extra1} > {dir}/{specif}/log{ber}/c{cop}b{ber}s{seed}.log 2>&1 &
  18. 18. Subprocess Module Instead • os.popen is being replaced by more general functions – subprocess.check_output – subprocess.Popen – subprocess.Pipe – subprocess.communicate BPG - Digital Document Preservation Simulation 20140722 RBL 18
  19. 19. Supersede That Old Function • Have a complicated function that works, but you want to replace it with a spiffier version? – Edit in place? • Might break the whole thing for a while – Comment it out with ''' or """? • Not perfectly reliable, like "if 0" in C • Supersede it: add the new version after the old – Last version replaces previous one in module or class dict BPG - Digital Document Preservation Simulation 20140722 RBL 19 def Foo(…): (moldy old code) def Foo(…): (shiny new code)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×