Threat model•“Preservation” means nothing unmodiﬁed.• This is why it becomes such a bogeyman!•Two things you need to know ﬁrst:• why you’re preserving what you’re preserving, and• what you’re preserving it against.•Libraries: your collection-development policyshould inform the ﬁrst question.• Your coll-dev policy doesn’t include local born-digital ordigitized materials? This is a problem. Fix it.•The second question is your “threat model.”
Why did I just make youdo that?•I’m weird.•I’m trying to destroy the myth that any givenmedium “preserves itself.”•Media do not preserve themselves. Peoplepreserve media—or media get bizarrely lucky.•We need not panic over digital preservationany more than we panic about print.•Approach digital preservation the same wayyou approach print preservation.
?Ignorance•“It’s in Google, so it’s preserved.” (Not even“Google Books!”)•“I make backups, so I’m ﬁne.”•“I have a graduate student who takes care ofthese things.”•“Metadata? What’s that? I have to have it?”•“Digital preservation is an unsolvable problem,so why even try?” (I’ve heard this one fromlibrarians. I bet you have too.)
Mitigating the risks:planning and auditingtools
Audit frameworks• Trusted Repository Audit Checklist• (If you see “NARA/RLG” somewhere? This is the framework thatevolved into TRAC. Long story.)• You can get an actual formal TRAC audit from CRL! Who has? Portico,Hathi, “Chronicle of Life,” two-three others. This audit is HARSH. (Sodon’t write oﬀ a repo because it hasn’t had a TRAC audit.)• If you hear the phrase “trusted digital repository,” it should meanthat the repo has had (or is pursuing) a TRAC audit.• DRAMBORA• More ﬂexible, less ﬁnger-shaking than TRAC.• Less of this “designated community” nonsense.• Less dependent on OAIS model (which I consider a strength).• Encourages archives to consider and document their individualsituations and think hard about risk mitigation.
Newer: SPOT model•Even less clunky than DRAMBORA.•I quite like this one.•Identifying Threats to Successful DigitalPreservation: the SPOT Model for RiskAssessment• http://www.dlib.org/dlib/september12/vermaaten/09vermaaten.html
So what do they audit?•Mission (and adherence to it)•Plans and policies• including contingency plans•Staﬀ infrastructure•Operations documentation• including tech infrastructure, service infrastructure•Sustainable funding•“Doing the right things with the stuﬀ.”• identiﬁers, ingest ﬁle format management, migration, etc.•NOTICE WHAT’S FIRST ON THE LIST.• remember, the tech part is the easy part!
TRAC, DRAMBORA, and DH•TRAC, DRAMBORA, and SPOT are designed toaudit repositories, not individual datasets, dataﬁles, or research projects.• They assume a lot of infrastructure and (in TRAC’s case) along-term time horizon that you probably aren’t.•So if you’re trying to think through a project,where do you go?• TRAC and DRAMBORA are probably overkill!• (Though parts of DRAMBORA won’t hurt you.)
Data Curation Proﬁles•Research project out of Purdue’s Digital DataCuration Center (“D2C2”)•“Toolkit:” interview instrument, user guide forinterview instrument, worksheet.•Small library of completed proﬁles•Ignore the user guide. Grab the worksheet, anduse the interview instrument for reference.•http://datacurationproﬁles.org• You have to make a login to download the toolkit pieces.
Physical medium failure•Gold CDs are not the panacea we thought.• They’re not bad; they’re just hard to audit, so they fail(when they fail) silently. Silent failure is DEADLY.•Current state of the art: get it on spinning disk.•Back up often. Distribute your backupsgeographically. Test them now and then.• Consider a LOCKSS cooperative agreement. Others have.•Bitrot-detection techniques may help here too.•Any physical medium WILL FAIL. Have a planfor when it does.
“Digital forensics”•The art and science of investigating digital ﬁleformats and media.• Reading obsolete ones.• Reverse-engineering and/or documenting existing ones sothey don’t go obsolete.• Ensuring secure deletion, when necessary.• Reconstructing what used to be on a physical storagemedium. (Surprising how often this is possible!)• Audit trails for legal and records-management purposes.• AMAZING report (highly highly recommended!): “DigitalForensics and Born-Digital Content in Cultural HeritageInstitutions.” http://www.clir.org/pubs/abstract/pub149abst.html. Both computer-nerdy and humanities-nerdy in the best possible way.
Avoiding “bitrot”•Sometimes used for “ﬁle format obsolescence.”•I use it for “the bits ﬂipped unexpectedly.”•Checking a ﬁle bit-by-bit against a backup copyis computationally impractical for every day.• Though on ingest it’s a good idea to verify bit-by-bit!•Checksums• A ﬁle is, fundamentally, a great big number.• Do math on the number ﬁle. Store the result as metadata.• To check for bitrot, redo the math and check the answeragainst the stored result. If they’re diﬀerent, scream.• Several checksum algorithms; for our purposes, which oneyou use doesn’t matter much.• “Hash collision:” it’s possible, but unlikely, for diﬀerent ﬁlesto have the same checksum. Potential hack vector!
Migration vs. emulation:dealing with obsolescence•Migration• change the ﬁle to be usable in new software/hardwareconﬁgurations• risks: information loss (FONTS!), imperfect transfer,choosing the wrong migration path• smart systems don’t throw away the old ﬁles!•Emulation• keep the ﬁle, train new software/hardware to behave likethe old• risks: imperfect emulation, impractical emulation• makes more sense for software (games!), less for ﬁles•Pragmatically: redigitization.
Finding tools•Migration• Current versions of the original software may be able toopen old ﬁles.• Open-source software in the same genre may be able totranslate proprietary ﬁle formats (often imperfectly). Tendto maintain translators longer than you’d think.• Look on the web!• MIGRATE FAST. Once it’s damaged or obsolete, it’sprobably too late.•Emulation• look for the gamers! it’s WILD what they’ll emulate!• Look to the open-source community for operating-system, hardware-driver emulators.• Frankly, there’s a lot of hype and vaporware here.
When is a PDF not a PDF?•When it’s a .doc with the wrong ﬁle extension•When there’s no ﬁle extension on it at all•When it’s so old it doesn’t follow thestandardized PDF conventions•When it’s otherwise malformed, made by abad piece of software.•How do you know whether you have a goodPDF? (Or .doc, or .jpg, or .xml, or anything else.)
File format registries andtesting tools•JHOVE: JSTOR/Harvard Object ValidationEnvironment• Java software intended to be pluggable into othersoftware environments• Answers “What format is this thing?” and “Is this thing agood example of the format?”• Limited repertoire of formats•PRONOM/DROID + GDFR = Uniﬁed DigitalFormats Registry•Wrapper tool: FITS, File Information Tool Set• JHOVE + DROID + various other testers. State of the art.
Thanks!•Copyright 2011 by Dorothea Salo.•This lecture and slide deck are licensed under aCreative Commons Attribution 3.0 UnitedStates License.
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.