for the digital arts and
University of Wisconsin
Preservation for the
digital arts and
University of Wisconsin
for the digital arts and
University of Wisconsin
And I said...
... you’re giving me how
much time for this?
• “Preservation” means nothing unmodiﬁed.
• This is why it becomes such a bogeyman!
• Two things you need to know ﬁrst:
• why you’re preserving what you’re preserving, and
• what you’re preserving it against.
• Your collection-development policy should
inform the ﬁrst question.
• Your coll-dev policy doesn’t include local born-digital or
digitized materials? This is a problem. Fix it.
• The second question is your “threat model.”
Why did I just make you
• I’m weird.
• I’m trying to destroy the myth that any given
medium “preserves itself.”
• Media do not preserve themselves. People preserve media
—or media get bizarrely lucky.
• We need not panic over digital preservation
any more than we panic about print.
• Approach digital preservation the same way you approach
• Strategically: this approach helps your colleagues get a
grip, too. Your colleagues may well be the biggest barrier
to digital preservation in your library!
In your groups...
List important threats
to digital data.
• “It’s in Google, so it’s preserved.” (Not even
• “I make backups, so I’m ﬁne.”
• “I have a graduate student who takes care of
• “Metadata? What’s that? I have to have it?”
• “Digital preservation is an unsolvable problem,
so why even try?” (I’ve heard this one from
librarians. I bet you have too.)
But ﬁrst, a word about
• “We can’t save everything digital!”
• Well, no, we can’t.
• We can’t save everything printed either.
• That’s no excuse, in either medium. Why do we
let it be one for digital materials?
• Yes, we will lose some stuﬀ. That’s life in the
big city. Dive in anyway.
And a word about scale
• Many of those currently panicking about digital
preservation are thinking about huge scales.
• At some repository size, bitrot happens faster than you can
detect and ﬁx it.
• Last I heard, this was somewhere in the exabyte range.
• We’re not. So let’s relax about some of this
stuﬀ. At our scale, many problems are solvable.
• Unless your problem is digital video. Good luck with that.
• Our scale problems happen on the front end, as
we’ve been learning this week.
Physical medium failure
• Gold CDs are not the panacea we thought.
• They’re not bad; they’re just hard to audit, so they fail
(when they fail) silently. Silent failure is DEADLY.
• How long will hardware be able to read them?
• ALL such physical media are risky, for the same reasons!
• Current state of the art: get it on spinning disk.
• Back up often. Distribute your backups
geographically. Test them now and then.
• Consider a LOCKSS cooperative agreement. Others have.
• Any physical medium WILL FAIL. Have a plan
for when it does.
• Sometimes used for “ﬁle format obsolescence.”
• I use it for “the bits ﬂipped unexpectedly.”
• Checking a ﬁle bit-by-bit against a backup copy
is computationally impractical for every day.
• Though on ingest it’s a good idea to verify bit-by-bit!
• A ﬁle is, fundamentally, a great big number.
• Do math on the number ﬁle. Store the result as metadata.
• To check for bitrot, redo the math and check the answer
against the stored result. If they’re diﬀerent, scream.
• Several checksum algorithms; for our purposes, which one
you use doesn’t matter much.
File format obsolescence
• When possible, prefer ﬁle formats that are:
• Open/non-proprietary. (If a software vendor goes out of
business, does their format?)
• Standardized, non-patent-encumbered
• In widespread use. (If the format dies, lots of people have
incentive to solve the problem.)
• For text, non-binary
• For everything else, lossless rather than lossy
• For compound objects, compound documents rather than
• Realistically? We often have to take what
Lossless? Lossy? What?
• Essential tradeoﬀ: quality and ﬁdelity vs. ﬁle size
• Clipping information out makes the ﬁle size
smaller! But once it’s gone, it’s gone.
• Tremendous problem with video. Lossless video
formats are HUGE.
• Lossy image formats: JPEG, JPEG2000 (much
• (more or less) Lossless: TIFF, PNG, GIF
• Compression may be lossless or lossy. Find out!
• I am NOT going to talk about codecs vs.
container formats. Consider it homework.
• No ideal choice here; lossless formats are
patent-encumbered and/or proprietary
• WAV and AIFF are okay. Ogg Vorbis is ideal, but
nobody supports it.
• mp3: if you must, it’s lossy.
Migration vs. emulation
• Migration: move the ﬁle to a new format
• Don’t throw away your original! You may have made the
wrong migration decision.
• Not necessarily a lossless process. (Fonts!)
• Emulation: create a modern hardware/software
environment that can deal with the old format
• For some cultural artifacts such as games, this is the only
• Emulation advocates make big claims that I’m not sure
they can back up. Proceed with caution.
• Migration of a dataset toward a well-deﬁned
• “Treat the same thing the same way.”
• E.g. census data... deﬁne a set of data tables, move all
data into them.
• Great for interoperability and preservation!
• Pitfall: “the same thing”?
• Humanities: TEI is a de facto normalizer for
humanities textual data.
• (Other XML formats in other ﬁelds: e.g. ChemML, NLM
• Migration can preserve information content
and (often but not always) appearance.
• Preserving interaction patterns is much
• Or a database with a query engine
• Or an applet or Flash object
• Or a collection whose interactions are based on an
obsolete software system. (DynaText anyone?)
• Hard problem. No obvious solutions; certainly
no easy ones.
When is a PDF not a PDF?
• When it’s a .doc with the wrong ﬁle extension
• When there’s no ﬁle extension on it at all
• When it’s so old it doesn’t follow the
standardized PDF conventions
• When it’s otherwise malformed, made by a
bad piece of software.
• How do you know whether you have a good
PDF? (Or .doc, or .jpg, or .xml, or anything else.)
File format registries and
• JHOVE: JSTOR/Harvard Object Validation
• Java software intended to be pluggable into other
• Answers “What format is this thing?” and “Is this thing a
good example of the format?”
• Limited repertoire of formats
• PRONOM/DROID + GDFR = Uniﬁed Digital
Forgetting what you have
• Absolutely pernicious problem. We don’t know
what we have to begin with!
• Do you know how much Faculty Stuﬀ is scattered
throughout your institution’s .edu domain? Me neither.
But I know it’s a lot. How much of that is irreplaceable?
• We’re also bad at labelling and tracking what
• No easy answer to this one; the solution lies in
a complete praxis reinvention.
• Yeah. Good luck with that.
... but I thought you meant
in libraries, Dorothea!
• Come on, we’ve solved that one: Metadata!
• Once it’s in the library, it’s probably ﬁne. The
real problem is all that Other Stuﬀ Out There.
• This is a collection-development problem and
should be treated as one.
• Don’t dump it on some poor “digital preservation
librarian!” That ﬂat out doesn’t scale.
• Don’t make the mistake of drawing thick lines around
“our stuﬀ” and “their stuﬀ.” Like it or not, our coll-dev
universe has moved beyond what’s published and what’s
What the stuﬀ you have
• Collect whatever it takes to answer this
• If the owner of this material were hit by a bus tomorrow,
what would be needed for others to use it?
• Nasty discipline-speciﬁc problem.
• This is what the NARA/RLG Trusted Digital Repository
checklist is aiming at with “designated community.”
• Where NARA/RLG goes oﬀ the rails is assuming you have
to go through this exercise with EVERYTHING YOU HAVE.
• Data-dictionaries, algorithms, speciﬁcations, tech
metadata, whatever it takes. Use common sense!
Rights and DRM
• Not having IP rights to something may mean
you can’t preserve it.
• Brian Lavoie writes well about this problem.
• Copyright law and its exceptions haven’t caught up to the
• Third-party services (e.g. blogs, ITunesU, Slideshare) are a
• DRM means that no matter the rights
situation, you’re stuck.
• PDFs: Users turn on “security” features. This is DRM. Tell
them not to do that!
• Huge headache with third-party services, again.
... and other hassles
• Privacy, conﬁdentiality, and human-subject
• Think “we’re the humanities; IRBs don’t happen to us”?
Think again. One word: FOLKLORE.
• Third-party copyright
• Campus musical or dramatic performances
• Issues of cultural sensitivity, heritage,
• You need a dark (or at least dim) archive if
you’re serious about digital preservation.
There is no way around this. Sorry.
• There is only one answer: POLICY.
• Unfortunately, it’s not a quick, easy, or
• Digital preservation costs money.
• People in high places are scared of it.
• It requires process and staﬀ change.
• You have to make the case. And then make it
again. And again. Until they get it!
• Where I am, Somebody Else’s Problem ﬁelds are
everywhere around this issue.
You are probably the
of last resort.
Be prepared for anything
excluded from your policy
When organizations fail
• Remember Geocities? We’re worse.
• Mellon: Can’t make a list of its funded on-the-web
projects, because most of them are GONE. G-O-N-E.
• We do not, as a profession, have a safety net
for each others’ projects and materials.
• This is, frankly, unconscionable.
• I don’t know how to ﬁx it; I am just warning
you that project rescues are and will continue
to be necessary.
• Institutional boundaries are a barrier here.
Great policy guidance
• Policy-making for research data in repositories:
• Practical data management: a legal and policy
• Australian, so take “legal” with a grain of salt
• Guide to social science data preparation and
Summary: the OAIS model
• “Reference model” for archival systems
• All theory, no praxis, by design. (Because praxis changes!)
• Four parts
• Data (and interaction) model
• Required responsibilities of an archive
• Recommended functions (in the computer-programming
sense) for carrying out those responsibilities
• My favorite distillation: Ockerbloom
For our purposes...
• We’re talking about the software.
• I’m not going to rant (much) about what IRs
are for or how they’re run.
• If you want that, read Roach Motel. Better yet, read
Palmer et al. 2009.
• We’re interested in the application (or lack
thereof ) of IRs to data curation in the arts and
humanities. Right? Right.
• I’m not afraid of the technical, and neither
should you be.
The IR content use-case
• A research paper
• In a single ﬁle; possibly more than one format
• Is not related to any other item in the history
• The user can download it, and... um... just
download it, really.
How much of our stuﬀ
does that work for?
• Image collections
• Page-scanned books (with or without OCR)
• Marked-up books
• Theses and dissertations
• Website preservation
• Audio and video
• Complex multimedia
• Database (linguistic, geographic...)
One metadata standard
does not ﬁt all
• The simple fact is that
• VRA Core
EPrints and DSpace do
• MODS Dublin Core, METS, and
• TEI Header nothing else natively.
• Dublin Core This is purely inadequate
for humanities data
• ... the beat goes on.
One ﬁle format does not
• Yes, we have to take what we get.
• With discrete ﬁles, most IR software is ﬁne.
• Forget about streaming audio/video.
• DSpace is good with static websites.
• For other composite objects, you’re in trouble.
• For anything like a database, you’re in trouble.
The DSpace/EPrints view
of the universe
• Communities and collections
• must be given explicit permission to add or edit materials
• Metadata entry forms
• DSpace: ﬁelds conﬁgurable by collection
• EPrints: auto-conﬁgures ﬁelds based on content type
• Many permitted per item; must upload one by one in DSpace!
• Get friendly with the DSpace batch importer. You’ll need it.
The Fedora view of the
• You can do anything at all with anything at all
as long as you’re willing to tell Fedora how to
do it. Inﬁnite ﬂexibility! But also inﬁnite
• “Content model:” what’s in this thing?
• “Service:” what should the user-interface do
with what’s in this thing?
• Metadata, relationships, stuﬀ
Can you use Fedora for
• Yes, but not alone; you need all the Content
Models and Services bolted on top.
• Try Islandora or Muradora. Fez is a last resort; it
acts like DSpace, and this is not a good thing.
• Even if you can’t build a real Fedora digital
library now, you may not be able to do so in
future if you stick with DSpace...
• ... but the Fedora/DSpace merger may change
What is this FOXML
• Think of it as the Fedora batch-import format.
• It’s complex! But it can absorb any amount or
type of XML metadata or data, which is really
• Out-of-the-box IR software will handle some
A&H data-curation jobs adequately...
• ... but by no means all of them.
• If you need sophisticated UI, bite the bullet
and go with Fedora. Islandora and Muradora
make Fedora simpler for simple things than it
• If you don’t need sophisticated user-facing UI,
go with EPrints.
• DSpace is a loser choice.