"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Digital presevation
1. Digital preservation
for ongoing access
Presentation for Council July 2008
David Pearson
Manager, Digital Preservation Section
2. Overview
1. We have lots of “digital stuff” in our collections
and it is growing
2. We will lose access to it unless we take action
3. We need to manage the process of keeping it
accessible and usable
4. Solutions have to be scalable, reliable and
automated
3. 1. “Digital stuff”- many collections
Pictures Oral History
Manuscripts Historical Web sites
Sheet music Newspapers
Maps
Ephemera Books Serial
4. How does it grow?
1. We collect it
– Physical carriers
– Online
• PANDORA web archive
• Australian web domain harvests
2. We create it
– Oral history interviews
– Photographs
– Publications
3. We convert it
– Digitise our collections
5. Web Archives
• Web sites are collected selectively
– Individually for access via PANDORA, or
– On a large scale via annual domain snapshots
• No control over content creation
• Lots of
– File formats
– Individual files (Pandora ≈ 51 million, Domain
harvest ≈ 1.3 billion files)
– Links
– Software (browser, plug-ins, readers)
• Internet content changes over time
6.
7.
8. Digitisation
• Around 135,000 items
digitised
• Newspaper project = 4
million pages by 2010
• Internally created so we
can control
– Standards
– File formats (e.g. TIFF,
JPEG, PDF )
– Metadata
– Workflows
• Issues
– Growing volume
9. Physical carriers
• Approx. 12,000 items – grows by
1,000 a year
Issues
• No control over creation
• Time lag before acquisition
• Variety of carriers (fragile) and file
formats
• Require various hardware, software,
operating systems, drivers to
access
• Labour intensive to process and
transfer to safe storage (growing
backlog)
11. Type of Digital Collections
Pandora
3%
2008 Maps
2%
Sheet Music
4%
Manuscripts
2%
Pictures
Australian Web 7%
Harvest
40%
Oral History
18%
Other
3%
Historical
Newspapers
21%
12. Growth: compared to books
Comparison of books collection &
digital collection "book equivalents"
6.00
"Book Equivalents" (millions)
5.00
4.00
Digital Collection
20 mb "book
3.00 equivalents"
Books Collection
2.00
1.00
0.00
2005 2006 2007 2008
Year end June
13. 2. Act or risk losing it
• “Digital stuff” is dependent on technology at all
stages
– Creation/capture
– Storage
– Access
• Technology changes rapidly thus software,
hardware, media, file formats, operating systems
become obsolete
• Unless managed deterioration can occur rapidly
e.g. data can be corrupted or lost in storage or
transfer process
15. 3. Managing to keep it
• “Not managing it” is not an option
• We need to
– Understand our “digital stuff” & associated risks
– Provide safe storage & ensure integrity
– Ensure access over time as technology
changes
– Develop & implement preservation workflows,
skills, standards, & strategies for ongoing
access
– Enable content to be shared and used in
different ways in the future
16. 4. Solutions and implications
• Large scale automated processes
• Original research & time to deliver the solutions
• Reasonably long lead times
• Audit processes and quality control monitoring
are critical
• Significant resources are required
17. Conclusions
• We are responsible for a lot of “digital stuff”
• If we simply collect and store it, it will become
unusable in a relatively short time as
technologies change
• Maintaining the ability to access it requires a
lot of good management, planning, &
dedicated resources
• We have to find and use solutions that can be
applied automatically and reliably to billions of
digital files