Pain points for preservation services / workflows in repositories

Pain points for preservation services /
workflows in repositories
Paul Wheatley
SPRUCE Project Manager
University of Leeds

Twitter: @prwheatley
http://openplanetsfoundation.org/blogs/paul

What I’m going to talk about...

• What evidence we have for answering this question
• What we know about our approach to developing
preservation tools and services
• Practitioner needs captured by SPRUCE

• Suggestions

SPRUCE Project

Sustainable Preservation Using
Community Engagement

• JISC funded
• 2 years in length (until Nov 2013)
• £250k funding
• Concepts developed from AQuA Project
http://wiki.opf-labs.org/display/SPR

Channel this thought...

“Sharing best practice?
We don’t even share practice!”
Andrew N. Jackson
Curate Camp, Toronto, 2nd October 2012

Crude maturity model for DP TM

Evidence

Best Practice

Standards

Evidence based approach to DP – some examples

• What is actually happening to file formats over
time?

Formats over Time: Exploring UK Web History
Andrew N. Jackson

Evidence based approach to DP (2)

• Are our file format ID tools improving?

LDS3: applying digital preservation principals to
linked data systems
David Tarrant and Leslie Carr

Evidence based approach to DP (3)

• Even if our digital files aren’t “obsolete”, do they
render “correctly”?
Percentage of tested attributes where a change was
observed in at least one file when rendered in a test
environment
Rendering
Original Software Matters: Report
on Emulated Hardware 24%
Attributes
on the results
of research into
Tested Environment

LibreOffice Writer
86%
digital object
Attributes
rendering
Euan
CorelWordPerfect X5
Attributes
76% Cochrane, Arch
ives New
Microsoft Word 2007 Zealand
59%
Attributes

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage of attributes where a change was observed
in at least one file

Suggestion 1: Evidence

• Evidence base approach
• Support the capture and sharing of evidence about
the problem and about the experiences of
performing preservation and curation

Approach to developing new preservation services

• Problems:
–Tools/services that don’t solve concrete problems
–Duplicated / wasted effort
–Insufficient re-use of existing
tools/technologies/approaches
–Tools that are difficult to maintain / re-use
• Reasons:
–Lack of practitioner/user driven development
–Requirements not shared
–Lack of awareness of existing work
–Poor development approaches / technology choices

Digital preservation costing initiatives

 LIFE 1, 2 and 3. Projects to explore digital preservation costing, and develop costing models.
 Cost Model for Digital Preservation (CMDP): Project at the Royal Danish Library and the
Danish National Archives to develop a new cost model. Currently covers Planning, Migrations
and Ingest
 Keeping Research Data Safe 1 and 2 (KRDS):Cost model and benefits analysis for preserving
research data
 Presto Prime cost model for digital storage
 Cost Estimation Toolkit (CET): Data centre costing model and toolkit, from NASA Goddard
 Cost Model for Small Scale Automated Digital Preservation Archives (Strodl and Rauber)
 APARSEN Project activity focused on digital preservation costing
 EPRSC and JISC study on Cost analysis of cloud computing for research
 Cost forecasting model for new digitization projects (Excel and web tool under development)
(Karim Boughida, Martha Whittaker, Linda Colet, Dan Chudnov)
 DP4lib business and cost model for a digital preservation service
 DANS Costs of Digital Archiving Volume 2 Project, focusing on preservation and
dissemination of research datasets
 Blue Ribbon Task Force on Sustainable Digital Preservation and Access
 Economic Sustainability Reference Model
 ENSURE Project - Enabling kNowledge Sustainability Usability and Recovery for Economic
value
 Cost Model for Electronic Health Records (Bote, Fernandez-Feijoo, and Ruizb)
 4C. EU funded project on costing. Due to commence in 2012. Led by JISC
 http://wiki.opf-labs.org/display/CDP/Home
 An extended blog-rant on why this typifies a big #fail for our community

In summary...

“10 Years on we are still pretty much talking about
the same things...
...Tools like DROID and PRONOM etc. didn’t work
properly then, and they still don’t work properly
now."
Steve Knight, New Zealand National Library, iPRES2012
(blogged by Inge Angevarre: How are we doing as a community?)

Suggestion 2: Development Approach
• Make it practitioner/user led
– Solve concrete problems
• Re-use, don't re-invent the wheel
– Most problems have already been solved, although often not by this community
– Re-use existing code where possible
• Keep it small, keep it simple
– Functional preservation tools should be atomic
– Modularise in the face of growing requirements
– Ensure results can be exploited and integrated with other orchestration/repository
platforms
• Make it easy to use, build on, re-purpose and ultimately, maintain
– Share your source
– Automate your build
– Package for easy install
• Share outputs, exchange knowledge, learn from each other
– Write up dev and user experiences and share them
– Publish data about usage
– Shout about it, blog it, tweet it, and add it to tool/service registry (or three)
• Adapted from: the SPRUCE Mashup Manifesto

The SPRUCE Mashup

Identify and Solve concrete problems
• 3 day workshop for ~30 people
• Practitioners bring along digital
collections
• We identify preservation challenges
• Pair up practitioners with technical
experts
• Apply existing open source tools to
solve the problems
• In doing so, we exchange knowledge Glasgow Mashup
about digital preservation April 2012
• Develop a supportive community

Mashup results: Datasets, Issues and Solutions

• Capture information about our work on a wiki:
–http://bit.ly/spruce-results
• Datasets: about the content, some context
• Issues: what the challenge or problem is
• Solutions: an experience of solving the Issue
• Begin as triples, builds into networks
• Most sourced from mashup events, some from the
SCAPE Project (these are preceeded with Isxx or Soxx codes)
• All are from “practitioners” or “problem owners”
• Useful to separate problem owner from solution
provider
• Understand the problem before diving into the solution

Suggestion 3: preservation challenges to target

• Key themes drawn from the practitioner sourced
preservation issues collected
• It gives an impression of what practitioners are facing
• Capture some of that “practice” (i.e. What we learnt about
trying out approach W, with tool X, on data Y, in situation Z)

• But, a word of caution...
• Not scientific approach to assessing user needs
• Steered to an extent by the shape of the events and the
time available
• First 2 mashup events had some focus on QA
– The following 3 events had an “open” focus on any DP related topic

Theme 1: Quality Assurance

• The problem:
–Some have broken data
–Some have suspected broken data
–Some have an intention to process data in some way, but
concerned about lack of ability to check the process
doesn't break the data
• The solution:
–Cross section of automated QA approaches required.
• How do we spot the flaws automatically?
• How do we fix them automatically?
• Often involves cross checking (eg. Data to metadata)
• Sometimes explorative. What actually caused the problem, how
do we prevent it?
• Every case feels unique, but often strikes a chord more widely

Jpylyzer example

• New characterisation + validation tool for JPEG2000
–Development and operation driven by use cases
• Eg, JP2 used at scale in mass digitisation efforts, truncation is
common potential problem, yet existing tools didn’t check (quite
complex) end of file conditions
• Flawed creation tools omit critical metadata in created files
• Examples of broken files enabled testing!
–Also enables validation against a profile, enables
automated QA of content from external suppliers
• More information:
–http://openplanetsfoundation.org/software/jpylyzer
–Also see page on JP2 preservation risks:
–http://wiki.opf-labs.org/display/TR/JP2

Theme 2: Appraisal + Ingest preparation

• The problem:
–We have stuff, what is it, what should I worry about, what
do I do next?
–We know roughly what we've got (we've had some
before) but we have a largely manual appraisal process
that doesn't scale well
–How do we turn this blob of content into something we
can ingest into our repository?
• The solution:
–Characterisation capability needs to vastly improve
–Automatic extraction of properties / flavour of content to
aid appraisal/selection
–Inform processing of data prior to ingest

Theme 3: Identify/locate preservation worthy data

• The problem:
– Institution has preservation worthy data scattered across shared server space
– Data is unmanaged, not checksummed, often doesn’t have a responsible owner
– Sorting this data from non-preservation worthy data is a challenge
• The solution:
– Find it
• Tools/approaches to “smell” preservation worthy data
– Make it safe
• Checksumming, creating manifests, registering basic details with a central authority with
preservation responsibility, pereodically recalc checksums. All components are there but not in
usable package
– Get it ready to ingest
• De-duplication, curation, management, add metadata, other ingest preparation

– Notes: not recorded specifically by SPRUCE but is a superset of theme 2.
Almost universally (anecdotally) acknowledged as a big problem.
– Not a problem people like to talk about in public forums

Theme 4: Conformance to institutional profile/policy

• The problem:
–Institution has policy driven requirements for the shape of
its content, defined by specific profiles
–Does data conform to these profiles?
–If not (in some cases), can it be made to conform?
• The solution:
–Conformance checking focused characterisation and
validation
–Modification of content + associated QA

–Notes (personal opinion warning):
• Some specific cases where this is a good idea (eg. Digitisation)
• Many cases where this is not a good idea

Theme 5: Identify preservation risks

• The problem:
– Data is in the repository, what risks does it face?
– Some worry about whether they should be migrating their content
– Some specifically want to format migrate and want help doing it
– Root of problem is: what are the risks?
– Risks themselves not well understood
– Woeful tool provision to assist in automated risk assessment
• The solution:
– Tools/approaches for identifying specific preservation risks in digital
data
– Logical progression is then for planning, action and QA

Theme 6: Long tail of issues

• The problem:
–Rights
–Structural issues
–Contextual issues
–Data capture / harvesting
–Data integrity
–Planning
–........

Preservation risks: PDF example

• Common format, particularly in IRs
• Reasonable understanding of the risks (eg. Non
embedded fonts, password protection / encryption)
• Lots of tools that do part of the job
• No simple, straight forward, automated tool to
identify PDFs with clear preservation risks

• Popular tools that provide potentially misleading
information

Summary

• Capture and share more evidence about the
problem and about our needs, as we move forward
• Apply resources in a careful manner. Don’t reinvent
the wheel
• Consider tackling these areas:
–Quality Assurance
–Appraisal and ingest preparation
–Identify/locate preservation worthy data
–Conformance to profiles/policy
–Identify preservation risks

Thanks for listening! Any questions?

Paul Wheatley
SPRUCE Project Manager
University of Leeds

Twitter: @prwheatley
Email: p.r.wheatley@leeds.ac.uk
http://openplanetsfoundation.org/blogs/paul

Pain points for preservation services / workflows in repositories

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to Pain points for preservation services / workflows in repositories

Similar to Pain points for preservation services / workflows in repositories (20)

Pain points for preservation services / workflows in repositories