Are cloud based virtual labs cost effective? (CSEDU 2012)
Pain points for preservation services / workflows in repositories
1. Pain points for preservation services /
workflows in repositories
Paul Wheatley
SPRUCE Project Manager
University of Leeds
Twitter: @prwheatley
http://openplanetsfoundation.org/blogs/paul
2. What I’m going to talk about...
• What evidence we have for answering this question
• What we know about our approach to developing
preservation tools and services
• Practitioner needs captured by SPRUCE
• Suggestions
3. SPRUCE Project
Sustainable Preservation Using
Community Engagement
• JISC funded
• 2 years in length (until Nov 2013)
• £250k funding
• Concepts developed from AQuA Project
http://wiki.opf-labs.org/display/SPR
4. Channel this thought...
“Sharing best practice?
We don’t even share practice!”
Andrew N. Jackson
Curate Camp, Toronto, 2nd October 2012
6. Evidence based approach to DP – some examples
• What is actually happening to file formats over
time?
Formats over Time: Exploring UK Web History
Andrew N. Jackson
7. Evidence based approach to DP (2)
• Are our file format ID tools improving?
LDS3: applying digital preservation principals to
linked data systems
David Tarrant and Leslie Carr
8. Evidence based approach to DP (3)
• Even if our digital files aren’t “obsolete”, do they
render “correctly”?
Percentage of tested attributes where a change was
observed in at least one file when rendered in a test
environment
Rendering
Original Software Matters: Report
on Emulated Hardware 24%
Attributes
on the results
of research into
Tested Environment
LibreOffice Writer
86%
digital object
Attributes
rendering
Euan
CorelWordPerfect X5
Attributes
76% Cochrane, Arch
ives New
Microsoft Word 2007 Zealand
59%
Attributes
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage of attributes where a change was observed
in at least one file
9. Suggestion 1: Evidence
• Evidence base approach
• Support the capture and sharing of evidence about
the problem and about the experiences of
performing preservation and curation
10. Approach to developing new preservation services
• Problems:
–Tools/services that don’t solve concrete problems
–Duplicated / wasted effort
–Insufficient re-use of existing
tools/technologies/approaches
–Tools that are difficult to maintain / re-use
• Reasons:
–Lack of practitioner/user driven development
–Requirements not shared
–Lack of awareness of existing work
–Poor development approaches / technology choices
11. Digital preservation costing initiatives
LIFE 1, 2 and 3. Projects to explore digital preservation costing, and develop costing models.
Cost Model for Digital Preservation (CMDP): Project at the Royal Danish Library and the
Danish National Archives to develop a new cost model. Currently covers Planning, Migrations
and Ingest
Keeping Research Data Safe 1 and 2 (KRDS):Cost model and benefits analysis for preserving
research data
Presto Prime cost model for digital storage
Cost Estimation Toolkit (CET): Data centre costing model and toolkit, from NASA Goddard
Cost Model for Small Scale Automated Digital Preservation Archives (Strodl and Rauber)
APARSEN Project activity focused on digital preservation costing
EPRSC and JISC study on Cost analysis of cloud computing for research
Cost forecasting model for new digitization projects (Excel and web tool under development)
(Karim Boughida, Martha Whittaker, Linda Colet, Dan Chudnov)
DP4lib business and cost model for a digital preservation service
DANS Costs of Digital Archiving Volume 2 Project, focusing on preservation and
dissemination of research datasets
Blue Ribbon Task Force on Sustainable Digital Preservation and Access
Economic Sustainability Reference Model
ENSURE Project - Enabling kNowledge Sustainability Usability and Recovery for Economic
value
Cost Model for Electronic Health Records (Bote, Fernandez-Feijoo, and Ruizb)
4C. EU funded project on costing. Due to commence in 2012. Led by JISC
http://wiki.opf-labs.org/display/CDP/Home
An extended blog-rant on why this typifies a big #fail for our community
12. In summary...
“10 Years on we are still pretty much talking about
the same things...
...Tools like DROID and PRONOM etc. didn’t work
properly then, and they still don’t work properly
now."
Steve Knight, New Zealand National Library, iPRES2012
(blogged by Inge Angevarre: How are we doing as a community?)
13. Suggestion 2: Development Approach
• Make it practitioner/user led
– Solve concrete problems
• Re-use, don't re-invent the wheel
– Most problems have already been solved, although often not by this community
– Re-use existing code where possible
• Keep it small, keep it simple
– Functional preservation tools should be atomic
– Modularise in the face of growing requirements
– Ensure results can be exploited and integrated with other orchestration/repository
platforms
• Make it easy to use, build on, re-purpose and ultimately, maintain
– Share your source
– Automate your build
– Package for easy install
• Share outputs, exchange knowledge, learn from each other
– Write up dev and user experiences and share them
– Publish data about usage
– Shout about it, blog it, tweet it, and add it to tool/service registry (or three)
• Adapted from: the SPRUCE Mashup Manifesto
14. The SPRUCE Mashup
Identify and Solve concrete problems
• 3 day workshop for ~30 people
• Practitioners bring along digital
collections
• We identify preservation challenges
• Pair up practitioners with technical
experts
• Apply existing open source tools to
solve the problems
• In doing so, we exchange knowledge Glasgow Mashup
about digital preservation April 2012
• Develop a supportive community
15. Mashup results: Datasets, Issues and Solutions
• Capture information about our work on a wiki:
–http://bit.ly/spruce-results
• Datasets: about the content, some context
• Issues: what the challenge or problem is
• Solutions: an experience of solving the Issue
• Begin as triples, builds into networks
• Most sourced from mashup events, some from the
SCAPE Project (these are preceeded with Isxx or Soxx codes)
• All are from “practitioners” or “problem owners”
• Useful to separate problem owner from solution
provider
• Understand the problem before diving into the solution
16. Suggestion 3: preservation challenges to target
• Key themes drawn from the practitioner sourced
preservation issues collected
• It gives an impression of what practitioners are facing
• Capture some of that “practice” (i.e. What we learnt about
trying out approach W, with tool X, on data Y, in situation Z)
• But, a word of caution...
• Not scientific approach to assessing user needs
• Steered to an extent by the shape of the events and the
time available
• First 2 mashup events had some focus on QA
– The following 3 events had an “open” focus on any DP related topic
17. Theme 1: Quality Assurance
• The problem:
–Some have broken data
–Some have suspected broken data
–Some have an intention to process data in some way, but
concerned about lack of ability to check the process
doesn't break the data
• The solution:
–Cross section of automated QA approaches required.
• How do we spot the flaws automatically?
• How do we fix them automatically?
• Often involves cross checking (eg. Data to metadata)
• Sometimes explorative. What actually caused the problem, how
do we prevent it?
• Every case feels unique, but often strikes a chord more widely
18. Jpylyzer example
• New characterisation + validation tool for JPEG2000
–Development and operation driven by use cases
• Eg, JP2 used at scale in mass digitisation efforts, truncation is
common potential problem, yet existing tools didn’t check (quite
complex) end of file conditions
• Flawed creation tools omit critical metadata in created files
• Examples of broken files enabled testing!
–Also enables validation against a profile, enables
automated QA of content from external suppliers
• More information:
–http://openplanetsfoundation.org/software/jpylyzer
–Also see page on JP2 preservation risks:
–http://wiki.opf-labs.org/display/TR/JP2
19. Theme 2: Appraisal + Ingest preparation
• The problem:
–We have stuff, what is it, what should I worry about, what
do I do next?
–We know roughly what we've got (we've had some
before) but we have a largely manual appraisal process
that doesn't scale well
–How do we turn this blob of content into something we
can ingest into our repository?
• The solution:
–Characterisation capability needs to vastly improve
–Automatic extraction of properties / flavour of content to
aid appraisal/selection
–Inform processing of data prior to ingest
20. Theme 3: Identify/locate preservation worthy data
• The problem:
– Institution has preservation worthy data scattered across shared server space
– Data is unmanaged, not checksummed, often doesn’t have a responsible owner
– Sorting this data from non-preservation worthy data is a challenge
• The solution:
– Find it
• Tools/approaches to “smell” preservation worthy data
– Make it safe
• Checksumming, creating manifests, registering basic details with a central authority with
preservation responsibility, pereodically recalc checksums. All components are there but not in
usable package
– Get it ready to ingest
• De-duplication, curation, management, add metadata, other ingest preparation
– Notes: not recorded specifically by SPRUCE but is a superset of theme 2.
Almost universally (anecdotally) acknowledged as a big problem.
– Not a problem people like to talk about in public forums
21. Theme 4: Conformance to institutional profile/policy
• The problem:
–Institution has policy driven requirements for the shape of
its content, defined by specific profiles
–Does data conform to these profiles?
–If not (in some cases), can it be made to conform?
• The solution:
–Conformance checking focused characterisation and
validation
–Modification of content + associated QA
–Notes (personal opinion warning):
• Some specific cases where this is a good idea (eg. Digitisation)
• Many cases where this is not a good idea
22. Theme 5: Identify preservation risks
• The problem:
– Data is in the repository, what risks does it face?
– Some worry about whether they should be migrating their content
– Some specifically want to format migrate and want help doing it
– Root of problem is: what are the risks?
– Risks themselves not well understood
– Woeful tool provision to assist in automated risk assessment
• The solution:
– Tools/approaches for identifying specific preservation risks in digital
data
– Logical progression is then for planning, action and QA
23. Theme 6: Long tail of issues
• The problem:
–Rights
–Structural issues
–Contextual issues
–Data capture / harvesting
–Data integrity
–Planning
–........
24. Preservation risks: PDF example
• Common format, particularly in IRs
• Reasonable understanding of the risks (eg. Non
embedded fonts, password protection / encryption)
• Lots of tools that do part of the job
• No simple, straight forward, automated tool to
identify PDFs with clear preservation risks
• Popular tools that provide potentially misleading
information
25. Summary
• Capture and share more evidence about the
problem and about our needs, as we move forward
• Apply resources in a careful manner. Don’t reinvent
the wheel
• Consider tackling these areas:
–Quality Assurance
–Appraisal and ingest preparation
–Identify/locate preservation worthy data
–Conformance to profiles/policy
–Identify preservation risks
26. Thanks for listening! Any questions?
Paul Wheatley
SPRUCE Project Manager
University of Leeds
Twitter: @prwheatley
Email: p.r.wheatley@leeds.ac.uk
http://openplanetsfoundation.org/blogs/paul