Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pain points for preservation services / workflows in repositories


Published on

  • Was a little hesitant about using ⇒⇒⇒ ⇐⇐⇐ at first, but am very happy that I did. The writer was able to write my paper by the deadline and it was very well written. So guys don’t hesitate to use it.
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Pain points for preservation services / workflows in repositories

  1. 1. Pain points for preservation services /workflows in repositories Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley
  2. 2. What I’m going to talk about...• What evidence we have for answering this question• What we know about our approach to developing preservation tools and services• Practitioner needs captured by SPRUCE• Suggestions
  3. 3. SPRUCE Project Sustainable Preservation Using Community Engagement • JISC funded • 2 years in length (until Nov 2013) • £250k funding • Concepts developed from AQuA Project
  4. 4. Channel this thought...“Sharing best practice? We don’t even share practice!” Andrew N. Jackson Curate Camp, Toronto, 2nd October 2012
  5. 5. Crude maturity model for DP TM Evidence Best Practice Standards
  6. 6. Evidence based approach to DP – some examples• What is actually happening to file formats over time? Formats over Time: Exploring UK Web History Andrew N. Jackson
  7. 7. Evidence based approach to DP (2)• Are our file format ID tools improving? LDS3: applying digital preservation principals to linked data systems David Tarrant and Leslie Carr
  8. 8. Evidence based approach to DP (3)• Even if our digital files aren’t “obsolete”, do they render “correctly”? Percentage of tested attributes where a change was observed in at least one file when rendered in a test environment Rendering Original Software Matters: Report on Emulated Hardware 24% Attributes on the results of research into Tested Environment LibreOffice Writer 86% digital object Attributes rendering Euan CorelWordPerfect X5 Attributes 76% Cochrane, Arch ives New Microsoft Word 2007 Zealand 59% Attributes 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage of attributes where a change was observed in at least one file
  9. 9. Suggestion 1: Evidence• Evidence base approach• Support the capture and sharing of evidence about the problem and about the experiences of performing preservation and curation
  10. 10. Approach to developing new preservation services• Problems: –Tools/services that don’t solve concrete problems –Duplicated / wasted effort –Insufficient re-use of existing tools/technologies/approaches –Tools that are difficult to maintain / re-use• Reasons: –Lack of practitioner/user driven development –Requirements not shared –Lack of awareness of existing work –Poor development approaches / technology choices
  11. 11. Digital preservation costing initiatives LIFE 1, 2 and 3. Projects to explore digital preservation costing, and develop costing models. Cost Model for Digital Preservation (CMDP): Project at the Royal Danish Library and the Danish National Archives to develop a new cost model. Currently covers Planning, Migrations and Ingest Keeping Research Data Safe 1 and 2 (KRDS):Cost model and benefits analysis for preserving research data Presto Prime cost model for digital storage Cost Estimation Toolkit (CET): Data centre costing model and toolkit, from NASA Goddard Cost Model for Small Scale Automated Digital Preservation Archives (Strodl and Rauber) APARSEN Project activity focused on digital preservation costing EPRSC and JISC study on Cost analysis of cloud computing for research Cost forecasting model for new digitization projects (Excel and web tool under development) (Karim Boughida, Martha Whittaker, Linda Colet, Dan Chudnov) DP4lib business and cost model for a digital preservation service DANS Costs of Digital Archiving Volume 2 Project, focusing on preservation and dissemination of research datasets Blue Ribbon Task Force on Sustainable Digital Preservation and Access Economic Sustainability Reference Model ENSURE Project - Enabling kNowledge Sustainability Usability and Recovery for Economic value Cost Model for Electronic Health Records (Bote, Fernandez-Feijoo, and Ruizb) 4C. EU funded project on costing. Due to commence in 2012. Led by JISC An extended blog-rant on why this typifies a big #fail for our community
  12. 12. In summary...“10 Years on we are still pretty much talking about the same things......Tools like DROID and PRONOM etc. didn’t work properly then, and they still don’t work properly now." Steve Knight, New Zealand National Library, iPRES2012 (blogged by Inge Angevarre: How are we doing as a community?)
  13. 13. Suggestion 2: Development Approach• Make it practitioner/user led – Solve concrete problems• Re-use, dont re-invent the wheel – Most problems have already been solved, although often not by this community – Re-use existing code where possible• Keep it small, keep it simple – Functional preservation tools should be atomic – Modularise in the face of growing requirements – Ensure results can be exploited and integrated with other orchestration/repository platforms• Make it easy to use, build on, re-purpose and ultimately, maintain – Share your source – Automate your build – Package for easy install• Share outputs, exchange knowledge, learn from each other – Write up dev and user experiences and share them – Publish data about usage – Shout about it, blog it, tweet it, and add it to tool/service registry (or three)• Adapted from: the SPRUCE Mashup Manifesto
  14. 14. The SPRUCE MashupIdentify and Solve concrete problems• 3 day workshop for ~30 people• Practitioners bring along digital collections• We identify preservation challenges• Pair up practitioners with technical experts• Apply existing open source tools to solve the problems• In doing so, we exchange knowledge Glasgow Mashup about digital preservation April 2012• Develop a supportive community
  15. 15. Mashup results: Datasets, Issues and Solutions• Capture information about our work on a wiki: –• Datasets: about the content, some context• Issues: what the challenge or problem is• Solutions: an experience of solving the Issue• Begin as triples, builds into networks• Most sourced from mashup events, some from the SCAPE Project (these are preceeded with Isxx or Soxx codes)• All are from “practitioners” or “problem owners”• Useful to separate problem owner from solution provider• Understand the problem before diving into the solution
  16. 16. Suggestion 3: preservation challenges to target• Key themes drawn from the practitioner sourced preservation issues collected• It gives an impression of what practitioners are facing• Capture some of that “practice” (i.e. What we learnt about trying out approach W, with tool X, on data Y, in situation Z)• But, a word of caution...• Not scientific approach to assessing user needs• Steered to an extent by the shape of the events and the time available• First 2 mashup events had some focus on QA – The following 3 events had an “open” focus on any DP related topic
  17. 17. Theme 1: Quality Assurance• The problem: –Some have broken data –Some have suspected broken data –Some have an intention to process data in some way, but concerned about lack of ability to check the process doesnt break the data• The solution: –Cross section of automated QA approaches required. • How do we spot the flaws automatically? • How do we fix them automatically? • Often involves cross checking (eg. Data to metadata) • Sometimes explorative. What actually caused the problem, how do we prevent it? • Every case feels unique, but often strikes a chord more widely
  18. 18. Jpylyzer example• New characterisation + validation tool for JPEG2000 –Development and operation driven by use cases • Eg, JP2 used at scale in mass digitisation efforts, truncation is common potential problem, yet existing tools didn’t check (quite complex) end of file conditions • Flawed creation tools omit critical metadata in created files • Examples of broken files enabled testing! –Also enables validation against a profile, enables automated QA of content from external suppliers• More information: – –Also see page on JP2 preservation risks: –
  19. 19. Theme 2: Appraisal + Ingest preparation• The problem: –We have stuff, what is it, what should I worry about, what do I do next? –We know roughly what weve got (weve had some before) but we have a largely manual appraisal process that doesnt scale well –How do we turn this blob of content into something we can ingest into our repository?• The solution: –Characterisation capability needs to vastly improve –Automatic extraction of properties / flavour of content to aid appraisal/selection –Inform processing of data prior to ingest
  20. 20. Theme 3: Identify/locate preservation worthy data• The problem: – Institution has preservation worthy data scattered across shared server space – Data is unmanaged, not checksummed, often doesn’t have a responsible owner – Sorting this data from non-preservation worthy data is a challenge• The solution: – Find it • Tools/approaches to “smell” preservation worthy data – Make it safe • Checksumming, creating manifests, registering basic details with a central authority with preservation responsibility, pereodically recalc checksums. All components are there but not in usable package – Get it ready to ingest • De-duplication, curation, management, add metadata, other ingest preparation – Notes: not recorded specifically by SPRUCE but is a superset of theme 2. Almost universally (anecdotally) acknowledged as a big problem. – Not a problem people like to talk about in public forums
  21. 21. Theme 4: Conformance to institutional profile/policy• The problem: –Institution has policy driven requirements for the shape of its content, defined by specific profiles –Does data conform to these profiles? –If not (in some cases), can it be made to conform?• The solution: –Conformance checking focused characterisation and validation –Modification of content + associated QA –Notes (personal opinion warning): • Some specific cases where this is a good idea (eg. Digitisation) • Many cases where this is not a good idea
  22. 22. Theme 5: Identify preservation risks• The problem: – Data is in the repository, what risks does it face? – Some worry about whether they should be migrating their content – Some specifically want to format migrate and want help doing it – Root of problem is: what are the risks? – Risks themselves not well understood – Woeful tool provision to assist in automated risk assessment• The solution: – Tools/approaches for identifying specific preservation risks in digital data – Logical progression is then for planning, action and QA
  23. 23. Theme 6: Long tail of issues• The problem: –Rights –Structural issues –Contextual issues –Data capture / harvesting –Data integrity –Planning –........
  24. 24. Preservation risks: PDF example• Common format, particularly in IRs• Reasonable understanding of the risks (eg. Non embedded fonts, password protection / encryption)• Lots of tools that do part of the job• No simple, straight forward, automated tool to identify PDFs with clear preservation risks• Popular tools that provide potentially misleading information
  25. 25. Summary• Capture and share more evidence about the problem and about our needs, as we move forward• Apply resources in a careful manner. Don’t reinvent the wheel• Consider tackling these areas: –Quality Assurance –Appraisal and ingest preparation –Identify/locate preservation worthy data –Conformance to profiles/policy –Identify preservation risks
  26. 26. Thanks for listening! Any questions? Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley Email: