SlideShare a Scribd company logo
1 of 26
Pain points for preservation services /
workflows in repositories
                                 Paul Wheatley
                        SPRUCE Project Manager
                             University of Leeds

                                  Twitter: @prwheatley
            http://openplanetsfoundation.org/blogs/paul
What I’m going to talk about...

• What evidence we have for answering this question
• What we know about our approach to developing
  preservation tools and services
• Practitioner needs captured by SPRUCE

• Suggestions
SPRUCE Project

 Sustainable Preservation Using
 Community Engagement




 • JISC funded
 • 2 years in length (until Nov 2013)
 • £250k funding
 • Concepts developed from AQuA Project
 http://wiki.opf-labs.org/display/SPR
Channel this thought...




“Sharing best practice?
     We don’t even share practice!”
                                         Andrew N. Jackson
                      Curate Camp, Toronto, 2nd October 2012
Crude maturity model for DP TM



                   Evidence



                 Best Practice



                  Standards
Evidence based approach to DP – some examples

• What is actually happening to file formats over
  time?




                         Formats over Time: Exploring UK Web History
                                                  Andrew N. Jackson
Evidence based approach to DP (2)

• Are our file format ID tools improving?




                           LDS3: applying digital preservation principals to
                                                       linked data systems
                                            David Tarrant and Leslie Carr
Evidence based approach to DP (3)

• Even if our digital files aren’t “obsolete”, do they
  render “correctly”?
                                Percentage of tested attributes where a change was
                                observed in at least one file when rendered in a test
                                                    environment
                                                                                                                                Rendering
                         Original Software                                                                                      Matters: Report
                      on Emulated Hardware                      24%
                            Attributes
                                                                                                                                on the results
                                                                                                                                of research into
 Tested Environment




                          LibreOffice Writer
                                                                                                                   86%
                                                                                                                                digital object
                              Attributes
                                                                                                                                rendering
                                                                                                                                Euan
                       CorelWordPerfect X5
                            Attributes
                                                                                                             76%                Cochrane, Arch
                                                                                                                                ives New
                       Microsoft Word 2007                                                                                      Zealand
                                                                                              59%
                            Attributes


                                               0%   10%   20%    30%      40%      50%     60%      70%      80%   90%   100%
                                                           Percentage of attributes where a change was observed
                                                                              in at least one file
Suggestion 1: Evidence

• Evidence base approach
• Support the capture and sharing of evidence about
  the problem and about the experiences of
  performing preservation and curation
Approach to developing new preservation services

• Problems:
  –Tools/services that don’t solve concrete problems
  –Duplicated / wasted effort
  –Insufficient re-use of existing
   tools/technologies/approaches
  –Tools that are difficult to maintain / re-use
• Reasons:
  –Lack of practitioner/user driven development
  –Requirements not shared
  –Lack of awareness of existing work
  –Poor development approaches / technology choices
Digital preservation costing initiatives

   LIFE 1, 2 and 3. Projects to explore digital preservation costing, and develop costing models.
   Cost Model for Digital Preservation (CMDP): Project at the Royal Danish Library and the
    Danish National Archives to develop a new cost model. Currently covers Planning, Migrations
    and Ingest
   Keeping Research Data Safe 1 and 2 (KRDS):Cost model and benefits analysis for preserving
    research data
   Presto Prime cost model for digital storage
   Cost Estimation Toolkit (CET): Data centre costing model and toolkit, from NASA Goddard
   Cost Model for Small Scale Automated Digital Preservation Archives (Strodl and Rauber)
   APARSEN Project activity focused on digital preservation costing
   EPRSC and JISC study on Cost analysis of cloud computing for research
   Cost forecasting model for new digitization projects (Excel and web tool under development)
    (Karim Boughida, Martha Whittaker, Linda Colet, Dan Chudnov)
   DP4lib business and cost model for a digital preservation service
   DANS Costs of Digital Archiving Volume 2 Project, focusing on preservation and
    dissemination of research datasets
   Blue Ribbon Task Force on Sustainable Digital Preservation and Access
   Economic Sustainability Reference Model
   ENSURE Project - Enabling kNowledge Sustainability Usability and Recovery for Economic
    value
   Cost Model for Electronic Health Records (Bote, Fernandez-Feijoo, and Ruizb)
   4C. EU funded project on costing. Due to commence in 2012. Led by JISC
   http://wiki.opf-labs.org/display/CDP/Home
   An extended blog-rant on why this typifies a big #fail for our community
In summary...




“10 Years on we are still pretty much talking about
  the same things...
...Tools like DROID and PRONOM etc. didn’t work
  properly then, and they still don’t work properly
  now."
    Steve Knight, New Zealand National Library, iPRES2012
      (blogged by Inge Angevarre: How are we doing as a community?)
Suggestion 2: Development Approach
• Make it practitioner/user led
   – Solve concrete problems
• Re-use, don't re-invent the wheel
   – Most problems have already been solved, although often not by this community
   – Re-use existing code where possible
• Keep it small, keep it simple
   – Functional preservation tools should be atomic
   – Modularise in the face of growing requirements
   – Ensure results can be exploited and integrated with other orchestration/repository
     platforms
• Make it easy to use, build on, re-purpose and ultimately, maintain
   – Share your source
   – Automate your build
   – Package for easy install
• Share outputs, exchange knowledge, learn from each other
   – Write up dev and user experiences and share them
   – Publish data about usage
   – Shout about it, blog it, tweet it, and add it to tool/service registry (or three)
• Adapted from: the SPRUCE Mashup Manifesto
The SPRUCE Mashup

Identify and Solve concrete problems
• 3 day workshop for ~30 people
• Practitioners bring along digital
  collections
• We identify preservation challenges
• Pair up practitioners with technical
  experts
• Apply existing open source tools to
  solve the problems
• In doing so, we exchange knowledge     Glasgow Mashup
  about digital preservation                April 2012
• Develop a supportive community
Mashup results: Datasets, Issues and Solutions

• Capture information about our work on a wiki:
  –http://bit.ly/spruce-results
• Datasets: about the content, some context
• Issues: what the challenge or problem is
• Solutions: an experience of solving the Issue
• Begin as triples, builds into networks
• Most sourced from mashup events, some from the
  SCAPE Project (these are preceeded with Isxx or Soxx codes)
• All are from “practitioners” or “problem owners”
• Useful to separate problem owner from solution
  provider
• Understand the problem before diving into the solution
Suggestion 3: preservation challenges to target

• Key themes drawn from the practitioner sourced
  preservation issues collected
• It gives an impression of what practitioners are facing
• Capture some of that “practice” (i.e. What we learnt about
  trying out approach W, with tool X, on data Y, in situation Z)

• But, a word of caution...
• Not scientific approach to assessing user needs
• Steered to an extent by the shape of the events and the
  time available
• First 2 mashup events had some focus on QA
  – The following 3 events had an “open” focus on any DP related topic
Theme 1: Quality Assurance

• The problem:
  –Some have broken data
  –Some have suspected broken data
  –Some have an intention to process data in some way, but
   concerned about lack of ability to check the process
   doesn't break the data
• The solution:
  –Cross section of automated QA approaches required.
    • How do we spot the flaws automatically?
    • How do we fix them automatically?
    • Often involves cross checking (eg. Data to metadata)
    • Sometimes explorative. What actually caused the problem, how
      do we prevent it?
    • Every case feels unique, but often strikes a chord more widely
Jpylyzer example

• New characterisation + validation tool for JPEG2000
  –Development and operation driven by use cases
    • Eg, JP2 used at scale in mass digitisation efforts, truncation is
      common potential problem, yet existing tools didn’t check (quite
      complex) end of file conditions
    • Flawed creation tools omit critical metadata in created files
    • Examples of broken files enabled testing!
  –Also enables validation against a profile, enables
   automated QA of content from external suppliers
• More information:
  –http://openplanetsfoundation.org/software/jpylyzer
  –Also see page on JP2 preservation risks:
  –http://wiki.opf-labs.org/display/TR/JP2
Theme 2: Appraisal + Ingest preparation

• The problem:
  –We have stuff, what is it, what should I worry about, what
   do I do next?
  –We know roughly what we've got (we've had some
   before) but we have a largely manual appraisal process
   that doesn't scale well
  –How do we turn this blob of content into something we
   can ingest into our repository?
• The solution:
  –Characterisation capability needs to vastly improve
  –Automatic extraction of properties / flavour of content to
   aid appraisal/selection
  –Inform processing of data prior to ingest
Theme 3: Identify/locate preservation worthy data

• The problem:
   – Institution has preservation worthy data scattered across shared server space
   – Data is unmanaged, not checksummed, often doesn’t have a responsible owner
   – Sorting this data from non-preservation worthy data is a challenge
• The solution:
   – Find it
       • Tools/approaches to “smell” preservation worthy data
   – Make it safe
       • Checksumming, creating manifests, registering basic details with a central authority with
         preservation responsibility, pereodically recalc checksums. All components are there but not in
         usable package
   – Get it ready to ingest
       • De-duplication, curation, management, add metadata, other ingest preparation


   – Notes: not recorded specifically by SPRUCE but is a superset of theme 2.
     Almost universally (anecdotally) acknowledged as a big problem.
   – Not a problem people like to talk about in public forums
Theme 4: Conformance to institutional profile/policy

• The problem:
  –Institution has policy driven requirements for the shape of
   its content, defined by specific profiles
  –Does data conform to these profiles?
  –If not (in some cases), can it be made to conform?
• The solution:
  –Conformance checking focused characterisation and
   validation
  –Modification of content + associated QA

  –Notes (personal opinion warning):
    • Some specific cases where this is a good idea (eg. Digitisation)
    • Many cases where this is not a good idea
Theme 5: Identify preservation risks

• The problem:
  – Data is in the repository, what risks does it face?
  – Some worry about whether they should be migrating their content
  – Some specifically want to format migrate and want help doing it
  – Root of problem is: what are the risks?
  – Risks themselves not well understood
  – Woeful tool provision to assist in automated risk assessment
• The solution:
  – Tools/approaches for identifying specific preservation risks in digital
    data
  – Logical progression is then for planning, action and QA
Theme 6: Long tail of issues

• The problem:
   –Rights
   –Structural issues
   –Contextual issues
   –Data capture / harvesting
   –Data integrity
   –Planning
   –........
Preservation risks: PDF example

• Common format, particularly in IRs
• Reasonable understanding of the risks (eg. Non
  embedded fonts, password protection / encryption)
• Lots of tools that do part of the job
• No simple, straight forward, automated tool to
  identify PDFs with clear preservation risks

• Popular tools that provide potentially misleading
  information
Summary

• Capture and share more evidence about the
  problem and about our needs, as we move forward
• Apply resources in a careful manner. Don’t reinvent
  the wheel
• Consider tackling these areas:
  –Quality Assurance
  –Appraisal and ingest preparation
  –Identify/locate preservation worthy data
  –Conformance to profiles/policy
  –Identify preservation risks
Thanks for listening! Any questions?

                                 Paul Wheatley
                        SPRUCE Project Manager
                             University of Leeds

                                   Twitter: @prwheatley
                       Email: p.r.wheatley@leeds.ac.uk
            http://openplanetsfoundation.org/blogs/paul

More Related Content

Viewers also liked

Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...JISC KeepIt project
 
Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, pleaseAndy Jackson
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsFuture Perfect 2012
 
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...stepheneisenhauer
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Preservation content in_files
Preservation content in_filesPreservation content in_files
Preservation content in_filesRichard Wright
 

Viewers also liked (7)

Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
 
Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, please
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
 
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
[Dpf manager] berlin workshop
[Dpf manager] berlin workshop[Dpf manager] berlin workshop
[Dpf manager] berlin workshop
 
Preservation content in_files
Preservation content in_filesPreservation content in_files
Preservation content in_files
 

Similar to Pain points for preservation services / workflows in repositories

Doing Science Properly In The Digital Age - Rutgers Seminar
Doing Science Properly In The Digital Age - Rutgers SeminarDoing Science Properly In The Digital Age - Rutgers Seminar
Doing Science Properly In The Digital Age - Rutgers SeminarNeil Chue Hong
 
Accelrys Announces Experiment Knowledge Base (EKB) for Enterprise Lab Management
Accelrys Announces Experiment Knowledge Base (EKB) for Enterprise Lab ManagementAccelrys Announces Experiment Knowledge Base (EKB) for Enterprise Lab Management
Accelrys Announces Experiment Knowledge Base (EKB) for Enterprise Lab ManagementBIOVIA
 
What to curate? Preserving and Curating Software-Based Art
What to curate? Preserving and Curating Software-Based ArtWhat to curate? Preserving and Curating Software-Based Art
What to curate? Preserving and Curating Software-Based Artneilgrindley
 
Claudia Bauzer Medeiros Digital preservation – caring for our data to foster...
Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster...Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster...
Claudia Bauzer Medeiros Digital preservation – caring for our data to foster...Beniamino Murgante
 
Deroure Repo3
Deroure Repo3Deroure Repo3
Deroure Repo3guru122
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDataWorks Summit
 
Software system design sample
Software system design sampleSoftware system design sample
Software system design sampleNorman K Ma
 
Smartere test og udvikling med virtualiserede miljøer, Mark Garcia, IBM US
Smartere test og udvikling med virtualiserede miljøer, Mark Garcia, IBM USSmartere test og udvikling med virtualiserede miljøer, Mark Garcia, IBM US
Smartere test og udvikling med virtualiserede miljøer, Mark Garcia, IBM USIBM Danmark
 
discopen
discopendiscopen
discopenJisc
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeCarole Goble
 
Web-Based Self- and Peer-Assessment of Teachers’ Educational Technology Compe...
Web-Based Self- and Peer-Assessment of Teachers’ Educational Technology Compe...Web-Based Self- and Peer-Assessment of Teachers’ Educational Technology Compe...
Web-Based Self- and Peer-Assessment of Teachers’ Educational Technology Compe...Hans Põldoja
 
Eudat user forum-london-11march2013-biovel-v3
Eudat user forum-london-11march2013-biovel-v3Eudat user forum-london-11march2013-biovel-v3
Eudat user forum-london-11march2013-biovel-v3Alex Hardisty
 
COSC 426 Lect. 7: Evaluating AR Applications
COSC 426 Lect. 7: Evaluating AR ApplicationsCOSC 426 Lect. 7: Evaluating AR Applications
COSC 426 Lect. 7: Evaluating AR ApplicationsMark Billinghurst
 
Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)Lviv Startup Club
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudGrid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudAdianto Wibisono
 
Scientific Software: Sustainability, Skills & Sociology
Scientific Software: Sustainability, Skills & SociologyScientific Software: Sustainability, Skills & Sociology
Scientific Software: Sustainability, Skills & SociologyNeil Chue Hong
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 
Towards Lensfield
Towards LensfieldTowards Lensfield
Towards LensfieldJim Downing
 
Are cloud based virtual labs cost effective? (CSEDU 2012)
Are cloud based virtual labs cost effective? (CSEDU 2012)Are cloud based virtual labs cost effective? (CSEDU 2012)
Are cloud based virtual labs cost effective? (CSEDU 2012)Nane Kratzke
 

Similar to Pain points for preservation services / workflows in repositories (20)

Doing Science Properly In The Digital Age - Rutgers Seminar
Doing Science Properly In The Digital Age - Rutgers SeminarDoing Science Properly In The Digital Age - Rutgers Seminar
Doing Science Properly In The Digital Age - Rutgers Seminar
 
Accelrys Announces Experiment Knowledge Base (EKB) for Enterprise Lab Management
Accelrys Announces Experiment Knowledge Base (EKB) for Enterprise Lab ManagementAccelrys Announces Experiment Knowledge Base (EKB) for Enterprise Lab Management
Accelrys Announces Experiment Knowledge Base (EKB) for Enterprise Lab Management
 
What to curate? Preserving and Curating Software-Based Art
What to curate? Preserving and Curating Software-Based ArtWhat to curate? Preserving and Curating Software-Based Art
What to curate? Preserving and Curating Software-Based Art
 
Claudia Bauzer Medeiros Digital preservation – caring for our data to foster...
Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster...Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster...
Claudia Bauzer Medeiros Digital preservation – caring for our data to foster...
 
Deroure Repo3
Deroure Repo3Deroure Repo3
Deroure Repo3
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 
Software system design sample
Software system design sampleSoftware system design sample
Software system design sample
 
Smartere test og udvikling med virtualiserede miljøer, Mark Garcia, IBM US
Smartere test og udvikling med virtualiserede miljøer, Mark Garcia, IBM USSmartere test og udvikling med virtualiserede miljøer, Mark Garcia, IBM US
Smartere test og udvikling med virtualiserede miljøer, Mark Garcia, IBM US
 
discopen
discopendiscopen
discopen
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
 
Web-Based Self- and Peer-Assessment of Teachers’ Educational Technology Compe...
Web-Based Self- and Peer-Assessment of Teachers’ Educational Technology Compe...Web-Based Self- and Peer-Assessment of Teachers’ Educational Technology Compe...
Web-Based Self- and Peer-Assessment of Teachers’ Educational Technology Compe...
 
Eudat user forum-london-11march2013-biovel-v3
Eudat user forum-london-11march2013-biovel-v3Eudat user forum-london-11march2013-biovel-v3
Eudat user forum-london-11march2013-biovel-v3
 
COSC 426 Lect. 7: Evaluating AR Applications
COSC 426 Lect. 7: Evaluating AR ApplicationsCOSC 426 Lect. 7: Evaluating AR Applications
COSC 426 Lect. 7: Evaluating AR Applications
 
Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)
 
CADA english
CADA englishCADA english
CADA english
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudGrid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the Cloud
 
Scientific Software: Sustainability, Skills & Sociology
Scientific Software: Sustainability, Skills & SociologyScientific Software: Sustainability, Skills & Sociology
Scientific Software: Sustainability, Skills & Sociology
 
Data Quality
Data QualityData Quality
Data Quality
 
Towards Lensfield
Towards LensfieldTowards Lensfield
Towards Lensfield
 
Are cloud based virtual labs cost effective? (CSEDU 2012)
Are cloud based virtual labs cost effective? (CSEDU 2012)Are cloud based virtual labs cost effective? (CSEDU 2012)
Are cloud based virtual labs cost effective? (CSEDU 2012)
 

Pain points for preservation services / workflows in repositories

  • 1. Pain points for preservation services / workflows in repositories Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley http://openplanetsfoundation.org/blogs/paul
  • 2. What I’m going to talk about... • What evidence we have for answering this question • What we know about our approach to developing preservation tools and services • Practitioner needs captured by SPRUCE • Suggestions
  • 3. SPRUCE Project Sustainable Preservation Using Community Engagement • JISC funded • 2 years in length (until Nov 2013) • £250k funding • Concepts developed from AQuA Project http://wiki.opf-labs.org/display/SPR
  • 4. Channel this thought... “Sharing best practice? We don’t even share practice!” Andrew N. Jackson Curate Camp, Toronto, 2nd October 2012
  • 5. Crude maturity model for DP TM Evidence Best Practice Standards
  • 6. Evidence based approach to DP – some examples • What is actually happening to file formats over time? Formats over Time: Exploring UK Web History Andrew N. Jackson
  • 7. Evidence based approach to DP (2) • Are our file format ID tools improving? LDS3: applying digital preservation principals to linked data systems David Tarrant and Leslie Carr
  • 8. Evidence based approach to DP (3) • Even if our digital files aren’t “obsolete”, do they render “correctly”? Percentage of tested attributes where a change was observed in at least one file when rendered in a test environment Rendering Original Software Matters: Report on Emulated Hardware 24% Attributes on the results of research into Tested Environment LibreOffice Writer 86% digital object Attributes rendering Euan CorelWordPerfect X5 Attributes 76% Cochrane, Arch ives New Microsoft Word 2007 Zealand 59% Attributes 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage of attributes where a change was observed in at least one file
  • 9. Suggestion 1: Evidence • Evidence base approach • Support the capture and sharing of evidence about the problem and about the experiences of performing preservation and curation
  • 10. Approach to developing new preservation services • Problems: –Tools/services that don’t solve concrete problems –Duplicated / wasted effort –Insufficient re-use of existing tools/technologies/approaches –Tools that are difficult to maintain / re-use • Reasons: –Lack of practitioner/user driven development –Requirements not shared –Lack of awareness of existing work –Poor development approaches / technology choices
  • 11. Digital preservation costing initiatives  LIFE 1, 2 and 3. Projects to explore digital preservation costing, and develop costing models.  Cost Model for Digital Preservation (CMDP): Project at the Royal Danish Library and the Danish National Archives to develop a new cost model. Currently covers Planning, Migrations and Ingest  Keeping Research Data Safe 1 and 2 (KRDS):Cost model and benefits analysis for preserving research data  Presto Prime cost model for digital storage  Cost Estimation Toolkit (CET): Data centre costing model and toolkit, from NASA Goddard  Cost Model for Small Scale Automated Digital Preservation Archives (Strodl and Rauber)  APARSEN Project activity focused on digital preservation costing  EPRSC and JISC study on Cost analysis of cloud computing for research  Cost forecasting model for new digitization projects (Excel and web tool under development) (Karim Boughida, Martha Whittaker, Linda Colet, Dan Chudnov)  DP4lib business and cost model for a digital preservation service  DANS Costs of Digital Archiving Volume 2 Project, focusing on preservation and dissemination of research datasets  Blue Ribbon Task Force on Sustainable Digital Preservation and Access  Economic Sustainability Reference Model  ENSURE Project - Enabling kNowledge Sustainability Usability and Recovery for Economic value  Cost Model for Electronic Health Records (Bote, Fernandez-Feijoo, and Ruizb)  4C. EU funded project on costing. Due to commence in 2012. Led by JISC  http://wiki.opf-labs.org/display/CDP/Home  An extended blog-rant on why this typifies a big #fail for our community
  • 12. In summary... “10 Years on we are still pretty much talking about the same things... ...Tools like DROID and PRONOM etc. didn’t work properly then, and they still don’t work properly now." Steve Knight, New Zealand National Library, iPRES2012 (blogged by Inge Angevarre: How are we doing as a community?)
  • 13. Suggestion 2: Development Approach • Make it practitioner/user led – Solve concrete problems • Re-use, don't re-invent the wheel – Most problems have already been solved, although often not by this community – Re-use existing code where possible • Keep it small, keep it simple – Functional preservation tools should be atomic – Modularise in the face of growing requirements – Ensure results can be exploited and integrated with other orchestration/repository platforms • Make it easy to use, build on, re-purpose and ultimately, maintain – Share your source – Automate your build – Package for easy install • Share outputs, exchange knowledge, learn from each other – Write up dev and user experiences and share them – Publish data about usage – Shout about it, blog it, tweet it, and add it to tool/service registry (or three) • Adapted from: the SPRUCE Mashup Manifesto
  • 14. The SPRUCE Mashup Identify and Solve concrete problems • 3 day workshop for ~30 people • Practitioners bring along digital collections • We identify preservation challenges • Pair up practitioners with technical experts • Apply existing open source tools to solve the problems • In doing so, we exchange knowledge Glasgow Mashup about digital preservation April 2012 • Develop a supportive community
  • 15. Mashup results: Datasets, Issues and Solutions • Capture information about our work on a wiki: –http://bit.ly/spruce-results • Datasets: about the content, some context • Issues: what the challenge or problem is • Solutions: an experience of solving the Issue • Begin as triples, builds into networks • Most sourced from mashup events, some from the SCAPE Project (these are preceeded with Isxx or Soxx codes) • All are from “practitioners” or “problem owners” • Useful to separate problem owner from solution provider • Understand the problem before diving into the solution
  • 16. Suggestion 3: preservation challenges to target • Key themes drawn from the practitioner sourced preservation issues collected • It gives an impression of what practitioners are facing • Capture some of that “practice” (i.e. What we learnt about trying out approach W, with tool X, on data Y, in situation Z) • But, a word of caution... • Not scientific approach to assessing user needs • Steered to an extent by the shape of the events and the time available • First 2 mashup events had some focus on QA – The following 3 events had an “open” focus on any DP related topic
  • 17. Theme 1: Quality Assurance • The problem: –Some have broken data –Some have suspected broken data –Some have an intention to process data in some way, but concerned about lack of ability to check the process doesn't break the data • The solution: –Cross section of automated QA approaches required. • How do we spot the flaws automatically? • How do we fix them automatically? • Often involves cross checking (eg. Data to metadata) • Sometimes explorative. What actually caused the problem, how do we prevent it? • Every case feels unique, but often strikes a chord more widely
  • 18. Jpylyzer example • New characterisation + validation tool for JPEG2000 –Development and operation driven by use cases • Eg, JP2 used at scale in mass digitisation efforts, truncation is common potential problem, yet existing tools didn’t check (quite complex) end of file conditions • Flawed creation tools omit critical metadata in created files • Examples of broken files enabled testing! –Also enables validation against a profile, enables automated QA of content from external suppliers • More information: –http://openplanetsfoundation.org/software/jpylyzer –Also see page on JP2 preservation risks: –http://wiki.opf-labs.org/display/TR/JP2
  • 19. Theme 2: Appraisal + Ingest preparation • The problem: –We have stuff, what is it, what should I worry about, what do I do next? –We know roughly what we've got (we've had some before) but we have a largely manual appraisal process that doesn't scale well –How do we turn this blob of content into something we can ingest into our repository? • The solution: –Characterisation capability needs to vastly improve –Automatic extraction of properties / flavour of content to aid appraisal/selection –Inform processing of data prior to ingest
  • 20. Theme 3: Identify/locate preservation worthy data • The problem: – Institution has preservation worthy data scattered across shared server space – Data is unmanaged, not checksummed, often doesn’t have a responsible owner – Sorting this data from non-preservation worthy data is a challenge • The solution: – Find it • Tools/approaches to “smell” preservation worthy data – Make it safe • Checksumming, creating manifests, registering basic details with a central authority with preservation responsibility, pereodically recalc checksums. All components are there but not in usable package – Get it ready to ingest • De-duplication, curation, management, add metadata, other ingest preparation – Notes: not recorded specifically by SPRUCE but is a superset of theme 2. Almost universally (anecdotally) acknowledged as a big problem. – Not a problem people like to talk about in public forums
  • 21. Theme 4: Conformance to institutional profile/policy • The problem: –Institution has policy driven requirements for the shape of its content, defined by specific profiles –Does data conform to these profiles? –If not (in some cases), can it be made to conform? • The solution: –Conformance checking focused characterisation and validation –Modification of content + associated QA –Notes (personal opinion warning): • Some specific cases where this is a good idea (eg. Digitisation) • Many cases where this is not a good idea
  • 22. Theme 5: Identify preservation risks • The problem: – Data is in the repository, what risks does it face? – Some worry about whether they should be migrating their content – Some specifically want to format migrate and want help doing it – Root of problem is: what are the risks? – Risks themselves not well understood – Woeful tool provision to assist in automated risk assessment • The solution: – Tools/approaches for identifying specific preservation risks in digital data – Logical progression is then for planning, action and QA
  • 23. Theme 6: Long tail of issues • The problem: –Rights –Structural issues –Contextual issues –Data capture / harvesting –Data integrity –Planning –........
  • 24. Preservation risks: PDF example • Common format, particularly in IRs • Reasonable understanding of the risks (eg. Non embedded fonts, password protection / encryption) • Lots of tools that do part of the job • No simple, straight forward, automated tool to identify PDFs with clear preservation risks • Popular tools that provide potentially misleading information
  • 25. Summary • Capture and share more evidence about the problem and about our needs, as we move forward • Apply resources in a careful manner. Don’t reinvent the wheel • Consider tackling these areas: –Quality Assurance –Appraisal and ingest preparation –Identify/locate preservation worthy data –Conformance to profiles/policy –Identify preservation risks
  • 26. Thanks for listening! Any questions? Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley Email: p.r.wheatley@leeds.ac.uk http://openplanetsfoundation.org/blogs/paul