Preservation Planning: Choosing a suitable digital preservation strategy

1,079 views

Published on

Presentation given at the Long-term Archiving perspectives meeting at the Office for Official EC publications in Luxembourg, November 11th, 2011

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,079
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Preservation Planning: Choosing a suitable digital preservation strategy

  1. 1. P res erva tio n P la nning : Choosing a suitable preservation approach Long-term Archiving P erspectives of E uropean Union P ublications meeting Office for Official Publications of the European Communities Luxembourg, November 10-11, 2011Gareth KnightCentre for e-Research
  2. 2. Preservation ObjectivesAuthentic - it is what it Understandability – what does purports to be this information mean? Content preservation Bitstream preservation Priscilla Caplans revised Preservation Pyramid
  3. 3. Identity • The exact sameness of things. • Leibnizs law indicates that 2 items that share common attributes are not only similar, but are the same thing • Can two things be the same? “ultimately nothing is the s ame as something else” (Paskin, 2003) A painting of LeibnizQuestions: • Both images are a pictorial representation of Leibniz • Image A is constructed using paint on a canvas • Image B is constructed as 0s and 1s • Do they share the same identity? • Is it necessary for all object attribute to be same, or is it acceptable to have some degree of granularity? • How much is identity based upon ability to measure attributes? Scanned copy of painting
  4. 4. Integrity
  5. 5. Is integrity maintained = Yes/No• Linked to notions of consistency, wholeness and truth• There has not been deliberate or accidental damage/change that has caused meaning to be altered or lost, in part or entirety.• Checksum algorithm applied to a file generates a distinct (possibly unique) alphanumeric value• Commonly used to check for accidental/deliberate data change/corruption • Generate checksum on October 1st • Generate checksum on October 14th & compare to Oct 1st value – are they the same? Y E S /N O
  6. 6. Is Integrity maintained = 0- 100%If one chunk became corrupted, the hashes for other chunks,which hadnt changed, could be used to prove its integrity.P iec ew is e ha s hing :•divides an input file into sections and checksums each chunkseparately.•Intended to measure integrity of disk images (dcfldd).• However, Insert or delete changes all subsequent hashes•R o lling ha s h:Looks at each point of file in semi-random orderDepends only on last few bytes
  7. 7. Example of Piecewise hashing (1) 19e33h213a7865b2b664348b ea3fe191227a4eg933bc41ge 2d839db2996b412e84h77a33 872e73ab867c883e7391ae65
  8. 8. Example of Piecewise hashing (2) 19e33h213a7865b2b664348b SAME! ea3fe191227a4eg933bc41ge SAME! a73921e173c94e8232fa91bb DIFFERENT TEXT 7894af8211c12bb123ah9912 INCOMPLETE
  9. 9. Renderability
  10. 10. Data Interpretation in practiceOAIS Reference ModelNAA Performance Model = + + + data computer OS application information content
  11. 11. Information Object Information PropertiesSome definitions: • Information P roperty/ D escription: IP • A description of part of the information content (OAIS RM v2, 2009) • P roperty: • An abstract attribute, trait or peculiarity suitable for describing preservation objects, actions or environments (Dappert, 2009)Observations: • No interpretation of significance – merely exists • May be held in different locations and different levels of detail
  12. 12. Information Property categories (1)Rothenberg & Bikson (1999) identify five types ofInformation Property: • C ontent: the author’s intellectual work, e.g. text, still image, audio waveform, etc. • C ontext: Information that affects the content’s intended meaning and establishes its provenance • Appearance: Information that contributes to the recreation of the performance, e.g. font type/colour/size, bit depth • S tructure: Relationship between 2+ types of content, e.g. e- mail attachments, internal hyperlinks • Behaviour: information that establishes how content interacts with the user, or other objects or components, e.g. hyperlink handling http://www.panix.com/~jeffr/Prof/digilong.html
  13. 13. Context Content Image & Text link Content and Context? StructureAppearance Behaviour
  14. 14. Information Property categories (2)PLANETS Digital Object Properties WP use differentclassification based upon ability to identify:•E x tra c ta ble properties : • Properties that can be extracted from or calculated on the fly, e.g. file size, image dimensions, MD•O bs erva tiona l properties : • Can only be determined by human observation, e.g. licence restriction(?)•P erform a nc e P ro perties : • Properties that emerge through combination of HW, SW & Data ObjectSource: PLANETS Digital Object Properties WG
  15. 15. Performance Observational Property PropertyExtractableinformation
  16. 16. Preservation Metadata: Documenting the technical encoding and intellectual content
  17. 17. PREMIS • "things that most working repositories are likely to need to know in order to support digital preservation“ • Core metadata that defines “viability, renderability, understandability, authenticity, and identity in a preservation context" What metadata assists with rendering? • Format • Size • Fixity • Creating Application: Name, version, datePREMIS DD 1.0 (May 2005) data was createdPREMIS DD 2.0 (March 2008) • Inhibitors: Features intended to inhibit access, use, or migration.
  18. 18. Technical Metadata for still images http://www.flickr.com/photos/k4chii/200303113/ Standards: Z39.87, MIX and others Information on •Image characteristics •Encoding scheme •Metadata
  19. 19. Document MD Applicable to formats that are primarily text, allow choice of font, support embedded multimedia & page layouts Example elements  Page Count  Word Count  Character Count  Paragraph Count  Line count  Table Count  Graphics Count  Language  Fonts (list of each font in document)  Features (additional document features, e.g. hasTransparency, hasOutline, hasAnnotation)
  20. 20. Third party services: Representation Information Registries•Require trusted third partyservices capable of identifyingformats • PRONOM, UDFR•Providing information onrendering data • OpenWith, various RI services
  21. 21. Preserving your object across changing technologies
  22. 22. Change in process over timeSOURCE PROCESS PERFORMANCE Intel PC, 2000 + + = Mac laptop, 2006 + + = X64 Ubuntu laptop, 2010 + + = operating software information hardware system application content Potential for changing to ‘Performance’ over time
  23. 23. Change is a necessity… and a risk“traditionally, preserving things meant keeping them unchanged; however… if we hold on to digital information without modifications, accessing theinformation will become increasingly more difficult, if not impossible.”(Su-Shing Chen, 2001)“The fundamental challenge of digital preservation is to preserve theaccessibility and authenticity of digital objects over time and domains, andacross changing technical environments” (Wilson, 2008)
  24. 24. Authenticity
  25. 25. Authenticity“the degree to which a person(or system) may regard anobject as what it is purported tobe”(OAIS RM v2)Questions:•How do you distinguish theauthentic original from theimitators?•What is authenticity in the digitalrealm? Which is the real Elvis? Img src: http://www.flickr.com/photos/mymollypop/2904798835/ http://www.flickr.com/photos/blahflowers/3827096787/ © 1973, Elvis Presley Enterprises, Inc. and RCA Records http://en.wikipedia.org/wiki/File:ElvisPresleyAlohafromHawaii.jpg
  26. 26. What do we need to keep for information Object to be authentic?“Understanding, defining and assessing the individualproperties… important.. for informing decisions about whichcharacteristics of that object should be preserved over time,in circumstances where it is not possible, for reasons such ascost, practicality or technical constraints, to preserve all theelements of that object”(Montague et al. The Concept of Significant Properties. 2010)“Unless such properties can be defined in a rigorous andmeasurable manner, cultural memory institutions have noobjective framework for identifying, implementing, andvalidating appropriate preservation strategies, nor forasserting the continued authenticity of their digital collections”(Dappert, 2009)
  27. 27. Acceptable Vs Unacceptable change•Easy to identify when preservation gone wrong, but how do youdecide when it goes right? • Interpretation is a value judgement – often influenced by different criteria • Uncertainty on level that evaluation should be performed – technical encoding, object type (e.g. still image), object sub-type (e.g. business document, research paper) • How do you measure attributes that are considered significant? • Technical properties may vary between formats • Observational properties require manual identification
  28. 28. Planning your strategy; strategising your plan • P res erva tio n P la n: defines a series of preservation actions to be taken by a responsible institution due to an identified risk for a given set of digital objects or records” http://www.dlib.org/dlib/november09/kulovits/11kulovits.html • P res erva tio n s tra teg y indicates commitment to preservation and high-level approach adopted – organisational mission, applied principles (e.g. use lifecycle approach), sequence of actions (immediate, medium term, long-term), risk management
  29. 29. Why develop a preservation plan?Assists decision-making process • Evaluate different strategies • Evaluate different toolsDetermine which is the most effective approach for your needs• Transparency of operation – enable others to view and understand approach adopted – inspire confidence and trust• Provide evidence of decision-making – decisions may be questioned. How do you prove that approach taken was appropriate for circumstances?
  30. 30. Evaluation frameworksVarious approaches may be adopted to develop preservation plan:•Produce internal decision tree • Fit intrinsic needs of organisation, but requires staff time to develop & may be limiting when considering new approaches•Perform informal “bottom-up” object analysis & develop bespokeplan • Fit requirements of object type, but may be time intensive to produce & may be incompatible with broader policies•Adopt 3rd party standardised plan (aka copy and paste) • Adopting existing plan saves time, but may be inappropriate for context•Use analysis frameworks and toolkits • Structured process by which organisation can identify objectives & develop plan to address them • DRAMBORA/DIRKS – analyse environment & practices, identify risks and brainstorm methods of mitigating or avoiding them • Data Asset Framework – identify data held, assess management practices & make recommendations for improvement • PLANETS Preservation Planning –define requirements, evaluate alternative approaches, analyse and compare results, recommend preferred approach, and develop plan
  31. 31. Preservation Planning workflow•Developed as part of DELOSproject & adopted by PLANETSConsortium•Conforms to the ‘General COTS(Commercial-Off-The-Shelf)selection process (GCS)•Abstract steps: Define criteria,Search for products, Createshortlist, Evaluate candidates,Analyze data & Select product•Uses utility analysis approach
  32. 32. PLANETS Planning workflow http://olymp.ifs.tuwien.ac.at:8080/plato/
  33. 33. Define Requirements: Factors to consider•Identify & analyse environment in whichdecisions are made (e.g. assumptions &constraints) to determine context: • Organisational/dept objectives (e.g. mission statement, mandate) • National/local policy framework (e.g. acquisition, legal framework) • Codes of practice • Financial limitations – what can you afford? • Object types to be maintained • Expertise & needs of key stakeholders, e.g. Designated Community
  34. 34. Whose views do you need to take into account?D ig ita l a rc hive pers pec tive • General trend to simplify object to make it (speculatively) easier to manage in future: • Reduce cost of preservation process • Limit risk that accessibility/preservation issues will emerge • Increase number of preservation options availableC rea to r pers pec tive • Author intent difficult to establish • Differs for each object – do you seek to treat each object individually or identify broad classes? • When do you ask them? On creation, after 5 years? May have different views on value.U s er pers pec tive • How do you analyse interpretation of current user community? • How do you predict needs of future users?
  35. 35. InSPECT Requirements Analysis Framework (2008)• Adopted a design method used to assist engineers & designers to create & re-design artefacts• Based upon theory that artefact construction is a product of designated function(s)• Assessment upon two philosophical approaches: 1. Teleology: study of design and purpose of object – why was it created? 2. Epistemology: Understand meaning and process by which knowledge is acquired• In combination, these encourage evaluation of context of creation and information needed to communicate intrinsic knowledge to a new audience (designated community)
  36. 36. Requirements Analysis activitiesS tep 1: O bjec t A na lys isInterpret context of creation:1. Analyse object to find out what it contains2. Identify original audience and functions that object was created to perform3. Determine info. properties necessary to achieve each functionS tep 2: S ta k eholder A na lys isDetermine future requirements of digital object1. Identify Stakeholders that will use object2. Determine function set they may perform when using object3. Identify quality thresholds for each information property that must be met to allow each function to be achieved – what is acceptable loss?
  37. 37. Define Requirements: PLANETS Requirement Categories• Produce list of criteria that will be used to evaluate diff. preservation strategies in specific domain• May take top-down (organisational) or top-down (object) approach• PLANETS identify four groups of characteristic to be evaluated: 1. O bject: Attributes of information content itself, e.g. behaviour, context 2. R ecord: Attributes of record including context, relationships & MD - potential overlap with Obj in some cases 3. P rocess : Attributes of preservation process, e.g. processing speed, usability of tool, ability to batch process, etc. 4. C os t: Set-up of process, cost per object, H/W & S/W, personnel• Non-prescriptive - evaluator may identify further top-level & sub- categories or ignore existing criteria (e.g. technical characteristics for format evaluation)• May be expressed as spreadsheet, list, mind-map, post-it notes & other forms
  38. 38. Record requirements as Evaluation Tree•Set of requirements may beexpressed as mind map,spreadsheet, or other form•Define structure ofevaluation process, groupingsimilar items together•Assign a measurementvalue to each ‘leaf’ • Objective measure: E.g. colour depth, duration • Subjective measure: Acceptable variance,
  39. 39. Define Requirements: Measure each criterion•Assign a measurement value to each ‘leaf’•Objective measures: • Unambiguous, automated (possibly), E.g. seconds to process object, colour depth, cost value•Subjective measures: • Acceptable, but often require manual evaluation, e.g. degree of format support•Type of scale • Numeric measure (e.g. 15 bit) • Boolean (Yes/No) • Controlled vocab (e.g. Yes/Acceptable/No) • Ordinal numbers (controlled list) • Subjective criteria (0-5)
  40. 40. Objective tree for web sites
  41. 41. Define Alternatives• On basis of object type and expressed requirements, what strategies are feasible?• Many different approaches available, e.g. TIFF images could undergo following actions: • Format conversion to JPG2k • Format conversion to PNG (to save space) • Format conversion to PDF (though would not recommend) • Emulation/virtual machine • Do nothing!• For each alternative strategy, may wish to define: • Tool to be tested (e.g. name, version, OS) • Configuration parameters • Function to be tested
  42. 42. Trial the preservation approachesDevelop a set of experiments to trial the preservation approach  Define workflow  Select representative test files  Perform evaluation  Evaluate the outcome according to your objective tree  Were there undesired/unexpected results?
  43. 43. PLATO conversion tool/format comparison Definition of alternative approaches to preserve GIF image (conversion to alt. formats) and identification of tool services available to perform action
  44. 44. Compare resultsRequire common basis for comparing different strategiesN o rm a lis e dis pa ra te res ults Each evaluation factor is measured differently (Y/N, cost, speed of conversion) Can make them comparable by converting them to a uniform scaleS et I m porta nt Fa c to rs Not all assessment criteria is equal – do you wish to prioritise specific reqs. (e.g. scalability, cost)C om pa re outc o m es & s elec t m os t a ppropria te pres erva tion s tra teg y
  45. 45. ConclusionsPreservation is an iterative process – must climb many steps to reach the top of the pyramidPreservation Planning enables organisation to understand and document their requirementsDemonstrate decision making – inspires confidence & trustNot a perform once, forget process. Must be repeated
  46. 46. Discussion points• Are traditional checksum techniques acceptable for measuring integrity, or do we need a more granular approach?• How should we utilise & build upon third party services, such as RI Registries & preservation plan tools, to achieve our preservation objectives?• What would a preservation plan for our scanned images, documents, metadata look like?
  47. 47. Thank You for your attention QUESTIONS? Gareth Knight gareth.knight@kcl.ac.uk

×