Preservation Planning: Choosing a suitable digital preservation strategy
P res erva tio n P la nning : Choosing a suitable preservation approach Long-term Archiving P erspectives of E uropean Union P ublications meeting Office for Official Publications of the European Communities Luxembourg, November 10-11, 2011Gareth KnightCentre for e-Research
Preservation ObjectivesAuthentic - it is what it Understandability – what does purports to be this information mean? Content preservation Bitstream preservation Priscilla Caplans revised Preservation Pyramid
Identity • The exact sameness of things. • Leibnizs law indicates that 2 items that share common attributes are not only similar, but are the same thing • Can two things be the same? “ultimately nothing is the s ame as something else” (Paskin, 2003) A painting of LeibnizQuestions: • Both images are a pictorial representation of Leibniz • Image A is constructed using paint on a canvas • Image B is constructed as 0s and 1s • Do they share the same identity? • Is it necessary for all object attribute to be same, or is it acceptable to have some degree of granularity? • How much is identity based upon ability to measure attributes? Scanned copy of painting
Is integrity maintained = Yes/No• Linked to notions of consistency, wholeness and truth• There has not been deliberate or accidental damage/change that has caused meaning to be altered or lost, in part or entirety.• Checksum algorithm applied to a file generates a distinct (possibly unique) alphanumeric value• Commonly used to check for accidental/deliberate data change/corruption • Generate checksum on October 1st • Generate checksum on October 14th & compare to Oct 1st value – are they the same? Y E S /N O
Is Integrity maintained = 0- 100%If one chunk became corrupted, the hashes for other chunks,which hadnt changed, could be used to prove its integrity.P iec ew is e ha s hing :•divides an input file into sections and checksums each chunkseparately.•Intended to measure integrity of disk images (dcfldd).• However, Insert or delete changes all subsequent hashes•R o lling ha s h:Looks at each point of file in semi-random orderDepends only on last few bytes
Example of Piecewise hashing (1) 19e33h213a7865b2b664348b ea3fe191227a4eg933bc41ge 2d839db2996b412e84h77a33 872e73ab867c883e7391ae65
Example of Piecewise hashing (2) 19e33h213a7865b2b664348b SAME! ea3fe191227a4eg933bc41ge SAME! a73921e173c94e8232fa91bb DIFFERENT TEXT 7894af8211c12bb123ah9912 INCOMPLETE
Data Interpretation in practiceOAIS Reference ModelNAA Performance Model = + + + data computer OS application information content
Information Object Information PropertiesSome definitions: • Information P roperty/ D escription: IP • A description of part of the information content (OAIS RM v2, 2009) • P roperty: • An abstract attribute, trait or peculiarity suitable for describing preservation objects, actions or environments (Dappert, 2009)Observations: • No interpretation of significance – merely exists • May be held in different locations and different levels of detail
Information Property categories (1)Rothenberg & Bikson (1999) identify five types ofInformation Property: • C ontent: the author’s intellectual work, e.g. text, still image, audio waveform, etc. • C ontext: Information that affects the content’s intended meaning and establishes its provenance • Appearance: Information that contributes to the recreation of the performance, e.g. font type/colour/size, bit depth • S tructure: Relationship between 2+ types of content, e.g. e- mail attachments, internal hyperlinks • Behaviour: information that establishes how content interacts with the user, or other objects or components, e.g. hyperlink handling http://www.panix.com/~jeffr/Prof/digilong.html
Context Content Image & Text link Content and Context? StructureAppearance Behaviour
Information Property categories (2)PLANETS Digital Object Properties WP use differentclassification based upon ability to identify:•E x tra c ta ble properties : • Properties that can be extracted from or calculated on the fly, e.g. file size, image dimensions, MD•O bs erva tiona l properties : • Can only be determined by human observation, e.g. licence restriction(?)•P erform a nc e P ro perties : • Properties that emerge through combination of HW, SW & Data ObjectSource: PLANETS Digital Object Properties WG
Preservation Metadata: Documenting the technical encoding and intellectual content
PREMIS • "things that most working repositories are likely to need to know in order to support digital preservation“ • Core metadata that defines “viability, renderability, understandability, authenticity, and identity in a preservation context" What metadata assists with rendering? • Format • Size • Fixity • Creating Application: Name, version, datePREMIS DD 1.0 (May 2005) data was createdPREMIS DD 2.0 (March 2008) • Inhibitors: Features intended to inhibit access, use, or migration.
Technical Metadata for still images http://www.flickr.com/photos/k4chii/200303113/ Standards: Z39.87, MIX and others Information on •Image characteristics •Encoding scheme •Metadata
Document MD Applicable to formats that are primarily text, allow choice of font, support embedded multimedia & page layouts Example elements Page Count Word Count Character Count Paragraph Count Line count Table Count Graphics Count Language Fonts (list of each font in document) Features (additional document features, e.g. hasTransparency, hasOutline, hasAnnotation)
Third party services: Representation Information Registries•Require trusted third partyservices capable of identifyingformats • PRONOM, UDFR•Providing information onrendering data • OpenWith, various RI services
Preserving your object across changing technologies
Change in process over timeSOURCE PROCESS PERFORMANCE Intel PC, 2000 + + = Mac laptop, 2006 + + = X64 Ubuntu laptop, 2010 + + = operating software information hardware system application content Potential for changing to ‘Performance’ over time
Change is a necessity… and a risk“traditionally, preserving things meant keeping them unchanged; however… if we hold on to digital information without modifications, accessing theinformation will become increasingly more difficult, if not impossible.”(Su-Shing Chen, 2001)“The fundamental challenge of digital preservation is to preserve theaccessibility and authenticity of digital objects over time and domains, andacross changing technical environments” (Wilson, 2008)
What do we need to keep for information Object to be authentic?“Understanding, defining and assessing the individualproperties… important.. for informing decisions about whichcharacteristics of that object should be preserved over time,in circumstances where it is not possible, for reasons such ascost, practicality or technical constraints, to preserve all theelements of that object”(Montague et al. The Concept of Significant Properties. 2010)“Unless such properties can be defined in a rigorous andmeasurable manner, cultural memory institutions have noobjective framework for identifying, implementing, andvalidating appropriate preservation strategies, nor forasserting the continued authenticity of their digital collections”(Dappert, 2009)
Acceptable Vs Unacceptable change•Easy to identify when preservation gone wrong, but how do youdecide when it goes right? • Interpretation is a value judgement – often influenced by different criteria • Uncertainty on level that evaluation should be performed – technical encoding, object type (e.g. still image), object sub-type (e.g. business document, research paper) • How do you measure attributes that are considered significant? • Technical properties may vary between formats • Observational properties require manual identification
Planning your strategy; strategising your plan • P res erva tio n P la n: defines a series of preservation actions to be taken by a responsible institution due to an identified risk for a given set of digital objects or records” http://www.dlib.org/dlib/november09/kulovits/11kulovits.html • P res erva tio n s tra teg y indicates commitment to preservation and high-level approach adopted – organisational mission, applied principles (e.g. use lifecycle approach), sequence of actions (immediate, medium term, long-term), risk management
Why develop a preservation plan?Assists decision-making process • Evaluate different strategies • Evaluate different toolsDetermine which is the most effective approach for your needs• Transparency of operation – enable others to view and understand approach adopted – inspire confidence and trust• Provide evidence of decision-making – decisions may be questioned. How do you prove that approach taken was appropriate for circumstances?
Evaluation frameworksVarious approaches may be adopted to develop preservation plan:•Produce internal decision tree • Fit intrinsic needs of organisation, but requires staff time to develop & may be limiting when considering new approaches•Perform informal “bottom-up” object analysis & develop bespokeplan • Fit requirements of object type, but may be time intensive to produce & may be incompatible with broader policies•Adopt 3rd party standardised plan (aka copy and paste) • Adopting existing plan saves time, but may be inappropriate for context•Use analysis frameworks and toolkits • Structured process by which organisation can identify objectives & develop plan to address them • DRAMBORA/DIRKS – analyse environment & practices, identify risks and brainstorm methods of mitigating or avoiding them • Data Asset Framework – identify data held, assess management practices & make recommendations for improvement • PLANETS Preservation Planning –define requirements, evaluate alternative approaches, analyse and compare results, recommend preferred approach, and develop plan
Preservation Planning workflow•Developed as part of DELOSproject & adopted by PLANETSConsortium•Conforms to the ‘General COTS(Commercial-Off-The-Shelf)selection process (GCS)•Abstract steps: Define criteria,Search for products, Createshortlist, Evaluate candidates,Analyze data & Select product•Uses utility analysis approach
Define Requirements: Factors to consider•Identify & analyse environment in whichdecisions are made (e.g. assumptions &constraints) to determine context: • Organisational/dept objectives (e.g. mission statement, mandate) • National/local policy framework (e.g. acquisition, legal framework) • Codes of practice • Financial limitations – what can you afford? • Object types to be maintained • Expertise & needs of key stakeholders, e.g. Designated Community
Whose views do you need to take into account?D ig ita l a rc hive pers pec tive • General trend to simplify object to make it (speculatively) easier to manage in future: • Reduce cost of preservation process • Limit risk that accessibility/preservation issues will emerge • Increase number of preservation options availableC rea to r pers pec tive • Author intent difficult to establish • Differs for each object – do you seek to treat each object individually or identify broad classes? • When do you ask them? On creation, after 5 years? May have different views on value.U s er pers pec tive • How do you analyse interpretation of current user community? • How do you predict needs of future users?
InSPECT Requirements Analysis Framework (2008)• Adopted a design method used to assist engineers & designers to create & re-design artefacts• Based upon theory that artefact construction is a product of designated function(s)• Assessment upon two philosophical approaches: 1. Teleology: study of design and purpose of object – why was it created? 2. Epistemology: Understand meaning and process by which knowledge is acquired• In combination, these encourage evaluation of context of creation and information needed to communicate intrinsic knowledge to a new audience (designated community)
Requirements Analysis activitiesS tep 1: O bjec t A na lys isInterpret context of creation:1. Analyse object to find out what it contains2. Identify original audience and functions that object was created to perform3. Determine info. properties necessary to achieve each functionS tep 2: S ta k eholder A na lys isDetermine future requirements of digital object1. Identify Stakeholders that will use object2. Determine function set they may perform when using object3. Identify quality thresholds for each information property that must be met to allow each function to be achieved – what is acceptable loss?
Define Requirements: PLANETS Requirement Categories• Produce list of criteria that will be used to evaluate diff. preservation strategies in specific domain• May take top-down (organisational) or top-down (object) approach• PLANETS identify four groups of characteristic to be evaluated: 1. O bject: Attributes of information content itself, e.g. behaviour, context 2. R ecord: Attributes of record including context, relationships & MD - potential overlap with Obj in some cases 3. P rocess : Attributes of preservation process, e.g. processing speed, usability of tool, ability to batch process, etc. 4. C os t: Set-up of process, cost per object, H/W & S/W, personnel• Non-prescriptive - evaluator may identify further top-level & sub- categories or ignore existing criteria (e.g. technical characteristics for format evaluation)• May be expressed as spreadsheet, list, mind-map, post-it notes & other forms
Record requirements as Evaluation Tree•Set of requirements may beexpressed as mind map,spreadsheet, or other form•Define structure ofevaluation process, groupingsimilar items together•Assign a measurementvalue to each ‘leaf’ • Objective measure: E.g. colour depth, duration • Subjective measure: Acceptable variance,
Define Requirements: Measure each criterion•Assign a measurement value to each ‘leaf’•Objective measures: • Unambiguous, automated (possibly), E.g. seconds to process object, colour depth, cost value•Subjective measures: • Acceptable, but often require manual evaluation, e.g. degree of format support•Type of scale • Numeric measure (e.g. 15 bit) • Boolean (Yes/No) • Controlled vocab (e.g. Yes/Acceptable/No) • Ordinal numbers (controlled list) • Subjective criteria (0-5)
Define Alternatives• On basis of object type and expressed requirements, what strategies are feasible?• Many different approaches available, e.g. TIFF images could undergo following actions: • Format conversion to JPG2k • Format conversion to PNG (to save space) • Format conversion to PDF (though would not recommend) • Emulation/virtual machine • Do nothing!• For each alternative strategy, may wish to define: • Tool to be tested (e.g. name, version, OS) • Configuration parameters • Function to be tested
Trial the preservation approachesDevelop a set of experiments to trial the preservation approach Define workflow Select representative test files Perform evaluation Evaluate the outcome according to your objective tree Were there undesired/unexpected results?
PLATO conversion tool/format comparison Definition of alternative approaches to preserve GIF image (conversion to alt. formats) and identification of tool services available to perform action
Compare resultsRequire common basis for comparing different strategiesN o rm a lis e dis pa ra te res ults Each evaluation factor is measured differently (Y/N, cost, speed of conversion) Can make them comparable by converting them to a uniform scaleS et I m porta nt Fa c to rs Not all assessment criteria is equal – do you wish to prioritise specific reqs. (e.g. scalability, cost)C om pa re outc o m es & s elec t m os t a ppropria te pres erva tion s tra teg y
ConclusionsPreservation is an iterative process – must climb many steps to reach the top of the pyramidPreservation Planning enables organisation to understand and document their requirementsDemonstrate decision making – inspires confidence & trustNot a perform once, forget process. Must be repeated
Discussion points• Are traditional checksum techniques acceptable for measuring integrity, or do we need a more granular approach?• How should we utilise & build upon third party services, such as RI Registries & preservation plan tools, to achieve our preservation objectives?• What would a preservation plan for our scanned images, documents, metadata look like?
Thank You for your attention QUESTIONS? Gareth Knight firstname.lastname@example.org