I was asked to give a presentation on some of the ideas which the Digital Preservation team at the NLA has been working on over the last year. These ideas have formed the basis for requirements and subsequent tender to replace key components of our Digital Library Infrastructure. The NLA wants to, either: source a product to provide this functionality; Work with a product to extend this functionality; or Build this functionality ourselves.
Like many Libraries, the NLA has very diverse and complex ecosystem. In relation to preservation requirements we have to consider: Lots of stuff (around 1.5 PB of total data) Lots of relationships (especially in our Domain Harvests) Mixed levels of intellectual control (catalogued at the file level and the box level) Many different format types – requiring possibly different and recurring actions at different times in the life of a digital object. Because we do not mandate formats that are accepted into the ecosystem, the NLA will have many formats that we are unable to identify and support.
A general break down of the collections is shown: The largest proportion of the content in the repository is digitised materials (primarily newspapers, however digitised materials can be found in almost all other collections) More problematic for us is born digital materials which is also found in most collections – but in lesser volumes Arguable the most problematic collection area is web archives (domain harvest and selectively harvested) because of: size contains potentially anything complex relationships
First, a caveat.
Pete and Jay from the NLNZ and myself have been talking for some time about an ecology or layers of consciousness for the need for digital preservation intervention. What is presented to you is not a perfect representation of reality. However, it is useful when we are trying to explore if our aims, goals and expectations of preservation are even remotely compatible. At the NLA we have been trying to change the perception of the library over the last 5-10 years to steer us towards integrated preservation systems. We are currently in the process of trying to achieve this last state. If you in the audience are somewhere else in this ecology, perhaps the rest of my talk will be gobbledegook.
We can see this ecology in another way: High vers. Low resources High vers. Low awareness Long term vers. Short term retention
The following is a mixture of observations, common practice and some new ideas.
In order to preserve content over time we believe that we need to do the following: Maintain access to the bit-streams; Maintain access to content encoded in the bit streams; and Maintaining access to the meaning of the content. We need to have all of these components covered in order to have a chance to preserve content over time.
Thus, the primary mission of the digital preservation section at the NLA is to: Maintain the ability to meaningfully access digital collection content over time. For example, two models are presented In model A = doing nothing - over time will lose access to not only content but also the bits In model B = through managed systems were are more likely to maintain access to the bits and through pres actions the content However, if we don’t have the context then we will be literally ‘preserving in the dark’.
Furthermore, our preservation process need to allow us to: Understand what is in the collection; Understand why we have it and what we want to do with it; Understand how to access it; Understand when access is going to become or is problematic; Continue to take steps to maintain access; Audit our arrangements.
In preservation we talk a lot about Risk. The concept of risk is itself risky because we have potentially many different types of risk. Also, obsolescence information is subjective and relative. Much of what we know is a best guess. The only real concrete information that we can get is: can ‘we’ access the format; What is our level and the vendors level of support when its it likely that we won’t be able to? Does a format have characteristics which are problematic, and may therefore be more risky. Also, when we do use risk metrics is some kind of meaningful way we tend to lump them into one bucket. However, there are different kinds of risk which are useful for different circumstances in a repository. We have started to refine risk based on high level use-cases classification. For example: parameter-based risk – e.g. specific M/D which we consider good or bad; exception risk – e.g. the format is not valid or did not validate; change risk – e.g. a number of files have failed fixity checks; conflict risk – e.g. tool X says it is a tiff, tool Y says that it is a pdf!; unknown value risk – e.g. our tools cannot identify 100,000 files in this transfer; access support risk - we can no longer get access 1,000,000 files in format X;
We are going to need many different types of preservation actions. we want to be both proactive and reactive. Be able to see the current state of the repository as well as be able to run ‘what if scenarios’. Understand if we need to take any action on a file. Do we need to take actions on all files in the repository of a particular type, or only those that belong to a particular group (e.g. Tiff’s in a particular collection). These actions could be as simple as replacing the access software (not touching the file). Or as complex as replacing a file and links inside a complex web object. Or even building and maintaining emulation environments over time. We also want be able to get rid of stuff we don’t want (e.g. may not be our responsibility or should not have been taken into the collection in the first place). If we are going to take an action on a file we want to know what about the file is important to the collection owners.
Based on the last point we think that the system at the NLA must take into account: Preservation Intent; Significance; and Level of support of formats and therefore access to the content.
We need a system that can express Pres Intent – does the content need to be preserved. If so, who is responsible for it, how long and what aspects? As we don’t believe in the ‘it is impossible to define significant properties for digital objects school of thought and because we find significant properties so problematic – we have adopted a middle position, expressing pres intent at a fairly hight level (e.g. want to view, edit, navigate, manipulate the content) This is a collaborative process of defining and articulating how the collections see their content and their required level of support for access to the content over time. Including which specific aspects they think are important.
Also, we need to build a system in which an intellectual entity and any given level of granularity can be recorded as being significant. We also need to be able know what is the ‘Level of Support’ for any given digital object within our ecosystem is, at a given time. e.g. given that we can identify what it is - how well (or not) do we maintain access to the content in this file – This will help us to work out priorities and what pres action/s we will need to take.
So, some of these ideas are expressed on this early painting that we found on a cave wall.
Then we took most of the fun out of it. On your right are knowledgebases and systems that deal with: Formats Software Level of Support Pres Intent Priority Pres Actions Pres Options Pres Evaluation
This model is based on being able to access: both human and machine accessible information. consistent preservation metadata which has been recorded and maintained for every digital object in the repository. There will also need to be consistent specific M/D for particular format types (if identified). A summary of this information, which can be grouped into defined sets of managed content (e.g. collections) needs to be readily available.
To start to build a system which looks at pres intent, significance, and level of support and other risk metrics we need to be careful about the level of detail in our system – is it best to have relative indicators or will we drown in the detail? Having said this, we require: Relevant information on formats and versions in our system. Relevant information on software and versions and dependencies that can access particular formats in our system. To be able to build relationships between formats and software – specifically what can open or open edit a format, we need to know: What software is available? Do I have it? What is the external and internal support The proximity of the software to the format e.g. was this software made for this format or is it generic software Take into account Pres Intent and Significance. Does it mater? Use other risk indicators carefully in a measured and meaningful way Reporting on level of support based on these relationships we can determine if we can maintain access to the format and what priority (if any) should be given to its treatment.
There are other parts of this system which we have not prototyped. However, these will need to be built as a part of the new system. These are: Preservation monitoring, reporting and prioritisation Preservation options and preservation action planning Preservation action evaluation
So I will describe this part of the ecology within the red box.
These are the preservation intent statements which we have currently compiled.
We started with agreed statement of Pres Intent for each collection: This was divided into a number of parts: Context of the collection and what they collect; The Preservation Intent of the collection for their identified material; Identified collecting issues/limitations in how the material is collected; Other issues which may effect preservation.
We then started to look at how we might systematise this info at a high level. This raised some very interesting questions about vocabulary and granularity. This partially worked but not to my satisfaction.
This table summarises the previous screen For example, the fields can be characterised as: Owner Description of material Intent: preserve (yes/no), time, what aspects (e.g. view, edit, navigate, manipulate content) Responsibility/Authorisation Detailed Notes Interestingly, the collections tended to view their material based on ingest workflows and catalogue level records. Not files or formats.
We have a slightly different view on how to describe their material. We tend to think about it in terms of files and formats and not workflows However, resolving this in a systematic way will be a job for our next system.
Now I will describe this part of the ecology within the red box. We have prototyped some of these systems which I will briefly show you
This is the File Format Home page It contains Formats that are relevant to our environment The Levels of abstraction are - format family followed by versions
If we take a look at the entry for TIFF We spent a lot of time thinking about descriptions headings for the free text – what makes a sustainable format – however, this is subjective information, but helpful. We have also have a controlled vocab which is integrated with the text field. We currently have a staff member working full time populating these fields for 6 months and hopefully longer
On the version page we have listed the software that can be used to identify this format (we could have many). This info will be linked back to format M/D summaries from the repository You can also see the relationship to our software registered in the system - this is expressed by the vocab (open, open/edit and transcode). We could list other info as required.
This is the Software home page. It lists software relevant to our environment We have not concentrated on this.
If we choose software like Photoshop CS 5 We have a vocab and free text descriptions We see: version releases; plug-ins; support levels at the NLA; support levels from the vendor; software and hardware requirements; etc. We came to the conclusion that in a relative system major release were what we record.
And the most important aspect for establishing the level of support is the format to software relationship The summary list created by these two knowledgebases shows the list of format that this software can access These relationships can be used in other knowledgebases and to run reports on access configurations, possible migration paths, and what if scenarios.
Another important part of our future system will be the level of support and prioritisation KBs
A stated, the key to level of support is the relationships between format to many software instances. However, we could only build that part of this system as, at this time as we: currently cannot get consistent info for all files from the repository (except No. of files by collection); can only connect to text fields on the pres intent screen; have no consistent significance info in the system; have no systematise risk system metrics in our current system. This will change soon!
Another way of look at the level of support to give us access risk could be: The overall level of support by format (including vendor support, internal support and proximity); Other risk metrics (e.g. has it been deemed obsolete in the outside world) The number of files affected; The preservation intent – by collection; Any significance info;
Prioritisation of treatment, could be based on a summary of all the previous fields - including: Any constraints imposed by rights policies or agreements; and Amount of resources available. This summary could give the NLA collections and management the information that they require to prioritise want they want preserved.
We have a number of other modules in this model. For example, Pres options & action planning
We would like to know what options that we can support. For example: Report on relationships between specific options which have been linked to specific formats (e.g. migration). Report on specific software in our KB which are noted as being relevant for specific preservation actions. For example, tell us all the software we have which can access X and is registered as a migration path. Link to other information available through other link data sources.
The part of the system that looks at generating options should also: enable staff to define, approve and prioritise preservation action plans for sets of managed content support preservation action plans which include: multiple steps and combining manual and automated workflows. replacing files and linkages within a complex object Link to a specific emulation environment Replace existing software to change the level of support Specific the action – no action is required It should also be able to support simulating changes to the environment.
And finally, pres options evaluation
Ultimately, we want to be able to tell if what we have planned is any good - before we start any processing happening in the repository that could take some time.
Currently, these ideas and requirements have become ‘partially real’ (almost like ‘Mostly Dead’ from the movie Princess Bride). They still need to be implemented. They formed the basis for the preservation requirements in a subsequent: RFP (Request for Proposal) process; and RFT (Request for Tender) process.
RFP When to market July 2011 A number of responses were received for: Core systems Preservation Digitisation Other Workflows Select vendors were invited to participate in the new stage.
RFT Closed at the end of Dec 2011.
So which version of reality have we decided upon? The evaluation report has recommended that the Library proceed to contract negotiations with selected tenders for each scope of work. Currently the Library is preparing a submission for ministerial approval prior to commencement of contract negotiations with vendors. Thanks for your time.
Dave Pearson The Adventures of Digi
The Adventures of Digi:Ideas, Requirementsand Reality David Pearson National Library of Australia Future Perfect 2012 Digi By Imogene Pearson (7 years) (March 2012)
1.) Some Context Digi By Imogene Pearson (7 years) (March 2012)
From a preservation point of view, the Library’s digital collections present:• A mix of materials needing to be kept in perpetuity, along with materials that can be discarded after specified periods or events;• Mixed levels of complexity in terms of object structure, relationships and dependencies;• Mixed levels of intellectual control;• A wide range of file formats (and carrier formats);• Different levels of complexity in preservation planning and processing;• Different timetables for preservation action;• A need for different preservation approaches, often at different scales; and• A need for recurring – and possibly changing - preservation action cycles over time, using a changing suite of tools.
EcologyEcology or Layers of consciousness for the need for digital preservation intervention (Given some need to access content over time) Unaware:• I am unaware if I have any digital content; or• I am unaware if I may have a problem accessing any of my digital content.Aware - no response:• I don’t think that I have a problem accessing any of my digital content;• I recognise that I have a problem accessing some of my digital content;• I recognise that I have a problem accessing some of my digital content. However, the problem is not my problem; or• I recognise that I have a problem, but have no response in place - not even a limited one.Aware – taking some action:• I accept that I may have a problem accessing some of my digital content. I am taking limited actions to manage this problem; or• I accept that I may have a problem accessing some of my digital content. The preservation mandate is a part of my enterprise or system ecology.
Another way of looking at it might be: David Pearson 2012
3.) What we have come to understand over time. http://www.motifake.com/79532 via Google Images
Preservation responsibilities:Preservation of the Librarys digital collections involves three main goals:• Maintaining access to reliable data at bit-stream level;• Maintaining access to content encoded in the bit streams; and• Maintaining access to the intended and available meaning of the content.While specific preservation activities may focus on one or more of these goals, the Library’s preservation responsibility is only fulfilled when all three goals have been adequately addressed.This responsibility applies across all digital collections, subject to curatorial and policy decisions for specific groups of digital objects.
Mission: The primary objective of preservation activities within the NLA is to maintain theability to meaningfully access digital collection content over time. ‘Logical on ‘Logical on Physical Physical Stuff’ Stuff’A B Contextual Dependency Information – About Information – About time Content Formats etc. Systems to Ingest, Manage, Report and take Actions time Systems to Access – Master or Derivative ‘Stuffed?’ David Pearson 2012 Google Images
Required preservation processesThe Library must be able to:• Understand what it holds in its collections;• Understand what its preservation intentions are for every digital object and what it is entitled to do to realise its intentions;• Understand what is required to provide access, existing inhibitors to access, and the current level of support the Library is able to provide;• Evaluate and monitor the degree of risk arising from collection composition, preservation intentions and available level of support within the Library for digital collection content, and monitor for risk conditions arising during general Digital Library operations;• Anticipate the effects of changes in support;• Recognise planning triggers, and plan and take appropriate action on a scale appropriate to the size of the target; and• Audit the effectiveness of its preservation arrangements and modify the arrangements if necessary.
Risk or ‘Risk-on’ (are you a splitter or a lumper?)• ‘parameter-based’ risks: a match against a criterion defined by Library staff to indicate a preservation risk – for example, video encoded with a codec considered to be problematic;• ‘exception’ risks: the value of a monitored parameter is outside a set of acceptable values;• ‘change’ risks: there has been a change in status for a monitored parameter for content – for example, the confidence in format identification for a particular file has changed;• ‘conflict’ risks: conflicting values for the parameter are reported by one or more tools – for example, file format identification returns conflicting values;• ‘unknown value’ risks: undetermined values for defined parameters – for example, undetermined values for file format and version; and• ‘access support’ risks: changes in level of support which affect the Library’s ability to access to content in accordance with preservation intent and significance – for example, reduction below an acceptable threshold in the availability of supporting software for a particular file format.• ‘content-based’ risks: characteristics of content that may not be identifiable from metadata – for example, presence of deprecated HTML tags.
Likely preservation treatment actionsBroad preservation action approaches that are likely to be required will include:• Format migration at the point of collecting;• Format migration on recognition of risks;• Format migration at the point of delivery;• Emulation of various levels of software and hardware environments;• Maintenance or supply of appropriate software or hardware;• Documenting known problems for which no other action can be taken; and• Deaccessioning or deletion.
Prioritising Preservation Treatment:The Library expects to take into account indicators of ‘preservation intent’, ‘significance’, and ‘level of support’ within monitoring and reporting activities, and in evaluations of risk and prioritisation for preservation planning and action. http://callmemilo.deviantart.com/art/Thunderbirds-are-GO-20717927
Preservation intent – indicates the expectations for preservation for content:• whether content is to be preserved;• who is responsible for preservation of the content;• the period over which content must be preserved;• the required level of support for access to the content over time; for example, that the Library intends to actively maintain the ability to both present and modify content, or only to present content, or does not intend to actively maintain access to content beyond its expected useful life.• Preservation intent may also extend to include more specific characteristics to be supported, based on curatorial input or constraints imposed by rights policies or agreements with rights holders.
Significance – indicates the relative priority required for taking preservation action to maintainaccess to content, as determined by collection curators; for example, content rated as highlysignificant would be prioritised for preservation planning and action before content of lowersignificance.Level of support – indicates how well a digital collection object is supported within the Library,based on a combination of how much is known about the object and its components (includingtheir file formats), and the degree to which supporting software or hardware environments areavailable. NLA Image
Preservation assessment and reportingThe Library must be able to review the composition and characteristics of its digital collections to assess trends that may affect preservation management, to aid setup of preservation monitoring, planning and action, and to report on specific aspects of content when necessary.A solution must enable staff to define and request, on both an ad hoc and scheduled basis:• summary reports of content, metadata characteristics and risks across collections or defined sets of managed content;• detailed metadata reports for individual items or sets of items; and• audit trail history reports for individual items or sets of items.
Reference knowledgebases (General)Enable staff to create, update and maintain reference information knowledgebases on:• File formats and versions• Software and hardware components that support access to file formats and versions, for maintaining access to managed content; and• The level of support available for particular file formats and versions: – i. sets of software or hardware components available to support access to formats; – ii. functions supported, both for providing access to content and for use in preservation action – for example, presentation, modification, batch processing; – iii. fidelity of support – how well functions are supported; and – iv. known risks, including potential inhibitors to preservation, associated with formats or supporting software or hardware.• Preservation intent descriptions and parameters for sets of content.
Other systems are also required to interrelate in thisecosystem such as:•Preservation monitoring, reporting and prioritisation•Preservation options and preservation action planning•Preservation action evaluation
5.) Pres Intent (current NLA prototype) David Pearson 2012
CollectionsPreservation Intent - Asian Collections and Overseas Collections Management — Version 1.0Preservation Intent - Australian Books and Serials — Version 1.0Preservation Intent - Dance — Version 1.0Preservation Intent - Manuscripts — DraftPreservation Intent - Maps — Version 1.0Preservation Intent - Music — DraftPreservation Intent - Newsapaper Digitisation — Version 1.0Preservation Intent - Oral History — UnknownPreservation Intent - Pictures — Version 1.0Preservation Intent - Selective Web Harvesting — Version 1.0Preservation Intent - Web Domain Harvests — Version 1.0
An attempt to systematise Pres Intent (requires some additional thinking)
This is how we tended to think about it (a job for a new system).
6.) Info on Formats, software and level of support (someprototyping) = NLA 2011
7.) Level of support and Prioritisation NLA 2011
Level of support (an early concept model) DP 2011
Prioritising preservation treatment based on level of supportIn evaluations of risk and prioritisation for preservation planning and action, we must take into the Level of Support/Access Risks and:• Any constraints imposed by rights policies or agreements; and• The amount of resources available.Based on these factors, the Library (Management, Collections and Digi Pres) should be able to prioritise material to be preserved.
8.) Preservation actions and options generation NLA 2011
Options for preservation actionsWe would like to be able to enable staff to:•define types of preservation actions for use within preservation planning and evaluation.•update and delete reference information on options for preservation action, both in general andfor particular formats or format types.•link to information able from the software KB which provides information on what actionsspecific software might be useful for and the proximity of the software to the format.•Link to other linked data sources.
Pres action options generationThe Library must be able to test and evaluate preservation action plans to determine if theysatisfactorily achieve the preservation intent for managed content. For example, a solutionshould:•enable staff to develop and test executable preservation action plans for sets of managedcontent. Including: – Single and multiple step actions (combining manual and automated workflows) – Replacing files/s and linkages in complex objects – Linking to a specific emulation environment (if available) – Replacing access software – Specifying that no action is required•Support simulations or testing of preservation actions against a content Testbed. For example,enable staff to perform what if simulations to determine impact of changes to availability ofsupport for access, including: – a. Removal of software or hardware sets supporting access, to assess risks or impacts on access; and – b. Addition or revision of software or hardware sets supporting access, to assess proposed remedial preservation action plans.•enable staff to define quality assurance criteria for preservation action plan outcomes
Preservation options evaluation• support import and integration of preservation-treated content and metadata, from either internal or external processes, including: – a. Verifying that preservation-treated digital content conforms to acceptance criteria for preservation outcomes for designated sets of digital content; – b. Enabling staff to quality assure and approve preservation-treated digital content for incorporation into the collection; and – c. After approval, send to preservation action scheduler for treatment of file/s, metadata and associated relationships.• support ‘rollback’ of updated versions of content, metadata and associated relationships to restore previous versions, if necessary.• enable staff to define and approve acceptance criteria for preservation action outcomes for sets of managed content.
10.) So what!Currently, these ideas and requirements have become ‘partially real’. They still need to be implemented.They formed the basis for the preservation requirements in a subsequent:• RFP (Request for Proposal) process; and• RFT (Request for Tender) process. http://www.wildsound-filmmaking-feedback-events.com/images/austin_powers_dr_evil.jpg
RFPSo all of these ideas where consolidated as requirements for a Request for Proposal which went to the market in July 2011.A number of responses were received for:• Core systems• Preservation• Digitisation• Other Workflows http://www.melbournesumos.com.au/pics/twister/Twister078.jpgThese were evaluated and some of the vendors were invited to participate in the next stage.
RFTBased on the RFP, the NLA clarified the requirements for the next process.A select group from the RFP process were invited to participated in a Request for Tender in which closed in late December 2011. http://simpro.co/wp-content/uploads/2010/10/paperwork2.jpg
What version of realityhave we decided upon? Everything, for Everyone Forever Digi By Imogene Pearson (7 years) (March 2012) http://www.flickr.com/photos/ricksmit/15671245/