Can repositories provide digital preservation services, and why should you be interested? DP refers to any actions need for long-term continuity and access to digital information. As well as running a repository that stores objects, manages metadata and delivers content to users, there may come a day when some additional intervention is needed to ensure this continuity. One method is file format migration. It’s not something that happens automatically in a repository, so you have to plan for it, integrate its actions in workflow, and keep track of what is being done through the database. One of the ways I teach it is by using this reference model developed by NASA, now a standard in the DP world. Note some basic functions, many of which are already provided by a Repository.
Here we see EPrints attempts to map to OAIS. This was devised by one of the EPrints team at Southampton. He mostly just renamed some of the OAIS entities. Note how it puts the EPrints IR in the position of 'Data Management', while the Archival Storage box has become a dedicated Preservation Service Provider. This feels like a credible workflow which matches OAIS. This service doesn't have to provide the Access function, since it's assumed the IR can do that already. It looks great for access, but it doesn’t fully explain how it’s going to do archival storage. The implication is that another new service has to be built, that EPrints can integrate with. That said EPrints does offer a lot of plugins and tools that will perform ingest as described earlier.
Here’s Fedora, another IR package. They have a slightly more complex approach because they use a very specific digital object model which they call FOXML. It’s a set of XML code that structures each object as a thing with a unique ID, its properties, its datastreams, and its dissemination copies in a single wrapper. Note their version of Archival Storage is more articulated than the previous slide. XML objects and bistreams are stored separately. Their ingest explicitly includes validation of SIPs. Their idea of Access also shows they are thinking of dissemination copies, and about metadata. The governing database is, I assume, taking place at PolicyEnforce and FOXML.
Here’s that same sequence again, with some possible tools we could use to do these specific actions. [May need to explain some of these] PREMIS metadata is also going to help, by keeping a detailed log of all these events, with results / outcomes. (More on PREMIS after this). Ideally we are looking for a system that can integrate all of these external tools and plugins, and automate this process. We want to be able to batch process many objects quickly; and we want to run scripts, some of which can invoke other scripts. All of these steps should go into a procedure which takes place as seamlessly as possible. And here are some possible solutions…
Now if we were interested in implementing a migration policy as part of a DP strategy, one thing we would want to do is take a bit more care in identifying the file formats that are ingested. The DROID tool can do just this It integrates with ePrints so can be automatically included as a check in the workflow And its outputs can automatically added to the repository database This means that when we come to a migration we can
Repositories and Preservation by Ed Pinsent
Repositories and Preservation Ed PinsentULCC/LEAP IRM Workshop 15 June 2012
Why preservation?• Long-term value• Legal needs – compliance or rights• Business needs• Cost a lot to produce• Enhance reputation 2
May depend on…• Institutional commitment to doing it• Mandate to preserve• Business drivers• Advocacy for “best practice”• “We are all in the business of knowledge and its preservation” 3
Open Archival Information System (OAIS) Reference Model 4
Ingest procedure…+ tools1. Fixity generation – MD5 checksum2. Virus checking - AVG3. Format identification – DROID + PRONOM4. Format validation - JHOVE5. Environmental metadata extraction - NLNZ Metadata Extract Tool6. Format specific metadata extraction - NLNZ Metadata Extract Tool7. Store in digital archive – software script 7
Integrated tools and plugins• DROID (Digital Record Object Identification) - an automatic file format identification tool – Looks up live data held in PRONOM – Outputs definitive file format profile, size, extension + checksum• EPrints plugin – Automates process – Stores in database as metadata 8