Developing and Implementing Tools to Manage Hybrid

the cataloguers and also allow BEAM to plan for any capture, migration and storage of the digital

The second ho...
I spoke a little about hybrid usage above. Suffice to say we are trying to support a hybrid reading
room as much as possib...

We have built a system that has the shortest path to stabilisation. That is to say digital materials
come off thei...
Upcoming SlideShare
Loading in …5

Developing & Implementing Tools for Managing Hybrid Archives - Notes


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Developing & Implementing Tools for Managing Hybrid Archives - Notes

  1. 1. Developing and Implementing Tools to Manage Hybrid Libraries By Peter Cliff 21st May 2010, Centre for Archives and Information Studies, Dundee The futureArch project is a Mellon-funded project to provide the Bodleian Library, and the wider community, with the skills and infrastructure to archive and present born-digital manuscripts. We call the service the project is establishing “Bodleian Electronic Archives & Manuscripts” or BEAM. I am assuming a predominately non-technical (at least Information Technology technical – archivists bring their own technical expertise and terminology) audience and so rather than bore you with the relative merits of working in Java and Eclipse RCP, this talk will bore you with the philosophy that underpins BEAM development work. A simple view of an archive work flow I'm starting from a very simple model of the way archiving works at the Bodleian. Of course, there is more to it than this! There are four steps – the arrival of the material at the Library (“transfer”), the storage of that material, the processing of that material (“cataloguing”) and finally the use of that material by researchers. We can also insert a step between arrival and storage, which I'll call “quarantine” which is where the materials are examined and any risks to both the incoming material or materials already in the archive are assessed. This could be active mould, insects, hazardous materials, and the like. Also important to notice is that there can be a significant period of time between storage and any kind of processing. Often materials arrive and are put in a “safe place” and left there until such a time as it becomes politically sensible, or the funds arrive, to process the collection, appraising, cataloguing and preparing for presentation. Augmenting the work flow It would not be appropriate (or indeed possible) for the futureArch project to change the working practices of the Bodleian archivists. While significant cultural change will probably result from the introduction and subsequent ubiquity of digital materials, this change must happen at a natural pace. By establishing itself as a service to archivists (perhaps like conservation), BEAM neatly sidesteps the need to enforce cultural change. Instead we are attempting to make the handling of digital material as easy as possible for the archivists. Because of this we are not replacing the archival work flow. Rather, we are inserting hooks into it, augmenting it to enable the archivist to continue their work with digital as well as paper based materials. The first “hook” is into the aforementioned materials risk assessment (“quarantine”). We are encouraging archivists to identify if the collection contains digital material as soon as possible and this is the time to do it. The reason for early identification is two fold. Firstly, we talk a lot about “born-digital” but I think we need to think of it as “born-mouldy”. This is not to scare people, but rather to highlight that digital material is decaying from the moment it is created. (This includes reviewing collections already accessioned). Perhaps more importantly is that BEAM needs to know what it will be processing early to ensure the materials can be prepared in time to meet the needs of
  2. 2. the cataloguers and also allow BEAM to plan for any capture, migration and storage of the digital material. The second hook is the transfer of the digital material to BEAM staff for processing. We manage this with that most technical of inventions – the separation sheet. The digital material is removed from its original place in the accession, and replaced with a sheet outlining what has been removed. The material travels with an accompanying sheet – a copy of the one left in the box – to enable us (should we need to) to reunite the digital media once the digital manuscripts on those media have been recovered. We also have a “Collections Management Database” which assigns a unique identifier to the collection and that number then goes on all the materials associated with that collection, including in the digital material's metadata. That is the transfer of the physical media, but as we move forward, we are also developing digital transfer – from the donor (or an archivist helping the donor) direct to BEAM. This means we can collect digital materials on or near creation date rather than years later. Doing so helps avoid obsolete formats and hardware. The third hook is the presentation of the digital materials back to the archivist for use during the appraisal and cataloguing steps in the work flow. This means the archivist does not need to worry about the formats or the hardware (unless they want to – we record metadata about these things), but can focus on the content of the materials. The forth hook is the presentation of the digital materials to researchers. The Bodleian Library collections are made available in the reading rooms only. This paradigm has not changed with digital materials, though I envisage it will due to reader requests. Given that, BEAM is building a system where by digital manuscripts can be used along side the paper in the reading room. This is done using a customised laptop. The laptop is configured to ensure the materials cannot be transferred onto other media and removed from the reading room. This is achieved with a combination of encryption, virtual machines and a bit of luck! (There will be further details on this available as the project progresses). BEAM work flow That is, in a nut shell, how BEAM might augment the existing archivist work flow, but what are we doing within the BEAM service to manage digital materials? At a very high level, the answer is not much different. The BEAM work flow borrows shamelessly from the archival work flow. So we have steps for transfer, storage, cataloguing and usage as well. We also add additional digital preservation actions – monitoring the stores – which is much the same as ensuring the stacks do not get wet, etc. The process of capture of digital materials from their media into a safe digital store varies depending on the type of material. We have used both sophisticated digital forensics to brute force hacking together bits of old machines purchased from auction sites in the hope they'll read obsolete disks. I'll talk a bit more about storage in a moment. The cataloguing is achieved in both simple and complex ways. There are the free-text notes the archivists may have made during the capture process. On ingest into the store, automatic metadata generation tools will start and process the digital material and firstly identify any material at risk due to obsolescence or intrinsic hazards (like viruses), and also where possible extract text, titles, keywords, etc. After the automated processes have complete, the archivists can then view that metadata and add their own.
  3. 3. I spoke a little about hybrid usage above. Suffice to say we are trying to support a hybrid reading room as much as possible and so “use” is blended here. In time, I would envisage cataloguing becoming more of a mixed environment, but at the moment there are established methods of creating metadata for the paper materials, and developing methods for the digital and they remain discrete. BEAM Storage The Bodleian Library is in the process of building a digital asset management system. BEAM will use storage and services offered by that system. The DAMS will provide two geographically and technically separate sites for storage. Geographic separation is to mitigate against physical risks such as fire, flood and theft. Technical separation is to enable us to use both “tried and tested”, albeit limited, file store technology alongside what we believe to be self-healing, resilient and reliable storage technology, but without the long track record. Should a fundamental bug be found in one storage technology, it is hoped it will not be manifest in the other. Essential for our storage needs is that it is local to our processing room. It is not on any public network and probably never will be. Using storage “in the cloud” is not an option for us. This is mainly due to the trust of our donors, who seem reluctant to find their data being housed with a third-party. While the DAMS will provide various interfaces to store data, including object-based storage, BEAM has specified the use of simple file systems. This has been the result of much discussion but is essentially founded on the principle that simple file systems have withstood the test of time, provide for the “natural (file) order” of the digital materials. As we will be storing compound objects, it is also difficult to see how treating a bundle of say 20,000 files as a single object will help. Given this, if you have access to a well maintained, secure, backed up and checksum monitored file system, you can do digital preservation. Repository systems, for us, fulfil different (though no less important) function. Ingester Demo In order to recreate the demo experience :-) you can download the Ingester: Once unpacked, refer to the BEAM Ingest Quick Start Guide available in the BEAM Ingest directory. Tool Building Principles Where possible BEAM will both use, contribute to and create Open Source Software. Of course, there are idealogical and cost reasons regarding freedom, but for us it is also about practicality. We do not want to worry about licensing when archiving our software tool chain. Where appropriate we will us Open Standards. Open Standards are a very good thing, but they must be used appropriately. It seems foolish to attempt to adopt an open standard for internal use only, when a simple home grown system would do the job and be easier to implement. Trying to use a standard just because its there can lead to having jump hoops that are probably unnecessary. That said, developing with standards in mind is important, and if there is one that meets your needs, use it! We have adopted METS and EAD at BEAM. Open standards matter more at the boundaries of the preservation system so keep in mind mapping on export and when exposing your data to the
  4. 4. world. We have built a system that has the shortest path to stabilisation. That is to say digital materials come off their media, have a minimal amount of processing done, and then we put them into the preservation store whole. This is to support the previously mentioned passage of time between storage and cataloguing. Data sitting in a quarantine zone is not preserved, but may not be checked for some time, so it needs to be moved to storage quickly. We currently keep both extracted files and media “images” (bit-by-bit copies of the media). This is both because we want to avoid second guessing future researchers and also because we do not have time to appraise all the digital material on some media (anything > 2GB hard drives). In a similar vein, we also try to transform obsolete files as late as possible – usually for presentation, rather than at ingest. This is because it is likely such transformations will be done better in the future. Both these ideologies will have to be reassessed when we run out of storage capacity! Finally, and perhaps most importantly, we are adopting a modular approach to our system. If you imagined a car where all the parts were glued together, it would probably work fine until one thing failed – say the headlamp bulb. If that bulb were not replaceable you would have to replace the entire car to get it fixed. Cars have replaceable headlamps, and we want the preservation system to have replaceable parts too. Such an approach has numerous advantages, including the ability to scale capacity of processing and storage but adding additional modules, and also being able to upgrade or replace parts of the system independently of the others – essential for the long-term sustainability of the system. This also means that if you can only manage to do one part of the system, then just do that one part – for example, just keeping stuff in a secure store. You can then add modules as they are built. In this way you can start small and grow as and when you can or you need to. The end The final slide shows some of the further tools futureArch hopes to develop for BEAM – either by adopting existing products, or building them ourselves. It is an ambitious project, but we're trying hard!