Developing & Implementing Tools for Managing Hybrid Archives - Notes
Developing and Implementing Tools to Manage Hybrid
By Peter Cliff
21st May 2010, Centre for Archives and Information Studies, Dundee
The futureArch project is a Mellon-funded project to provide the Bodleian Library, and the wider
community, with the skills and infrastructure to archive and present born-digital manuscripts. We
call the service the project is establishing “Bodleian Electronic Archives & Manuscripts” or BEAM.
I am assuming a predominately non-technical (at least Information Technology technical –
archivists bring their own technical expertise and terminology) audience and so rather than bore you
with the relative merits of working in Java and Eclipse RCP, this talk will bore you with the
philosophy that underpins BEAM development work.
A simple view of an archive work flow
I'm starting from a very simple model of the way archiving works at the Bodleian. Of course, there
is more to it than this! There are four steps – the arrival of the material at the Library (“transfer”),
the storage of that material, the processing of that material (“cataloguing”) and finally the use of
that material by researchers.
We can also insert a step between arrival and storage, which I'll call “quarantine” which is where
the materials are examined and any risks to both the incoming material or materials already in the
archive are assessed. This could be active mould, insects, hazardous materials, and the like.
Also important to notice is that there can be a significant period of time between storage and any
kind of processing. Often materials arrive and are put in a “safe place” and left there until such a
time as it becomes politically sensible, or the funds arrive, to process the collection, appraising,
cataloguing and preparing for presentation.
Augmenting the work flow
It would not be appropriate (or indeed possible) for the futureArch project to change the working
practices of the Bodleian archivists. While significant cultural change will probably result from the
introduction and subsequent ubiquity of digital materials, this change must happen at a natural pace.
By establishing itself as a service to archivists (perhaps like conservation), BEAM neatly sidesteps
the need to enforce cultural change. Instead we are attempting to make the handling of digital
material as easy as possible for the archivists. Because of this we are not replacing the archival
work flow. Rather, we are inserting hooks into it, augmenting it to enable the archivist to continue
their work with digital as well as paper based materials.
The first “hook” is into the aforementioned materials risk assessment (“quarantine”). We are
encouraging archivists to identify if the collection contains digital material as soon as possible and
this is the time to do it. The reason for early identification is two fold. Firstly, we talk a lot about
“born-digital” but I think we need to think of it as “born-mouldy”. This is not to scare people, but
rather to highlight that digital material is decaying from the moment it is created. (This includes
reviewing collections already accessioned). Perhaps more importantly is that BEAM needs to know
what it will be processing early to ensure the materials can be prepared in time to meet the needs of
the cataloguers and also allow BEAM to plan for any capture, migration and storage of the digital
The second hook is the transfer of the digital material to BEAM staff for processing. We manage
this with that most technical of inventions – the separation sheet. The digital material is removed
from its original place in the accession, and replaced with a sheet outlining what has been removed.
The material travels with an accompanying sheet – a copy of the one left in the box – to enable us
(should we need to) to reunite the digital media once the digital manuscripts on those media have
been recovered. We also have a “Collections Management Database” which assigns a unique
identifier to the collection and that number then goes on all the materials associated with that
collection, including in the digital material's metadata.
That is the transfer of the physical media, but as we move forward, we are also developing digital
transfer – from the donor (or an archivist helping the donor) direct to BEAM. This means we can
collect digital materials on or near creation date rather than years later. Doing so helps avoid
obsolete formats and hardware.
The third hook is the presentation of the digital materials back to the archivist for use during the
appraisal and cataloguing steps in the work flow. This means the archivist does not need to worry
about the formats or the hardware (unless they want to – we record metadata about these things),
but can focus on the content of the materials.
The forth hook is the presentation of the digital materials to researchers. The Bodleian Library
collections are made available in the reading rooms only. This paradigm has not changed with
digital materials, though I envisage it will due to reader requests. Given that, BEAM is building a
system where by digital manuscripts can be used along side the paper in the reading room. This is
done using a customised laptop. The laptop is configured to ensure the materials cannot be
transferred onto other media and removed from the reading room. This is achieved with a
combination of encryption, virtual machines and a bit of luck! (There will be further details on this
available as the project progresses).
BEAM work flow
That is, in a nut shell, how BEAM might augment the existing archivist work flow, but what are we
doing within the BEAM service to manage digital materials? At a very high level, the answer is not
much different. The BEAM work flow borrows shamelessly from the archival work flow. So we
have steps for transfer, storage, cataloguing and usage as well. We also add additional digital
preservation actions – monitoring the stores – which is much the same as ensuring the stacks do not
get wet, etc.
The process of capture of digital materials from their media into a safe digital store varies
depending on the type of material. We have used both sophisticated digital forensics to brute force
hacking together bits of old machines purchased from auction sites in the hope they'll read obsolete
disks. I'll talk a bit more about storage in a moment.
The cataloguing is achieved in both simple and complex ways. There are the free-text notes the
archivists may have made during the capture process. On ingest into the store, automatic metadata
generation tools will start and process the digital material and firstly identify any material at risk
due to obsolescence or intrinsic hazards (like viruses), and also where possible extract text, titles,
keywords, etc. After the automated processes have complete, the archivists can then view that
metadata and add their own.
I spoke a little about hybrid usage above. Suffice to say we are trying to support a hybrid reading
room as much as possible and so “use” is blended here. In time, I would envisage cataloguing
becoming more of a mixed environment, but at the moment there are established methods of
creating metadata for the paper materials, and developing methods for the digital and they remain
The Bodleian Library is in the process of building a digital asset management system. BEAM will
use storage and services offered by that system. The DAMS will provide two geographically and
technically separate sites for storage. Geographic separation is to mitigate against physical risks
such as fire, flood and theft. Technical separation is to enable us to use both “tried and tested”,
albeit limited, file store technology alongside what we believe to be self-healing, resilient and
reliable storage technology, but without the long track record. Should a fundamental bug be found
in one storage technology, it is hoped it will not be manifest in the other. Essential for our storage
needs is that it is local to our processing room. It is not on any public network and probably never
will be. Using storage “in the cloud” is not an option for us. This is mainly due to the trust of our
donors, who seem reluctant to find their data being housed with a third-party.
While the DAMS will provide various interfaces to store data, including object-based storage,
BEAM has specified the use of simple file systems. This has been the result of much discussion but
is essentially founded on the principle that simple file systems have withstood the test of time,
provide for the “natural (file) order” of the digital materials. As we will be storing compound
objects, it is also difficult to see how treating a bundle of say 20,000 files as a single object will
Given this, if you have access to a well maintained, secure, backed up and checksum monitored file
system, you can do digital preservation. Repository systems, for us, fulfil different (though no less
In order to recreate the demo experience :-) you can download the Ingester:
Once unpacked, refer to the BEAM Ingest Quick Start Guide available in the BEAM Ingest
Tool Building Principles
Where possible BEAM will both use, contribute to and create Open Source Software. Of course,
there are idealogical and cost reasons regarding freedom, but for us it is also about practicality. We
do not want to worry about licensing when archiving our software tool chain.
Where appropriate we will us Open Standards. Open Standards are a very good thing, but they must
be used appropriately. It seems foolish to attempt to adopt an open standard for internal use only,
when a simple home grown system would do the job and be easier to implement. Trying to use a
standard just because its there can lead to having jump hoops that are probably unnecessary. That
said, developing with standards in mind is important, and if there is one that meets your needs, use
it! We have adopted METS and EAD at BEAM. Open standards matter more at the boundaries of
the preservation system so keep in mind mapping on export and when exposing your data to the
We have built a system that has the shortest path to stabilisation. That is to say digital materials
come off their media, have a minimal amount of processing done, and then we put them into the
preservation store whole. This is to support the previously mentioned passage of time between
storage and cataloguing. Data sitting in a quarantine zone is not preserved, but may not be checked
for some time, so it needs to be moved to storage quickly. We currently keep both extracted files
and media “images” (bit-by-bit copies of the media). This is both because we want to avoid second
guessing future researchers and also because we do not have time to appraise all the digital material
on some media (anything > 2GB hard drives). In a similar vein, we also try to transform obsolete
files as late as possible – usually for presentation, rather than at ingest. This is because it is likely
such transformations will be done better in the future. Both these ideologies will have to be
reassessed when we run out of storage capacity!
Finally, and perhaps most importantly, we are adopting a modular approach to our system. If you
imagined a car where all the parts were glued together, it would probably work fine until one thing
failed – say the headlamp bulb. If that bulb were not replaceable you would have to replace the
entire car to get it fixed. Cars have replaceable headlamps, and we want the preservation system to
have replaceable parts too. Such an approach has numerous advantages, including the ability to
scale capacity of processing and storage but adding additional modules, and also being able to
upgrade or replace parts of the system independently of the others – essential for the long-term
sustainability of the system. This also means that if you can only manage to do one part of the
system, then just do that one part – for example, just keeping stuff in a secure store. You can then
add modules as they are built. In this way you can start small and grow as and when you can or you
The final slide shows some of the further tools futureArch hopes to develop for BEAM – either by
adopting existing products, or building them ourselves. It is an ambitious project, but we're trying