1. Implementing Archivematica for
research data preservation at
York and Hull
Jenny Mitcham (Digital Archivist) - University of York
Jisc RDN event - 06 September 2016
2. What I’m going to cover
This is a presentation in 4 parts:
1. Background to our project
2. Implementing Archivematica
3. The challenges of preserving research data
4. Future plans
3. Part one: The Filling the Digital
Preservation Gap project
4. Filling the digital preservation gap:
Project aim
“…to investigate
Archivematica and explore
how it might be used to
provide digital preservation
functionality within a wider
infrastructure for Research
Data Management.”
5. Project structure
• Phase 1 – explore: testing, research,
thinking -produce a report (3 months)
• Phase 2 – develop: make
Archivematica better for RDM, plan
implementation - report (4 months)
• Phase 3 – implement: set up proof of
concepts at York and Hull and further
investigation of file format problem (6
months)
6. The team
University of Hull:
• Chris Awre – Head of Information Services,
Library and Learning Innovation
• Richard Green – Independent Consultant
• Simon Wilson – University Archivist
University of York:
• Julie Allinson – Manager, Digital York
• Jen Mitcham – Digital Archivist
Artefactual Systems
Funded by Jisc
(Research Data Spring)
8. What are we trying to achieve?
Demonstrate that it is possible to:
• pull metadata from PURE / pull content from Box
• capture further data to help us manage the dataset
• automatically initiate ingest by Archivematica
• set up Archivematica to package the data up for
longer term preservation (automatically)
• provide a dissemination copy of the data for our
Hydra repository
...basically what we said in our implementation plans
9. In addition…
• Keep an eye on the broader picture
– How can preservation processes for research
data be used for other materials e.g., archives
• Consider different use cases for research
data organisation on deposit
– Single file, multiple files, hierarchical files, etc.
– With or without associated metadata
• Share experiences across two institutions
with different environments
10. How did we approach it?
We wanted to work in a way that:
• was useful to others
• was open and accessible
• had the bigger picture in mind
So we are:
• sharing code on github
• working in google docs
• engaging Hydra and Archivematica communities
• blogging and talking at events like this
15. What were the challenges?
• mostly time!
– recruiting suitably skilled developer at short
notice
– relying on Artefactual Systems who have their
own list of priorities and timescales
– working with local IT department and different
priorities
• outstanding tasks from phase 2 which
needed further development
• integration/APIs (eg with PURE and Box)
16. What worked well?
• Re-using existing code (rather than re-
inventing the wheel)
– The puree gem from Lancaster University: this is
a way of pulling metadata out of PURE and it
saved us a huge amount of work
– Automation tools from Artefactual Systems: a
lightweight method of automating transfers
within Archivematica. We funded a webinar about
this in phase 2 of our project.
• Flexibility and capacity in house to do the
work
17. Part three: The challenges of
preserving files that we can’t
identify
18. A quick look at file formats
Research data file formats are:
• Numerous
• Sometimes a bit obscure
• Sometimes very big
• Ever-changing
• Often very new
This means they can be hard to preserve... The
first hurdle is that we can’t identify them. If we
can’t identify them how can we carry out
preservation activities?
20. The NDSA Levels
of Digital
Preservation:
Level 2 requires
you to know what
you’ve got ...
and levels 3 and 4
build on this
21. Can we identify our research data?
We ran Droid* over the research data
deposited with us over the past year.
Out of 3752 individual files:
• only 37% (1382) of the files were identified
(with varying degrees of accuracy)
• there were 34 different identified file
formats in the sample
* Droid is a free tool from The National Archives that can be
used to automatically identify file formats
22. Identified research data files
Files identified by Droid (listed by file type)
...note that native files of the software in the
previous graph of research data applications
are not represented
23. Unidentified research data files
• Files not identified by Droid (listed by file ext)
• 107 different file extensions not identified
– huge number with no extension (help!)
– how do we solve the .dat file problem?
24. What is the project doing to solve
the file identification problem?
• We have sponsored the development of 8
new file format signature records in PRONOM
for different types of research data
• We have created our own research data file
signatures for inclusion in PRONOM (and
blogged about it to encourage others to do
the same)
• We have been talking to TNA about how to
engage the community more
26. Future plans
• We have a week left to finish our active
project work (eeeek!)
• ...and look out for our phase 3 report in mid
October (and other dissemination outputs)
• We need to work out how to move from
‘proof of concept’ to production
– York will be establishing how to move seamlessly
from this project into the Jisc Shared Service
–Hull will be using the work to inform a City of
Culture digital archive
28. Do talk to me if you are interested in
finding out more about this project
Useful links:
Project website: http://www.york.ac.uk/borthwick/archivematica
Digital archiving blog: http://digital-archiving.blogspot.co.uk/
Archivematica: https://www.archivematica.org/en/
Phase 1 report http://dx.doi.org/10.6084/m9.figshare.1481170
Phase 2 report https://dx.doi.org/10.6084/m9.figshare.2073220