This document summarizes the findings of the Jisc Research Data Spring project at the University of York and Hull which investigated how Archivematica could be used to provide digital preservation for research data. The project tested Archivematica, explored how it handles different file formats and research data, and identified ways to improve Archivematica and integrate it into research data management workflows. The next phases will develop Archivematica further and implement proof of concepts at York and Hull to preserve research data using Archivematica.
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
“Filling the digital preservation gap”an update from the Jisc Research Data Spring project at York and Hull
1. “Filling the digital preservation gap”
an update from the Jisc Research
Data Spring project at York and Hull
Jenny Mitcham
Digital Archivist
Borthwick Institute for Archives
University of York
13 August 2015
2. Project aim
“…to investigate
Archivematica and explore
how it might be used to
provide digital preservation
functionality within a wider
infrastructure for Research
Data Management.”
3. What about Hydra?
• Hydra is not mentioned much in our project
report ...this is deliberate!
• We wanted to keep our findings generic to
make it most useful to a wide range of
institutions who may be interested in digital
preservation...
• ...this means we are more likely to get further
funding
• However...
4. Project team
University of Hull:
• Chris Awre – Head of Information Services,
Library and Learning Innovation
• Richard Green – Independent Consultant
• Simon Wilson – University Archivist
University of York:
• Julie Allinson – Manager, Digital York
• Jen Mitcham – Digital Archivist
5. About the project
• Funded as part of Jisc Research Data Spring
• Started 30th March 2015
• Phase 1 is complete
• Phase 2 has just started and will run until
November
• …and we hope phase 3 will
be funded
6. Project structure
• Phase 1 – explore: testing, research, thinking -
produce a report (3 months)
• Phase 2 – develop: make Archivematica better
for RDM, plan implementation (4 months)
• Phase 3 – implement: set up proof of
concepts at York and Hull (6 months)
7. Phase 1 -The key questions
• Why? Why are we bothering to 'preserve' research data. What are
the drivers here and what are the risks if we don't? Why are we
looking at Archivematica?
• What? What are the characteristics of research data and how might
it differ from other born digital data that memory institutions are
establishing digital archives to manage and preserve? What types of
files are our researchers producing and how would Archivematica
handle these? What does Archivematica offer us and what benefits
does it bring?
• How? How would we incorporate Archivematica into a wider
technical infrastructure for research data management and what
workflows would we put in place? Where would it sit and what
other systems would it need to talk to? How can we improve
Archivematica for RDM?
• Who? Who else is using Archivematica (or other digital preservation
systems) to do similar things and what can we learn from them?
What staff resource is needed to preserve research data with
Archivematica?
9. Why Archivematica?
“The goal of the Archivematica project is to give
archivists and librarians with limited technical
and financial capacity the tools, methodology
and confidence to begin preserving digital
information today.”
10. Why Archivematica?
• Standards-based
• Open Source
• Flexible and customisable
• Compatible with hundreds of file formats
• Advanced search and storage management
• Integrated with third-party systems
From https://ww.archivematica.org/en/
11. What does research data look like?
York RDM questionnaire
2013: Please select the main
types of electronic research
data you generate
13. The importance of identification
How well are these
formats identified by
digital preservation
tools?
• Better than expected!
• Sometimes partial
• Sometimes quite
generic (without a
version number)
MATLAB N
SPSS Partial
Stata N
R N
EndNote Partial
NVivo N
LaTeX Partial
Python NWolfram
Mathematica Partial
Gaussian N
ChemDraw Partial
SAS Partial
ArcGIS Partial
GraphPad Prism Partial
Adobe Photoshop Partial
ATLAS.ti N
C++ N
Eclipse NA? No native file formats
MS Excel Y
RSB - ImageJ Partial
14. What does research data look like?
• Potentially quite big
• Wide range of file formats (some well understood
but a long tail of more specialist/obscure formats)
• Sometimes sensitive and/or confidential
• Ever changing (new software and techniques are
used for dynamic and cutting edge research)
• May be different versions of the data (as new
publications are released)
• Value not well understood at the point of deposit
15. What does Archivematica do?
The short answer:
“It packages data up in a standards compliant
way and prepares it to be stored for the long
term”
16. What does Archivematica do?
The longer answer:
• Assigns unique identifiers
• Creates a checksum for each object
• Creates a text file with a directory tree of the transfer
• Option to quarantine data for a specified period
• Runs virus checks
• Cleans up file and directory names (removing characters that may cause
problems)
• Runs identification tools so you can find out what file formats you have
• Extracts data from zip files (or not if you would rather not)
• Extracts metadata embedded in the files (if you want)
• Normalises files (if a migration path exists)
• ...
17. What does Archivematica do?
The really really long answer (if you have time):
• Read the manual
https://ww.archivematica.org/en/docs/archivematica-1.4/
18. What does Archivematica do?
One final answer (honest):
It gives us a greater level of confidence that we
will be able to continue to provide access to
usable copies of research data over the longer
term
19. What are the downsides?
• It isn’t a magic bullet
• There is no guarantee your data will be
readable in the future
• It can only be as good as current digital
preservation practice
• It can be fiddly to install correctly
• The GUI isn’t that intuitive
• You need staff who understand it
20. Phase 2: ‘develop’
1. Enable better workflows for RDM (producing a
DIP on request)
2. Allowing the DIP (access copy of data) to be
usable by different repository systems
3. Helping reduce bottlenecks for big data
4. Workflows for unidentified files
5. Enabling easier querying of data within
Archivematica by third party applications
6. Better documentation
21. Phase 2: RDM Workflows at York
• We get a copy of data from researcher
• We transfer it to Archivematica
• Archivematica packages it up for storage and
creates the Archival Information Package (AIP)
• Archivematica sends the AIP to archival storage
• Metadata is published in data catalogue
• If someone requests the data Archivematica will
create a Dissemination Information Package (DIP)
• DIP will be uploaded to Digital Library for access