A presentation given by Jenny Mitcham at the Northern Collaboration Conference on 10th September 2015 at Leeds. It describes work underway in the "Filling the Digital Preservation Gap" project using Archivematica to preserve research data
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
A collaborative approach to "filling the digital preservation gap" for Research Data Management
1. A collaborative approach to
“filling the digital preservation
gap” for Research Data
Management
Jenny Mitcham
Digital Archivist
Borthwick Institute for Archives
University of York
10 September 2015
2. What am I talking about...?
You
I’m going to talk to you
about digital archiving...
Me
Does she
mean storage?
OAIS blaa blaa AIP blaa
TRAC, PREMIS blaa...
You
Who invited
her anyway?Me
You
Digital preservation is all
about the active
management of data
Me
Is she still just
talking about
storage...?
3. So what am I talking about?
You
Digital preservation refers to
the series of managed
activities necessary to ensure
continued access to digital
materials for as long as
necessary.
Digital preservation ...refers
to all of the actions required
to maintain access to digital
materials beyond the limits of
media failure or technological
change.*
Me
Oh I see!
zzzzzzzz
Not just
storage then?
* Text shamelessly stolen from the DPC Preservation Handbook
4. This is a digital archive
The Open Archival Information System (OAIS)
8. Filling the digital preservation gap:
Project aim
“…to investigate
Archivematica and explore
how it might be used to
provide digital preservation
functionality within a wider
infrastructure for Research
Data Management.”
9. This is a collaboration
University of Hull:
• Chris Awre – Head of Information Services, Library and
Learning Innovation
• Richard Green – Independent Consultant
• Simon Wilson – University Archivist
University of York:
• Julie Allinson – Manager,
Digital York
• Jen Mitcham – Digital Archivist
Artefactual Systems
Jisc
10. Project structure
• Phase 1 – explore: testing, research,
thinking -produce a report (3 months)
• Phase 2 – develop: make
Archivematica better for RDM, plan
implementation (4 months)
• Phase 3 – implement: set up proof of
concepts at York and Hull (6 months)
12. Why do we need digital preservation
for research data?
• There is a digital preservation gap in current RDM
infrastructures
• We can’t ignore digital preservation – moving targets for
data retention mean we need to take this seriously
• Funder requirements around retention:
– NERC - data should be retained for a minimum of 10 years but
for projects of major importance this may need to be 20 years
or longer
– STFC - expect data to be retained for a minimum of 10 years and
data that cannot be re-measured should be retained indefinitely
– Wellcome Trust – expect data to be kept for a minimum of 10
years but suggest longer periods for certain types of data
13. University of York RDM questionnaire 2013
• Which data management issues have you come
across in your research over the last five years?
– “Inability to read files in old software formats on old
media or because of expired software licences”
– 24% of 181 researchers who answered this question
admitted this had been a problem for them
Why do we need digital preservation
for research data?
14. Why Archivematica?
“The goal of the Archivematica project is to give
archivists and librarians with limited technical
and financial capacity the tools, methodology
and confidence to begin preserving digital
information today.”
15. Why Archivematica?
• Standards-based
• Open Source
• Flexible and customisable
• Compatible with hundreds of file formats
• Advanced search and storage management
• Integrated with third-party systems
From https://ww.archivematica.org/en/
16. What does Archivematica do?
The short answer:
“It packages data up in a standards compliant
way and prepares it to be stored for the long
term”
17. What does Archivematica do?
The longer answer:
• Assigns unique identifiers
• Creates a checksum for each object
• Creates a text file with a directory tree of the transfer
• Option to quarantine data for a specified period
• Runs virus checks
• Cleans up file and directory names (removing characters that may cause
problems)
• Runs identification tools so you can find out what file formats you have
• Extracts data from zip files (or not if you would rather not)
• Extracts metadata embedded in the files (if you want)
• Normalises files (if a migration path exists)
• ...
18. What does Archivematica do?
The really really long answer (if you have time):
• Read the manual
https://ww.archivematica.org/en/docs/archivematica-1.4/
19. What does research data look like?
York RDM questionnaire
2013: Please select the main
types of electronic research
data you generate
21. What does research data look like?
York RDM questionnaire 2013:
If your project is not yet
complete, can you make an
estimate of the ‘final’ size of
your digital data
22. Value of research data
“There has probably been an awful lot
of good data lost due to poor practice
in archiving ...”
“Storing vast datasets which are not
part of the final publication adds a lot
of cost for very little benefit.”
“Unprocessed data is generally large
and difficult to analyse, unless the
analysis tools are provided in the
archive.”
“I hope strongly that in the future I
might contribute to a widely available
repository for musical
instruction/examples ....both for
other players/composers and for
musicological researchers.”
Researchers
24. How could you use Archivematica?
• Host it in-house and link it to an existing
repository/access system (for example DSpace,
CONTENTdm, Fedora/Hydra ...or a CRIS)
• Host it in-house and use as a standalone system
(you would need to have a storage system in place and establish a
way of facilitating access to the data)
• Sign up for a hosted instance of Archivematica
with archivesDIRECT (combines Archivematica with
DuraCloud storage)
• Sign up for a hosted instance of Archivematica
with Arkivum (combines Archivematica with Arkivum storage)
25. Why would we recommend
Archivematica for RDM?
• It is flexible and can be configured in different ways for
different institutional needs and workflows
• It allows many of the tasks around digital preservation to
be carried out in an automated fashion
• It can be used alongside other existing systems as part of a
wider workflow for research data
• It is a good digital preservation solution for those with
limited resources
• It is an evolving solution that is continually driven and
enhanced by and for the digital preservation community
• It gives institutions greater confidence that they will be
able to continue to provide access to usable copies of
research data over time
26. What are the downsides?
• It isn’t a magic bullet
• There is no guarantee your data will be
readable in the future
• It can only be as good as current digital
preservation practice
• It can be fiddly to install correctly
• The GUI isn’t that intuitive
• You need staff who understand it
27. FAQs
• Why do we need a digital preservation system for research
data?
• What are the risks if we don't address digital preservation?
• Why are we interested in Archivematica?
• Why do we recommend Archivematica to help preserve
research data?
• What does Archivematica actually do?
• How could Archivematica be incorporated into a wider
technical infrastructure for research data management?
• What does research data look like?
• How would Archivematica handle research data?
• What are the limitations of Archivematica for research data?
• What costs are associated with using Archivematica?
• ....
28. Read all about it!
http://digital-archiving.blogspot.co.uk/
29. RDM Workflows at York
• We get a copy of data from a researcher
• We transfer it to Archivematica
• Archivematica packages it up for storage and
creates the Archival Information Package (AIP)
• Archivematica sends the AIP to archival storage
• Metadata is published in data catalogue
• If someone requests the data Archivematica will
create a Dissemination Information Package (DIP)
• DIP will be uploaded to Digital Library for access
32. How can we improve Archivematica?
1. Enable better workflows for RDM (producing a
DIP on request)
2. Allowing the DIP (access copy of data) to be
usable by different repository systems
3. Helping reduce bottlenecks for big data
4. Workflows for unidentified files
5. Enabling easier querying of data within
Archivematica by third party applications
6. Better documentation
33. Where to find out more
http://www.york.ac.uk/borthwick/
35. Any questions?
Thanks for listening
You
Has she
finished yet...?
Me
Feel free to contact me at
jenny.mitcham@york.ac.uk
Me
...I really am
keen to talk to
you about this
project
Useful links:
Borthwick website: http://www.york.ac.uk/borthwick/
Digital archiving blog: http://digital-archiving.blogspot.co.uk/
Archivematica: https://www.archivematica.org/en/
Report: http://dx.doi.org/10.6084/m9.figshare.1481170