Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using Archivemedia to preserve research data


Published on

Jenny Mitcham from the University of York and Chris Awre from the University of Hull share lessons learned from their project to explore the potential of the digital preservation solution Archivematica to help manage research data that academics within the University produce. The project 'Filling the Digital Preservation Gap' has been carried out with funding from Jisc as part of their Research Data Spring program and was a collaboration of the University of York and the University of Hull. The project did not only explore Archivematica as a possible solution but also how it could integrate with the repositories and other systems for the management of research data.

The Series is jointly sponsored by ANDS and CAUL.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Using Archivemedia to preserve research data

  1. 1. Filling the Digital Preservation Gap Chris Awre (@clawre) and Jenny Mitcham (@Jenny_Mitcham) 7th or 8th November 2016
  2. 2. Introduction: who we are and why are we doing this?
  3. 3. Research at Hull and York • University of Hull : – 5 Faculties, 11 Schools – c. 22,000 students – 62% research classed as 3* or 4* in REF 2014 – In top 50 UK institutions by ‘research power’ • University of York: – 30+ academic departments – c. 16,000 students – Ranked in the top ten of UK universities for research council income – Secured £46 million in research council income in 2014/15
  4. 4. Why do we need digital preservation for research data? We can’t ignore digital preservation – moving targets for data retention mean we need to take this seriously Funder requirements around retention: • NERC - data should be retained for a minimum of 10 years but for projects of major importance this may need to be 20 years or longer • STFC - expect data to be retained for a minimum of 10 years and data that cannot be re-measured should be retained indefinitely • Wellcome Trust – expect data to be kept for a minimum of 10 years but suggest longer periods for certain types of data
  5. 5. Why do we need digital preservation for research data? University of York RDM questionnaire 2013: Which data management issues have you come across in your research over the last five years? 24% of 181 researchers who answered this question admitted this had been a problem for them “Inability to read files in old software formats on old media or because of expired software licences”
  6. 6. Jisc Research Data Spring initiative • A three phase funding programme starting in March 2015 • Looking for ideas for technical tools to help with RDM • Ideas crowd-sourced and voted on by anyone interested • Best ideas invited to pitch for funding • See for more information
  7. 7. Project aim - our pitch! “…to investigate Archivematica and explore how it might be used to provide digital preservation functionality within a wider infrastructure for Research Data Management.”
  8. 8. Project structure • Phase 1 – explore: testing, research, thinking (3 months) • Phase 2 – develop: make Archivematica better for RDM, plan implementation (4 months) • Phase 3 – implement: set up proof of concepts at York and Hull and further investigate of file format problem (6 months)
  9. 9. The team University of Hull: • Chris Awre – Head of Information Services, Library and Learning Innovation • Richard Green – Independent Consultant • Simon Wilson – University Archivist University of York: • Julie Allinson – Manager, Digital York • Jen Mitcham – Digital Archivist
  10. 10. Phase 1 - Explore
  11. 11. What were we trying to achieve? A feasibility study: • What does research data look like? • What does Archivematica do to preserve data? • How can Archivematica integrate with our other RDM systems? • What are our institutional requirements for digital preservation? • Does Archivematica meet those requirements? • Where does Archivematica fall short? All written up and published in a report…
  12. 12. What is Archivematica? ● Free and open-source digital preservation system (AGPLv3) designed to maintain standards-based, long-term access to digital objects ● Allows users to process digital objects from ingest to access using OAIS functional model ● Implements format normalization upon ingest and preserves originals to support emulation and migration strategies
  13. 13. What is Archivematica? ● Archivematica is a processing pipeline consisting of a bundle of open-source tools and python scripts which deliver a series of preservation micro-services ● Archivematica is designed to output high-quality, standards-compliant Archival Information Packages (AIPs) ● Bagit, METS, PREMIS
  14. 14. Archivematica development partners and more!
  15. 15. Why would we recommend Archivematica for RDM? • It is flexible and can be configured in different ways for different institutional needs and workflows • It allows many of the tasks around digital preservation to be carried out in an automated fashion • It can be used alongside other existing systems as part of a wider workflow for research data • It is a good digital preservation solution for those with limited resources • It is an evolving solution that is continually driven and enhanced by and for the digital preservation community • It gives institutions greater confidence that they will be able to continue to provide access to usable copies of research data over time
  16. 16. What are the downsides? • It isn’t a magic bullet • There is no guarantee your data will be readable in the future • It can only be as good as current digital preservation practice • It can be fiddly to install correctly • The GUI isn’t that intuitive • You need staff who understand it
  17. 17. How could you use Archivematica? • Host it in-house and link it to an existing repository/access system (for example DSpace, CONTENTdm, Fedora/Hydra ...or a CRIS) • Host it in-house and use as a standalone system (you would need to have a storage system in place and establish a way of facilitating access to the data) • Sign up for a hosted instance of Archivematica with archivesDIRECT (combines Archivematica with DuraCloud storage) • Sign up for a hosted instance of Archivematica with Arkivum (combines Archivematica with Arkivum storage)
  18. 18. Phase 2 - Develop
  19. 19. Phase 2 development work • Six different areas of work • Development carried out by Artefactual Systems from July 2015 to January 2016 • Weekly Google Hangouts to report on progress • Will be available in Archivematica soon...
  20. 20. Deliverable 1 Problem: Research Data needs to be kept, but we don’t know if anyone will ever want it and it might be *massive* The Solution: enable the DIP to be generated ‘on request’ and not as part of the initial ingest
  21. 21. Deliverable 2 Problem: We want to be able to grab the DIP, and metadata about it and pull it into our repository The Solution: a library to help with parsing and creating METS files
  22. 22. Deliverable 3 Problem: We want to be able to report on what we have The Solution: a search API to answer basic questions about number of files in storage, their formats, date of ingest etc.
  23. 23. Deliverable 4 Problem: With large datasets, the current checksum mechanism in Archivematica could be a bottleneck The Solution: support for multiple checksum algorithms
  24. 24. Deliverable 5 Problem: What about all those file formats that Archivematica can’t identify? The Solution: mechanism for running file identification with multiple tools and a report of unidentified formats. ...and I’ll talk a bit more about file format identification later!
  25. 25. Deliverable 6 Problem: We want to make it easier for institutions to adopt Archivematica The Solution: a webinar describing Archivematica’s Automation Tools
  26. 26. What worked well • Artefactual staff were good people to work with (and patient) • Artefactual have the bigger picture in mind and really want to understand the use cases • Our work builds on work that others have done and is being used and built on by future work that is in the pipeline: – Search API work being looked at by Bentley Historical Library – DIP generation by Simon Fraser University
  27. 27. What didn’t work well • Many of the areas of development were only partially solved through our work: – the problems were big and complex – what does success look like? – perhaps we tried to do too much? • It was hard to prove the impact of our checksum work • Solving the file identification problem is a huge task and needed more thought....
  28. 28. Implementation plans Hull and York also worked on implementation plans for Archivematica. This was key because…. Deciding to use a system is easy...deciding exactly how to use it is much harder Separate plans created for Hull and York as different RDM systems were in place and there were different institutional needs and priorities.
  29. 29. Phase 3 - Implement
  30. 30. York p-o-c implementation York wanted to provide: • an easy way of depositing data • a way of monitoring datasets for RDM staff • a way of requesting access to data with: • data sent to archivematica • dataset metadata pulled from PURE
  31. 31. York p-o-c implementation Metadata from PURE pulled in nightly or on- demand Fedora objects created for the dataset to store local admin info and help connect the PURE and Archivematica records Visual representation of workflow status
  32. 32. York PCDM modelling Dataset = Dataset record from PURE Individual data files stored, but folder structure is not Folder structure available in Archivematica METS Dataset can be made up of multiple ‘Packages’ of data, eg. newer version
  33. 33. What next in York? • Our RDM staff love the p-o-c and we have agreement to turn it into a production system over the autumn/winter • This has been a helpful exercise for broader data modelling / Hydra implementation at York • York is a pilot in the Jisc Shared Service for Research Data and will move forward with this work over the next couple of years
  34. 34. Hull p-o-c implementation • Hull keen to make Archivematica part of a workflow for any type of repository content – not just research data. You may have seen a poster at Hydra Connect last year: Hull’s p-o-c implements most of the automated bulk ingest route, creates AIP(s) and builds repository objects from the DIP(s)
  35. 35. Hull p-o-c implementation • User assembles files and simple descriptive file(s) in Box folder. Shares the folder with Archivematica • System checks folder contents and if OK creates a bag (BagIt standard) for each object which is passed to Archivematica • Archivematica processes the bag to create an AIP which goes to a preservation store… • …and also a DIP which is passed to the DIP processor • DIP processor creates Hydra objects from the DIP contents and injects them into the repository QA queue… • …matched to the AIP by UUID Thanks to Cottage Labs for all the new development work!
  36. 36. Hull p-o-c options • Depositors have several options: • A folder containing multiple data files and one descriptive file ➔ a single AIP and a single repository object with (optionally) one or more surrogate files for download (so can be a “metadata-only” record) • A folder containing multiple files and a csv file (one row per file) ➔ multiple AIPs with multiple repository objects, each with (optionally) a surrogate for download • A folder containing the top-level folder of a structure ➔ a zipped structure in a single AIP and a single repository object (optionally) containing the zipped file for download
  37. 37. What’s next in Hull? • We hope to be able to take the p-o-c work and turn it into a production system • Hull is the UK’s “City of Culture” next year and there will be a great deal of digital material that the University Archives want to capture for posterity
  38. 38. Phase 3 - The file formats problem
  39. 39. File formats problem Research data file formats are: • Numerous • Sometimes a bit obscure • Sometimes very big • Ever-changing • Often very new This means they can be hard to preserve... The first hurdle is that we can’t identify them. If we can’t identify them how can we carry out preservation activities?
  40. 40. Top research data applications at York
  41. 41. Can we identify our research data? We ran Droid* over the research data deposited with Research Data York over the past year. Out of 3752 individual files: • only 37% (1382) of the files were identified (with varying degrees of accuracy) • there were 34 different identified file formats in the sample * Droid is a free tool from The National Archives that can be used to automatically identify file formats
  42. 42. Unidentified research data files Files not identified by Droid (listed by file ext): – 107 different file extensions not identified – huge number with no extension (help!) – how do we solve the .dat file problem?
  43. 43. Supporting signature development at The National Archives
  44. 44. Creating our own signatures
  45. 45. Conclusions and where to find out more
  46. 46. Impact “In many ways the project at York and Hull felt like a precursor to the Shared Services pilot; highlighting both the potential problems in working with a wide range of stakeholders and systems, as well as the massive benefits possible from pooling our collective knowledge and resources to tackle the technical challenges which remain in RDM.” From ‘Unlocking Research’ blog from the University of Cambridge Office of Scholarly Communication (16 September 2016) “I've just read your paper on linking repositories and Archivematica - fascinating and full of very useful information! I will certainly be following up with many of the links and ideas you presented, especially the Jisc Research Data Shared Service work on digital preservation, and I will discuss with my colleagues the possibility of using Archivematica at our data centre and our options for collaboration. Many of our issues relate to the long tail of research data and the preservation of data already archived.” Comment on Digital Archiving blog (18 October 2016)
  47. 47. Challenges • Impact of short, but focused, timeframes and short lead in times • Access to appropriate skills (mainly technical development) limited scope of work • Limited budget, hence ‘parsimonious’ approach to making the best use of this • Interpretations of digital archiving across preservation, RDM and IT communities • Balancing dissemination with actual doing!
  48. 48. What have we learned? • Archivematica can be used to manage the preservation of research data • And that this can be embedded within similar, but different, institutional workflows • There is benefit in getting focused systems to do what they do best rather than adding functionality to any particular one • There is a file format recognition issue that will affect long-term preservation of research data files • But there is a way to address this through the development of additional file signatures • There is real benefit in working collaboratively in this area, both within the project and beyond it, to identify common ways of tackling the problem of preserving research data
  49. 49. Jisc Research Data Shared Service • Almost in parallel with the Research Data Spring projects Jisc were planning a Research Data Shared Service • The resulting system will be managed and hosted, and will offer three core modules : repository, preservation and reporting • Phase 1 and 2 reports from Hull and York very influential for the preservation module • Commercial and open source offerings for each module, including Archivematica (for preservation) and Hydra (for the repo) • Over 20 pilot institutions recruited (including York) – all identified preservation as a priority •
  50. 50. Further information Website: Blog: Reports: