SlideShare a Scribd company logo
From Box to Hydra via Archivematica
Turning proof of concept into reality
Background
• University of Hull and University of York working on a Research Data Spring
project
• Filling the digital preservation gap, 2015-16
• https://www.york.ac.uk/borthwick/projects/archivematica/
• Dual use cases for the University of Hull
• Digital preservation of archival materials
• Management and preservation of research data
Systems background
• Box
• Institutional subscription from 2015
• Supported and managed personal cloud storage service
• Archivematica
• No experience prior to the project, but had watched its development over a period
of years
• Particularly liked the combination of microservices that can be used flexibly
according to use case
Repository
• Hydra digital repository – http://hydra.hull.ac.uk
• Implemented 2012 based on previous Fedora repository
• Designed to hold any structured digital collection (within reason!) to meet
University’s needs
• NB ** Hydra is now Samvera **
• Community is refreshing and re-launching for the next decade
• Watch this space – http://samvera.org
• New website and logo coming shortly
Questions
• How can we enable a preservation workflow with the systems environment
available to us?
• How can we facilitate pathways to preserving archival materials and
research data alongside each other?
• What is required to bring these different components together to best
effect?
Ingest to the system, either direct
or via ingest folder (Box)
Archivematica captures content
and processes it through
microservices
Archivematica outputs AIP for
storage and DIP for repository
DIP processor unpacks DIPs and
creates repository objects
Repository manages access to
objects
Project focus
• User assembles files and simple descriptive file(s) in Box
folder. Shares the folder with Archivematica
• System checks folder contents and if OK creates a bag
(BagIt standard) for each object which is passed to
Archivematica
• Archivematica processes the bag to create an AIP which
goes to a preservation store…
• …and also a DIP which is passed to the DIP processor
• DIP processor creates Hydra objects from the DIP
contents and injects them into the repository QA
queue…
• …matched to the AIP by UUID
Joining up the dots
• The joins between the three components were:
• A ‘Box-watcher’ – users share their data with a nominated Box user account for the
archivematica system. This account watch for shares with it, and automatically
create a BAGIT of the files found and transfer this to archivematica for processing
• A ‘DIP processor’ – this takes the BAGIT DIP from archivematica, breaks it open and
uses the information within this to create repository objects
• These tools were wrapped into a single gem, hullsync
• https://github.com/uohull/hullsync
Deposit options
• Depositors have several options:
• A folder containing multiple data files and one descriptive file  a single AIP and a single repository
object with (optionally) one or more surrogate files for download (so can be a “metadata-only”
record)
• A folder containing multiple files and a csv file (one row per file)  multiple AIPs with multiple
repository objects, each with (optionally) a surrogate for download
• A folder containing the top-level folder of a structure  a zipped structure in a single AIP and a single
repository object (optionally) containing the zipped file for download
In detail – option 1
• A folder containing multiple data files and one descriptive file  a single
AIP and a single repository object with (optionally) one or more surrogate
files for download (so can be a “metadata-only” record)
• Data files are associated with a .txt descriptive file providing associated metadata
• Descriptive file can be used to determine access permissions and content model
• Descriptive metadata can be provided using Dublin Core
• Can also submit README.txt for information to inform repository staff on
appropriate actions
In detail – option 2
• A folder containing multiple files and a csv file (one row per file)  multiple
AIPs with multiple repository objects, each with (optionally) a surrogate for
download
• Use a .csv file instead of a .txt file for the descriptive information
• Use column headings to cover the same fields as in option 1
• Can associate the same or different metadata with each object
• Can create simple or compound objects
In detail – option 3
• A folder containing the top-level folder of a structure  a zipped structure
in a single AIP and a single repository object (optionally) containing the
zipped file for download
• Aim is to allow the submission of a folder or nested folders of data, replicating how
the files are organised
• Files are unpacked by Archivematica, analysed, and then re-zipped up for submission
to the repository
Lessons learned
• Error handling needs attention when turning the p-o-c into production
• But the testing highlighted a lot of the errors that would need handling
• A key element when joining systems together
• Normalisation of filetypes requires additional consideration
• E.g., how to deal with TIFF files converted to JPG
• The zipping and unzipping workflows require further attention to ensure
success for this option
Next steps
• Take learning and tools from the Research Data Spring project and use these
as the basis for development of services
• Two use cases
• Research data storage and management service development
• City of Culture digital archive
• Understanding Archivematica pipelines and options better – Perpetua test!
• Focus on improving proof-of-concept and developing additional
functionality
Research data storage and management
• Joint Library and ICTD project to discover and understand research data
storage and management needs amongst academic staff
• Open workshops
• Data interviews
• Capture and processing of research data a part of local provision alongside
advice and guidance on options outside the institution
City of Culture digital archive
• Hull2017 – City of Culture
• Events throughout the year
• Four data elements
• Business archive
• Creative archive
• Participatory archive
• Research and evaluation archive
• Applying the same technology environment to manage ingest and delivery
Key issues going forward
• What are the differences in pipeline processing in Archivematica between
research data and archival materials?
• Dealing with unusual file formats – a key learning point from the RDS
project
• Scaling up to meet heavier data demands
• Being realistic about what we can’t use this environment for and need
alternative approaches, e.g., Big Data
To conclude
• Combining components has its issues, but it has been better to exploit
systems that do certain parts of the workflow well and turn them into more
than the sum of their parts
• Data is not simple
• We need flexibility in how we look to manage it
• We need engagement with researchers to understand it
• Turning an idea into production needs careful planning
• Scope for community exchange or training on how to do this?
Thank you
c.awre@hull.ac.uk
(And many thanks to the University of York and my colleagues Richard Green and
Simon Wilson, plus Cottage Labs LLC for their work on this)

More Related Content

What's hot

PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2 PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
EDINA, University of Edinburgh
 

What's hot (20)

Grant Funding Programme
Grant Funding ProgrammeGrant Funding Programme
Grant Funding Programme
 
SMRUDAS
SMRUDAS SMRUDAS
SMRUDAS
 
Engaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh UniversityEngaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh University
 
UK Research Data Discovery Service metadata schema
UK Research Data Discovery Service metadata schemaUK Research Data Discovery Service metadata schema
UK Research Data Discovery Service metadata schema
 
Scottish Digital Library Consortium Meeting: Edinburgh DataShare
Scottish Digital Library Consortium Meeting: Edinburgh DataShareScottish Digital Library Consortium Meeting: Edinburgh DataShare
Scottish Digital Library Consortium Meeting: Edinburgh DataShare
 
Business cases and costs RDN
Business cases and costs RDNBusiness cases and costs RDN
Business cases and costs RDN
 
Lightning Talk - Angela Dappart
Lightning Talk - Angela DappartLightning Talk - Angela Dappart
Lightning Talk - Angela Dappart
 
Data sharing in the Netherlands
Data sharing in the NetherlandsData sharing in the Netherlands
Data sharing in the Netherlands
 
Who is doing what, and how do we know? [PEPRS]
Who is doing what, and how do we know? [PEPRS]Who is doing what, and how do we know? [PEPRS]
Who is doing what, and how do we know? [PEPRS]
 
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2 PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
 
Jisc Research Data Management Shared Service Workshop: An institutional persp...
Jisc Research Data Management Shared Service Workshop: An institutional persp...Jisc Research Data Management Shared Service Workshop: An institutional persp...
Jisc Research Data Management Shared Service Workshop: An institutional persp...
 
Going for GOLD - Adventures in Open Linked Geospatial Metadata
Going for GOLD - Adventures in Open Linked Geospatial MetadataGoing for GOLD - Adventures in Open Linked Geospatial Metadata
Going for GOLD - Adventures in Open Linked Geospatial Metadata
 
Jisc research data shared service overview IDCC 2016
Jisc research data shared service overview IDCC 2016Jisc research data shared service overview IDCC 2016
Jisc research data shared service overview IDCC 2016
 
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
 
National data services lightening talk at the RDA
National data services lightening talk at the RDANational data services lightening talk at the RDA
National data services lightening talk at the RDA
 
RDM shared services at IDCC
RDM shared services at IDCCRDM shared services at IDCC
RDM shared services at IDCC
 
Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...
 
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShareResearch Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
 
RDA UK
RDA UKRDA UK
RDA UK
 
COBWEB technology platform and future development needs
COBWEB technology platform and future development needsCOBWEB technology platform and future development needs
COBWEB technology platform and future development needs
 

Similar to From Box to Hydra via Archivematica

Montemayor_AIMS_Inventory_Presentation_revised
Montemayor_AIMS_Inventory_Presentation_revisedMontemayor_AIMS_Inventory_Presentation_revised
Montemayor_AIMS_Inventory_Presentation_revised
Gabe Montemayor
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
Roxanne Missingham
 

Similar to From Box to Hydra via Archivematica (20)

Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
 
Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Presentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenbergPresentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenberg
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
 
Data Storage
Data StorageData Storage
Data Storage
 
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
 
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
 
MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...
 
A collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDMA collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDM
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...
 
Montemayor_AIMS_Inventory_Presentation_revised
Montemayor_AIMS_Inventory_Presentation_revisedMontemayor_AIMS_Inventory_Presentation_revised
Montemayor_AIMS_Inventory_Presentation_revised
 
OpenStack Swift In the Enterprise
OpenStack Swift In the EnterpriseOpenStack Swift In the Enterprise
OpenStack Swift In the Enterprise
 
BatIg
BatIgBatIg
BatIg
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Shallcross code4lib-midwest 20150724
Shallcross code4lib-midwest 20150724Shallcross code4lib-midwest 20150724
Shallcross code4lib-midwest 20150724
 

More from Jisc RDM

2019-06_Eunis_Burland
2019-06_Eunis_Burland2019-06_Eunis_Burland
2019-06_Eunis_Burland
Jisc RDM
 
Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018
Jisc RDM
 

More from Jisc RDM (20)

2019-06_Eunis_Burland
2019-06_Eunis_Burland2019-06_Eunis_Burland
2019-06_Eunis_Burland
 
Jisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 PaperJisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 Paper
 
Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7
 
Jisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case studyJisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case study
 
Building a national Data Repository Data Modelling
Building a national Data Repository Data ModellingBuilding a national Data Repository Data Modelling
Building a national Data Repository Data Modelling
 
Building a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture OverviewBuilding a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture Overview
 
Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018
 
Research Data Toolkit
Research Data ToolkitResearch Data Toolkit
Research Data Toolkit
 
Pre jisc datachampday_260318
Pre jisc datachampday_260318Pre jisc datachampday_260318
Pre jisc datachampday_260318
 
Stories from the Field: Data are Messy and that's (kind of) ok
Stories from the Field: Data are Messy and that's (kind of) okStories from the Field: Data are Messy and that's (kind of) ok
Stories from the Field: Data are Messy and that's (kind of) ok
 
Fair data - dinkum research - by Andy Turner
Fair data -  dinkum research - by Andy TurnerFair data -  dinkum research - by Andy Turner
Fair data - dinkum research - by Andy Turner
 
2018 03 codata - making the case
2018 03 codata - making the case2018 03 codata - making the case
2018 03 codata - making the case
 
Research Data Shared Service update at DPC
Research Data Shared Service update at DPCResearch Data Shared Service update at DPC
Research Data Shared Service update at DPC
 
Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1
 
Managing data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCMManaging data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCM
 
Managing data behind creative masterpieces
Managing data behind creative masterpiecesManaging data behind creative masterpieces
Managing data behind creative masterpieces
 
Lightning Talks - Intro
Lightning Talks - IntroLightning Talks - Intro
Lightning Talks - Intro
 
Lightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellanLightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellan
 
Lightning Talk - Nick Sheppard
Lightning Talk - Nick SheppardLightning Talk - Nick Sheppard
Lightning Talk - Nick Sheppard
 
Lightning talk - Adam Harwood
Lightning talk - Adam HarwoodLightning talk - Adam Harwood
Lightning talk - Adam Harwood
 

Recently uploaded

Recently uploaded (20)

Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Open Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointOpen Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPoint
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resources
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptBasic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
 
NCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdfNCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdf
 
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
 
The Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryThe Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. Henry
 
[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation
 
size separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticssize separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceutics
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17
 

From Box to Hydra via Archivematica

  • 1. From Box to Hydra via Archivematica Turning proof of concept into reality
  • 2. Background • University of Hull and University of York working on a Research Data Spring project • Filling the digital preservation gap, 2015-16 • https://www.york.ac.uk/borthwick/projects/archivematica/ • Dual use cases for the University of Hull • Digital preservation of archival materials • Management and preservation of research data
  • 3. Systems background • Box • Institutional subscription from 2015 • Supported and managed personal cloud storage service • Archivematica • No experience prior to the project, but had watched its development over a period of years • Particularly liked the combination of microservices that can be used flexibly according to use case
  • 4. Repository • Hydra digital repository – http://hydra.hull.ac.uk • Implemented 2012 based on previous Fedora repository • Designed to hold any structured digital collection (within reason!) to meet University’s needs • NB ** Hydra is now Samvera ** • Community is refreshing and re-launching for the next decade • Watch this space – http://samvera.org • New website and logo coming shortly
  • 5. Questions • How can we enable a preservation workflow with the systems environment available to us? • How can we facilitate pathways to preserving archival materials and research data alongside each other? • What is required to bring these different components together to best effect?
  • 6. Ingest to the system, either direct or via ingest folder (Box) Archivematica captures content and processes it through microservices Archivematica outputs AIP for storage and DIP for repository DIP processor unpacks DIPs and creates repository objects Repository manages access to objects
  • 7. Project focus • User assembles files and simple descriptive file(s) in Box folder. Shares the folder with Archivematica • System checks folder contents and if OK creates a bag (BagIt standard) for each object which is passed to Archivematica • Archivematica processes the bag to create an AIP which goes to a preservation store… • …and also a DIP which is passed to the DIP processor • DIP processor creates Hydra objects from the DIP contents and injects them into the repository QA queue… • …matched to the AIP by UUID
  • 8. Joining up the dots • The joins between the three components were: • A ‘Box-watcher’ – users share their data with a nominated Box user account for the archivematica system. This account watch for shares with it, and automatically create a BAGIT of the files found and transfer this to archivematica for processing • A ‘DIP processor’ – this takes the BAGIT DIP from archivematica, breaks it open and uses the information within this to create repository objects • These tools were wrapped into a single gem, hullsync • https://github.com/uohull/hullsync
  • 9. Deposit options • Depositors have several options: • A folder containing multiple data files and one descriptive file  a single AIP and a single repository object with (optionally) one or more surrogate files for download (so can be a “metadata-only” record) • A folder containing multiple files and a csv file (one row per file)  multiple AIPs with multiple repository objects, each with (optionally) a surrogate for download • A folder containing the top-level folder of a structure  a zipped structure in a single AIP and a single repository object (optionally) containing the zipped file for download
  • 10. In detail – option 1 • A folder containing multiple data files and one descriptive file  a single AIP and a single repository object with (optionally) one or more surrogate files for download (so can be a “metadata-only” record) • Data files are associated with a .txt descriptive file providing associated metadata • Descriptive file can be used to determine access permissions and content model • Descriptive metadata can be provided using Dublin Core • Can also submit README.txt for information to inform repository staff on appropriate actions
  • 11. In detail – option 2 • A folder containing multiple files and a csv file (one row per file)  multiple AIPs with multiple repository objects, each with (optionally) a surrogate for download • Use a .csv file instead of a .txt file for the descriptive information • Use column headings to cover the same fields as in option 1 • Can associate the same or different metadata with each object • Can create simple or compound objects
  • 12. In detail – option 3 • A folder containing the top-level folder of a structure  a zipped structure in a single AIP and a single repository object (optionally) containing the zipped file for download • Aim is to allow the submission of a folder or nested folders of data, replicating how the files are organised • Files are unpacked by Archivematica, analysed, and then re-zipped up for submission to the repository
  • 13. Lessons learned • Error handling needs attention when turning the p-o-c into production • But the testing highlighted a lot of the errors that would need handling • A key element when joining systems together • Normalisation of filetypes requires additional consideration • E.g., how to deal with TIFF files converted to JPG • The zipping and unzipping workflows require further attention to ensure success for this option
  • 14. Next steps • Take learning and tools from the Research Data Spring project and use these as the basis for development of services • Two use cases • Research data storage and management service development • City of Culture digital archive • Understanding Archivematica pipelines and options better – Perpetua test! • Focus on improving proof-of-concept and developing additional functionality
  • 15.
  • 16. Research data storage and management • Joint Library and ICTD project to discover and understand research data storage and management needs amongst academic staff • Open workshops • Data interviews • Capture and processing of research data a part of local provision alongside advice and guidance on options outside the institution
  • 17. City of Culture digital archive • Hull2017 – City of Culture • Events throughout the year • Four data elements • Business archive • Creative archive • Participatory archive • Research and evaluation archive • Applying the same technology environment to manage ingest and delivery
  • 18. Key issues going forward • What are the differences in pipeline processing in Archivematica between research data and archival materials? • Dealing with unusual file formats – a key learning point from the RDS project • Scaling up to meet heavier data demands • Being realistic about what we can’t use this environment for and need alternative approaches, e.g., Big Data
  • 19. To conclude • Combining components has its issues, but it has been better to exploit systems that do certain parts of the workflow well and turn them into more than the sum of their parts • Data is not simple • We need flexibility in how we look to manage it • We need engagement with researchers to understand it • Turning an idea into production needs careful planning • Scope for community exchange or training on how to do this?
  • 20. Thank you c.awre@hull.ac.uk (And many thanks to the University of York and my colleagues Richard Green and Simon Wilson, plus Cottage Labs LLC for their work on this)