GROWING OPEN DATA: MAKING
THE SHARING OF XXL-SIZED
RESEARCH DATA FILES ONLINE
A REALITY, USING EDINBURGH
DATASHARE
PAULINE WARD: PAULINE.WARD@ED.AC.UK @PAULINEDATAWARD
GEORGE HAMILTON
THE CHALLENGE
• Researchers are generating bigger files. At University of
Edinburgh all researchers are entitled to 500 GB storage.
THE CHALLENGE
• Researchers need to be able to share their data online.
• For impact.
• For discoverability.
• For reproducibility.
• For compliance.
THE CHALLENGE
• DataShare is the Institutional Repository for research data for
staff and students at the University of Edinburgh:
datashare.is.ed.ac.uk .
• Previous file size limit of 2.1 GB.
• Largest file we’ve been asked to share: 20 GB – split into
smaller files.
• Largest fileset we’ve been asked to share: 226 GB – split into
smaller filesets.
THE CHALLENGE
• Some files had to be imported via time-consuming batch
import process because too big / too numerous for web
deposit.
• Some files still waiting to be shared because they are too big
for users to be able to conveniently download them.
• These files are generated from a wide range of disciplines and
wide range of methods.
THE SOLUTION
• Getting the files from the depositors: address upload
• Allowing users to get the files: address download
THE SOLUTION: UPLOAD
• HTML5 resumable upload
THE SOLUTION: UPLOAD
• EDINA’s code for implementing HTML5 upload in DSpace is on
GitHub:
https://github.com/edina/DSpace/tree/xml-html5-upload
• Uses resumable.js
• This was the XMLUI re-write of functionality that was available
for DSpace 5.0 JSPUI. See
https://jira.duraspace.org/browse/DS-1562 for further details.
THE SOLUTION: UPLOAD
• Testing shows files up to 15 GB upload successfully.
• (cf figshare 5 GB file size limit, Zenodo 2 GB)
• 20 GB file upload has been done in testing, but generates an error
message in the browser, and the user must find and Resume the
submission from the Submissions page
• Multiple files can be uploaded by drag’n’drop.
THE SOLUTION: DOWNLOAD
We wanted a mechanism, which DSpace doesn’t provide, of
zipping up files for download.
• BitTorrent was one possible approach: could be added at a later
date
• Other approaches possible (Rsync, Secure Copy (SCP))
THE SOLUTION: DOWNLOAD
• FTP download: agreed
• Tried and tested technology that we are confident we can put in place
and will work well
• All files will be accessed from the FTP server anonymously
• Users can still download files via browser via FTP
• Users who wish can use an FTP client, allowing them to resume a
download
THE SOLUTION: DOWNLOAD
• Specification:
• All files will still be required to have appropriate metadata stored in
DSpace
• All filesets will now be downloadable as a zip file (previous 5.2 GB limit)
• Move DSpace assetstore to a location where more storage available
• Statistics (i.e. numbers) of file downloads by SFTP will be added to
DSpace statistics
THE SOLUTION: DOWNLOAD
• This is a replacement for our current on-the-fly zip file
creation of Item bitstreams.
• Will mitigate potential performance issues. Because it will use
less server resources (Java threads and RAM)
SUMMARY
• We have implemented HTML5 upload in the DataShare (DSpace)
web interface to allow depositors to easily and quickly deposit
individual files up to 15 GB.
• We are working on integrating an SFTP server to allow users to
retrieve filesets larger than our current 20 GB limit. Storage
rather than network/browser timeout will become the limiting
factor on fileset size. We anticipate making numerous filesets
around 100 GB available in this way in the medium term.

Growing Open Data: Making the sharing of XXL-sized research data files online a reality, using Edinburgh DataShare

  • 1.
    GROWING OPEN DATA:MAKING THE SHARING OF XXL-SIZED RESEARCH DATA FILES ONLINE A REALITY, USING EDINBURGH DATASHARE PAULINE WARD: PAULINE.WARD@ED.AC.UK @PAULINEDATAWARD GEORGE HAMILTON
  • 2.
    THE CHALLENGE • Researchersare generating bigger files. At University of Edinburgh all researchers are entitled to 500 GB storage.
  • 3.
    THE CHALLENGE • Researchersneed to be able to share their data online. • For impact. • For discoverability. • For reproducibility. • For compliance.
  • 4.
    THE CHALLENGE • DataShareis the Institutional Repository for research data for staff and students at the University of Edinburgh: datashare.is.ed.ac.uk . • Previous file size limit of 2.1 GB. • Largest file we’ve been asked to share: 20 GB – split into smaller files. • Largest fileset we’ve been asked to share: 226 GB – split into smaller filesets.
  • 5.
    THE CHALLENGE • Somefiles had to be imported via time-consuming batch import process because too big / too numerous for web deposit. • Some files still waiting to be shared because they are too big for users to be able to conveniently download them. • These files are generated from a wide range of disciplines and wide range of methods.
  • 6.
    THE SOLUTION • Gettingthe files from the depositors: address upload • Allowing users to get the files: address download
  • 7.
    THE SOLUTION: UPLOAD •HTML5 resumable upload
  • 8.
    THE SOLUTION: UPLOAD •EDINA’s code for implementing HTML5 upload in DSpace is on GitHub: https://github.com/edina/DSpace/tree/xml-html5-upload • Uses resumable.js • This was the XMLUI re-write of functionality that was available for DSpace 5.0 JSPUI. See https://jira.duraspace.org/browse/DS-1562 for further details.
  • 11.
    THE SOLUTION: UPLOAD •Testing shows files up to 15 GB upload successfully. • (cf figshare 5 GB file size limit, Zenodo 2 GB) • 20 GB file upload has been done in testing, but generates an error message in the browser, and the user must find and Resume the submission from the Submissions page • Multiple files can be uploaded by drag’n’drop.
  • 12.
    THE SOLUTION: DOWNLOAD Wewanted a mechanism, which DSpace doesn’t provide, of zipping up files for download. • BitTorrent was one possible approach: could be added at a later date • Other approaches possible (Rsync, Secure Copy (SCP))
  • 13.
    THE SOLUTION: DOWNLOAD •FTP download: agreed • Tried and tested technology that we are confident we can put in place and will work well • All files will be accessed from the FTP server anonymously • Users can still download files via browser via FTP • Users who wish can use an FTP client, allowing them to resume a download
  • 14.
    THE SOLUTION: DOWNLOAD •Specification: • All files will still be required to have appropriate metadata stored in DSpace • All filesets will now be downloadable as a zip file (previous 5.2 GB limit) • Move DSpace assetstore to a location where more storage available • Statistics (i.e. numbers) of file downloads by SFTP will be added to DSpace statistics
  • 15.
    THE SOLUTION: DOWNLOAD •This is a replacement for our current on-the-fly zip file creation of Item bitstreams. • Will mitigate potential performance issues. Because it will use less server resources (Java threads and RAM)
  • 16.
    SUMMARY • We haveimplemented HTML5 upload in the DataShare (DSpace) web interface to allow depositors to easily and quickly deposit individual files up to 15 GB. • We are working on integrating an SFTP server to allow users to retrieve filesets larger than our current 20 GB limit. Storage rather than network/browser timeout will become the limiting factor on fileset size. We anticipate making numerous filesets around 100 GB available in this way in the medium term.