Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Growing Open Data: Making the sharing of XXL-sized research data files online a reality, using Edinburgh DataShare

978 views

Published on

I explained how we overcame the technical challenges preventing our depositors from sharing bigger files by adding HTML5 upload to our DSpace repository (Edinburgh DataShare) allowing upload of files up to 15 GB and even 20 GB in some circumstances. These slides also explain our plans to link up an FTP server to Edinburgh DataShare to make it convenient for users to download filesets of around 10 GB and possibly even bigger. Unfortunately I hit my 7-minute limit before I had a chance to get to the FTP bit. Thank you Slideshare for allowing everybody to find out what was in the rest of my talk!

The code for the HTML5 upload is at https://github.com/edina/DSpace/tree/xml-html5-upload . This 7-minute presentation ("24x7") was given at Open Repositories 2016 on the 14th of June 2016. The full abstract is at: https://www.conftool.com/or2016/index.php?page=browseSessions&form_session=126

N.B. the 20 GB file which I have uploaded, and which is our current upper size limit, is 20 GB in the strict sense of the word i.e. 20 x 1,024 x 1,024 x 1,024 bytes. In Linux this is displayed as 21,474,836,480 (bytes) whereas in Windows the size is usually expressed in KB i.e. 20,971,520 KB.

Published in: Software
  • Be the first to comment

Growing Open Data: Making the sharing of XXL-sized research data files online a reality, using Edinburgh DataShare

  1. 1. GROWING OPEN DATA: MAKING THE SHARING OF XXL-SIZED RESEARCH DATA FILES ONLINE A REALITY, USING EDINBURGH DATASHARE PAULINE WARD: PAULINE.WARD@ED.AC.UK @PAULINEDATAWARD GEORGE HAMILTON
  2. 2. THE CHALLENGE • Researchers are generating bigger files. At University of Edinburgh all researchers are entitled to 500 GB storage.
  3. 3. THE CHALLENGE • Researchers need to be able to share their data online. • For impact. • For discoverability. • For reproducibility. • For compliance.
  4. 4. THE CHALLENGE • DataShare is the Institutional Repository for research data for staff and students at the University of Edinburgh: datashare.is.ed.ac.uk . • Previous file size limit of 2.1 GB. • Largest file we’ve been asked to share: 20 GB – split into smaller files. • Largest fileset we’ve been asked to share: 226 GB – split into smaller filesets.
  5. 5. THE CHALLENGE • Some files had to be imported via time-consuming batch import process because too big / too numerous for web deposit. • Some files still waiting to be shared because they are too big for users to be able to conveniently download them. • These files are generated from a wide range of disciplines and wide range of methods.
  6. 6. THE SOLUTION • Getting the files from the depositors: address upload • Allowing users to get the files: address download
  7. 7. THE SOLUTION: UPLOAD • HTML5 resumable upload
  8. 8. THE SOLUTION: UPLOAD • EDINA’s code for implementing HTML5 upload in DSpace is on GitHub: https://github.com/edina/DSpace/tree/xml-html5-upload • Uses resumable.js • This was the XMLUI re-write of functionality that was available for DSpace 5.0 JSPUI. See https://jira.duraspace.org/browse/DS-1562 for further details.
  9. 9. THE SOLUTION: UPLOAD • Testing shows files up to 15 GB upload successfully. • (cf figshare 5 GB file size limit, Zenodo 2 GB) • 20 GB file upload has been done in testing, but generates an error message in the browser, and the user must find and Resume the submission from the Submissions page • Multiple files can be uploaded by drag’n’drop.
  10. 10. THE SOLUTION: DOWNLOAD We wanted a mechanism, which DSpace doesn’t provide, of zipping up files for download. • BitTorrent was one possible approach: could be added at a later date • Other approaches possible (Rsync, Secure Copy (SCP))
  11. 11. THE SOLUTION: DOWNLOAD • FTP download: agreed • Tried and tested technology that we are confident we can put in place and will work well • All files will be accessed from the FTP server anonymously • Users can still download files via browser via FTP • Users who wish can use an FTP client, allowing them to resume a download
  12. 12. THE SOLUTION: DOWNLOAD • Specification: • All files will still be required to have appropriate metadata stored in DSpace • All filesets will now be downloadable as a zip file (previous 5.2 GB limit) • Move DSpace assetstore to a location where more storage available • Statistics (i.e. numbers) of file downloads by SFTP will be added to DSpace statistics
  13. 13. THE SOLUTION: DOWNLOAD • This is a replacement for our current on-the-fly zip file creation of Item bitstreams. • Will mitigate potential performance issues. Because it will use less server resources (Java threads and RAM)
  14. 14. SUMMARY • We have implemented HTML5 upload in the DataShare (DSpace) web interface to allow depositors to easily and quickly deposit individual files up to 15 GB. • We are working on integrating an SFTP server to allow users to retrieve filesets larger than our current 20 GB limit. Storage rather than network/browser timeout will become the limiting factor on fileset size. We anticipate making numerous filesets around 100 GB available in this way in the medium term.

×