Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Moving Gigantic Files Into and Out of the Alfresco Repository

440 views

Published on

This talk is a technical case study showing show Metaversant solved a problem for one of their clients, Noble Research Institute. Researchers at Noble deal with very large files which are often difficult to move into and out of the Alfresco repository.

Published in: Technology
  • Be the first to comment

Moving Gigantic Files Into and Out of the Alfresco Repository

  1. 1. Moving Gigantic Files In and Out of the Repository Jeff Potts Metaversant Group, Inc.
  2. 2. Learn. Connect. Collaborate. What’s the Deal with Large Files? • Alfresco can manage files of any size, but getting large files into and out of the repo is often problematic • They take way too long to transfer – Sessions timeout – Machines go to sleep – Incomplete files get transferred – Users think, “Is this thing hung?” and then cancel • End-users must actively monitor transfers in most cases
  3. 3. This talk is a technical case study about an approach to significantly improving large file transfers
  4. 4. Learn. Connect. Collaborate. About Noble Research Institute • Research organization focused on improving agriculture for all mankind – Research – Producer Relations – Applied agricultural systems and stewardship – Education • About 400 employees from all over the world • Headquartered in Ardmore, Oklahoma • https://www.noble.org
  5. 5. Learn. Connect. Collaborate. • Consulting firm focused on solving business problems with open source Content Management, Workflow, & Search technology • Founded in 2010 • Clients all over the world in a variety of industries, including: – Airlines – Manufacturing – Construction – Financial Services – Higher Education – Life Sciences – Professional Services https://www.metaversant.com
  6. 6. Learn. Connect. Collaborate. The Problem • Researchers work with very large files • Typical size ranges from a few GB to hundreds of GB • Source of the files is mixed – Generate internally (e.g., gene sequencing machines) – Acquire data sets from other research institutions • Data governance team wants everything in Alfresco • Large size makes moving files in and out of Alfresco difficult
  7. 7. Learn. Connect. Collaborate. What We Tried • Desktop Sync • CMIS update content stream – Versions are created, somewhat painful to disable auto-versioning • Increasing timeouts – Losing battle when files are multiple gigabytes • Using Alfresco FTP – Usually requires thick client installed – Not preferred by end-users • Resumable upload Share customization – Actually worked pretty well – Only handles uploads, not downloads
  8. 8. Learn. Connect. Collaborate. Sidebar: Resumable Upload Details • Share customization (closed source) • Leverages resumable.js, see http://www.resumablejs.com/ • Utilizes the HTML5 File API • If an upload stalls or ends prematurely, the end-user can restart where it left off
  9. 9. Learn. Connect. Collaborate. Inescapable math related to moving large files • How long does it take to move 25 GB of data? – Ethernet = 10 Mbit/s = 333.33 minutes – Fast Ethernet = 100 Mbit/s = 33.33 minutes – Gigabit Ethernet = 1 Gbit/s = 3.33 minutes – 10 Gigabit Ethernet = 10 Gbit/s = 0.33 minutes – 100 Gigabit Ethernet = 100 Gbit/s = 0.03 minutes • Assumes full bandwidth is available • Network only, does not account for disk or other non-network latencies on either end
  10. 10. It’s not the actual import/export that’s killing us, it’s the movement of so many bytes over the network
  11. 11. Learn. Connect. Collaborate. Technologies That Move Large Files • BitTorrent – Looked at BitTorrent Sync which became Resilio Sync – Performance increases when multiple people have the same file – Primarily peer-to-peer with an emphasis on desktop-to-desktop or between devices • GridFTP – Extends FTP to add parallelism – Multiple implementations, including at least one that is commercially supported – Works between servers, desktop-to-server, and between devices
  12. 12. Learn. Connect. Collaborate. GridFTP was created to move large files to clusters • Extension of FTP • Defined by the Open Grid Foundation (http://www.ogf.org) • Designed specifically to facilitate transfers of large files and large sets of files • Uses multiple parallel streams to move data over TCP • One of several ways that a product called Globus uses to move data between end points • More information at http://toolkit.globus.org/toolkit/docs/6.0/gridftp/
  13. 13. Learn. Connect. Collaborate. Globus provides data migration tools to researchers • Non-profit business within the University of Chicago • Focused on providing low-cost tools to researchers doing data-intensive research • Globus is SaaS that acts as a middleman to coordinate transfers of data between endpoints • Publishes a list of public endpoints • Provides API and services such as authentication • Sync between two endpoints typically uses GridFTP protocol • It is possible to use GridFTP without leveraging Globus – See http://toolkit.globus.org/toolkit/docs/latest-stable/admin/install/
  14. 14. Globus/GridFTP helps move bytes over the network. Alfresco BFSIT does fast imports once the files are on the server
  15. 15. Learn. Connect. Collaborate. High-Level Approach: Two Step Import First Step: Globus Personal Connect to Globus Endpoint Shared Mount
  16. 16. Learn. Connect. Collaborate. High-Level Approach: Two Step Import Second Step: Alfresco Bulk File System Import Shared Mount
  17. 17. Learn. Connect. Collaborate. High-Level Approach: Two Step Export First Step: Write file(s) to File System Shared Mount
  18. 18. Learn. Connect. Collaborate. High-Level Approach: Two Step Export Second Step: Globus Endpoint to Globus Personal Connect Shared Mount
  19. 19. With the high-level approach determined, it was time to work on the details
  20. 20. Learn. Connect. Collaborate. Where to Put the UI? • Considered Share – But researchers were already looking for a more streamlined interface • Considered ADF – But it was too new at the time – Wasn’t the right fit for this particular application • Decided on custom Spring Boot application – Needed an app anyway – Could bring ADF later in if desired
  21. 21. Learn. Connect. Collaborate. Custom Globus Alfresco Transfers application Simple Scope • Start transfer jobs • See the status of transfer jobs • Publishes and subscribes to queues used to coordinate multi-step transfers • Authentication – Authenticates against Alfresco – Accounts linked to Globus via Oauth Built With • Spring Boot • Angular 4 • Bootstrap 3 • Apache ActiveMQ • Apache Maven
  22. 22. Learn. Connect. Collaborate. • Alfresco Enterprise Edition, Clustered • Globus Server Endpoint • Both point to the same shared mount Solution Components Shared Mount
  23. 23. Learn. Connect. Collaborate. Solution Components • Globus SaaS communicates with – Globus Server Endpoint – Each individual’s Globus Personal Connect • Globus SaaS provides a REST API Shared Mount
  24. 24. Learn. Connect. Collaborate. • Spring Boot application used to create transfer jobs • Coordinates the transfers • Persists transfer job and user objects to PostgreSQL Solution Components Shared Mount
  25. 25. Learn. Connect. Collaborate. • Everything is asynchronous • Apache ActiveMQ acts as the message broker, persists queues Solution Components Shared Mount
  26. 26. Learn. Connect. Collaborate. Queues and Listeners Alfresco Import Listener Alfresco Export Listener Globus Inbound Transfer Listener Globus Outbound Transfer Listener Transfer Status Listener Given a file path, imports it into a specified node ref using BFSIT Given a node ref, exports it to a specified file path Given an endpoint ID and a path, transfer it to the Noble endpoint Given a path on the Noble endpoint, transfer to a specified path on an endpoint Persist status changes; Kick off next step AMP Globus Alfresco Transfers Spring Boot App
  27. 27. Importing into Alfresco
  28. 28. Learn. Connect. Collaborate. 1. Save Transfer Job 2. Put message on a queue Transfer to Alfresco (1) 1. 2. “Do Globus transfer” Shared Mount
  29. 29. Learn. Connect. Collaborate. 1. See message 2. Start transfer 3. Perform the transfer 4. Put message on the queue Transfer to Alfresco (2) 1. ”Do Globus transfer” 2. 3. 4. “Globus transfer done” Shared Mount
  30. 30. Learn. Connect. Collaborate. 2. 3. “Do Alfresco transfer” 1. See message 2. Update status 3. Queue message Transfer to Alfresco (3) 1. “Globus transfer done” Shared Mount
  31. 31. Learn. Connect. Collaborate. 5. 4. “Alfresco import done” 1. “Do Alfresco import” 2. BFSIT 3. “Alfresco import done” Transfer to Alfresco (4) 1. See message 2. BFSIT import 3. Queue message 4. See message 5. Update status Shared Mount
  32. 32. Downloading from Alfresco
  33. 33. Learn. Connect. Collaborate. 1. Save Transfer Job 2. Put message on a queue Transfer from Alfresco (1) 1. 2. "Do Alfresco export” Shared Mount
  34. 34. Learn. Connect. Collaborate. 1. See message 2. Custom export 3. Queue message Transfer from Alfresco (2) 1. “Do Alfresco export” 2. 3. “Alfresco export done” Shared Mount
  35. 35. Learn. Connect. Collaborate. 1. See message 2. Update status 3. Queue message Transfer from Alfresco (3) 1. “Alfresco export done” 3. “Do Globus transfer” 2. Shared Mount
  36. 36. Learn. Connect. Collaborate. 1. See message 2. Initiate transfer 3. Do transfer 4. Queue message 5. See message 6. Set status Transfer from Alfresco (4) 6. 1. “Do Globus transfer” 3.4. “Globus transfer done” 2. 5. Shared Mount
  37. 37. How did we do?
  38. 38. Learn. Connect. Collaborate. Metrics: Multi-file* Upload/Download Upload to Alfresco Download from Alfresco Method Time Rate Time Rate Out-of-the-box 5 minutes 612 MB/min 6.4 minutes 476.6 MB/min Globus Alfresco Transfers 2 minutes 1530 MB/min 3.6 minutes 1020 MB/min Improvement 60% faster 150% more throughput 53% faster 114% more throughput *Four files totaling 3,060 MB
  39. 39. Learn. Connect. Collaborate. Metrics: Single-file* Upload/Download Upload to Alfresco Download from Alfresco Method Time Rate Time Rate Out-of-the-box 7.2 minutes 616.2 MB/min DNF** DNF** Globus Alfresco Transfers 3.6 minutes 1220.4 MB/min 5.1 minutes 862.9 MB/min Improvement 50% faster 98% more throughput Infinitely faster Infinitely greater throughput *Single file of size 4,418 MB **Alfresco throws an exception at around 1 GB
  40. 40. Learn. Connect. Collaborate. Results • Transfers can now be done as “fire-and-forget” jobs • Any number of files, any size • Streamlined, purpose-built UI keeps researchers focused • Integrates with existing sync technology researchers like • Reduced transfer time by 50 - 60% • Increased transfer rate by 100 – 150%
  41. 41. Learn. Connect. Collaborate. Futures • Improve download by doing a move from content store rather than a write • Send files to/from any Globus endpoint, including external – Currently transfer source/target is Globus Personal Connect on Noble workstations • Security hardening • Set metadata on multiple files during import • Auditing/usage reports • Possible new requirements – Scheduled/recurring transfers – Share integration – ADF integration
  42. 42. Thank You! https://www.metaversant.com https://ecmarchitect.com @jeffpotts01

×