Big Data Processing in the Cloud: A Hydra/Sufia Experience

639 views

Published on

This presentation addresses the challenge of processing big data in a cloud-based data repository. Using the Hydra Project’s Hydra and Sufia ruby gems and working with the Hydra community, we created a special repository for the project, and set up background jobs. Our approach is to create the metadata with these jobs, which are distributed across multiple computing cores. This will allow us to scale our infrastructure out on an as-needed basis, and decouples automatic metadata creation from the response times seen by the user. While the metadata is not immediately available after ingestion, it does mean that the object is. By distributing the jobs, we can compute complex properties without impacting the repository server. Hydra and Sufia allowed us to get a head start by giving us a simple self deposit repository, complete with background jobs support via Redis and Resque.

  • Be the first to comment

  • Be the first to like this

Big Data Processing in the Cloud: A Hydra/Sufia Experience

  1. 1. BIG DATA PROCESSING IN THE CLOUD: A HYDRA/SUFIA EXPERIENCE Helsinki June 2014 Collin Brittle Zhiwu Xie
  2. 2. WHO?
  3. 3. WHAT?
  4. 4. WHY?
  5. 5. SENSORS
  6. 6. SMARTINFRASTRUCTURE
  7. 7. DATA SHARING • Encourage exploratory and multidisciplinary research • Foster open and inclusive communities around • modeling of dynamic systems • structural health monitoring and damage detection • occupancy studies • sensor evaluation • data fusion • energy reduction • evacuation management • …
  8. 8. CHARACTERIZATION • Compute intensive • Storage intensive • Communication intensive • On-demand • Scalability challenge
  9. 9. COMPUTE INTENSIVE • About 6GB raw data per hour • Must be continuously processed, ingested, and further processed • User-generated computations • Must not interfere with data retrieval
  10. 10. STORAGE INTENSIVE • SEB will accumulate about 60TB of raw data per year • To facilitate researchers, we must keep raw data for an extended period of time, e.g., >= 5 years • VT currently does not have an affordable storage facility to hold this much data • Within XSEDE, only TACC’s Ranch can allocate this much storage
  11. 11. COMMUNICATION INTENSIVE • What if hundreds of researchers around the world each tried to download hundreds of TB of our data?
  12. 12. ON DEMAND • Explorative and multidisciplinary research cannot predict the data usage beforehand
  13. 13. SCALABILITY • How to deal with these challenges in a scalable manner?
  14. 14. BIG DATA + CLOUD • Affordable • Elastic • Scalable
  15. 15. FRAMEWORK REQUIREMENTS • Mix local and remote content • Support background processing • Be distributable
  16. 16. FRAMEWORK REQUIREMENTS • Mix local and remote content • Support background processing • Be distributable
  17. 17. OBJECTS AND DATASTREAMS Local Object Meta Meta File
  18. 18. OBJECTS AND DATASTREAMS Local Object Meta Meta File
  19. 19. REMOTE STORAGE Local Repository EC2 GlacierS3 Amazon
  20. 20. FRAMEWORK REQUIREMENTS • Mix local and remote content • Support background processing • Be distributable
  21. 21. Worker Worker Worker Database Public Server Clients Redis BACKGROUND PROCESSING
  22. 22. 0100 0010 FROM QUEUES TO THE CLOUD 1010 0101 0101 0101 1100 0011
  23. 23. 1010 0101 FROM QUEUES TO THE CLOUD 1010 0101 1100 0011 1010 0101
  24. 24. 1010 0101 FROM QUEUES TO THE CLOUD 1010 0101 1100 0011 1010 0101
  25. 25. 1010 0101 FROM QUEUES TO THE CLOUD 1010 0101 1100 0011 1010 0101
  26. 26. FROM QUEUES TO THE CLOUD 1010 0101 1010 0101 1010 0101 1100 0011
  27. 27. FROM QUEUES TO THE CLOUD 1010 0101 1010 0101 1100 0011 0011 1100 1010 0101
  28. 28. FROM QUEUES TO THE CLOUD 1010 0101 1010 0101 1010 0101
  29. 29. FROM QUEUES TO THE CLOUD 1010 0101 1010 0101 1010 0101
  30. 30. FROM QUEUES TO THE CLOUD 1010 0101 1010 0101 1010 0101
  31. 31. FROM QUEUES TO THE CLOUD 1010 0101 1111 0000 1010 0101 1010 0101
  32. 32. FROM QUEUES TO THE CLOUD 1010 0101 1010 0101
  33. 33. QUEUEING
  34. 34. QUEUEING
  35. 35. FRAMEWORK REQUIREMENTS • Mix local and remote content • Support background processing • Be distributable
  36. 36. 0101 0101 0101 0101 FROM QUEUES TO THE CLOUD
  37. 37. 0010 0100 0010 0100 0010 0100 1010 0101 1010 0101 1010 0101 1100 0011 1100 0011 1100 0011
  38. 38. 1100 0011 FROM QUEUES TO THE CLOUD 1010 0101 1100 0011 0010 0100 0000 0010
  39. 39. Database Public Server Clients Redis Master Redis Slave Private Server Private Server Private Server DISTRIBUTED PROCESSING
  40. 40. SCALE UP
  41. 41. SCALE UP
  42. 42. WE CHOSE SUFIA
  43. 43. WHAT IS SUFIA? • Ruby on Rails framework… • Based on Hydra… • Using Fedora Commons… • And Resque
  44. 44. FRAMEWORK REQUIREMENTS • Mix local and remote content • Support background processing • Be distributable
  45. 45. QUESTIONS? rotated8 (who works at) vt.edu

×