Adrian Cole / Cloudsoft       Big Blobs: moving big data in       and out of the cloudWednesday, November 2, 11
Adrian Cole (@jclouds)    founded jclouds march 2009    chief evangelist at CloudsoftWednesday, November 2, 11
Agenda    • intro to jclouds blobstore    • Omixon case study    • awkward silence (or Q/A)Wednesday, November 2, 11
Portable APIs               BlobStore          LoadBalancer               Compute            Table       Provider-Specific...
Who’s integrating?Wednesday, November 2, 11
Blob Storage                      global name space                      key, value with metadata                      sit...
Blob Storage    Set<String> containers = namespacesInMyAccount;    Map<String, InputStream> keyValues = contentsOfContaine...
Blob Storage                                                    adrian@googlestorage                                      ...
java overview                        github jclouds/jclouds // init context = new BlobStoreContextFactory().createContext(...
clojure overview                 github jclouds/jclouds (use org.jclouds.blobstore2) (def *blobstore* (blobstore “azureblo...
Big data pipelines with            Scale-out on the cloud                             @tiborkisstibor                     ...
bioinformatic pipelines     Usually requires high     CPU     Continuously increasing     data volumes     Complex algorit...
bioinformatics SaaS                                          13Wednesday, November 2, 11
challenges of SaaS building       Hadoop cluster startup/shutdown        - Cluster starting problems         - Automatic c...
where did we start?          30GB file @max 16MB/s upload to S3                                               32 minutes  ...
where did we end up?          30GB file @max 100MB/s upload to S3                                                 32 5 min...
How did we get there?         Add multi-part upload support         Optimize slicing         Optimize parallel upload stra...
Multi-Part upload         Large Blobs cannot be sent in a single request in most         BlobStores. (ex. 5GB max in S3)  ...
Slicing       Each upload part must advance to the appropriate       position in the source payload efficiently.          P...
Slicing Algorithm       A Blob can be sliced into a maximum number of parts,       and these parts have min and max sizes....
Upload Strategy       Start sequential, stabilize, then parallelize       SequentialMultipartUploadStrategy       Simpler,...
22Wednesday, November 2, 11
What’s the top-speed?                            23Wednesday, November 2, 11
Is this as good as it gets?             10GigE should be able to do 1280MB/s             cc1.4xlarge has been measured up ...
So, where do we go now?           zero copy transfer           more work on slice algorithms           tools and integrati...
Wanna play?    blobStore.putBlob(“movies”, blob, multipart());    (put-blob *blobstore* “movies” blob                     ...
Questions?                            github jclouds-examples   @jclouds @tiborkisstibor                     adrian@clouds...
Upcoming SlideShare
Loading in …5
×

Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

1,570 views

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,570
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

  1. 1. Adrian Cole / Cloudsoft Big Blobs: moving big data in and out of the cloudWednesday, November 2, 11
  2. 2. Adrian Cole (@jclouds) founded jclouds march 2009 chief evangelist at CloudsoftWednesday, November 2, 11
  3. 3. Agenda • intro to jclouds blobstore • Omixon case study • awkward silence (or Q/A)Wednesday, November 2, 11
  4. 4. Portable APIs BlobStore LoadBalancer Compute Table Provider-Specific Hooks Embeddable Over 30 Tested Providers! 4Wednesday, November 2, 11
  5. 5. Who’s integrating?Wednesday, November 2, 11
  6. 6. Blob Storage global name space key, value with metadata sites on demand unlimited size 6Wednesday, November 2, 11
  7. 7. Blob Storage Set<String> containers = namespacesInMyAccount; Map<String, InputStream> keyValues = contentsOfContainer 7Wednesday, November 2, 11
  8. 8. Blob Storage adrian@googlestorage Love Letters Movies Tron putBlob The One Shrek Goonies The Blob 3d = true url = http://disney.go.com/tron 8Wednesday, November 2, 11
  9. 9. java overview github jclouds/jclouds // init context = new BlobStoreContextFactory().createContext("s3", accesskeyid, secret); blobStore = context.getBlobStore(); // create container blobStore.createContainerInLocation(null, “adriansmovies”); // add blob blob = blobStore.blobBuilder("sushi.avi").payload(file).build(); blobStore.putBlob(“adriansmovies”, blob); 9Wednesday, November 2, 11
  10. 10. clojure overview github jclouds/jclouds (use org.jclouds.blobstore2) (def *blobstore* (blobstore “azureblob” account key)) (create-container *blobstore* “movies”) (put-blob *blobstore* “movies” (blob “tron.mp4“ :payload tron-file)) 10Wednesday, November 2, 11
  11. 11. Big data pipelines with Scale-out on the cloud @tiborkisstibor 11Wednesday, November 2, 11
  12. 12. bioinformatic pipelines Usually requires high CPU Continuously increasing data volumes Complex algorithms on top of large datasets 12Wednesday, November 2, 11
  13. 13. bioinformatics SaaS 13Wednesday, November 2, 11
  14. 14. challenges of SaaS building Hadoop cluster startup/shutdown - Cluster starting problems - Automatic cluster shutdown strategies Hadoop cluster monitoring on the cloud System monitoring Consumption based monitoring Data transfer paths AWS Import -> S3 -> hdfs -> S3 -> AWS Export ACL settings for clients buckets S3 <=> hdfs transfers 14Wednesday, November 2, 11
  15. 15. where did we start? 30GB file @max 16MB/s upload to S3 32 minutes 1PB file @max 16MB/s upload to S3 18.2 hours 15Wednesday, November 2, 11
  16. 16. where did we end up? 30GB file @max 100MB/s upload to S3 32 5 minutes 1PB file @max 100MB/s upload to S3 18.2 2.9 hours 16Wednesday, November 2, 11
  17. 17. How did we get there? Add multi-part upload support Optimize slicing Optimize parallel upload strategy Find big guns 17Wednesday, November 2, 11
  18. 18. Multi-Part upload Large Blobs cannot be sent in a single request in most BlobStores. (ex. 5GB max in S3) Large X-fers are likely to fail at inconvenient positions, and without resume. Multi-part uploads allow you to send slices of a payload, which the server assembles later 18Wednesday, November 2, 11
  19. 19. Slicing Each upload part must advance to the appropriate position in the source payload efficiently. Payload slice(Payload input, long offset, long length); ex. NettyPayloadSlicer uses ChunkedFileInputStream 19Wednesday, November 2, 11
  20. 20. Slicing Algorithm A Blob can be sliced into a maximum number of parts, and these parts have min and max sizes. up to 3.2GB, converge 32M parts then increase part size approaching max (5GB) then continue at max part size or overflow 20Wednesday, November 2, 11
  21. 21. Upload Strategy Start sequential, stabilize, then parallelize SequentialMultipartUploadStrategy Simpler, less likely to fail, easier to retry, little to optimize outside chunk size ParallelMultipartUploadStrategy Much better throughput, but need to optimize degree, retries & error handling 21Wednesday, November 2, 11
  22. 22. 22Wednesday, November 2, 11
  23. 23. What’s the top-speed? 23Wednesday, November 2, 11
  24. 24. Is this as good as it gets? 10GigE should be able to do 1280MB/s cc1.4xlarge has been measured up to ~560MB/s local but we’re only getting ~100MB/s sustained 24Wednesday, November 2, 11
  25. 25. So, where do we go now? zero copy transfer more work on slice algorithms tools and integrations (ex. hdfs) add implementations for other blobstores 25Wednesday, November 2, 11
  26. 26. Wanna play? blobStore.putBlob(“movies”, blob, multipart()); (put-blob *blobstore* “movies” blob :multipart? true) or just visit github jclouds-examples blobstore-largeblob blobstore-hdfs 26Wednesday, November 2, 11
  27. 27. Questions? github jclouds-examples @jclouds @tiborkisstibor adrian@cloudsoftcorp.com 27Wednesday, November 2, 11

×