Your SlideShare is downloading. ×
Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

1,026
views

Published on

Published in: Technology, Business

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,026
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Adrian Cole / Cloudsoft Big Blobs: moving big data in and out of the cloudWednesday, November 2, 11
  • 2. Adrian Cole (@jclouds) founded jclouds march 2009 chief evangelist at CloudsoftWednesday, November 2, 11
  • 3. Agenda • intro to jclouds blobstore • Omixon case study • awkward silence (or Q/A)Wednesday, November 2, 11
  • 4. Portable APIs BlobStore LoadBalancer Compute Table Provider-Specific Hooks Embeddable Over 30 Tested Providers! 4Wednesday, November 2, 11
  • 5. Who’s integrating?Wednesday, November 2, 11
  • 6. Blob Storage global name space key, value with metadata sites on demand unlimited size 6Wednesday, November 2, 11
  • 7. Blob Storage Set<String> containers = namespacesInMyAccount; Map<String, InputStream> keyValues = contentsOfContainer 7Wednesday, November 2, 11
  • 8. Blob Storage adrian@googlestorage Love Letters Movies Tron putBlob The One Shrek Goonies The Blob 3d = true url = http://disney.go.com/tron 8Wednesday, November 2, 11
  • 9. java overview github jclouds/jclouds // init context = new BlobStoreContextFactory().createContext("s3", accesskeyid, secret); blobStore = context.getBlobStore(); // create container blobStore.createContainerInLocation(null, “adriansmovies”); // add blob blob = blobStore.blobBuilder("sushi.avi").payload(file).build(); blobStore.putBlob(“adriansmovies”, blob); 9Wednesday, November 2, 11
  • 10. clojure overview github jclouds/jclouds (use org.jclouds.blobstore2) (def *blobstore* (blobstore “azureblob” account key)) (create-container *blobstore* “movies”) (put-blob *blobstore* “movies” (blob “tron.mp4“ :payload tron-file)) 10Wednesday, November 2, 11
  • 11. Big data pipelines with Scale-out on the cloud @tiborkisstibor 11Wednesday, November 2, 11
  • 12. bioinformatic pipelines Usually requires high CPU Continuously increasing data volumes Complex algorithms on top of large datasets 12Wednesday, November 2, 11
  • 13. bioinformatics SaaS 13Wednesday, November 2, 11
  • 14. challenges of SaaS building Hadoop cluster startup/shutdown - Cluster starting problems - Automatic cluster shutdown strategies Hadoop cluster monitoring on the cloud System monitoring Consumption based monitoring Data transfer paths AWS Import -> S3 -> hdfs -> S3 -> AWS Export ACL settings for clients buckets S3 <=> hdfs transfers 14Wednesday, November 2, 11
  • 15. where did we start? 30GB file @max 16MB/s upload to S3 32 minutes 1PB file @max 16MB/s upload to S3 18.2 hours 15Wednesday, November 2, 11
  • 16. where did we end up? 30GB file @max 100MB/s upload to S3 32 5 minutes 1PB file @max 100MB/s upload to S3 18.2 2.9 hours 16Wednesday, November 2, 11
  • 17. How did we get there? Add multi-part upload support Optimize slicing Optimize parallel upload strategy Find big guns 17Wednesday, November 2, 11
  • 18. Multi-Part upload Large Blobs cannot be sent in a single request in most BlobStores. (ex. 5GB max in S3) Large X-fers are likely to fail at inconvenient positions, and without resume. Multi-part uploads allow you to send slices of a payload, which the server assembles later 18Wednesday, November 2, 11
  • 19. Slicing Each upload part must advance to the appropriate position in the source payload efficiently. Payload slice(Payload input, long offset, long length); ex. NettyPayloadSlicer uses ChunkedFileInputStream 19Wednesday, November 2, 11
  • 20. Slicing Algorithm A Blob can be sliced into a maximum number of parts, and these parts have min and max sizes. up to 3.2GB, converge 32M parts then increase part size approaching max (5GB) then continue at max part size or overflow 20Wednesday, November 2, 11
  • 21. Upload Strategy Start sequential, stabilize, then parallelize SequentialMultipartUploadStrategy Simpler, less likely to fail, easier to retry, little to optimize outside chunk size ParallelMultipartUploadStrategy Much better throughput, but need to optimize degree, retries & error handling 21Wednesday, November 2, 11
  • 22. 22Wednesday, November 2, 11
  • 23. What’s the top-speed? 23Wednesday, November 2, 11
  • 24. Is this as good as it gets? 10GigE should be able to do 1280MB/s cc1.4xlarge has been measured up to ~560MB/s local but we’re only getting ~100MB/s sustained 24Wednesday, November 2, 11
  • 25. So, where do we go now? zero copy transfer more work on slice algorithms tools and integrations (ex. hdfs) add implementations for other blobstores 25Wednesday, November 2, 11
  • 26. Wanna play? blobStore.putBlob(“movies”, blob, multipart()); (put-blob *blobstore* “movies” blob :multipart? true) or just visit github jclouds-examples blobstore-largeblob blobstore-hdfs 26Wednesday, November 2, 11
  • 27. Questions? github jclouds-examples @jclouds @tiborkisstibor adrian@cloudsoftcorp.com 27Wednesday, November 2, 11