Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

  • 963 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
963
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
8
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Adrian Cole / Cloudsoft Big Blobs: moving big data in and out of the cloudWednesday, November 2, 11
  • 2. Adrian Cole (@jclouds) founded jclouds march 2009 chief evangelist at CloudsoftWednesday, November 2, 11
  • 3. Agenda • intro to jclouds blobstore • Omixon case study • awkward silence (or Q/A)Wednesday, November 2, 11
  • 4. Portable APIs BlobStore LoadBalancer Compute Table Provider-Specific Hooks Embeddable Over 30 Tested Providers! 4Wednesday, November 2, 11
  • 5. Who’s integrating?Wednesday, November 2, 11
  • 6. Blob Storage global name space key, value with metadata sites on demand unlimited size 6Wednesday, November 2, 11
  • 7. Blob Storage Set<String> containers = namespacesInMyAccount; Map<String, InputStream> keyValues = contentsOfContainer 7Wednesday, November 2, 11
  • 8. Blob Storage adrian@googlestorage Love Letters Movies Tron putBlob The One Shrek Goonies The Blob 3d = true url = http://disney.go.com/tron 8Wednesday, November 2, 11
  • 9. java overview github jclouds/jclouds // init context = new BlobStoreContextFactory().createContext("s3", accesskeyid, secret); blobStore = context.getBlobStore(); // create container blobStore.createContainerInLocation(null, “adriansmovies”); // add blob blob = blobStore.blobBuilder("sushi.avi").payload(file).build(); blobStore.putBlob(“adriansmovies”, blob); 9Wednesday, November 2, 11
  • 10. clojure overview github jclouds/jclouds (use org.jclouds.blobstore2) (def *blobstore* (blobstore “azureblob” account key)) (create-container *blobstore* “movies”) (put-blob *blobstore* “movies” (blob “tron.mp4“ :payload tron-file)) 10Wednesday, November 2, 11
  • 11. Big data pipelines with Scale-out on the cloud @tiborkisstibor 11Wednesday, November 2, 11
  • 12. bioinformatic pipelines Usually requires high CPU Continuously increasing data volumes Complex algorithms on top of large datasets 12Wednesday, November 2, 11
  • 13. bioinformatics SaaS 13Wednesday, November 2, 11
  • 14. challenges of SaaS building Hadoop cluster startup/shutdown - Cluster starting problems - Automatic cluster shutdown strategies Hadoop cluster monitoring on the cloud System monitoring Consumption based monitoring Data transfer paths AWS Import -> S3 -> hdfs -> S3 -> AWS Export ACL settings for clients buckets S3 <=> hdfs transfers 14Wednesday, November 2, 11
  • 15. where did we start? 30GB file @max 16MB/s upload to S3 32 minutes 1PB file @max 16MB/s upload to S3 18.2 hours 15Wednesday, November 2, 11
  • 16. where did we end up? 30GB file @max 100MB/s upload to S3 32 5 minutes 1PB file @max 100MB/s upload to S3 18.2 2.9 hours 16Wednesday, November 2, 11
  • 17. How did we get there? Add multi-part upload support Optimize slicing Optimize parallel upload strategy Find big guns 17Wednesday, November 2, 11
  • 18. Multi-Part upload Large Blobs cannot be sent in a single request in most BlobStores. (ex. 5GB max in S3) Large X-fers are likely to fail at inconvenient positions, and without resume. Multi-part uploads allow you to send slices of a payload, which the server assembles later 18Wednesday, November 2, 11
  • 19. Slicing Each upload part must advance to the appropriate position in the source payload efficiently. Payload slice(Payload input, long offset, long length); ex. NettyPayloadSlicer uses ChunkedFileInputStream 19Wednesday, November 2, 11
  • 20. Slicing Algorithm A Blob can be sliced into a maximum number of parts, and these parts have min and max sizes. up to 3.2GB, converge 32M parts then increase part size approaching max (5GB) then continue at max part size or overflow 20Wednesday, November 2, 11
  • 21. Upload Strategy Start sequential, stabilize, then parallelize SequentialMultipartUploadStrategy Simpler, less likely to fail, easier to retry, little to optimize outside chunk size ParallelMultipartUploadStrategy Much better throughput, but need to optimize degree, retries & error handling 21Wednesday, November 2, 11
  • 22. 22Wednesday, November 2, 11
  • 23. What’s the top-speed? 23Wednesday, November 2, 11
  • 24. Is this as good as it gets? 10GigE should be able to do 1280MB/s cc1.4xlarge has been measured up to ~560MB/s local but we’re only getting ~100MB/s sustained 24Wednesday, November 2, 11
  • 25. So, where do we go now? zero copy transfer more work on slice algorithms tools and integrations (ex. hdfs) add implementations for other blobstores 25Wednesday, November 2, 11
  • 26. Wanna play? blobStore.putBlob(“movies”, blob, multipart()); (put-blob *blobstore* “movies” blob :multipart? true) or just visit github jclouds-examples blobstore-largeblob blobstore-hdfs 26Wednesday, November 2, 11
  • 27. Questions? github jclouds-examples @jclouds @tiborkisstibor adrian@cloudsoftcorp.com 27Wednesday, November 2, 11