Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(STG406) Using S3 to Build and Scale an Unlimited Storage Service


Published on

Amazon Cloud Drive's plans to provide a low cost, unlimited storage service presented a major engineering challenge. In this session, you learn how the Amazon Cloud Drive team designed and optimized the storage back-end, Amazon S3, to handle millions of users while containing infrastructure costs. In this session, the lead engineers share details of how they built the service for massive scale, and the regular steps they take to increase performance and efficiency. They also describe proven techniques for scaling and optimization, learned from experience.

Published in: Technology

(STG406) Using S3 to Build and Scale an Unlimited Storage Service

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Using Amazon S3 to Build and Scale an Unlimited Storage Service for Millions of Consumers Tarlochan Cheema Kevin Christen October 2015 STG406
  2. 2. What to expect from the session • What is Amazon Cloud Drive? • Key Challenges • Services design & architecture • Content store deep dive • Lessons learned
  3. 3. What Is Amazon Cloud Drive? • Unlimited cloud storage from Amazon for consumers • Subscription based storage plans Unlimited photos Unlimited photo storage, plus 5 GB for videos and files for just $11.99 per year. Unlimited everything Securely store all of your photos, videos, files and documents for just $59.99 per year.
  4. 4. How do I use it from anywhere and any device? Amazon Apps for photos and files Mobile Computer Mac PC Web
  5. 5. What’s in it for developers & partners? Reach millions of customers RESTful APIs Android & iOS SDKs Revenue sharing
  6. 6. A growing partner ecosystem Access to millions of Amazon customers Revenue-sharing for developers and partnersnew!
  7. 7. Key challenges? • Unlimited storage • Millions of users • Billions of files • Variety of content (photos/videos/docs) • Variety of metadata • Flexible indexing & querying • Terabytes of logs
  8. 8. Key design goals? • Highly scalable • Durable • Reliable • RESTful • Low latency • Near real-time queries • Consistency • Idempotency • Low cost
  9. 9. Amazon Cloud Drive service architecture Indexing & Query Analytics AppsUsers Asynchronous Pipeline Amazon Kinesis Stream Message Queue Amazon Cloud Drive service Amazon EC2 Content Store Amazon S3 Metadata Store Amazon DynamoDB Notifications Content Processing Amazon Elastic Transcoder Amazon ELB
  10. 10. What does Cloud Drive store in Amazon S3? • Customer content • Derived content • Transcoded videos • Thumbnails of videos, documents • Log files • Dynamic configuration • DynamoDB backups • Using the publicly available AWS Java SDK
  11. 11. Storing customer content • Single Amazon S3 bucket per geographical region • Billions of objects per content bucket • Randomly generated keys • Keys are stored in Amazon DynamoDB • Avoids hot key prefixes • No list operations • Amazon S3 server-side encryption • AES 256
  12. 12. Managing log files • Cloud Drive consists of 800+ servers in 3 AWS regions • More during peak load times • 200GB+ logs per hour • Delivered to Timber log archiving service • Timber encrypts and stores in Amazon S3
  13. 13. Log file types • Application logs • Time-stamped and severity-tagged messages • Service logs • Amazon-wide standard format • Record per service invocation • Source for metrics • Wire logs
  14. 14. Log files All logs archived in Amazon S3 by Timber
  15. 15. Log files Service logs processed into Amazon Redshift load files
  16. 16. Log files Amazon Redshift COPY command loads files into data warehouse in parallel
  17. 17. Coordinating dynamic configuration • Dynamic values like feature toggles • Enable feature for test customers • Dial capabilities up from 0% -> 100% • Configuration files stored in S3 • Servers poll for changes using HTTP HEAD (GetObjectMetadata) • File is reloaded only if ETag has changed
  18. 18. Challenge 1/6: Upload size variation • Uploads vary widely in size • Text files to VM images • Even images vary from 10K GIFs to 20MB RAW • Maintain reasonable performance for all file sizes • Prevent large files from causing resource starvation
  19. 19. Challenge 1/6: Upload size variation • Solution: Size-aware upload logic • Size < 15MB: PUT object • Upload performed by the request thread • Size larger or unknown: multipart upload API • Parts uploaded by a thread pool with blocking array in front • Fixed-size 5MB parts • 50GB file size limit, due to 10,000 part limit for multipart API
  20. 20. Challenge 2/6: Rapid upload availability • Content should be available as soon as possible • But some content processing takes time • Solution: a mix of synchronous, asynchronous, and optimistic synchronous processing
  21. 21. Challenge 2/6: Rapid upload availability • Metadata extraction from images and videos • Quick • Largely independent of file size Synchronous Asynchronous Optimistic synchronous
  22. 22. Challenge 2/6: Rapid upload availability • Video transcoding • Necessary for playback on different devices • Time consuming and size dependent • We use the Amazon Elastic Transcoder service Synchronous Asynchronous Optimistic synchronous
  23. 23. Challenge 2/6: Rapid upload availability • Document transformation to PDF • Timing is unpredictable • Try synchronous with a timeout • If timeout, queue SQS message for async processing Synchronous Asynchronous Optimistic synchronous
  24. 24. Challenge 3/6: Intermittent connections • Clients may have slow and intermittent connections to our service • Especially mobile devices • This makes uploading a large file in a single HTTP request difficult • But multipart upload APIs are complex • Especially for the happy path • Solution: Resumable uploads
  25. 25. Challenge 3/6: Intermittent connections • Client attempts large upload • If it fails mid-stream, Cloud Drive saves the transmitted bytes • Leveraging existing Amazon S3 multipart upload • Client queries for resumption point • Client resumes upload • HTTP Content-Range header • Cloud Drive completes multipart upload
  26. 26. Challenge 3/6: Intermittent connections • Problem: Can’t use instance profile credentials from different instances for a single multipart upload
  27. 27. Challenge 3/6: Intermittent connections • We used the AWS Security Token Service (STS) to provide consistent credentials for each step of the upload • Amazon S3 presigned URLs are another option •
  28. 28. Challenge 4/6: Download size variation • Like uploads, downloads vary widely in size • Maintain reasonable performance for all file sizes • Prevent large requests from causing resource starvation • Solution: Size-aware download logic
  29. 29. Challenge 4/6: Download size variation • Small downloads (<5MB) • Single GET object • In the request thread • Retry once on failure • This covers 90% of our customer’s files
  30. 30. Challenge 4/6: Download size variation • Large downloads • Custom parallel download logic for large files • 5MB part size (range requests) • Dedicated thread pool with blocking queue to avoid affecting uploads, small file downloads • Connection reuse • Single retry on failure or timeout • Uses Apache HTTPClient
  31. 31. Challenge 5/6: Thumbnails of large images • High traffic for thumbnails of images • 3000+ requests per second • Image thumbnails generated on-the-fly • Large images thumbnails are expensive • Large object to download from Amazon S3 • More time to generate thumbnail
  32. 32. Challenge 5/6: Thumbnails of large images Content Bucket Cloud Drive Thumbnail Bucket Solution: Create an intermediate JPEG thumbnail and cache it in Amazon S3
  33. 33. Challenge 5/6: Thumbnails of large images • Cache in S3 bucket with 48 hour expiry • Key on hash of customer id + image id + image version • 2k X 2k JPEG, ~1MB • Cache candidates: • JPEG, PNG, TIFF >10MB • All other images (primarily RAW)
  34. 34. Challenge 6/6: Large direct downloads • No on-the-fly transformations to large files • Downloading to disk doesn’t make sense • Redirect to a short-lived Amazon S3 presigned URL
  35. 35. Takeaways • Amazon S3 is flexible • Not just for big data, but caching, coordinating configuration • Selection of Amazon S3 keys is important • Upload and download strategies depend on file size and workflow • First fallacy of distributed computing: the network is reliable • Retrying upload and download requests may be appropriate • Limit retries
  36. 36. Final Thoughts Experience Amazon Cloud Drive Build Apps with Amazon Cloud Drive API Earn revenue & reach millions of Amazon customers
  37. 37. Thank you!
  38. 38. Remember to complete your evaluations!