• Like

Maximizing Amazon S3 Performance (STG304) | AWS re:Invent 2013

  • 4,410 views
Uploaded on

This advanced session targets Amazon Simple Storage Service (Amazon S3) technical users. We will discuss the impact of object naming conventions and parallelism on S3 performance, provide real-world …

This advanced session targets Amazon Simple Storage Service (Amazon S3) technical users. We will discuss the impact of object naming conventions and parallelism on S3 performance, provide real-world examples and code the implements best practices for naming of objects and implementing parallelism of both PUTs and GETs, cover multi-part uploads and byte-range downloads and introduce GNU parallel for a quick and easy way to improve S3 performance.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
4,410
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
123
Comments
2
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Maximizing Amazon S3 Performance Craig Carl, AWS November 15, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2. Trillions Of Unique Customer Objects
  • 3. 1.5 Million+ Peak Transactions Per Second
  • 4. Architecture Optimizing PUTs Choosing a region Multipart upload Building a naming scheme Considering LISTs Optimizing GETs Using CloudFront Range-based GETs
  • 5. Choosing a Region • Performance – Proximity to your users – Co-locating with compute, other AWS resources • Other things to think about – Legal and regulatory requirements – Costs vary by region
  • 6. Pay Attention to Your Naming Scheme If: • You want consistent performance from a bucket • You want a bucket capable of routinely exceeding 100 TPS http://amzn.to/18oF5LC
  • 7. Transactions Per Second (TPS) 1 8 2 5 100/8 = 12.5 events/sec 100,000 users @ 10 events an hour = 224 TPS
  • 8. Distributing Key Names • Don’t do this <my_bucket>/2013_11_13-164533125.jpg <my_bucket>/2013_11_13-051033564.jpg <my_bucket>/2013_11_13-061133789.jpg <my_bucket>/2013_11_13-051033458.jpg <my_bucket>/2013_11_12-063433125.jpg <my_bucket>/2013_11_12-021033564.jpg <my_bucket>/2013_11_12-065533789.jpg <my_bucket>/2013_11_12-011033458.jpg <my_bucket>/2013_11_11-022333125.jpg <my_bucket>/2013_11_11-153433564.jpg <my_bucket>/2013_11_11-065233789.jpg <my_bucket>/2013_11_11-065633458.jpg
  • 9. Distributing Key Names • Add randomness to the beginning of the key name <my_bucket>/521335461-2013_11_13.jpg <my_bucket>/465330151-2013_11_13.jpg <my_bucket>/987331160-2013_11_13.jpg <my_bucket>/465765461-2013_11_13.jpg <my_bucket>/125631151-2013_11_13.jpg <my_bucket>/934563160-2013_11_13.jpg <my_bucket>/532132341-2013_11_13.jpg <my_bucket>/565437681-2013_11_13.jpg <my_bucket>/234567460-2013_11_13.jpg <my_bucket>/456767561-2013_11_13.jpg <my_bucket>/345565651-2013_11_13.jpg <my_bucket>/431345660-2013_11_13.jpg
  • 10. Other Techniques for Distributing Key Names • Store objects as a hash of their name – add the original name as metadata • “deadmau5_mix.mp3”  0aa316fb000eae52921aab1b4697424958a53ad9 – watch for duplicate names! – prepend keyname with short hash • 0aa3-deadmau5_mix.mp3 • Epoch time (reverse) – 5321354831-deadmau5_mix.mp3
  • 11. Randomness in a Key Name Can Be an Anti-Pattern • Lifecycle policies • LISTs with prefix filters • Maintaining thumbnails of images – craig.jpg -> stored as orig-09329jed0fc – thumb-09329jed0fc • When you need to recover a file with its original name
  • 12. Solving for the Anti-Pattern • Add additional prefixes to help sorting <my_bucket>/images/521335461-2013_11_13.jpg <my_bucket>/images/465330151-2013_11_13.jpg <my_bucket>/movies/293924440-2013_11_13.jpg <my_bucket>/movies/987331160-2013_11_13.jpg <my_bucket>/thumbs-small/838434842-2013_11_13.jpg <my_bucket>/thumbs-small/342532454-2013_11_13.jpg <my_bucket>/thumbs-small/345233453-2013_11_13.jpg <my_bucket>/thumbs-small/345453454-2013_11_13.jpg • Amazon S3 maintains keys lexicographically in its internal indices
  • 13. Distributing Your Key Names Is Always a Good Idea! It can take some time for improvements to manifest Open a support case if you need an immediate bump or if you’ve got any questions! http://amzn.to/18oF5LC
  • 14. Amazon CloudFront
  • 15. Using Amazon CloudFront for Distribution • • • • Caches objects from Amazon S3 Reduces the number of Amazon S3 GETs Low latency with multiple endpoints High transfer rate • Two flavors: – Web distribution (static content) – RTMP distribution (on-demand streaming of media)
  • 16. Multipart Upload Provides Parallelism • Allows faster, more flexible uploads • Allows you to upload a single object as a set of parts • Upon upload, Amazon S3 then presents all parts as a single object • Enables parallel uploads, pausing and resuming an object upload, and beginning uploads before you know the total object size
  • 17. Choose the Right Part Size • Strike a balance between part size and number of parts – Lots of small parts increase connection overhead, invalidating the benefits of parallelism – Too few large parts don’t get you enough benefits of multipart; don’t get you resiliency to network errors • We recommend parts of 25–50 MB on higher-bandwidth networks and parts of 10 MB on mobile networks
  • 18. You Can Parallelize Your GETs, Too • Use range-based GETs to get multithreaded performance when downloading objects • Compensates for unreliable networks • Benefits of multithreaded parallelism • Align your ranges with your parts!
  • 19. If you’re using SSL and parallelizing… • You’re likely to become CPU-constrained because encryption is CPU-intensive • Amazon S3 recommends using AES-256 to optimize for security and performance • You can leverage AES-NI hardware on your host to improve your performance
  • 20. If Your Application Relies on LIST… • Getting the objects your customers have stored • Seeing sets of files (all animations, videos) • Getting logs • Viewing inventories • Sorting keys based on metadata
  • 21. What Should You Do? • Parallelize LIST when you need a sequential list of your keys • You should build a secondary index of your keys, such as with Amazon DynamoDB, to get a faster alternative to LIST when a sequential list isn’t sufficient – Sorting by metadata – Looking up by category – Objects by time stamp
  • 22. LIST Operations with Amazon DynamoDB • Maintain metadata in DynamoDB – Keep data about what’s in your buckets in DynamoDB • On PUTs, enter data about your objects in DynamoDB • On GETs, use DynamoDB to assist in your search for specific objects • You can use DynamoDB to give you “LIST” based on specific criteria
  • 23. Wrap up: Maximizing Amazon S3 Performance Architecture Optimizing PUTs Choosing a region Multipart upload Building a naming scheme Considering LISTs Optimizing GETs Using CloudFront Range-based GETs
  • 24. Please give us your feedback on this presentation STG304 As a thank you, we will select prize winners daily for completed surveys!