Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kevin Stinson, Senior Software Engineer, Big Dat...
What to Expect from the Session
• Quick overview of Quantcast’s MapReduce System
• Changes made to move to AWS and Amazon ...
A little bit about Quantcast
• Uses real-time data about consumer behavior to
significantly improve the relevancy of digit...
MapReduce at Quantcast
QFS – Quantcast’s distributed file system
• Open sourced - https://github.com/quantcast/qfs
• Written in C++
• Compatible ...
QFS - continued
• Many of our internal tools assume data is on QFS
• Quantcast has more than 17 PB of data stored in QFS
Basic QFS setup
Metaserver
QFS Client
Chunkserver
RAM SSD Disk
Chunkserver
RAM SSD Disk
Quantflow – Quantcast’s MapReduce system
• Over 40 PB processed daily
• Heavily relies on QFS
• Uses QFS instance tiered w...
Moving Quantflow to AWS
Adding Amazon S3 support to QFS
• Uses S3 bucket as a block device
• Replication and erasure coding is not supported
becau...
QFS setup with S3 bucket
Metaserver
QFS Client
Chunkserver
RAM SSD Disk
Chunkserver
RAM SSD Disk
S3 Bucket
Changes to Quantflow
• Repackage Quantflow for easier installation on fresh
Amazon EC2 cluster
• Some important services r...
Data flow
• Ends of data pipeline are generally QFS on S3
• Intermediate data is on QFS using tiered RAM disks,
SSDs, or A...
Copying data to S3 QFS
• Copied 8 PB of data center data to S3 as backup and as
input for AWS Quantflow jobs
• Done as cop...
Issues and Resolutions
Low S3 performance
• Initial tests of Quantflow in AWS ran slower than
expected
• S3 performance hit apparent cap at 20-30...
Finding the cause
• It took us 2 months to find out the root cause, even with
help from AWS engineers
• A tcpdump showed 8...
Fixing the problem
• We configured dnscache on worker nodes to forward
DNS queries to S3 endpoints to Amazon DNS
• We achi...
Improvement from DNS caching change
22
74
0 10 20 30 40 50 60 70 80
Throughput (GB/sec)
3200ConcurrentProcesses
Comparison...
Checklist to enable 100 GB/sec with S3
• Use multipart upload with large-enough object size
• Use well-distributed object ...
Tools that helped
• dig, tcpdump, boto with logging
• AWS CLI, S3 bucket logging
• Parallel execution tools like GXP clust...
Quantflow spot fleet issues
• Getting a large spot fleet of more capable instances can
be difficult, take a long time, or ...
A workaround
• Request multiple smaller spot fleets
• Tell QFS that each fleet is its own virtual rack
• QFS will try to s...
Parting thoughts
• An easily overlooked item of your setup can have a large
impact on performance
• As we started using AW...
The Smartsheet use case
In the before times…
During the Waywhen.
Game changers disrupt prior assumptions
$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e
PRE 1feed5-1337-d00d-2ba5e/
2016-12-25 13:29:10 1048576 1feed5-1337...
$ ls -la
drwxrwxr-x. 2 djhanson djhanson 4096 Dec 25 10:38 bar
-rw-rw-r--. 1 djhanson djhanson 0 Dec 25 10:37 foo.bar
-rw-...
$ aws s3 cp s3://icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/ ./picture
$ md5sum ./picture
d41d8cd98f00b204e9800998ecf8427e...
X-AMZ-META-FTW
X-AMZ-META-BILLTO: ALICE
X-AMZ-META-CREATOR: BOB
X-AMZ-META-STYLE: CLASSIFIED
X-AMZ-META-RELATIONSHIP: COMPLICATED
The pow...
A caveat about consistency
Related Sessions
• For more about Quantcast’s experiences with other AWS
services, check out DAT310 - Building Real-Time
C...
Thank you!
Remember to complete
your evaluations!
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3...
Upcoming SlideShare
Loading in …5
×

AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

767 views

Published on

Startups around the world use AWS services to access the power of the cloud to grow faster and more cost effectively. In this session, Smartsheet talks about how they were able to cost-effectively build their prototype for scale and avoid replatforming at different points in the adoption curve, and Quantcast discusses how they are running a high-performance analytics solution on AWS. They provide several tips and tricks for S3, and show how they removed a traditional MySQL data store from a distributed-image hosting application so that the only required data store is S3. They also show how to avoid common, cumbersome database practices by working with the eventually consistent nature of S3 objects and the fact that objects and directories share the same namespace.

Published in: Technology
  • Be the first to comment

AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kevin Stinson, Senior Software Engineer, Big Data Services, Quantcast Corporation D.J. Hanson, Director of Infrastructure, Smartsheet November 30, 2016 Case Study: How Startups like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 STG309
  2. 2. What to Expect from the Session • Quick overview of Quantcast’s MapReduce System • Changes made to move to AWS and Amazon S3 • Problems we encountered on the way and their resolutions
  3. 3. A little bit about Quantcast • Uses real-time data about consumer behavior to significantly improve the relevancy of digital advertising • Over 100 billion bids and 40 PB of data processed per day • 180 engineers globally across San Francisco, Seattle, Singapore, and London • We’re hiring – reinvent@quantcast.com
  4. 4. MapReduce at Quantcast
  5. 5. QFS – Quantcast’s distributed file system • Open sourced - https://github.com/quantcast/qfs • Written in C++ • Compatible with Hadoop 0.23 and higher, Hive, Spark, Storm, etc. • Supports replication, erasure coding, tiered storage, and rack awareness
  6. 6. QFS - continued • Many of our internal tools assume data is on QFS • Quantcast has more than 17 PB of data stored in QFS
  7. 7. Basic QFS setup Metaserver QFS Client Chunkserver RAM SSD Disk Chunkserver RAM SSD Disk
  8. 8. Quantflow – Quantcast’s MapReduce system • Over 40 PB processed daily • Heavily relies on QFS • Uses QFS instance tiered with RAM disks and SSDs for intermediate data • Bundled with control/monitoring systems like Zookeeper and Ganglia
  9. 9. Moving Quantflow to AWS
  10. 10. Adding Amazon S3 support to QFS • Uses S3 bucket as a block device • Replication and erasure coding is not supported because S3 is reliable • Makes S3 appear as just another tier in QFS • I/O performance comparable to other S3-based file systems such as EMRFS • Supports fast renames and deletes • Usable with standard Hadoop and Hadoop-friendly tools
  11. 11. QFS setup with S3 bucket Metaserver QFS Client Chunkserver RAM SSD Disk Chunkserver RAM SSD Disk S3 Bucket
  12. 12. Changes to Quantflow • Repackage Quantflow for easier installation on fresh Amazon EC2 cluster • Some important services run on dedicated instances but all MapReduce workers can run on spot instances.
  13. 13. Data flow • Ends of data pipeline are generally QFS on S3 • Intermediate data is on QFS using tiered RAM disks, SSDs, or Amazon EBS volumes using replication or erasure coding • Direct access to QFS data in data center possible but limited by bandwidth and cost control concerns
  14. 14. Copying data to S3 QFS • Copied 8 PB of data center data to S3 as backup and as input for AWS Quantflow jobs • Done as copy from one QFS instance to another • Process took weeks to complete • Major bottleneck was 20 Gb/sec link between data center and Amazon • Still copy 120-150 TB/day
  15. 15. Issues and Resolutions
  16. 16. Low S3 performance • Initial tests of Quantflow in AWS ran slower than expected • S3 performance hit apparent cap at 20-30 GB/sec • Adding more EC2 instances to Quantflow cluster did not improve performance • Tests accessing S3 directly had same problem
  17. 17. Finding the cause • It took us 2 months to find out the root cause, even with help from AWS engineers • A tcpdump showed 8% of traffic was from DNS queries • Parallel DNS query benchmark shows using our internal DNS server only achieves 200 QPS vs. 10,000 QPS using Amazon DNS • All DNS queries went to a DNS server on a t2.micro instance – this was a legacy from our data center setup • S3 uses short DNS TTLs for load balancing
  18. 18. Fixing the problem • We configured dnscache on worker nodes to forward DNS queries to S3 endpoints to Amazon DNS • We achieved 75 GB/sec with 3,200 concurrent processes on 200 c3.8xlarge instances with dnscache; 100 GB/sec is easily achievable by using c4.8xlarge and adding a few more instances • Using Amazon VPC DNS should work also
  19. 19. Improvement from DNS caching change 22 74 0 10 20 30 40 50 60 70 80 Throughput (GB/sec) 3200ConcurrentProcesses Comparison of S3 Read Performance on 200xc3.8xlarge, 16 workers/instance, 64MBx16 Objects, Boto2 APIs w/ DNSCache Single DNS forwarder
  20. 20. Checklist to enable 100 GB/sec with S3 • Use multipart upload with large-enough object size • Use well-distributed object keys • Have enough DNS capacity to achieve 10,000 QPS • Enable partitioning of bucket, which needs time and data • Pay attention to instance types and their bandwidth
  21. 21. Tools that helped • dig, tcpdump, boto with logging • AWS CLI, S3 bucket logging • Parallel execution tools like GXP cluster shell • Try micro-benchmarking before checking the whole stack
  22. 22. Quantflow spot fleet issues • Getting a large spot fleet of more capable instances can be difficult, take a long time, or cost more then expected • With availability and pricing changes, we may want a mixture of several different types of spot instances and be able to drop or lose instances • Because intermediate data is stored locally, losing instances can cause job failures
  23. 23. A workaround • Request multiple smaller spot fleets • Tell QFS that each fleet is its own virtual rack • QFS will try to spread out the data across racks • Using N-way replication, up to N-1 fleets can be lost • Using QFS’s standard 6+3 erasure coding, up to 3 fleets can be lost and less space than 4-way replication is used
  24. 24. Parting thoughts • An easily overlooked item of your setup can have a large impact on performance • As we started using AWS services on a larger scale, we hit a number of account limitations such as instance limits, total provisioned SSD limits, etc. • If your performance levels off, ask your friendly AWS liaison if an account limitation is the issue
  25. 25. The Smartsheet use case
  26. 26. In the before times… During the Waywhen.
  27. 27. Game changers disrupt prior assumptions
  28. 28. $ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e PRE 1feed5-1337-d00d-2ba5e/ 2016-12-25 13:29:10 1048576 1feed5-1337-d00d-2ba5e $ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/ PRE mobile/ PRE thumbs/ $ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/mobile/ 2016-11-22 10:17:21 0 2016-11-22 10:17:34 165342 400.jpg $ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/thumbs/ 2016-11-22 10:15:23 0 2016-11-22 10:17:13 455 20.png 2016-11-22 10:17:13 169722 400.png 2016-11-22 10:17:12 494804 700.png A dirty trick – Objects aren’t paths
  29. 29. $ ls -la drwxrwxr-x. 2 djhanson djhanson 4096 Dec 25 10:38 bar -rw-rw-r--. 1 djhanson djhanson 0 Dec 25 10:37 foo.bar -rw-rw-r--. 1 djhanson djhanson 0 Dec 25 10:37 foo.baz -rw-rw-r--. 1 djhanson djhanson 0 Dec 25 10:37 foo.qux $ mv foo.bar bar # Works directory exists $ mv foo.baz baz # Ooops not what we wanted! $ mv foo.qux qux/ # Fails appropriately. mv: cannot move `foo.qux' to `qux/': Not a directory $ find . . ./bar ./bar/foo.bar  Desired state ./baz  This is not what we wanted ./foo.qux  Proper error condition The power of the trailing slash
  30. 30. $ aws s3 cp s3://icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/ ./picture $ md5sum ./picture d41d8cd98f00b204e9800998ecf8427e $ aws s3 cp s3://icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e ./picture $ md5sum ./picture 1cdb80e2693da95e7fa647895d6277c8 $ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e PRE 1feed5-1337-d00d-2ba5e/ 2016-12-25 13:29:10 1048576 1feed5-1337-d00d-2ba5e $ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/ PRE mobile/ PRE thumbs/ 2016-12-25 13:29:44 18 meta.json Take care when operating against paths
  31. 31. X-AMZ-META-FTW
  32. 32. X-AMZ-META-BILLTO: ALICE X-AMZ-META-CREATOR: BOB X-AMZ-META-STYLE: CLASSIFIED X-AMZ-META-RELATIONSHIP: COMPLICATED The power of the trailing slash
  33. 33. A caveat about consistency
  34. 34. Related Sessions • For more about Quantcast’s experiences with other AWS services, check out DAT310 - Building Real-Time Campaign Analytics Using AWS Services • For more info on S3, check out STG303 - Deep Dive on Amazon S3
  35. 35. Thank you!
  36. 36. Remember to complete your evaluations!

×