Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(APP402) Serving Billions of Web Requests Each Day with Elastic Beanstalk | AWS re:Invent 2014

4,464 views

Published on

AWS Elastic Beanstalk provides a number of simple and flexible interfaces for developing and deploying your applications. Follow Thinknear's rapid growth from inception to acquisition, scaling from a few dozen requests per hour to billions of requests served each day with AWS Elastic Beanstalk. Thinknear engineers demonstrate how they extended the AWS Elastic Beanstalk platform to scale to billions of requests while meeting response times below 100 ms, discuss tradeoffs they made in the process, and what did and did not work for their mobile ad bidding business.

Published in: Technology
  • Thanks for the handouts guys! Real big help. Busy using Elastic Beanstalk after your talk :)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

(APP402) Serving Billions of Web Requests Each Day with Elastic Beanstalk | AWS re:Invent 2014

  1. 1. APP402 Serving Billions of Web Requests Each Day with AWS Elastic Beanstalk John Hinnegan, Engineer, Thinknear by Telenav Mik Quinlan, Engineer, Thinknear by Telenav
  2. 2. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Resources and handouts Who are we? Why AWS Elastic Beanstalk? Architecture The data collection challenge The scalability challenge The monitoring challenge Agenda
  3. 3. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Resources and handouts
  4. 4. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Resources and handouts •We are releasing a full snapshot of our setup on AWS Elastic Beanstalk •All our files related to setup of the host and environment –RAID0 and Logrotate –Monitoring and CollectD –Unix and app server tuning •Open sourcing our Amazon Simple Storage Solution log- rotation gem •Slides and presentation available through AWS re:Invent
  5. 5. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Resources and handouts •Full production AWS Elastic Beanstalk config available online –https://github.com/ThinkNear/aws_templates •We have blogged about the concepts in this presentation with more detailed walkthroughs –http://engineering.thinknear.com •This presentation should give you a good idea of the pieces needed to design a high-performance system on AWS Elastic Beanstalk
  6. 6. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Who are we?
  7. 7. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Who are we? •Thinknear: hyper-local, mobile advertising platform –Huh? A buying platform for mobile ads based on location •Continue to operate as separate BU
  8. 8. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. RTB mobile advertising •We are an RTB mobile ad buying platform –(a hyperlocal mobile DSP in advertising lingo) •RTB = real-time bidding
  9. 9. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Why AWSElastic Beanstalk?
  10. 10. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. AWS Elastic Beanstalk provides... •Speed to market –Very small engineering team –AWS Elastic Beanstalk gave us a preconfigured starting point •Ties together lots of AWS services –Lower team learning curve
  11. 11. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. What are we doing with AWS Elastic Beanstalk? •Real-time ad bidder and ad server •Running on Tomcat •API-based (no web) •Peak of 150K+ requests per second •Peak of 100+ hosts •9+ billion requests / 2+ TB data per day •Autoscaling: 70-100% per day up and down
  12. 12. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Why are we still using AWS Elastic Beanstalk? •The UI •Managed deployment mechanism •Configuration mechanism •Aggregated metrics •Ongoing development and support –Challenges get addressed; continued innovation –Someone is there to help when stuff goes wrong
  13. 13. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture
  14. 14. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: Reqs and constraints •Must be scalable –Start small, but plans to grow big –Handle lots of requests: min 100s reqs/sec/host => 1000s better •Must be robust, reliable, and fault tolerant –May get kicked off exchanges if too many errors or timeouts –Goal is 99.9% valid responses to exchanges
  15. 15. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: Reqs and constraints •Must achieve low latency –Client-side limited to 100-150 ms –If we’re late, our response doesn’t count –Sometimes they take first response in which case it’s better to be faster than “righter” •Limited by standard interface JSON
  16. 16. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: Principles •Read-only whenever possible •Async writes •Horizontally scalable •Don’t try to prevent all errors because that’s impossible: •instead know that errors will happen and focus on detection
  17. 17. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: Principles •Encapsulate errors –Don’t expose failures to clients •Internal rate limiting –Protect from volume surges •Proven technologies –Limit technical risk
  18. 18. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: Principles •Successful system will achieve several goals –High request throughput –High request success rate –Low costs –Low maintenance requirements –Be able to scale 10x-100x without major improvements
  19. 19. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: EB structure •AWS Elastic Beanstalk application => Thinknear stage –Sandbox Developer integration –Test Automated integration –Production
  20. 20. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: AWS Elastic Beanstalk structure •AWS Elastic Beanstalk environment => Thinknear cluster –One per large partner –Smaller partners share a cluster –Custom configs and easier troubleshooting
  21. 21. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: AWS Elastic Beanstalk structure
  22. 22. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture
  23. 23. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture
  24. 24. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: Amazon S3 •Amazon Simple Storage Solution used for static data –Preloaded on each host at startup –Refreshed infrequently and asynchronously –Just check time stamps for changes –Data changes rarely (daily, weekly) –Relatively small data sets (total cache ~ 4 GB)
  25. 25. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture
  26. 26. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: Memcached •Amazon ElastiCache (Memcached) is primary data store –Yes, it’s transitory and must be rebuilt on restart –Super fast: client-side latencies average under 3 ms, 99.5% < 15 ms –Everything read online during a request is key-value lookup –We serialize gets with foreign keys to avoid joins; software joins are faster for us •Still need to tune Memcached; nontrivial at scale •Pre-normalize data into format optimized for reads
  27. 27. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture
  28. 28. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: Ehcache •On-host hot cache local Ehcache –Some items are extremely hot and requested for a disproportionate number of requests –Amazon DynamoDB good with distributed keys •But not with hot keys –Use on-host caches to hold hot items for short durations before refresh –Also reduces call volume to data store$
  29. 29. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture
  30. 30. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: Amazon DynamoDB •Reading: –Extremely large (multi-TB) datasets needing fast access –Data sets change infrequently (weekly-ish) –Can load extremely large datasets efficiently from Amazon Elastic MapReduce –Slightly slower than Memcached for reads, but ... –Variability in read times are extremely low
  31. 31. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. Architecture: Amazon DynamoDB •Writing –Durable, fast-propagating values –Perfect for counters: writing increment-values async in offline threads for extremely fast request latency impact
  32. 32. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The data collection challenge
  33. 33. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The big data challenge: Volume and rate •Terabytes generated per day •Solid use cases –What do you need to report on? –What conclusions do you need to draw? •Strategize on what to keep –Report on spend, clicks, performance: Persist critical entries 100% –Analyze and forecast inventory: Sample auctions 1%
  34. 34. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The big data challenge: Analysis •Too many dimensions and diverse reports to use something like Amazon Kinesis •Lots of work to ensure completeness –Scan Amazon S3 for missing log periods –Error on failed uploads –Keep event totals in DynamoDB as an audit trail
  35. 35. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The big data challenge: Collection •Use local storage over Amazon Elastic Block Store –Nothing important persists on the local drive –Amazon EBS (used to be) slower –Amazon EBS is another point of failure •Set up local drives in Raid0 for better write performance •Details in handout
  36. 36. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The big data challenge: Collection •Writing to multiple files (vs. all same data to one file) also improves disk performance •Be sure to monitor for low disk space •Example in the handout
  37. 37. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The big data challenge: Amazon S3
  38. 38. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The big data challenge: Amazon S3 gotchas •AWS Elastic Beanstalk already uses logrotate –Not tuned for extremely large data sets –Will run redundantly with your own configs unless removed/overwritten •Amazon S3 push availability not 100%; need retry logic –It is really good, but with, say, 100 hosts pushing 5 logs every 5 min = ~29,000 pushes per day …
  39. 39. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The big data challenge: Amazon S3 gotchas •Host names get reused; confuses monitoring –Used to use host name for logfiles. Scaling up and down dozens of hosts/day, you get the same ones over again –Instead, each host generates UUID locally on startup and prefixes all files with the UUID –Monitoring process can then tell when a host missed a 5-minute gap •Cron can crash; need to monitor cron
  40. 40. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The big data challenge: Logrotate •tn_s3_uploader –Custom Gem to reliably upload files to Amazon S3 –Will retry uploads that fail –Notifies Honeybadger of failures –Repeated failures to uploaded to Amazon S3 will cause the script to mark the host as unhealthy --if you can’t get the logs, better to shut down •Recently open sourced •Full logrotate setup detailed in handout
  41. 41. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The scalability challenge
  42. 42. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The scalability challenge: Network latency ●We need consistent minimal network latency ●Respond to exchange <= 100 ms ●We use AWS (obviously) ●No network congestion issues
  43. 43. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The scalability challenge: Network latency •Must know where clients are –Effect of x-country end points •Persistent connections •Timeouts from the ELB load balancer •Tuned on Tomcat as well (see later)
  44. 44. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The scalability challenge: ELB •Set ELB idle timeout (via UI or ELB API) to clean up dangling connections –ELB recommends to be 1s lower than application layer idle connection timeout (keep alive timeout) –We run the opposite, with ELB idle timeout 1s lower –This results in mystery 500s (500 responses not captured in any metrics), but better overall error rate (measured as 200 responses / total requests) –Need to test yourself
  45. 45. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The scalability challenge: .ebxtensions Resources: AWSEBLoadBalancer: Type: "AWS::ElasticLoadBalancing::LoadBalancer" Properties: ConnectionDrainingPolicy: { "Enabled" : "true", "Timeout": "300" } Listeners: [ { "LoadBalancerPort" : "80", "InstancePort" : "8080" , "Protocol" : "HTTP", "InstanceProtocol" : "HTTP" } ] AvailabilityZones: ["us-east-1X", "us-east-1Y"] CrossZone: "true" AccessLoggingPolicy: { "EmitInterval" : "5", "Enabled" : "true", "S3BucketName" : "XXXXX", "S3BucketPrefix" : { "Ref": "AWSEBEnvironmentName" } } •ConnectionDrainingPolicy for graceful disconnects –Grace period to finish any in- progress requests –Doesn’t wait the whole time; can be generous •Access logging policy –Great for diagnosing issues –Records per-request bytes, response codes, latency measures –Can trace/group by host
  46. 46. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The scalability challenge: App Server •We are running on Tomcat •Tried and true company wouldn’t fail •100% API-based web service –We do not serve any websites or content –None of our responses are cacheable
  47. 47. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The scalability challenge: OS tuning •Tomcat typically has between 1000 -1400 open files –Established TCP connections –jars –Piped processes •Tune –limits.conf –sysctl.conf
  48. 48. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The scalability challenge: OS tuning •Tomcat “too many open files” errors, so... Parameter Value hardnofile 16384 softnofile 16384
  49. 49. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The scalability challenge: OS sysctl.conf •Long list see handout; most important: Parameter Value Comment net.core.somaxconn 4096 ●Maximum length of the kernel's listen queue ●Limited by acceptCount ●High for bursty traffic ●Must be > Tomcat acceptCount net.ipv4.tcp_fin_timeout 15 ●Default is 60 secs ●We had many connections in TIME_WAIT state ●Increase frequency of closing connections and release of resources
  50. 50. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The scalability challenge: Bypass Apache •AWS Elastic Beanstalk runs Apache locally as a proxy –(Presumably, this lets AWS Elastic Beanstalk have standard configs across multiple platforms) •We bypass Apache and point load balancers directly at application server –If you don’t bypass Apache, must tune it –Bypass simply by pointing load balancers at app server (port 8080)
  51. 51. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The monitoring challenge
  52. 52. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The monitoring challenge: Thresholds •High volume, high performance systems average, min, and max are not the whole story •Ideally, use percentiles, e.g., 99th •When we started, we used Amazon CloudWatch (which doesn’t support percentiles)
  53. 53. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The monitoring challenge: Thresholds •As a crutch, we used threshold counts –We knew what latency thresholds were important –We counted requests exceeding some thresholds –Can estimate where percentiles are
  54. 54. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The monitoring challenge: Librato
  55. 55. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The monitoring challenge: Learning point •If you push dozens of metrics per request from a system doing 10s of thousands of requests per second to CloudWatch, you get a personal phone call –Actually, they may now throttle you, but they didn’t used to •Custom client-side data aggregation code running on each host; kind of ugly, but got the job done
  56. 56. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The monitoring challenge: Server side •Most metrics collected server side •Ideal to collect client side, but that requires close relationship with the client •Compare max at ELB vs. server side to measure internal network and platform latency •Set goals conservatively & compare notes with clients
  57. 57. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The monitoring challenge: Current config •High Volume Application Metrics: •Approximate percentiles •More economical •Eliminate client-side batching -> moved to collectd •CloudWatch: issues with high volume
  58. 58. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The monitoring challenge: collectd •collectd -host performance •Installed via .ebextensions file •Example config in the handout
  59. 59. © 2014 Telenav and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Telenav. The monitoring challenge: Tomcat •Via collectd plugins •Export JVM MBeans <MBean "garbage_collector"> ObjectName "java.lang:type=GarbageCollector,*" InstancePrefix "gc-" InstanceFrom "name" ...config... </MBean>
  60. 60. Please give us your feedback on this session. Complete session evaluations and earn re:Invent swag. http://bit.ly/awsevals LinkedIn: https://www.linkedin.com/company/thinknear Blog: http://engineering.thinknear.com Twitter: @softwaregravy @MikQuinlan

×