0
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or i...
Topics We’ll Cover Today
• Auto Scaling introduction
• Console demo
• Maintaining application response times and fleet uti...
Ways You Can Use Auto Scaling
Launch EC2 instances
and groups from
reusable templates
Scale up and down as
needed automati...
Common Scenarios
• Schedule a one-time scale out and flip to production
• Follow daily, weekly, or monthly cycles
• Provis...
Demo
Learn the new terms:
Launch Configuration
Auto Scaling Group
Scaling Policy
Amazon CloudWatch Alarm
Amazon SNS Notifi...
What’s New in Auto Scaling
Better integration
• EC2 console support
• Scheduled scaling policies in
CloudFormation templat...
Why Auto Scaling?
Scale Up Control CostsImprove Availability
Why Auto Scaling?
Scale Up Control CostsImprove Availability
The Weather Company
• Top 30 web property in the U.S.
• 2nd most viewed television
channel in the U.S.
• 85% of U.S. airli...
Wunderground Radar and
Maps
100 million hits a day
One Billion data points per day
Migrated real-time radar mapping system...
30,000
Personal
Weather
Stations
Source: Wunderground, Inc. 2013
Why Auto Scaling?
Why Auto Scaling?
Why Auto Scaling?
Why Auto Scaling?
Why Auto Scaling?
Hurricane Sandy
Before Migration – Traditional IT Model doesn’t scale well
Server Count
(110 Servers)
Avg. CPU Load HTTP Response Latency
...
Radar on AWS Auto Scaling Architecture
Radar on AWS
CPU Utilization
Radar on AWS
Host Count
Radar on AWS
Radar on AWS
Radar on AWS
Scale up to ensure consistent
performance during high-demand
Why Auto Scaling?
Scale Up Control CostsImprove Availability
Auto Scaling
for 99.9%
Uptime
Here.com Local Search Application
• Local Search app
• First customer facing
application on AWS
• Obvious need for
Uptime
Here.com Local Search Architecture
US-East-1
US-West-2
EU-West-1
US-East-1a
Zookeeper1
Zookeeper2
Zookeeper3
Frontend
Grou...
Here.com Local Search Architecture
US-East-1
US-West-2
EU-West-1
US-East-1a
Zookeeper1
Zookeeper2
Zookeeper3
Frontend
Grou...
Auto Scaling when upgrading
without any downtime
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
New
v2
New
V2
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
New
v2
New
V2
“Auto scaling”
Web Server Fleet
(Amazon EC2)
Database Fleet
(RDS or DB on EC2)
Load Balancing
(ELB)
v1.1 v1.1
v1.1 v1.1
v1...
Here.com Local Search Success
• Increased Uptime to 99.9%
• All detected health
problems have been
successfully replaced b...
Why Auto Scaling?
Scale Up Control CostsImprove Availability
Adobe Creative
Cloud Runs on
AWS
Adobe Shared
Cloud Architecture
on AWS
Auto Scaling the Web Layer
Based on
Number of HTTP requests
Average CPU load
Network in/out
Auto Scaling the Web Layer
Auto Scaling the Worker Layer
Based on
SQS queue length
Based on
Number of HTTP requests
Averag...
Scale up fast, scale down slow
Cost Control
• Scheduled scaling: we analyzed our traffic and
picked numbers.
– scale up in the morning, scale down in the...
CloudFormation + Auto Scaling
"ScaleUpPolicy" : {
"Type" : "AWS::Auto Scaling::ScalingPolicy",
"Properties" : {
"Adjustmen...
How – Custom Metrics
. . .
Sat Oct 6 05:51:03 UTC 2012
Number of AZs: 4
Number of Web Servers: 16
Number of Healthy Web Se...
How – multi-input scaling
Scale up
Scale down
+2 instances if more than 50 visible messages for >5 min
+50% instances if m...
Adobe’s Advice
• Use CloudFormation!
• Know your system, thresholds
• Watch your scaling history
• Scaling up is easy, sca...
Scaling strategies we use
Scaling with
CloudWatch alarms
Scheduled scaling
(onetime, recurring)
A little background on our application
• Ruby on Rails
• Unicorn
• We teach kids math!
A workload well suited for auto scaling
Scaling with CloudWatch alarms
Performance test to get a baseline
• Discover the ideal number of
worker processes per server
– Too few and resources go
u...
Performance testing
Identify the breaking point
Breaking point was at about 400 users per server
Our first method to find scale points
• Provision a static amount
of servers that we know
can handle peak load
• Adjust sc...
Let’s do some math – identify variables
Independent
• Concurrent users
Dependent
• CPU utilization
• Memory utilization
• ...
Let’s do some math – find the slope
• Adding about 1600 users per hour
• Which is about 27 per minute
• We know that we ca...
Let’s do some math – when to scale?
• We know (from other testing) that it takes
us about 5 minutes for a new node to
come...
How much to scale up by?
• The lowest we can scale up by is 1 node per AZ,
otherwise we would be unbalanced
• For us, this...
Evaluate your predictions
• In the real world, we’ve inched up from
scaling at 53%
• Our perf test is a little harsher tha...
Scheduled scaling
Acceleration in load is not constant
Request count for a 24 hour period
We can’t use one size fits all
• Scale too aggressively
– Overprovisioning: increases
cost
– Bounciness: we add more
than ...
Putting it all together
The opportunity cost of NOT scaling
• Our usage curve
from 3/20
• Low of about 5
concurrent users
• High of about
10,000 c...
The opportunity cost of NOT scaling
• No autoscaling
• 672 instance hours
• $302.40 at on-
demand prices
The opportunity cost of NOT scaling
• Autoscaling four
times per day
• 360 instance hours
• $162 at on-
demand prices
• 46...
The opportunity cost of NOT scaling
• Autoscaling as
needed, twelve
times per day
• 272 instance hours
• $122.40 at on-
de...
The opportunity cost of NOT scaling
$302/day
$162/day
$122/day
Demand curve hugs the usage curve…
…and a (mostly) flat response curve
“Auto Scaling saves us a lot of money; with
a little bit of math, flexibility of AWS allows
us to further save by aligning...
Why Auto Scaling?
Scale Up Control CostsImprove Availability
Key Takeaways
• Maintaining application response times and fleet utilization
• Scaling up and handling unexpected “weather...
Thank You!
Derek Chiles
derekch@amazon.com
@derekchiles
More Nines for Your Dimes: Improving Availability and Lowering Costs using Auto Scaling and Amazon EC2
More Nines for Your Dimes: Improving Availability and Lowering Costs using Auto Scaling and Amazon EC2
Upcoming SlideShare
Loading in...5
×

More Nines for Your Dimes: Improving Availability and Lowering Costs using Auto Scaling and Amazon EC2

564

Published on

Running your Amazon EC2 instances in Auto Scaling groups allows you to improve your application's availability right out of the box. Auto Scaling replaces impaired or unhealthy instances automatically to maintain your desired number of instances (even if that number is one). You can also use Auto Scaling to automate the provisioning of new instances and software configurations as well as to track of usage and costs by app, project, or cost center. Of course, you can also use Auto Scaling to adjust capacity as needed - on demand, on a schedule, or dynamically based on demand. In this session, we show you a few of the tools you can use to enable Auto Scaling for the applications you run on Amazon EC2.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
564
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
58
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "More Nines for Your Dimes: Improving Availability and Lowering Costs using Auto Scaling and Amazon EC2 "

  1. 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. More Nines for Your Dimes: Improving Availability and Lowering Costs using Auto Scaling Derek Chiles, AWS Solutions Architecture (@derekchiles) July 10, 2014
  2. 2. Topics We’ll Cover Today • Auto Scaling introduction • Console demo • Maintaining application response times and fleet utilization • Handling cyclical demand, unexpected “weather events” • Auto Scaling for 99.9% Uptime • Single-instance groups • Cost control and asymmetric scaling responses • CloudFormation, custom scripts, and multiple inputs • Using performance testing to choose scaling strategies • Dealing with bouncy or steep curves AWS The Weather Channel Nokia Adobe Dreambox
  3. 3. Ways You Can Use Auto Scaling Launch EC2 instances and groups from reusable templates Scale up and down as needed automatically Auto-replace Instances and maintain EC2 capacity
  4. 4. Common Scenarios • Schedule a one-time scale out and flip to production • Follow daily, weekly, or monthly cycles • Provision capacity dynamically by scaling on CPU, memory, request rate, queue depth, users, etc. • Auto-tag instances with cost center, project, version, stage • Auto-replace instances that fail ELB or EC2 checks • Auto-balance instances across multiple zones. Prepare for a Big Launch Fit Capacity to Demand Be Ready for Spikes Simplify Cost Allocation Maintain Stable Capacity Go Multi-AZ
  5. 5. Demo Learn the new terms: Launch Configuration Auto Scaling Group Scaling Policy Amazon CloudWatch Alarm Amazon SNS Notification
  6. 6. What’s New in Auto Scaling Better integration • EC2 console support • Scheduled scaling policies in CloudFormation templates • ELB connection draining • Auto-assign public IPs in VPC • Spot + Auto Scaling More APIs • Create groups based on running instances • Create launch configurations based on running instances • Attach running instances to a group • Describe account limits for groups and launch configs
  7. 7. Why Auto Scaling? Scale Up Control CostsImprove Availability
  8. 8. Why Auto Scaling? Scale Up Control CostsImprove Availability
  9. 9. The Weather Company • Top 30 web property in the U.S. • 2nd most viewed television channel in the U.S. • 85% of U.S. airlines depend on our forecasts • Major retailers base marketing spend and store displays based on our forecasts • 163 million unique visitors across TV and web
  10. 10. Wunderground Radar and Maps 100 million hits a day One Billion data points per day Migrated real-time radar mapping system wunderground.com to AWS Cloud
  11. 11. 30,000 Personal Weather Stations Source: Wunderground, Inc. 2013
  12. 12. Why Auto Scaling?
  13. 13. Why Auto Scaling?
  14. 14. Why Auto Scaling?
  15. 15. Why Auto Scaling?
  16. 16. Why Auto Scaling? Hurricane Sandy
  17. 17. Before Migration – Traditional IT Model doesn’t scale well Server Count (110 Servers) Avg. CPU Load HTTP Response Latency (~6000 ms) HTTP Response Latency (5-15ms) Server Count (from 110 to 170 Instances) Avg. CPU Load After Migration - Wunderground Radar App
  18. 18. Radar on AWS Auto Scaling Architecture
  19. 19. Radar on AWS CPU Utilization
  20. 20. Radar on AWS Host Count
  21. 21. Radar on AWS
  22. 22. Radar on AWS
  23. 23. Radar on AWS
  24. 24. Scale up to ensure consistent performance during high-demand
  25. 25. Why Auto Scaling? Scale Up Control CostsImprove Availability
  26. 26. Auto Scaling for 99.9% Uptime
  27. 27. Here.com Local Search Application • Local Search app • First customer facing application on AWS • Obvious need for Uptime
  28. 28. Here.com Local Search Architecture US-East-1 US-West-2 EU-West-1 US-East-1a Zookeeper1 Zookeeper2 Zookeeper3 Frontend Group Backend Groups US-East-1b Zookeeper1 Zookeeper2 Zookeeper3 Frontend Group Backend Groups AP-Southeast-1
  29. 29. Here.com Local Search Architecture US-East-1 US-West-2 EU-West-1 US-East-1a Zookeeper1 Zookeeper2 Zookeeper3 Frontend Group Backend Groups US-East-1b Zookeeper1 Zookeeper2 Zookeeper3 Frontend Group Backend Groups AP-Southeast-1 Single-Instance Auto Scaling Groups (Zookeeper) 1. Auto-healing: Instances auto-register in DNS via Route53 2. Dynamic: Auto Scaling Group Names are used for cluster-node lookups (cluster1-zookeeper1) 3. Used Standard Tools such as DNS instead of Queries or Elastic IPs
  30. 30. Auto Scaling when upgrading without any downtime
  31. 31. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old
  32. 32. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2
  33. 33. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2
  34. 34. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2
  35. 35. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2
  36. 36. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2 New v2 New V2
  37. 37. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2 New v2 New V2
  38. 38. “Auto scaling” Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 Auto scaling Max instances Min instances Scaling Trigger Custom Metrics Upper Threshold Lower Threshold Increment by Common scenario: Blue Green Deployments Using Auto Scaling
  39. 39. Here.com Local Search Success • Increased Uptime to 99.9% • All detected health problems have been successfully replaced by Auto Scaling with zero intervention. • Zookeeper setup has performed flawlessly “We’ve been paranoid so it still pages us; It’s beginning to feel silly.”
  40. 40. Why Auto Scaling? Scale Up Control CostsImprove Availability
  41. 41. Adobe Creative Cloud Runs on AWS
  42. 42. Adobe Shared Cloud Architecture on AWS
  43. 43. Auto Scaling the Web Layer Based on Number of HTTP requests Average CPU load Network in/out
  44. 44. Auto Scaling the Web Layer Auto Scaling the Worker Layer Based on SQS queue length Based on Number of HTTP requests Average CPU load Network in/out
  45. 45. Scale up fast, scale down slow
  46. 46. Cost Control • Scheduled scaling: we analyzed our traffic and picked numbers. – scale up in the morning, scale down in the evening • Policies for slow scale down • Stage environments: downscale everything to “min-size” daily (or more)
  47. 47. CloudFormation + Auto Scaling "ScaleUpPolicy" : { "Type" : "AWS::Auto Scaling::ScalingPolicy", "Properties" : { "AdjustmentType" : "ChangeInCapacity", "Auto ScalingGroupName" : { "Ref" : "WorkerAuto ScalingGroup" }, "Cooldown" : {"Ref": "cooldown"}, "ScalingAdjustment" : { "Ref" : "adjustup" } } }, "WorkerAlarmScaleUp": { "Type": "AWS::CloudWatch::Alarm", "Properties": { "EvaluationPeriods":{"Ref" : "evalperiod"}, "Statistic": "Sum", "Threshold": {"Ref" : "upthreshold"}, "AlarmDescription": "Scale up if the work load of transcode queue is high", "Period": {"Ref" : "period"}, "AlarmActions": [ { "Ref": "ScaleUpPolicy" }, { "Ref" : "scalingSNStopic" } ], "Namespace": "AWS/SQS", "Dimensions": [ { "Name": "QueueName", "Value": {"Ref" : "queuename" }}], "ComparisonOperator": "GreaterThanThreshold", "MetricName": "ApproximateNumberOfMessagesVisible"
  48. 48. How – Custom Metrics . . . Sat Oct 6 05:51:03 UTC 2012 Number of AZs: 4 Number of Web Servers: 16 Number of Healthy Web Servers: 16 ELB Request Count: 9523.0 Request Count Per Healthy Web Server: 595.1875 Network In Per Healthy Web Server: 51 MB Network Out Per Healthy Web Server: 1 MB CPU Per Healthy Web Server: 25.23875 Publishing Custom Metrics: InstanceRequestCount, HealthyWebServers, InstanceNetworkIn, InstanceNetworkOut, InstanceCPUUtilization to namespace WebServer in us-east-1 . . .
  49. 49. How – multi-input scaling Scale up Scale down +2 instances if more than 50 visible messages for >5 min +50% instances if more than 1000 msg for >2 min + fixed 100 instances if more than 10000 msg for >1 min -10 instance if 0 msg for more than 10 min -25% if 0 msg for more than 30 min
  50. 50. Adobe’s Advice • Use CloudFormation! • Know your system, thresholds • Watch your scaling history • Scaling up is easy, scaling down not so much • Mantra: scale up fast; scale down slow
  51. 51. Scaling strategies we use Scaling with CloudWatch alarms Scheduled scaling (onetime, recurring)
  52. 52. A little background on our application • Ruby on Rails • Unicorn • We teach kids math!
  53. 53. A workload well suited for auto scaling
  54. 54. Scaling with CloudWatch alarms
  55. 55. Performance test to get a baseline • Discover the ideal number of worker processes per server – Too few and resources go unused – Too many and performance suffers under load • Obtain the maximum load sustainable per server – Our performance tests measures number of concurrent users • Find the chokepoint – For us, this was CPU utilization
  56. 56. Performance testing
  57. 57. Identify the breaking point Breaking point was at about 400 users per server
  58. 58. Our first method to find scale points • Provision a static amount of servers that we know can handle peak load • Adjust scale up and scale down alarms based on observed highs and lows • This worked, but was super inefficient, both in time and money spent
  59. 59. Let’s do some math – identify variables Independent • Concurrent users Dependent • CPU utilization • Memory utilization • Disk I/O • Network I/O
  60. 60. Let’s do some math – find the slope • Adding about 1600 users per hour • Which is about 27 per minute • We know that we can handle a max of about 400 users per server at 80% CPU usage • Which is about 0.2% CPU usage per user
  61. 61. Let’s do some math – when to scale? • We know (from other testing) that it takes us about 5 minutes for a new node to come online • We’re adding 27 users per minute • Which means we need to start spinning up new nodes when we’re about 135 users ( 27 x 5 ) per node short of max • Which is at about 53% utilization: (80% - (0.2% * 135))
  62. 62. How much to scale up by? • The lowest we can scale up by is 1 node per AZ, otherwise we would be unbalanced • For us, this is an extra 800 users of capacity in five minutes, plenty enough to keep up with our rate of adding 1600 users per hour • Adding 800 users of capacity every five minutes, we could support 9600 additional users per hour
  63. 63. Evaluate your predictions • In the real world, we’ve inched up from scaling at 53% • Our perf test is a little harsher than the real world • Numbers derived from the perf test are only as accurate as the simulation of traffic you in your perf test.
  64. 64. Scheduled scaling
  65. 65. Acceleration in load is not constant Request count for a 24 hour period
  66. 66. We can’t use one size fits all • Scale too aggressively – Overprovisioning: increases cost – Bounciness: we add more than we need and have to partially scale back shortly after scaling up, which increases cost • Scale too timidly – Poor performance – Outages due to lack of capacity
  67. 67. Putting it all together
  68. 68. The opportunity cost of NOT scaling • Our usage curve from 3/20 • Low of about 5 concurrent users • High of about 10,000 concurrent users
  69. 69. The opportunity cost of NOT scaling • No autoscaling • 672 instance hours • $302.40 at on- demand prices
  70. 70. The opportunity cost of NOT scaling • Autoscaling four times per day • 360 instance hours • $162 at on- demand prices • 46% savings vs no autoscaling
  71. 71. The opportunity cost of NOT scaling • Autoscaling as needed, twelve times per day • 272 instance hours • $122.40 at on- demand prices • 24% savings vs scaling 4 times per day • 60% savings vs no autoscaling
  72. 72. The opportunity cost of NOT scaling $302/day $162/day $122/day
  73. 73. Demand curve hugs the usage curve…
  74. 74. …and a (mostly) flat response curve
  75. 75. “Auto Scaling saves us a lot of money; with a little bit of math, flexibility of AWS allows us to further save by aligning our demand curve with usage curve.” -- Dreambox
  76. 76. Why Auto Scaling? Scale Up Control CostsImprove Availability
  77. 77. Key Takeaways • Maintaining application response times and fleet utilization • Scaling up and handling unexpected “weather events” • Auto Scaling for 99.9% Uptime • Single-instance groups • Cost control and asymmetric scaling responses • CloudFormation, custom scripts, and multiple inputs • Using performance testing to choose scaling strategies • Dealing with bouncy or steep curves The Weather Channel Nokia Adobe Dreambox
  78. 78. Thank You! Derek Chiles derekch@amazon.com @derekchiles
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×