© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or i...
Topics We’ll Cover Today
•  Ways to use Auto Scaling
•  Auto Scaling introduction
•  Maintaining application response time...
Ways You Can Use Auto Scaling
Launch EC2 instances
and groups from
reusable templates
Scale up and down as
needed automati...
Common Scenarios
•  Schedule a one-time scale out and flip to production
•  Follow daily, weekly, or monthly cycles
•  Pro...
Auto Scaling group
{min: 1, max: 1, desired: 1, plan:
maintain}
Availability Zone #1
security group
EC2 instance
web app
s...
What’s New in Auto Scaling
Better integration
•  EC2 console support
•  Scheduled scaling policies in
CloudFormation templ...
Why Auto Scaling?
Scale Up Control CostsImprove Availability
Why Auto Scaling?
Scale Up Control CostsImprove Availability
The Weather Company
•  Top	
  30	
  web	
  property	
  in	
  the	
  
U.S.	
  
•  2nd	
  most	
  viewed	
  television	
  
c...
Wunderground Radar and
Maps
100 million hits a day
One Billion data points per day
Migrated real-time radar mapping system...
30,000
Personal
Weather
Stations
Source: Wunderground, Inc. 2013
Why Auto Scaling?
Why Auto Scaling?
Why Auto Scaling?
Why Auto Scaling?
Why Auto Scaling?
Hurricane Sandy
Before	
  Migra+on	
  –	
  Tradi+onal	
  IT	
  Model	
  doesn’t	
  scale	
  well	
  	
  
Server	
  Count	
  
(110	
  Serve...
Radar on AWS Auto Scaling Architecture
Radar on AWS
CPU Utilization
Radar on AWS
Host Count
Radar on AWS
Radar on AWS
Why Auto Scaling?
Scale Up Control CostsImprove Availability
Auto Scaling
for 99.9%
Uptime
Here.com Local Search Application
•  Local Search app
•  First customer facing
application on AWS
•  Obvious need for
Upti...
Here.com Local Search Architecture
US-East-1
US-West-2
EU-West-1
US-East-1a
Zookeeper1
Zookeeper2
Zookeeper3
Frontend
Grou...
Here.com Local Search Architecture
US-East-1
US-West-2
EU-West-1
US-East-1a
Zookeeper1
Zookeeper2
Zookeeper3
Frontend
Grou...
Auto Scaling when upgrading
without any downtime
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
New
v2
New
V2
Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
New
v2
New
V2
“Auto scaling”
Web Server Fleet
(Amazon EC2)
Database Fleet
(RDS or DB on EC2)
Load Balancing
(ELB)
v1.1 v1.1
v1.1 v1.1
v1...
Here.com Local Search Success
•  Increased Uptime 99.9%
•  All detected health
problems have been
successfully replaced by...
Why Auto Scaling?
Scale Up Control CostsImprove Availability
Adobe Creative
Cloud Runs on
AWS
Adobe Shared
Cloud Architecture
on AWS
Auto Scaling the Web Layer
Based	
  on	
  	
  
Number	
  of	
  HTTP	
  requests	
  
Average	
  CPU	
  load	
  
Network	
  ...
Auto Scaling the Web Layer
Auto Scaling the Worker Layer
Based	
  on	
  	
  
SQS	
  queue	
  length	
  
Based	
  on	
  	
 ...
Scale up fast, scale down slow
Cost Control
•  Scheduled scaling: we analyzed our traffic and
picked numbers.
–  scale up in the morning, scale down in t...
CloudFormation + Auto Scaling
"ScaleUpPolicy" : {
"Type" : "AWS::Auto Scaling::ScalingPolicy",
"Properties" : {
"Adjustmen...
How – Custom Metrics
. . .
Sat Oct 6 05:51:03 UTC 2012
Number of AZs: 4
Number of Web Servers: 16
Number of Healthy Web Se...
How – multi-input scaling
Scale up
Scale down
+2 instances if more than 50 visible messages for >5 min
+50% instances if m...
Adobe’s Advice
•  Use CloudFormation!
•  Know your system, thresholds
•  Watch your scaling history
•  Scaling up is easy,...
Why Auto Scaling?
Scale Up Control CostsImprove Availability
Key Takeaways
•  Maintaining application response times and fleet utilization
•  Handling cyclical demand, “weather events...
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or i...
>1 million
downloads
In first month
>30 million
downloads
so far
SwiftKey - Best known for smart apps
SwiftKey NoteSwiftKe...
Business issue
•  High availability
•  Low latency required – global low latency in
future
•  Highly resilient
Architecture for a user service
Architecture notes
•  Deployed using Chef
•  Two of each server to make a service – spread
around AZs
•  Using Redis to ac...
Cassandra
•  Cassandra deployed at present as 3 nodes – 1
per AZ in a region. Can lose any node.
•  Has been tested runnin...
Zookeeper and Exhibitor
•  We use Apache Zookeeper so servers can find where other servers
are and configuration
–  Run as...
Other thoughts
•  Make computer stateless and parallelised
–  Can then scale
–  Doesn’t matter if a node fails
–  Can cost...
Other thoughts
•  Look at Trusted Adviser
–  Warns you about ELB that are not spread across AZ
–  Warns about snapshots no...
AWS Cost Explorer
Find me on Twitter - @imcdnzl
We’re hiring – visit our website
AWS Partner Trail
Win a Kindle Fire
•  10 in total
•  Get a code from our
sponsors
Please rate
this session
using the AWS
Summits App
and help us build
better events
#AWSSummit
@AWScloud @AWS_UKI
Upcoming SlideShare
Loading in...5
×

AWS Summit London 2014 | Improving Availability and Lowering Costs (300)

760

Published on

This mid-level technical session will focus on helping you to improve availability and lower costs by using Auto Scaling and Amazon EC2.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
760
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "AWS Summit London 2014 | Improving Availability and Lowering Costs (300)"

  1. 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. More Nines for Your Dimes: Improving Availability and Lowering Costs using Auto Scaling April 30, 2014 “Fitz” Philip Fitzsimons Solutions Architecture
  2. 2. Topics We’ll Cover Today •  Ways to use Auto Scaling •  Auto Scaling introduction •  Maintaining application response times and fleet utilization •  Handling cyclical demand, “weather events” •  Auto Scaling for 99.9% Uptime •  Single-instance groups •  Cost control and asymmetric scaling responses •  Cloudformation, custom scripts, and multiple inputs •  High availability, low latency & high resiliency •  Cassandra & Zookeeper AWS The Weather Channel Nokia Adobe SwiftKey
  3. 3. Ways You Can Use Auto Scaling Launch EC2 instances and groups from reusable templates Scale up and down as needed automatically Auto-replace Instances and maintain EC2 capacity
  4. 4. Common Scenarios •  Schedule a one-time scale out and flip to production •  Follow daily, weekly, or monthly cycles •  Provision capacity dynamically by scaling on CPU, memory, request rate, queue depth, users, etc. •  Auto-tag instances with cost center, project, version, stage •  Auto-replace instances that fail ELB or EC2 checks •  Auto-balance instances across multiple zones. Prepare for a Big Launch Fit Capacity to Demand Be Ready for Spikes Simplify Cost Allocation Maintain Stable Capacity Go Multi-AZ
  5. 5. Auto Scaling group {min: 1, max: 1, desired: 1, plan: maintain} Availability Zone #1 security group EC2 instance web app server Elastic Load Balancing region CloudWatch Metrics (standard or custom) alarm Triggers Launch Config AMI instance type, key pairs, security groups, block device mappings etc Auto Scaling Policy AMI instance type, key pairs, security groups, block device mappings etc Auto Scaling Policy AMI instance type, key pairs, security groups, block device mappings etc Auto Scaling Policy AdjustmentType ASGName Cooldown MinAdjustmentStep PolicyName ScalingAdjustment Action Health State Custom Health Check Availability Zone #2 www.example.com Scaling Plan maintain manual schedule demand Autoscaling
  6. 6. What’s New in Auto Scaling Better integration •  EC2 console support •  Scheduled scaling policies in CloudFormation templates •  ELB connection draining •  Auto-assign public IPs in VPC •  Spot + Auto Scaling More APIs •  Create groups based on running instances •  Create launch configurations based on running instances •  Attach running instances to a group •  Describe account limits for groups and launch configs
  7. 7. Why Auto Scaling? Scale Up Control CostsImprove Availability
  8. 8. Why Auto Scaling? Scale Up Control CostsImprove Availability
  9. 9. The Weather Company •  Top  30  web  property  in  the   U.S.   •  2nd  most  viewed  television   channel  in  the  U.S.   •  85%  of  U.S.  airlines  depend  on   our  forecasts   •  Major  retailers  base  markeCng   spend  and  store  displays  based   on  our  forecasts   •  163  million  unique  visitors   across  TV  and  web  
  10. 10. Wunderground Radar and Maps 100 million hits a day One Billion data points per day Migrated real-time radar mapping system wunderground.com to AWS Cloud
  11. 11. 30,000 Personal Weather Stations Source: Wunderground, Inc. 2013
  12. 12. Why Auto Scaling?
  13. 13. Why Auto Scaling?
  14. 14. Why Auto Scaling?
  15. 15. Why Auto Scaling?
  16. 16. Why Auto Scaling? Hurricane Sandy
  17. 17. Before  Migra+on  –  Tradi+onal  IT  Model  doesn’t  scale  well     Server  Count   (110  Servers)   Avg.  CPU  Load   HTTP  Response  Latency   (~6000  ms)   HTTP  Response  Latency   (5-­‐15ms)   Server  Count   (from  110  to  170  Instances)   Avg.  CPU  Load   ANer  Migra+on  -­‐  Wunderground  Radar  App    
  18. 18. Radar on AWS Auto Scaling Architecture
  19. 19. Radar on AWS CPU Utilization
  20. 20. Radar on AWS Host Count
  21. 21. Radar on AWS
  22. 22. Radar on AWS
  23. 23. Why Auto Scaling? Scale Up Control CostsImprove Availability
  24. 24. Auto Scaling for 99.9% Uptime
  25. 25. Here.com Local Search Application •  Local Search app •  First customer facing application on AWS •  Obvious need for Uptime
  26. 26. Here.com Local Search Architecture US-East-1 US-West-2 EU-West-1 US-East-1a Zookeeper1 Zookeeper2 Zookeeper3 Frontend Group Backend Groups US-East-1b Zookeeper1 Zookeeper2 Zookeeper3 Frontend Group Backend Groups AP-Southeast-1
  27. 27. Here.com Local Search Architecture US-East-1 US-West-2 EU-West-1 US-East-1a Zookeeper1 Zookeeper2 Zookeeper3 Frontend Group Backend Groups US-East-1b Zookeeper1 Zookeeper2 Zookeeper3 Frontend Group Backend Groups AP-Southeast-1 Single-Instance Auto Scaling Groups (Zookeeper) 1.  Auto-healing: Instances auto-register in DNS via Route53 2.  Dynamic: Auto Scaling Group Names are used for cluster-node lookups (cluster1-zookeeper1) 3.  Used Standard Tools such as DNS instead of Queries or Elastic IPs
  28. 28. Auto Scaling when upgrading without any downtime
  29. 29. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old
  30. 30. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2
  31. 31. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2
  32. 32. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2
  33. 33. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2
  34. 34. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2 New v2 New V2
  35. 35. Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2 New v2 New V2
  36. 36. “Auto scaling” Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 Auto scaling Max instances Min instances Scaling Trigger Custom Metrics Upper Threshold Lower Threshold Increment by Common scenario: Blue Green Deployments Using Auto Scaling
  37. 37. Here.com Local Search Success •  Increased Uptime 99.9% •  All detected health problems have been successfully replaced by Auto Scaling with zero intervention. •  Zookeeper setup has performed flawlessly “We’ve been paranoid so it still pages us; It’s beginning to feel silly.”
  38. 38. Why Auto Scaling? Scale Up Control CostsImprove Availability
  39. 39. Adobe Creative Cloud Runs on AWS
  40. 40. Adobe Shared Cloud Architecture on AWS
  41. 41. Auto Scaling the Web Layer Based  on     Number  of  HTTP  requests   Average  CPU  load   Network  in/out  
  42. 42. Auto Scaling the Web Layer Auto Scaling the Worker Layer Based  on     SQS  queue  length   Based  on     Number  of  HTTP  requests   Average  CPU  load   Network  in/out  
  43. 43. Scale up fast, scale down slow
  44. 44. Cost Control •  Scheduled scaling: we analyzed our traffic and picked numbers. –  scale up in the morning, scale down in the evening •  Policies for slow scale down •  Stage environments: downscale everything to “min-size” daily (or more)
  45. 45. CloudFormation + Auto Scaling "ScaleUpPolicy" : { "Type" : "AWS::Auto Scaling::ScalingPolicy", "Properties" : { "AdjustmentType" : "ChangeInCapacity", "Auto ScalingGroupName" : { "Ref" : "WorkerAuto ScalingGroup" }, "Cooldown" : {"Ref": "cooldown"}, "ScalingAdjustment" : { "Ref" : "adjustup" } } }, "WorkerAlarmScaleUp": { "Type": "AWS::CloudWatch::Alarm", "Properties": { "EvaluationPeriods":{"Ref" : "evalperiod"}, "Statistic": "Sum", "Threshold": {"Ref" : "upthreshold"}, "AlarmDescription": "Scale up if the work load of transcode queue is high", "Period": {"Ref" : "period"}, "AlarmActions": [ { "Ref": "ScaleUpPolicy" }, { "Ref" : "scalingSNStopic" } ], "Namespace": "AWS/SQS", "Dimensions": [ { "Name": "QueueName", "Value": {"Ref" : "queuename" }}], "ComparisonOperator": "GreaterThanThreshold", "MetricName": "ApproximateNumberOfMessagesVisible"
  46. 46. How – Custom Metrics . . . Sat Oct 6 05:51:03 UTC 2012 Number of AZs: 4 Number of Web Servers: 16 Number of Healthy Web Servers: 16 ELB Request Count: 9523.0 Request Count Per Healthy Web Server: 595.1875 Network In Per Healthy Web Server: 51 MB Network Out Per Healthy Web Server: 1 MB CPU Per Healthy Web Server: 25.23875 Publishing Custom Metrics: InstanceRequestCount, HealthyWebServers, InstanceNetworkIn, InstanceNetworkOut, InstanceCPUUtilization to namespace WebServer in us-east-1 . . .
  47. 47. How – multi-input scaling Scale up Scale down +2 instances if more than 50 visible messages for >5 min +50% instances if more than 1000 msg for >2 min + fixed 100 instances if more than 10000 msg for >1 min -10 instance if 0 msg for more than 10 min -25% if 0 msg for more than 30 min
  48. 48. Adobe’s Advice •  Use CloudFormation! •  Know your system, thresholds •  Watch your scaling history •  Scaling up is easy, scaling down not so much •  Mantra: scale up fast; scale down slow
  49. 49. Why Auto Scaling? Scale Up Control CostsImprove Availability
  50. 50. Key Takeaways •  Maintaining application response times and fleet utilization •  Handling cyclical demand, “weather events” •  Auto Scaling for 99.9% Uptime •  Single-instance groups •  Cost control and asymmetric scaling responses •  Cloudformation, custom scripts, and multiple inputs •  High availability, low latency & high resiliency •  Cassandra & Zookeeper The Weather Channel Nokia Adobe SwiftKey
  51. 51. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Improving availability and lowering costs Dr Ian McDonald, SwiftKey April 30, 2014
  52. 52. >1 million downloads In first month >30 million downloads so far SwiftKey - Best known for smart apps SwiftKey NoteSwiftKey Keyboard iPhone and iPadAndroid Best-selling paid app on Google Play, 2013 & 2012 4.7 star rating Editors’ Choice Top 10 free app chart, US
  53. 53. Business issue •  High availability •  Low latency required – global low latency in future •  Highly resilient
  54. 54. Architecture for a user service
  55. 55. Architecture notes •  Deployed using Chef •  Two of each server to make a service – spread around AZs •  Using Redis to accelerate but may remove and just use Cassandra •  Use Zookeeper for services to find each other
  56. 56. Cassandra •  Cassandra deployed at present as 3 nodes – 1 per AZ in a region. Can lose any node. •  Has been tested running between regions including writes
  57. 57. Zookeeper and Exhibitor •  We use Apache Zookeeper so servers can find where other servers are and configuration –  Run as multiple instances –  Works as a shared namespace –  State stored in S3 via Exhibitor •  Netflix Exhibitor is a Java supervisor system for ZooKeeper. It provides a number of features: –  Watches a ZK instance and makes sure it is running –  Performs periodic backups –  Perform periodic cleaning of ZK log directory –  A GUI explorer for viewing ZK nodes –  A rich REST API (above taken directly from Exhbitor webpage)
  58. 58. Other thoughts •  Make computer stateless and parallelised –  Can then scale –  Doesn’t matter if a node fails –  Can cost optimise – look at CloudWatch to see whether CPU, IO bound etc •  Storage –  If possible store state in S3 or a database that can shard globally e.g. Cassandra
  59. 59. Other thoughts •  Look at Trusted Adviser –  Warns you about ELB that are not spread across AZ –  Warns about snapshots not done –  Warn about under utilized resources (i.e. spending too much) •  Use your AWS people: –  ask support questions –  talk to AWS Solution Architects –  get account manager to give RI report if on consolidated billing
  60. 60. AWS Cost Explorer
  61. 61. Find me on Twitter - @imcdnzl We’re hiring – visit our website
  62. 62. AWS Partner Trail Win a Kindle Fire •  10 in total •  Get a code from our sponsors
  63. 63. Please rate this session using the AWS Summits App and help us build better events
  64. 64. #AWSSummit @AWScloud @AWS_UKI

×