• Save
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
Upcoming SlideShare
Loading in...5
×
 

AWS Summit London 2014 | Improving Availability and Lowering Costs (300)

on

  • 749 views

This mid-level technical session will focus on helping you to improve availability and lower costs by using Auto Scaling and Amazon EC2.

This mid-level technical session will focus on helping you to improve availability and lower costs by using Auto Scaling and Amazon EC2.

Statistics

Views

Total Views
749
Views on SlideShare
699
Embed Views
50

Actions

Likes
2
Downloads
0
Comments
0

1 Embed 50

http://www.thisweekinaws.com 50

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    AWS Summit London 2014 | Improving Availability and Lowering Costs (300) AWS Summit London 2014 | Improving Availability and Lowering Costs (300) Presentation Transcript

    • © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. More Nines for Your Dimes: Improving Availability and Lowering Costs using Auto Scaling April 30, 2014 “Fitz” Philip Fitzsimons Solutions Architecture
    • Topics We’ll Cover Today •  Ways to use Auto Scaling •  Auto Scaling introduction •  Maintaining application response times and fleet utilization •  Handling cyclical demand, “weather events” •  Auto Scaling for 99.9% Uptime •  Single-instance groups •  Cost control and asymmetric scaling responses •  Cloudformation, custom scripts, and multiple inputs •  High availability, low latency & high resiliency •  Cassandra & Zookeeper AWS The Weather Channel Nokia Adobe SwiftKey
    • Ways You Can Use Auto Scaling Launch EC2 instances and groups from reusable templates Scale up and down as needed automatically Auto-replace Instances and maintain EC2 capacity
    • Common Scenarios •  Schedule a one-time scale out and flip to production •  Follow daily, weekly, or monthly cycles •  Provision capacity dynamically by scaling on CPU, memory, request rate, queue depth, users, etc. •  Auto-tag instances with cost center, project, version, stage •  Auto-replace instances that fail ELB or EC2 checks •  Auto-balance instances across multiple zones. Prepare for a Big Launch Fit Capacity to Demand Be Ready for Spikes Simplify Cost Allocation Maintain Stable Capacity Go Multi-AZ
    • Auto Scaling group {min: 1, max: 1, desired: 1, plan: maintain} Availability Zone #1 security group EC2 instance web app server Elastic Load Balancing region CloudWatch Metrics (standard or custom) alarm Triggers Launch Config AMI instance type, key pairs, security groups, block device mappings etc Auto Scaling Policy AMI instance type, key pairs, security groups, block device mappings etc Auto Scaling Policy AMI instance type, key pairs, security groups, block device mappings etc Auto Scaling Policy AdjustmentType ASGName Cooldown MinAdjustmentStep PolicyName ScalingAdjustment Action Health State Custom Health Check Availability Zone #2 www.example.com Scaling Plan maintain manual schedule demand Autoscaling
    • What’s New in Auto Scaling Better integration •  EC2 console support •  Scheduled scaling policies in CloudFormation templates •  ELB connection draining •  Auto-assign public IPs in VPC •  Spot + Auto Scaling More APIs •  Create groups based on running instances •  Create launch configurations based on running instances •  Attach running instances to a group •  Describe account limits for groups and launch configs
    • Why Auto Scaling? Scale Up Control CostsImprove Availability
    • Why Auto Scaling? Scale Up Control CostsImprove Availability
    • The Weather Company •  Top  30  web  property  in  the   U.S.   •  2nd  most  viewed  television   channel  in  the  U.S.   •  85%  of  U.S.  airlines  depend  on   our  forecasts   •  Major  retailers  base  markeCng   spend  and  store  displays  based   on  our  forecasts   •  163  million  unique  visitors   across  TV  and  web  
    • Wunderground Radar and Maps 100 million hits a day One Billion data points per day Migrated real-time radar mapping system wunderground.com to AWS Cloud
    • 30,000 Personal Weather Stations Source: Wunderground, Inc. 2013
    • Why Auto Scaling?
    • Why Auto Scaling?
    • Why Auto Scaling?
    • Why Auto Scaling?
    • Why Auto Scaling? Hurricane Sandy
    • Before  Migra+on  –  Tradi+onal  IT  Model  doesn’t  scale  well     Server  Count   (110  Servers)   Avg.  CPU  Load   HTTP  Response  Latency   (~6000  ms)   HTTP  Response  Latency   (5-­‐15ms)   Server  Count   (from  110  to  170  Instances)   Avg.  CPU  Load   ANer  Migra+on  -­‐  Wunderground  Radar  App    
    • Radar on AWS Auto Scaling Architecture
    • Radar on AWS CPU Utilization
    • Radar on AWS Host Count
    • Radar on AWS
    • Radar on AWS
    • Why Auto Scaling? Scale Up Control CostsImprove Availability
    • Auto Scaling for 99.9% Uptime
    • Here.com Local Search Application •  Local Search app •  First customer facing application on AWS •  Obvious need for Uptime
    • Here.com Local Search Architecture US-East-1 US-West-2 EU-West-1 US-East-1a Zookeeper1 Zookeeper2 Zookeeper3 Frontend Group Backend Groups US-East-1b Zookeeper1 Zookeeper2 Zookeeper3 Frontend Group Backend Groups AP-Southeast-1
    • Here.com Local Search Architecture US-East-1 US-West-2 EU-West-1 US-East-1a Zookeeper1 Zookeeper2 Zookeeper3 Frontend Group Backend Groups US-East-1b Zookeeper1 Zookeeper2 Zookeeper3 Frontend Group Backend Groups AP-Southeast-1 Single-Instance Auto Scaling Groups (Zookeeper) 1.  Auto-healing: Instances auto-register in DNS via Route53 2.  Dynamic: Auto Scaling Group Names are used for cluster-node lookups (cluster1-zookeeper1) 3.  Used Standard Tools such as DNS instead of Queries or Elastic IPs
    • Auto Scaling when upgrading without any downtime
    • Map Data on S3 US-East-1a Zookeeper1 cluster1 old old
    • Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2
    • Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2
    • Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2
    • Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2
    • Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2 New v2 New V2
    • Map Data on S3 US-East-1a Zookeeper1 cluster1 old old New Data V2 New v2 New V2
    • “Auto scaling” Web Server Fleet (Amazon EC2) Database Fleet (RDS or DB on EC2) Load Balancing (ELB) v1.1 v1.1 v1.1 v1.1 v1.2 v1.2 v1.2 v1.2 Auto scaling Max instances Min instances Scaling Trigger Custom Metrics Upper Threshold Lower Threshold Increment by Common scenario: Blue Green Deployments Using Auto Scaling
    • Here.com Local Search Success •  Increased Uptime 99.9% •  All detected health problems have been successfully replaced by Auto Scaling with zero intervention. •  Zookeeper setup has performed flawlessly “We’ve been paranoid so it still pages us; It’s beginning to feel silly.”
    • Why Auto Scaling? Scale Up Control CostsImprove Availability
    • Adobe Creative Cloud Runs on AWS
    • Adobe Shared Cloud Architecture on AWS
    • Auto Scaling the Web Layer Based  on     Number  of  HTTP  requests   Average  CPU  load   Network  in/out  
    • Auto Scaling the Web Layer Auto Scaling the Worker Layer Based  on     SQS  queue  length   Based  on     Number  of  HTTP  requests   Average  CPU  load   Network  in/out  
    • Scale up fast, scale down slow
    • Cost Control •  Scheduled scaling: we analyzed our traffic and picked numbers. –  scale up in the morning, scale down in the evening •  Policies for slow scale down •  Stage environments: downscale everything to “min-size” daily (or more)
    • CloudFormation + Auto Scaling "ScaleUpPolicy" : { "Type" : "AWS::Auto Scaling::ScalingPolicy", "Properties" : { "AdjustmentType" : "ChangeInCapacity", "Auto ScalingGroupName" : { "Ref" : "WorkerAuto ScalingGroup" }, "Cooldown" : {"Ref": "cooldown"}, "ScalingAdjustment" : { "Ref" : "adjustup" } } }, "WorkerAlarmScaleUp": { "Type": "AWS::CloudWatch::Alarm", "Properties": { "EvaluationPeriods":{"Ref" : "evalperiod"}, "Statistic": "Sum", "Threshold": {"Ref" : "upthreshold"}, "AlarmDescription": "Scale up if the work load of transcode queue is high", "Period": {"Ref" : "period"}, "AlarmActions": [ { "Ref": "ScaleUpPolicy" }, { "Ref" : "scalingSNStopic" } ], "Namespace": "AWS/SQS", "Dimensions": [ { "Name": "QueueName", "Value": {"Ref" : "queuename" }}], "ComparisonOperator": "GreaterThanThreshold", "MetricName": "ApproximateNumberOfMessagesVisible"
    • How – Custom Metrics . . . Sat Oct 6 05:51:03 UTC 2012 Number of AZs: 4 Number of Web Servers: 16 Number of Healthy Web Servers: 16 ELB Request Count: 9523.0 Request Count Per Healthy Web Server: 595.1875 Network In Per Healthy Web Server: 51 MB Network Out Per Healthy Web Server: 1 MB CPU Per Healthy Web Server: 25.23875 Publishing Custom Metrics: InstanceRequestCount, HealthyWebServers, InstanceNetworkIn, InstanceNetworkOut, InstanceCPUUtilization to namespace WebServer in us-east-1 . . .
    • How – multi-input scaling Scale up Scale down +2 instances if more than 50 visible messages for >5 min +50% instances if more than 1000 msg for >2 min + fixed 100 instances if more than 10000 msg for >1 min -10 instance if 0 msg for more than 10 min -25% if 0 msg for more than 30 min
    • Adobe’s Advice •  Use CloudFormation! •  Know your system, thresholds •  Watch your scaling history •  Scaling up is easy, scaling down not so much •  Mantra: scale up fast; scale down slow
    • Why Auto Scaling? Scale Up Control CostsImprove Availability
    • Key Takeaways •  Maintaining application response times and fleet utilization •  Handling cyclical demand, “weather events” •  Auto Scaling for 99.9% Uptime •  Single-instance groups •  Cost control and asymmetric scaling responses •  Cloudformation, custom scripts, and multiple inputs •  High availability, low latency & high resiliency •  Cassandra & Zookeeper The Weather Channel Nokia Adobe SwiftKey
    • © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Improving availability and lowering costs Dr Ian McDonald, SwiftKey April 30, 2014
    • >1 million downloads In first month >30 million downloads so far SwiftKey - Best known for smart apps SwiftKey NoteSwiftKey Keyboard iPhone and iPadAndroid Best-selling paid app on Google Play, 2013 & 2012 4.7 star rating Editors’ Choice Top 10 free app chart, US
    • Business issue •  High availability •  Low latency required – global low latency in future •  Highly resilient
    • Architecture for a user service
    • Architecture notes •  Deployed using Chef •  Two of each server to make a service – spread around AZs •  Using Redis to accelerate but may remove and just use Cassandra •  Use Zookeeper for services to find each other
    • Cassandra •  Cassandra deployed at present as 3 nodes – 1 per AZ in a region. Can lose any node. •  Has been tested running between regions including writes
    • Zookeeper and Exhibitor •  We use Apache Zookeeper so servers can find where other servers are and configuration –  Run as multiple instances –  Works as a shared namespace –  State stored in S3 via Exhibitor •  Netflix Exhibitor is a Java supervisor system for ZooKeeper. It provides a number of features: –  Watches a ZK instance and makes sure it is running –  Performs periodic backups –  Perform periodic cleaning of ZK log directory –  A GUI explorer for viewing ZK nodes –  A rich REST API (above taken directly from Exhbitor webpage)
    • Other thoughts •  Make computer stateless and parallelised –  Can then scale –  Doesn’t matter if a node fails –  Can cost optimise – look at CloudWatch to see whether CPU, IO bound etc •  Storage –  If possible store state in S3 or a database that can shard globally e.g. Cassandra
    • Other thoughts •  Look at Trusted Adviser –  Warns you about ELB that are not spread across AZ –  Warns about snapshots not done –  Warn about under utilized resources (i.e. spending too much) •  Use your AWS people: –  ask support questions –  talk to AWS Solution Architects –  get account manager to give RI report if on consolidated billing
    • AWS Cost Explorer
    • Find me on Twitter - @imcdnzl We’re hiring – visit our website
    • AWS Partner Trail Win a Kindle Fire •  10 in total •  Get a code from our sponsors
    • Please rate this session using the AWS Summits App and help us build better events
    • #AWSSummit @AWScloud @AWS_UKI