AWS Summit London 2014 | Improving Availability and Lowering Costs (300)

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
More Nines for Your Dimes:
Improving Availability and
Lowering Costs using Auto Scaling
April 30, 2014
“Fitz” Philip Fitzsimons
Solutions Architecture

Topics We’ll Cover Today
•  Ways to use Auto Scaling
•  Auto Scaling introduction
•  Maintaining application response times and fleet utilization
•  Handling cyclical demand, “weather events”
•  Auto Scaling for 99.9% Uptime
•  Single-instance groups
•  Cost control and asymmetric scaling responses
•  Cloudformation, custom scripts, and multiple inputs
•  High availability, low latency & high resiliency
•  Cassandra & Zookeeper
AWS
The Weather Channel
Nokia
Adobe
SwiftKey

Ways You Can Use Auto Scaling
Launch EC2 instances
and groups from
reusable templates
Scale up and down as
needed automatically
Auto-replace
Instances and
maintain EC2 capacity

Common Scenarios
•  Schedule a one-time scale out and flip to production
•  Follow daily, weekly, or monthly cycles
•  Provision capacity dynamically by scaling on CPU, memory,
request rate, queue depth, users, etc.
•  Auto-tag instances with cost center, project, version, stage
•  Auto-replace instances that fail ELB or EC2 checks
•  Auto-balance instances across multiple zones.
Prepare for a Big Launch
Fit Capacity to Demand
Be Ready for Spikes
Simplify Cost Allocation
Maintain Stable Capacity
Go Multi-AZ

Auto Scaling group
{min: 1, max: 1, desired: 1, plan:
maintain}
Availability Zone #1
security group
EC2 instance
web app
server
Elastic Load
Balancing
region
CloudWatch
Metrics

(standard or custom)
alarm
Triggers
Launch Config
AMI
instance type,
key pairs,
security groups,
block device
mappings etc
Auto Scaling Policy
AMI
instance type, key
pairs, security
groups, block
device mappings
etc
Auto Scaling Policy
AMI
instance type, key
pairs, security
groups, block
device mappings
etc
Auto Scaling Policy
AdjustmentType
ASGName
Cooldown
MinAdjustmentStep
PolicyName
ScalingAdjustment
Action
Health State
Custom Health
Check
Availability Zone #2
www.example.com
Scaling Plan
maintain
manual
schedule
demand
Autoscaling

What’s New in Auto Scaling
Better integration
•  EC2 console support
•  Scheduled scaling policies in
CloudFormation templates
•  ELB connection draining
•  Auto-assign public IPs in VPC
•  Spot + Auto Scaling
More APIs
•  Create groups based on running
instances
•  Create launch configurations based
on running instances
•  Attach running instances to a group
•  Describe account limits for groups
and launch configs

Why Auto Scaling?
Scale Up Control CostsImprove Availability

The Weather Company
•  Top
30
web
property
in
the

U.S.

•  2nd
most
viewed
television

channel
in
the
U.S.

•  85%
of
U.S.
airlines
depend
on

our
forecasts

•  Major
retailers
base
markeCng

spend
and
store
displays
based

on
our
forecasts

•  163
million
unique
visitors

across
TV
and
web

Wunderground Radar and
Maps
100 million hits a day
One Billion data points per day
Migrated real-time radar mapping system wunderground.com to
AWS Cloud

30,000
Personal
Weather
Stations
Source: Wunderground, Inc. 2013

Why Auto Scaling?
Hurricane Sandy

Before
Migra+on
–
Tradi+onal
IT
Model
doesn’t
scale
well

Server
Count

(110
Servers)

Avg.
CPU
Load
HTTP
Response
Latency

(~6000
ms)

HTTP
Response
Latency

(5-‐15ms)

Server
Count

(from
110
to
170
Instances)

Avg.
CPU
Load

ANer
Migra+on
-‐
Wunderground
Radar
App

Radar on AWS Auto Scaling Architecture

Here.com Local Search Application
•  Local Search app
•  First customer facing
application on AWS
•  Obvious need for
Uptime

Here.com Local Search Architecture
US-East-1
US-West-2
EU-West-1
US-East-1a
Zookeeper1
Zookeeper2
Zookeeper3
Frontend
Group
Backend
Groups
US-East-1b
Zookeeper1
Zookeeper2
Zookeeper3
Frontend Group
Backend Groups
AP-Southeast-1

Here.com Local Search Architecture
US-East-1
US-West-2
EU-West-1
US-East-1a
Zookeeper1
Zookeeper2
Zookeeper3
Frontend
Group
Backend
Groups
US-East-1b
Zookeeper1
Zookeeper2
Zookeeper3
Frontend Group
Backend Groups
AP-Southeast-1
Single-Instance Auto Scaling
Groups (Zookeeper)
1.  Auto-healing: Instances auto-register in
DNS via Route53
2.  Dynamic: Auto Scaling Group Names
are used for cluster-node lookups
(cluster1-zookeeper1)
3.  Used Standard Tools such as DNS
instead of Queries or Elastic IPs

Auto Scaling when upgrading
without any downtime

Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old

Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2

Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
New
v2
New
V2

“Auto scaling”
Web Server Fleet
(Amazon EC2)
Database Fleet
(RDS or DB on EC2)
Load Balancing
(ELB)
v1.1 v1.1
v1.1 v1.1
v1.2
v1.2
v1.2
v1.2
Auto scaling
Max instances
Min instances
Scaling Trigger
Custom Metrics
Upper Threshold
Lower Threshold
Increment by
Common scenario: Blue Green Deployments
Using Auto Scaling

Here.com Local Search Success
•  Increased Uptime 99.9%
•  All detected health
problems have been
successfully replaced by
Auto Scaling with zero
intervention.
•  Zookeeper setup has
performed flawlessly
“We’ve been
paranoid so it still
pages us; It’s
beginning to feel
silly.”

Adobe Creative
Cloud Runs on
AWS

Adobe Shared
Cloud Architecture
on AWS

Auto Scaling the Web Layer
Based
on

Number
of
HTTP
requests

Average
CPU
load

Network
in/out

Auto Scaling the Web Layer
Auto Scaling the Worker Layer
Based
on

SQS
queue
length

Based
on

Number
of
HTTP
requests

Average
CPU
load

Network
in/out

Scale up fast, scale down slow

Cost Control
•  Scheduled scaling: we analyzed our traffic and
picked numbers.
–  scale up in the morning, scale down in the evening
•  Policies for slow scale down
•  Stage environments: downscale everything to
“min-size” daily (or more)

CloudFormation + Auto Scaling
"ScaleUpPolicy" : {
"Type" : "AWS::Auto Scaling::ScalingPolicy",
"Properties" : {
"AdjustmentType" : "ChangeInCapacity",
"Auto ScalingGroupName" : { "Ref" : "WorkerAuto ScalingGroup" },
"Cooldown" : {"Ref": "cooldown"},
"ScalingAdjustment" : { "Ref" : "adjustup" }
}
},
"WorkerAlarmScaleUp": {
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"EvaluationPeriods":{"Ref" : "evalperiod"},
"Statistic": "Sum",
"Threshold": {"Ref" : "upthreshold"},
"AlarmDescription": "Scale up if the work load of transcode queue is high",
"Period": {"Ref" : "period"},
"AlarmActions": [ { "Ref": "ScaleUpPolicy" }, { "Ref" : "scalingSNStopic" } ],
"Namespace": "AWS/SQS",
"Dimensions": [ { "Name": "QueueName", "Value": {"Ref" : "queuename" }}],
"ComparisonOperator": "GreaterThanThreshold",
"MetricName": "ApproximateNumberOfMessagesVisible"

How – Custom Metrics
. . .
Sat Oct 6 05:51:03 UTC 2012
Number of AZs: 4
Number of Web Servers: 16
Number of Healthy Web Servers: 16
ELB Request Count: 9523.0
Request Count Per Healthy Web Server: 595.1875
Network In Per Healthy Web Server: 51 MB
Network Out Per Healthy Web Server: 1 MB
CPU Per Healthy Web Server: 25.23875
Publishing Custom Metrics: InstanceRequestCount, HealthyWebServers,
InstanceNetworkIn, InstanceNetworkOut, InstanceCPUUtilization to namespace
WebServer in us-east-1
. . .

How – multi-input scaling
Scale up
Scale down
+2 instances if more than 50 visible messages for >5 min
+50% instances if more than 1000 msg for >2 min
+ fixed 100 instances if more than 10000 msg for >1 min
-10 instance if 0 msg for more than 10 min
-25% if 0 msg for more than 30 min

Adobe’s Advice
•  Use CloudFormation!
•  Know your system, thresholds
•  Watch your scaling history
•  Scaling up is easy, scaling down not so much
•  Mantra: scale up fast; scale down slow

Key Takeaways
•  Maintaining application response times and fleet utilization
•  Handling cyclical demand, “weather events”
•  Auto Scaling for 99.9% Uptime
•  Single-instance groups
•  Cost control and asymmetric scaling responses
•  Cloudformation, custom scripts, and multiple inputs
•  High availability, low latency & high resiliency
•  Cassandra & Zookeeper
The Weather Channel
Nokia
Adobe
SwiftKey

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Improving availability and
lowering costs
Dr Ian McDonald, SwiftKey
April 30, 2014

>1 million
downloads
In first month
>30 million
downloads
so far
SwiftKey - Best known for smart apps
SwiftKey NoteSwiftKey Keyboard
iPhone and iPadAndroid
Best-selling paid
app on
Google Play,
2013 & 2012
4.7 star
rating
Editors’
Choice
Top 10 free
app chart,
US

Business issue
•  High availability
•  Low latency required – global low latency in
future
•  Highly resilient

Architecture for a user service

Architecture notes
•  Deployed using Chef
•  Two of each server to make a service – spread
around AZs
•  Using Redis to accelerate but may remove and
just use Cassandra
•  Use Zookeeper for services to find each other

Cassandra
•  Cassandra deployed at present as 3 nodes – 1
per AZ in a region. Can lose any node.
•  Has been tested running between regions
including writes

Zookeeper and Exhibitor
•  We use Apache Zookeeper so servers can find where other servers
are and configuration
–  Run as multiple instances
–  Works as a shared namespace
–  State stored in S3 via Exhibitor
•  Netflix Exhibitor is a Java supervisor system for ZooKeeper. It
provides a number of features:
–  Watches a ZK instance and makes sure it is running
–  Performs periodic backups
–  Perform periodic cleaning of ZK log directory
–  A GUI explorer for viewing ZK nodes
–  A rich REST API
(above taken directly from Exhbitor webpage)

Other thoughts
•  Make computer stateless and parallelised
–  Can then scale
–  Doesn’t matter if a node fails
–  Can cost optimise – look at CloudWatch to see whether CPU, IO
bound etc
•  Storage
–  If possible store state in S3 or a database that can shard
globally e.g. Cassandra

Other thoughts
•  Look at Trusted Adviser
–  Warns you about ELB that are not spread across AZ
–  Warns about snapshots not done
–  Warn about under utilized resources (i.e. spending too much)
•  Use your AWS people:
–  ask support questions
–  talk to AWS Solution Architects
–  get account manager to give RI report if on consolidated billing

Find me on Twitter - @imcdnzl
We’re hiring – visit our website

AWS Partner Trail
Win a Kindle Fire
•  10 in total
•  Get a code from our
sponsors

Please rate
this session
using the AWS
Summits App
and help us build
better events

AWS Summit London 2014 | Improving Availability and Lowering Costs (300)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to AWS Summit London 2014 | Improving Availability and Lowering Costs (300)

Similar to AWS Summit London 2014 | Improving Availability and Lowering Costs (300) (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS Summit London 2014 | Improving Availability and Lowering Costs (300)