2. Topics We’ll Cover Today
• Ways to use Auto Scaling
• Auto Scaling introduction
• Maintaining application response times and fleet utilization
• Handling cyclical demand, “weather events”
• Auto Scaling for 99.9% Uptime
• Single-instance groups
• Cost control and asymmetric scaling responses
• Cloudformation, custom scripts, and multiple inputs
• High availability, low latency & high resiliency
• Cassandra & Zookeeper
AWS
The Weather Channel
Nokia
Adobe
SwiftKey
3. Ways You Can Use Auto Scaling
Launch EC2 instances
and groups from
reusable templates
Scale up and down as
needed automatically
Auto-replace
Instances and
maintain EC2 capacity
4. Common Scenarios
• Schedule a one-time scale out and flip to production
• Follow daily, weekly, or monthly cycles
• Provision capacity dynamically by scaling on CPU, memory,
request rate, queue depth, users, etc.
• Auto-tag instances with cost center, project, version, stage
• Auto-replace instances that fail ELB or EC2 checks
• Auto-balance instances across multiple zones.
Prepare for a Big Launch
Fit Capacity to Demand
Be Ready for Spikes
Simplify Cost Allocation
Maintain Stable Capacity
Go Multi-AZ
5. Auto Scaling group
{min: 1, max: 1, desired: 1, plan:
maintain}
Availability Zone #1
security group
EC2 instance
web app
server
Elastic Load
Balancing
region
CloudWatch
Metrics
(standard or custom)
alarm
Triggers
Launch Config
AMI
instance type,
key pairs,
security groups,
block device
mappings etc
Auto Scaling Policy
AMI
instance type, key
pairs, security
groups, block
device mappings
etc
Auto Scaling Policy
AMI
instance type, key
pairs, security
groups, block
device mappings
etc
Auto Scaling Policy
AdjustmentType
ASGName
Cooldown
MinAdjustmentStep
PolicyName
ScalingAdjustment
Action
Health State
Custom Health
Check
Availability Zone #2
www.example.com
Scaling Plan
maintain
manual
schedule
demand
Autoscaling
6. What’s New in Auto Scaling
Better integration
• EC2 console support
• Scheduled scaling policies in
CloudFormation templates
• ELB connection draining
• Auto-assign public IPs in VPC
• Spot + Auto Scaling
More APIs
• Create groups based on running
instances
• Create launch configurations based
on running instances
• Attach running instances to a group
• Describe account limits for groups
and launch configs
9. The Weather Company
• Top
30
web
property
in
the
U.S.
• 2nd
most
viewed
television
channel
in
the
U.S.
• 85%
of
U.S.
airlines
depend
on
our
forecasts
• Major
retailers
base
markeCng
spend
and
store
displays
based
on
our
forecasts
• 163
million
unique
visitors
across
TV
and
web
10. Wunderground Radar and
Maps
100 million hits a day
One Billion data points per day
Migrated real-time radar mapping system wunderground.com to
AWS Cloud
17. Before
Migra+on
–
Tradi+onal
IT
Model
doesn’t
scale
well
Server
Count
(110
Servers)
Avg.
CPU
Load
HTTP
Response
Latency
(~6000
ms)
HTTP
Response
Latency
(5-‐15ms)
Server
Count
(from
110
to
170
Instances)
Avg.
CPU
Load
ANer
Migra+on
-‐
Wunderground
Radar
App
25. Here.com Local Search Application
• Local Search app
• First customer facing
application on AWS
• Obvious need for
Uptime
26. Here.com Local Search Architecture
US-East-1
US-West-2
EU-West-1
US-East-1a
Zookeeper1
Zookeeper2
Zookeeper3
Frontend
Group
Backend
Groups
US-East-1b
Zookeeper1
Zookeeper2
Zookeeper3
Frontend Group
Backend Groups
AP-Southeast-1
27. Here.com Local Search Architecture
US-East-1
US-West-2
EU-West-1
US-East-1a
Zookeeper1
Zookeeper2
Zookeeper3
Frontend
Group
Backend
Groups
US-East-1b
Zookeeper1
Zookeeper2
Zookeeper3
Frontend Group
Backend Groups
AP-Southeast-1
Single-Instance Auto Scaling
Groups (Zookeeper)
1. Auto-healing: Instances auto-register in
DNS via Route53
2. Dynamic: Auto Scaling Group Names
are used for cluster-node lookups
(cluster1-zookeeper1)
3. Used Standard Tools such as DNS
instead of Queries or Elastic IPs
29. Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
30. Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
31. Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
32. Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
33. Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
34. Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
New
v2
New
V2
35. Map Data on S3
US-East-1a
Zookeeper1
cluster1
old old
New Data
V2
New
v2
New
V2
36. “Auto scaling”
Web Server Fleet
(Amazon EC2)
Database Fleet
(RDS or DB on EC2)
Load Balancing
(ELB)
v1.1 v1.1
v1.1 v1.1
v1.2
v1.2
v1.2
v1.2
Auto scaling
Max instances
Min instances
Scaling Trigger
Custom Metrics
Upper Threshold
Lower Threshold
Increment by
Common scenario: Blue Green Deployments
Using Auto Scaling
37. Here.com Local Search Success
• Increased Uptime 99.9%
• All detected health
problems have been
successfully replaced by
Auto Scaling with zero
intervention.
• Zookeeper setup has
performed flawlessly
“We’ve been
paranoid so it still
pages us; It’s
beginning to feel
silly.”
41. Auto Scaling the Web Layer
Based
on
Number
of
HTTP
requests
Average
CPU
load
Network
in/out
42. Auto Scaling the Web Layer
Auto Scaling the Worker Layer
Based
on
SQS
queue
length
Based
on
Number
of
HTTP
requests
Average
CPU
load
Network
in/out
44. Cost Control
• Scheduled scaling: we analyzed our traffic and
picked numbers.
– scale up in the morning, scale down in the evening
• Policies for slow scale down
• Stage environments: downscale everything to
“min-size” daily (or more)
46. How – Custom Metrics
. . .
Sat Oct 6 05:51:03 UTC 2012
Number of AZs: 4
Number of Web Servers: 16
Number of Healthy Web Servers: 16
ELB Request Count: 9523.0
Request Count Per Healthy Web Server: 595.1875
Network In Per Healthy Web Server: 51 MB
Network Out Per Healthy Web Server: 1 MB
CPU Per Healthy Web Server: 25.23875
Publishing Custom Metrics: InstanceRequestCount, HealthyWebServers,
InstanceNetworkIn, InstanceNetworkOut, InstanceCPUUtilization to namespace
WebServer in us-east-1
. . .
47. How – multi-input scaling
Scale up
Scale down
+2 instances if more than 50 visible messages for >5 min
+50% instances if more than 1000 msg for >2 min
+ fixed 100 instances if more than 10000 msg for >1 min
-10 instance if 0 msg for more than 10 min
-25% if 0 msg for more than 30 min
48. Adobe’s Advice
• Use CloudFormation!
• Know your system, thresholds
• Watch your scaling history
• Scaling up is easy, scaling down not so much
• Mantra: scale up fast; scale down slow
50. Key Takeaways
• Maintaining application response times and fleet utilization
• Handling cyclical demand, “weather events”
• Auto Scaling for 99.9% Uptime
• Single-instance groups
• Cost control and asymmetric scaling responses
• Cloudformation, custom scripts, and multiple inputs
• High availability, low latency & high resiliency
• Cassandra & Zookeeper
The Weather Channel
Nokia
Adobe
SwiftKey
52. >1 million
downloads
In first month
>30 million
downloads
so far
SwiftKey - Best known for smart apps
SwiftKey NoteSwiftKey Keyboard
iPhone and iPadAndroid
Best-selling paid
app on
Google Play,
2013 & 2012
4.7 star
rating
Editors’
Choice
Top 10 free
app chart,
US
53. Business issue
• High availability
• Low latency required – global low latency in
future
• Highly resilient
55. Architecture notes
• Deployed using Chef
• Two of each server to make a service – spread
around AZs
• Using Redis to accelerate but may remove and
just use Cassandra
• Use Zookeeper for services to find each other
56. Cassandra
• Cassandra deployed at present as 3 nodes – 1
per AZ in a region. Can lose any node.
• Has been tested running between regions
including writes
57. Zookeeper and Exhibitor
• We use Apache Zookeeper so servers can find where other servers
are and configuration
– Run as multiple instances
– Works as a shared namespace
– State stored in S3 via Exhibitor
• Netflix Exhibitor is a Java supervisor system for ZooKeeper. It
provides a number of features:
– Watches a ZK instance and makes sure it is running
– Performs periodic backups
– Perform periodic cleaning of ZK log directory
– A GUI explorer for viewing ZK nodes
– A rich REST API
(above taken directly from Exhbitor webpage)
58. Other thoughts
• Make computer stateless and parallelised
– Can then scale
– Doesn’t matter if a node fails
– Can cost optimise – look at CloudWatch to see whether CPU, IO
bound etc
• Storage
– If possible store state in S3 or a database that can shard
globally e.g. Cassandra
59. Other thoughts
• Look at Trusted Adviser
– Warns you about ELB that are not spread across AZ
– Warns about snapshots not done
– Warn about under utilized resources (i.e. spending too much)
• Use your AWS people:
– ask support questions
– talk to AWS Solution Architects
– get account manager to give RI report if on consolidated billing