Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:Invent 2013

They Don't Hug Back!
Or Why You Need To Stop Worrying About
prodweb001 And Start Loving i-98fb9856
Chris Munns, Amazon Web Services
November 13, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Why are we here?
Old-school IT practices continue to weigh us down
in the cloud. We need a way out.

“Everything now is a programmable
resource. There are no physical
things anymore. Things that you
needed to do by walking to the
datacenter, by hugging your
servers, and believe me I’ve
hugged servers enough in my life.
They DO NOT hug you back.”

“Everything now is a programmable
resource. There are no physical
things anymore. Things that you
needed to do by walking to the
datacenter, by hugging your
servers, and believe me I’ve
hugged servers enough in my life.
They DO NOT hug you back.” Dr. Werner Vogels (Re:Invent 2012)

“But I love my servers!”
- You (now)

https://secure.flickr.com/photos/schluesselbein/4157426778/

“They hate you, actually, I
honestly believe that they
hate you.

“They hate you, actually, I
honestly believe that they
hate you. At least that is
how they behaved
towards me.” –
Dr. Werner Vogels (Re:Invent 2012)

“But I love my servers!”
“Well now I’m kind of sad.”
- You (now)

https://secure.flickr.com/photos/bensonkua/2687804310/

So where does
server hugging
come from?

NAMING
THEM
https://secure.flickr.com/photos/quinnanya/4464205726

So where does server hugging come from?

Why do we name them?


Because we have to know where to
find them.


Because we have to know where to
find them.
Where do we need to find them?

Here

https://secure.flickr.com/photos/arthur-caranta/2925352521

Here
Or here?

https://secure.flickr.com/photos/arthur-caranta/2925352521

IF THIS THING
IS OUT OF
TAPE, YOU
HAD A REALLY
BAD DAY.
https://secure.flickr.com/photos/stephendotcarter/6587082437


Why did we need to find them in
person?


person?
Because we HAD to fix them.


person?
Because we HAD to fix them. WHY?


We fixed them because:
Dead servers == dead space
Dead space == wasted $$$
Dead servers == worse performance
Worse performance == lost $$$

So where else does
server hugging
come from?

SERVERS != OUR PETS

https://secure.flickr.com/photos/thegirlsny/3877243166/

What we name our pets
•
•
•
•
•
•
•
•

Greek gods: Zeus, Thor, Hercules…
Elements: Hydrogen, Helium, Lithium…
Comic book heroes: Superman, Ironman…
Musicians, Cities, Countries, Movies
Prodweb01, Prodapi01…
Web01.prod, Web01.test…
Tacotruck01
P1cfw01v03

P1cfw01v03
https://secure.flickr.com/photos/75898532@N00/3243666946/

EC
2

EC2
EC2

EC2
EC2
EC2
EC2

EC
2

P1cfw01v03
https://secure.flickr.com/photos/verylastexcitingmoment/3118396767/

Waking when they cry:
*** Nagios ***
Notification Type: PROBLEM
Service: Web CPU
Host: web03.example.com
Address: 10.167.10.51
State: CRITICAL
Date/Time: Thu Oct 24 08:14:13 UTC 2013
Additional Info:
CRITICAL – CPU LOAD 29

Hugging server babies and you
•
•
•
•
•
•
•

Is the site performing worse?
Are your customers impacted?
How impacted are they?
What are the other 20 web instances doing?
Did I really need to wake up at 4am for this?
If a server uses 100% of its CPU, should I care?
If this server is bad, how much work is there in fixing
it?
• Is there something custom about this server?

Server hugging bad practices
• “Pet-ting” – caring about a server’s “name,” its
well being, its individual status
• “Snowflakes” – unique hosts in a common pool
• “Model T-ing” – Hand-built one-off servers
• “Names In Stone” – overuse of host names as
a source of truth

In short, there are a lot of old-school, dated habits
being taken to cloud infrastructure. And once you’ve
brought them to the cloud, you lose out on a lot of the
benefits of the cloud.
Such as:
• Dynamic scale up/down
• Self healing infrastructures
• Increased flexibility
• Automation

https://secure.flickr.com/photos/tolomea/5113266973/

Letting go involves moving forward with
some of the best of what AWS can offer you
in terms of services and how you can work
with them in some pretty incredible ways.

Letting go and loving the new way
•
•
•
•
•
•

Using Auto Scaling for everything
ENIs and EIPs
Tags are the new DNS
Deployment tools
Host-based configuration
Service registries

Sleeping through
Infrastructure Recovery

https://secure.flickr.com/photos/dominiqs/331702231

The things that should never wake you up
•
•
•
•
•
•

High CPU usage on anything
High memory usage on anything
Thread/process exhaustion
Filled disks
Not running software
Failed instances

Common actions taken when paged
1. Look at logs
2. Look at graphs
3. Reboot/restart related application/instance

1. Look at logs
2. Look at graphs

}

Looking at past data


1. Look at logs
2. Look at graphs

}

Looking at past data


Why do this manually?

Traffic to our site vs. provisioned capacity manually
Provisioned capacity

Traffic to our site vs. provisioned capacity manually
76%

24%

Traffic to our site vs. provisioned capacity with Auto Scaling


STONITH
"Shoot the other node in the head”
Don’t be afraid to kill a node a with
something wrong with it as a resolution
to failure!
With Auto Scaling it’s fine!

STONITH
Internet
Gateway

ELB

Web
Instance

ELB

ELB

Web
Instance

Web
Instance

Auto Scaling Group min=3
Availability Zone
Availability Zone

Virtual Private Cloud
AWS Cloud

Availability Zone

STONITH
CloudWatch
Internet
Gateway

ELB

Web
Instance

ELB

ELB

Web
Instance

Web
Instance

Availability Zone
Availability Zone

AWS Cloud

Availability Zone

STONITH
Alarm
CloudWatch

Amazon SNS
Internet
Gateway

ELB

Web
Instance

ELB

ELB

Web
Instance

Web
Instance

Availability Zone
Availability Zone

AWS Cloud

Availability Zone

STONITH
Alarm
CloudWatch

Amazon SQS

Amazon SNS
Internet
Gateway

ELB

Web
Instance

ELB

ELB

Web
Instance

Web
Instance

Auto scaling Group min=3
Availability Zone
Availability Zone

AWS Cloud

Availability Zone

STONITH
Alarm
CloudWatch

Amazon SQS

Amazon SNS
Internet
Gateway

ELB

Web
Instance

ELB

ELB

Web
Instance

Watcher
Instance

Web
Instance

Availability Zone
Availability Zone

AWS Cloud

Availability Zone

STONITH
Alarm
CloudWatch

Amazon SQS

Amazon SNS

EC2 API

Internet
Gateway

ELB

Web
Instance

ELB

ELB

Web
Instance

Watcher
Instance

Web
Instance

Availability Zone
Availability Zone

AWS Cloud

Availability Zone

STONITH
Alarm
CloudWatch

Amazon SQS

Amazon SNS

EC2 API

Internet
Gateway

ELB

ELB

ELB

Web
Instance

Watcher
Instance

Web
Instance

Availability Zone
Availability Zone

AWS Cloud

Availability Zone

STONITH
CloudWatch

Amazon SQS

Amazon SNS

EC2 API

Internet
Gateway

ELB

Web
Instance

ELB

ELB

Web
Instance

Watcher
Instance

Web
Instance

Availability Zone
Availability Zone

AWS Cloud

Availability Zone

Auto Scaling for everything!
• You can use Auto Scaling for singular instances that
don’t scale up or down
– min = 1, max = 1

• Auto Scaling gives you the ability to specify multiple
Availability Zones, even you only need a single host
– gives you multi-AZ failover

• Auto Scaling supports notifications on instance
creation/termination
– Useful for configuring other resources, bootstrapping, and
provisioning

• Auto Scaling is free!

Auto Scaling for everything!
• Make use of the user data or configuration
management tools to do things like:
– Re-attaching an Amazon Elastic Block Store (EBS) volume with
application data
– Re-attaching an Elastic Network Interface (ENI)
– Update service registries
– Update DNS
– Update other reliant applications of the new host

Elastic Network Interfaces/Elastic IPs
ENI:
• Add additional interfaces to an
instance
• One or more secondary private
IP addresses
• Has its own MAC address
• Can have Security Groups
assigned
• Tag-able
• Free

EIP:
• A static public IP address
• Can be assigned to either an
instance or an ENI
• Doesn’t replace private IP
• Small hourly charge when not
attached to an instance

Elastic Network Interfaces

Attaching multiple network interfaces to an instance is useful when you
want to:
• Create a management network.
• Use network and security appliances in your
Amazon Virtual Private Cloud (VPC).
• Create dual-homed instances with workloads/roles on distinct
subnets.
• Create a low-budget, high-availability solution.

Healing a single instance

EC2 API

AWS
CloudFormation
AWS Cloud


EC2 API

Internet
Gateway

NAT
Instance
Availability Zone
AWS Cloud

AWS
CloudFormation


EC2 API

App
Instance

Internet
Gateway

NAT
Instance
Availability Zone
AWS Cloud

AWS
CloudFormation


EC2 API

App
Instance

Internet
Gateway

Auto-Scaling
Group

NAT
Instance
Availability Zone
AWS Cloud

AWS
CloudFormation


EC2 API
Elastic Network
Instance

App
Instance
Auto-Scaling
Group

Internet
Gateway

EBS
Volume

NAT
Instance

Availability Zone
AWS Cloud

AWS
CloudFormation


EC2 API
Elastic Network
Instance

Instances

App
Instance
Auto-Scaling
Group

Internet
Gateway

EBS
Volume

NAT
Instance

Availability Zone
AWS Cloud

AWS
CloudFormation

"myENI" : {
"Type" : "AWS::EC2::NetworkInterface",
"Properties" : {
"Tags": [{"Key":"Name","Value":"AppENI"},
{"Key":"Project","Value":"Blog"}],
"Description": "Blog One Off App Server ENI.",
"SubnetId": "subnet-d2286cb9",
"PrivateIpAddress": "192.168.11.100"
}
}

import boto.ec2
import boto.utils
conn = boto.ec2.connect_to_region('us-west-2')
myfilters = {'tag:Name': 'AppENI', 'tag:Project': 'Blog’}
myEni=conn.get_all_network_interfaces(filters=myfilters)
myInstance=boto.utils.get_instance_metadata()['instance-id']
conn.attach_network_interface(myEni[0].id, myInstance, device_index=1,
dry_run=False)

import boto.ec2
import boto.utils
conn = boto.ec2.connect_to_region('us-west-2')

Connect to API

myfilters = {'tag:Name': 'AppENI', 'tag:Project': 'Blog’}

Find the right ENI

myEni=conn.get_all_network_interfaces(filters=myfilters)
myInstance=boto.utils.get_instance_metadata()['instance-id']

Attach ENI to instance

conn.attach_network_interface(myEni[0].id, myInstance, device_index=1,
dry_run=False)

Use tags as a source
of “truth” in your
infrastructure
https://secure.flickr.com/photos/cambodia4kidsorg/260004685

DNS bad. Tags good.
DNS
• 30-year old technology
• Only tells us a single
thing about a host, a
hostname to IP mapping.
• Potential for split
brain/broken replicas
• Caching issues, caching
issues, caching issues

Tags
• Set by you the user, held
in AWS and available via
APIs
• Key:Value is totally up to
you
• Can have several per
resource
• Free to implement and
query

DNS bad. Tags good.
DNS
Web03.example.com:
– 10.167.10.51

Tags
i-933f81a4:
–
–
–
–
–

Name:Web
Env:Prod
Project:Blog
Owner:BobSmith
aws:autoscaling:groupName :
ProdBlogWebsASG
– aws:cloudformation:stack-name:
BlogSiteProd

Tags as a source of truth
•
•
•
•
•
•

Tie various resources together
Billing reports
IAM resource-level permissions
Build automation
Deploy automation
Security resource grouping

Stop hand-crafting servers!

https://secure.flickr.com/photos/ndrwfgg/115898387

Use automation!
https://secure.flickr.com/photos/genewolf/147722350

AWS management tools
Higher-level services

AWS Elastic Beanstalk
Convenience

AWS OpsWorks

Do it yourself

AWS CloudFormation
Control

Host-based configuration management

Fabric

Host-based configuration management
• All more or less accomplish the same things
– File configuration, package/software installation, user management, run
commands, interface with OS, process management

• All have their own syntax that isn’t too dissimilar
• Some rely on agents, some are agentless
• Use HBCM alongside one of the tools from the previous
slide
• Spend the time required to learn them
• Can’t scale easily without HBCM

“I don’t have time to learn Chef!?”


“I don’t have time to learn Chef!?”

“I wrote custom shell
scripts instead!”

Go visit the AWS & Partner
exhibits and ask for more
info!


Making Use of
Service Registries

https://secure.flickr.com/photos/fringedbenefit/9178086713

https://secure.flickr.com/photos/smartfinn/2651755337/

NOT THAT KINDA
REGISTRY!
https://secure.flickr.com/photos/smartfinn/2651755337/

“A service registry is one of the fundamental
pieces of service-oriented architecture
(SOA) for achieving reuse. It refers to a
place in which service providers can impart
information about their offered services and
potential clients can search for services.”
- www.architecturejournal.net, Sept 2009

Service registry workflow
1. A new instance boots.
2. It registers itself with our “service registry.”
3. Changes to the service registry kick off changes on
other systems related to the new instance.
4. Other instances now know about our new instance.
5. On instance termination, instance is deregistered,
and other instances remove it from use.

Service registry examples:
•
•
•
•

Zookeeper
MuleSoft Anypoint Service Registry
Netflix Eureka
IBM WebSphere Service Registry and
Repository
• Airbnb SmartStack

Zookeeper
“is a centralized service for maintaining
configuration information, naming, providing
distributed synchronization, and providing group
services.” – zookeeper.apache.org
–
–
–
–
–
–

leader election
group membership
configuration maintenance
event notification
locking
priority queue mechanism

Zookeeper
Leader Host

Zookeeper
Instance

Worker
Instance

Zookeeper
Instance

Zookeeper
Instance

Worker
Instance

Availability Zone
Availability Zone
AWS Cloud

Availability Zone

Customer Story: Airbnb SmartStack
Martin Rhoads

Airbnb SmartStack
Helping you build Service Oriented Architectures
Martin Rhoads
SRE @ Airbnb
November 13, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Intros
not at Re:Invent

Igor Serebryany
+ SRE at Airbnb since 2012
+ Built datacenter automation at
SingleHop
+ Scientific computing at University
of Chicago
+ Hobbies: welding, biking, long
walks on the beach

10
2

Intros
This guy is even more bearded than the last!

Martin Rhoads
+ SRE at Airbnb
+ user of AWS since 2006
+ First 10 employees at RightScale
+ Previously worked at
Cloudscaling deploying
OpenStack at Tier1s and Telcos
+ BioInformatics at UCSB
+ Obsessed with making things
easier
10
3

SmartStack
Helping you build SOA

Why do I need SOA?
What are you trying to sell me?

+ The definitive way to scale your architecture
+ Allow different people to work on different code without stepping on toes
+ Separate deployment schedules
+ Separate machine and data requirements
+ Fail separately -- so you can have graceful degradation

10
5

How SOA happens
When customers love a service very, very much...

10
6

How SOA happens

10
7

How SOA happens

10
8

How SOA happens

10
9

How SOA happens

11
0

How SOA happens

11
1

Here’s how it ends up
A certain kind of fun

11
2

To sum up
1

Services help you scale

2

SOA is an architecture style designed around services

3

A SOA is hard to manage

4

SmartStack makes managing SOA a breeze

11
3

What is SmartStack?
And how does it help?

1

Service(s) you want to deliver

2

Zookeeper registry to track
everything

3

Nerve checks health and updates Zookeeper

4

Synapse routes between services

SERVICE
NERVE

ZOOKEEPER

SYNAPSE

MONORAIL
NERVE

MOBILE WEB
SYNAPSE

NERVE

SYNAPSE

ZOOKEEPER

+ /production/monorail/services/i-1234567 => {‘host’: 1.2.3.4, ‘port’: 5678}
+ /production/mobile_web/services/i-0abcdef => {‘host’: 5.6.7.8, ‘port’: 5678}

haproxy

At the core of synapse

We get myriad benefits from haproxy
+ Stable and well-tested
+ Performs in-process connectivity
checks
+ Great introspection and logging
+ Lots of load-balancing algorithms
(RR, least-conn)
+ Somewhat dynamically reconfigurable
(stats socket)

11
7

To Recap

SmartStack in action

11
8

Abstraction and DRY

Why
SmartStack?

Automatic failure detection
Introspection
Distributed by design

Abstraction

+ The same code in the same language is always doing
discovery/registration
+ Your application doesn’t know about nerve/synapse -- it only knows about
its dependencies
+ Always consistent across your infrastructure

12
0

Automatic Failure Handling
You don’t have to wake up

+ Bad backends are automatically taken out of rotation
+ Useful during both problems and routine maintenance/deploys
+ Push-based => very rapid detection; avoid those little blips
+ haproxy even routes around network partitions!

12
1

Introspection
See what’s REALLY going on

Leverage the power of haproxy
+ status page that lets you see local
state
+ lots of available integrations to
gather global state
+ world-class logging for large-scale
analysis

12
2

Distributed by Design
No central point of failure

+ Traffic flows directly between boxes -- no routing layer
+ Even if SmartStack is stopped or broken, haproxy keeps traffic flowing
+ Zookeeper helps to avoid common pitfalls (like different backends in
different network segments)

12
3

The Impact
How SmartStack has changed Airbnb

100+

2K

3K

30

Services
using
SmartStack

Requests per
second

LOC
deleted

Engineers
using
SmartStack
12
4

Spike : “Nerve and Synapse have greatly simplified my
life as an application developer, and have enabled me to
launch our first Node.js services with very little ops
overhead.”
Sean: “Smart Stack has made deployment of new java
services a matter of beer and 20 lines of ruby”

Our engineers
love
SmartStack

Ben: “SmartStack is great! It helped me to discover
services – and quit smoking”

Barbara: “I love it!”
Phillippe: “Distributed computing? And all this time I
thought everything was running on one machine”

Future Direction

Is this project, like, done...?

1

Better resiliency: more graceful handling of zookeeper edge
cases

2

Better testing: improve on the current integration test suite

3

Dynamic registration: for services running on Mesos et. al.

4

A push API for nerve: allow services to communicate coming downtime

5

An auto-scaling layer: use nerve information to determine load
levels

12
6

I’m sold!
How do I get started?

Getting Started

1

install Vagrant

2

git clone https://github.com/airbnb/smartstack-cookbook.git

3

vagrant up

12
8

Where is the code?

https://github.com/airbnb/nerve.git
https://github.com/airbnb/synapse.git

12
9

AWS re:Invent Pub Crawl
Join the AWS Startup Team this evening at the AWS Pub Crawl
When: Wednesday November 13, 5:30pm - 7:30pm
Where: Canaletto at The Venetian, 2nd Floor
Who Will Be There: Startups, the AWS Startup Team,
Startup Launch Companies, and
AWS re:Invent Hackathon winners

Startup Spotlight Sessions with Dr. Werner Vogels
Thurs. Nov 14, Marcello Room 4406

SPOT 203 – Fireside Chats – Startup Founders, 1:30-2:30pm
– Eliot Horowitz, CTO of MongoDB
– Jeff Lawson, CEO of Twilio
– Valentino Volonghi, Chief Architect of AdRoll

SPOT 204 – Fireside Chats – Startup Influencers, 3:00-4:00pm
– Albert Wegner, Managing Partner at Union Square Ventures
– David Cohen, Founder and CEO of TechStars

SPOT 101 - Startup Launches, 4:15-5:15pm
– 5 companies powered by AWS launching at AWS re:Invent 2013

We are sincerely eager to hear
your feedback on this
presentation and on re:Invent.
Please fill out an evaluation form
when you have a chance.

Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:Invent 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:Invent 2013

Similar to Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:Invent 2013 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:Invent 2013