The document discusses old practices of managing servers individually and how they no longer apply in cloud environments. It advocates letting go of habits like naming servers and worrying about their individual status. Instead, it recommends taking advantage of cloud services like Auto Scaling that allow infrastructure to be treated as code and provide self-healing capabilities. Specific practices highlighted include using tags instead of hostnames, treating all resources as auto-scaling groups, and quickly replacing unhealthy instances through mechanisms like STONITH.
2. Why are we here?
Old-school IT practices continue to weigh us down
in the cloud. We need a way out.
3. “Everything now is a programmable
resource. There are no physical
things anymore. Things that you
needed to do by walking to the
datacenter, by hugging your
servers, and believe me I’ve
hugged servers enough in my life.
They DO NOT hug you back.”
4. “Everything now is a programmable
resource. There are no physical
things anymore. Things that you
needed to do by walking to the
datacenter, by hugging your
servers, and believe me I’ve
hugged servers enough in my life.
They DO NOT hug you back.” Dr. Werner Vogels (Re:Invent 2012)
5. “But I love my servers!”
- You (now)
https://secure.flickr.com/photos/schluesselbein/4157426778/
6. “They hate you, actually, I
honestly believe that they
hate you.
7. “They hate you, actually, I
honestly believe that they
hate you. At least that is
how they behaved
towards me.” –
Dr. Werner Vogels (Re:Invent 2012)
8. “But I love my servers!”
“Well now I’m kind of sad.”
- You (now)
https://secure.flickr.com/photos/bensonkua/2687804310/
16. IF THIS THING
IS OUT OF
TAPE, YOU
HAD A REALLY
BAD DAY.
https://secure.flickr.com/photos/stephendotcarter/6587082437
17. So where does server hugging come from?
Why did we need to find them in
person?
18. So where does server hugging come from?
Why did we need to find them in
person?
Because we HAD to fix them.
19. So where does server hugging come from?
Why did we need to find them in
person?
Because we HAD to fix them. WHY?
20.
21.
22.
23.
24. So where does server hugging come from?
We fixed them because:
Dead servers == dead space
Dead space == wasted $$$
Dead servers == worse performance
Worse performance == lost $$$
31. Waking when they cry:
*** Nagios ***
Notification Type: PROBLEM
Service: Web CPU
Host: web03.example.com
Address: 10.167.10.51
State: CRITICAL
Date/Time: Thu Oct 24 08:14:13 UTC 2013
Additional Info:
CRITICAL – CPU LOAD 29
32. Hugging server babies and you
•
•
•
•
•
•
•
Is the site performing worse?
Are your customers impacted?
How impacted are they?
What are the other 20 web instances doing?
Did I really need to wake up at 4am for this?
If a server uses 100% of its CPU, should I care?
If this server is bad, how much work is there in fixing
it?
• Is there something custom about this server?
33. Server hugging bad practices
• “Pet-ting” – caring about a server’s “name,” its
well being, its individual status
• “Snowflakes” – unique hosts in a common pool
• “Model T-ing” – Hand-built one-off servers
• “Names In Stone” – overuse of host names as
a source of truth
34. In short, there are a lot of old-school, dated habits
being taken to cloud infrastructure. And once you’ve
brought them to the cloud, you lose out on a lot of the
benefits of the cloud.
Such as:
• Dynamic scale up/down
• Self healing infrastructures
• Increased flexibility
• Automation
36. Letting go involves moving forward with
some of the best of what AWS can offer you
in terms of services and how you can work
with them in some pretty incredible ways.
37. Letting go and loving the new way
•
•
•
•
•
•
Using Auto Scaling for everything
ENIs and EIPs
Tags are the new DNS
Deployment tools
Host-based configuration
Service registries
39. The things that should never wake you up
•
•
•
•
•
•
High CPU usage on anything
High memory usage on anything
Thread/process exhaustion
Filled disks
Not running software
Failed instances
42. Common actions taken when paged
1. Look at logs
2. Look at graphs
3. Reboot/restart related application/instance
43. Common actions taken when paged
1. Look at logs
2. Look at graphs
}
Looking at past data
3. Reboot/restart related application/instance
44. Common actions taken when paged
1. Look at logs
2. Look at graphs
}
Looking at past data
3. Reboot/restart related application/instance
Why do this manually?
45. Traffic to our site vs. provisioned capacity manually
Provisioned capacity
46. Traffic to our site vs. provisioned capacity manually
76%
Provisioned capacity
24%
47. Traffic to our site vs. provisioned capacity with Auto Scaling
Provisioned capacity
48. STONITH
"Shoot the other node in the head”
Don’t be afraid to kill a node a with
something wrong with it as a resolution
to failure!
With Auto Scaling it’s fine!
56. STONITH
Alarm
CloudWatch
Amazon SQS
Amazon SNS
EC2 API
Internet
Gateway
ELB
Web
Instance
ELB
ELB
Web
Instance
Watcher
Instance
Web
Instance
Auto scaling Group min=3
Availability Zone
Availability Zone
Virtual Private Cloud
AWS Cloud
Availability Zone
57. STONITH
Alarm
CloudWatch
Amazon SQS
Amazon SNS
EC2 API
Internet
Gateway
ELB
ELB
ELB
Web
Instance
Watcher
Instance
Web
Instance
Auto scaling Group min=3
Availability Zone
Availability Zone
Virtual Private Cloud
AWS Cloud
Availability Zone
58. STONITH
CloudWatch
Amazon SQS
Amazon SNS
EC2 API
Internet
Gateway
ELB
Web
Instance
ELB
ELB
Web
Instance
Watcher
Instance
Web
Instance
Auto scaling Group min=3
Availability Zone
Availability Zone
Virtual Private Cloud
AWS Cloud
Availability Zone
59. Auto Scaling for everything!
• You can use Auto Scaling for singular instances that
don’t scale up or down
– min = 1, max = 1
• Auto Scaling gives you the ability to specify multiple
Availability Zones, even you only need a single host
– gives you multi-AZ failover
• Auto Scaling supports notifications on instance
creation/termination
– Useful for configuring other resources, bootstrapping, and
provisioning
• Auto Scaling is free!
60. Auto Scaling for everything!
• Make use of the user data or configuration
management tools to do things like:
– Re-attaching an Amazon Elastic Block Store (EBS) volume with
application data
– Re-attaching an Elastic Network Interface (ENI)
– Update service registries
– Update DNS
– Update other reliant applications of the new host
61. Elastic Network Interfaces/Elastic IPs
ENI:
• Add additional interfaces to an
instance
• One or more secondary private
IP addresses
• Has its own MAC address
• Can have Security Groups
assigned
• Tag-able
• Free
EIP:
• A static public IP address
• Can be assigned to either an
instance or an ENI
• Doesn’t replace private IP
• Small hourly charge when not
attached to an instance
62. Elastic Network Interfaces
Attaching multiple network interfaces to an instance is useful when you
want to:
• Create a management network.
• Use network and security appliances in your
Amazon Virtual Private Cloud (VPC).
• Create dual-homed instances with workloads/roles on distinct
subnets.
• Create a low-budget, high-availability solution.
63. Elastic Network Interfaces
Attaching multiple network interfaces to an instance is useful when you
want to:
• Create a management network.
• Use network and security appliances in your
Amazon Virtual Private Cloud (VPC).
• Create dual-homed instances with workloads/roles on distinct
subnets.
• Create a low-budget, high-availability solution.
65. Healing a single instance
EC2 API
Internet
Gateway
NAT
Instance
Availability Zone
Virtual Private Cloud
AWS Cloud
AWS
CloudFormation
66. Healing a single instance
EC2 API
App
Instance
Internet
Gateway
NAT
Instance
Availability Zone
Virtual Private Cloud
AWS Cloud
AWS
CloudFormation
67. Healing a single instance
EC2 API
App
Instance
Internet
Gateway
Auto-Scaling
Group
NAT
Instance
Availability Zone
Virtual Private Cloud
AWS Cloud
AWS
CloudFormation
68. Healing a single instance
EC2 API
Elastic Network
Instance
App
Instance
Auto-Scaling
Group
Internet
Gateway
EBS
Volume
NAT
Instance
Availability Zone
Virtual Private Cloud
AWS Cloud
AWS
CloudFormation
69. Healing a single instance
EC2 API
Elastic Network
Instance
App
Instance
Auto-Scaling
Group
Internet
Gateway
EBS
Volume
NAT
Instance
Availability Zone
Virtual Private Cloud
AWS Cloud
AWS
CloudFormation
70. Healing a single instance
EC2 API
Elastic Network
Instance
Instances
App
Instance
Auto-Scaling
Group
Internet
Gateway
EBS
Volume
NAT
Instance
Availability Zone
Virtual Private Cloud
AWS Cloud
AWS
CloudFormation
71. Healing a single instance
EC2 API
Elastic Network
Instance
Instances
App
Instance
Auto-Scaling
Group
Internet
Gateway
EBS
Volume
NAT
Instance
Availability Zone
Virtual Private Cloud
AWS Cloud
AWS
CloudFormation
72. Healing a single instance
EC2 API
Elastic Network
Instance
Instances
App
Instance
Auto-Scaling
Group
Internet
Gateway
EBS
Volume
NAT
Instance
Availability Zone
Virtual Private Cloud
AWS Cloud
AWS
CloudFormation
73. Healing a single instance
EC2 API
Elastic Network
Instance
Instances
App
Instance
Auto-Scaling
Group
Internet
Gateway
EBS
Volume
NAT
Instance
Availability Zone
Virtual Private Cloud
AWS Cloud
AWS
CloudFormation
74. Healing a single instance
EC2 API
Elastic Network
Instance
Instances
App
Instance
Auto-Scaling
Group
Internet
Gateway
EBS
Volume
NAT
Instance
Availability Zone
Virtual Private Cloud
AWS Cloud
AWS
CloudFormation
75. Healing a single instance
EC2 API
Elastic Network
Instance
Instances
App
Instance
Auto-Scaling
Group
Internet
Gateway
EBS
Volume
NAT
Instance
Availability Zone
Virtual Private Cloud
AWS Cloud
AWS
CloudFormation
76. Healing a single instance
"myENI" : {
"Type" : "AWS::EC2::NetworkInterface",
"Properties" : {
"Tags": [{"Key":"Name","Value":"AppENI"},
{"Key":"Project","Value":"Blog"}],
"Description": "Blog One Off App Server ENI.",
"SubnetId": "subnet-d2286cb9",
"PrivateIpAddress": "192.168.11.100"
}
}
78. Healing a single instance
import boto.ec2
import boto.utils
conn = boto.ec2.connect_to_region('us-west-2')
Connect to API
myfilters = {'tag:Name': 'AppENI', 'tag:Project': 'Blog’}
Find the right ENI
myEni=conn.get_all_network_interfaces(filters=myfilters)
myInstance=boto.utils.get_instance_metadata()['instance-id']
Attach ENI to instance
conn.attach_network_interface(myEni[0].id, myInstance, device_index=1,
dry_run=False)
79. Use tags as a source
of “truth” in your
infrastructure
https://secure.flickr.com/photos/cambodia4kidsorg/260004685
80. DNS bad. Tags good.
DNS
• 30-year old technology
• Only tells us a single
thing about a host, a
hostname to IP mapping.
• Potential for split
brain/broken replicas
• Caching issues, caching
issues, caching issues
Tags
• Set by you the user, held
in AWS and available via
APIs
• Key:Value is totally up to
you
• Can have several per
resource
• Free to implement and
query
82. Tags as a source of truth
•
•
•
•
•
•
Tie various resources together
Billing reports
IAM resource-level permissions
Build automation
Deploy automation
Security resource grouping
87. Host-based configuration management
• All more or less accomplish the same things
– File configuration, package/software installation, user management, run
commands, interface with OS, process management
• All have their own syntax that isn’t too dissimilar
• Some rely on agents, some are agentless
• Use HBCM alongside one of the tools from the previous
slide
• Spend the time required to learn them
• Can’t scale easily without HBCM
88. “I don’t have time to learn Chef!?”
https://secure.flickr.com/photos/45909111@N00/9374169461/
89. “I don’t have time to learn Chef!?”
“I wrote custom shell
scripts instead!”
https://secure.flickr.com/photos/45909111@N00/9374169461/
90. Go visit the AWS & Partner
exhibits and ask for more
info!
https://secure.flickr.com/photos/45909111@N00/9374169461/
91. Making Use of
Service Registries
https://secure.flickr.com/photos/fringedbenefit/9178086713
94. “A service registry is one of the fundamental
pieces of service-oriented architecture
(SOA) for achieving reuse. It refers to a
place in which service providers can impart
information about their offered services and
potential clients can search for services.”
- www.architecturejournal.net, Sept 2009
95. Service registry workflow
1. A new instance boots.
2. It registers itself with our “service registry.”
3. Changes to the service registry kick off changes on
other systems related to the new instance.
4. Other instances now know about our new instance.
5. On instance termination, instance is deregistered,
and other instances remove it from use.
102. Intros
not at Re:Invent
Igor Serebryany
+ SRE at Airbnb since 2012
+ Built datacenter automation at
SingleHop
+ Scientific computing at University
of Chicago
+ Hobbies: welding, biking, long
walks on the beach
10
2
103. Intros
This guy is even more bearded than the last!
Martin Rhoads
+ SRE at Airbnb
+ user of AWS since 2006
+ First 10 employees at RightScale
+ Previously worked at
Cloudscaling deploying
OpenStack at Tier1s and Telcos
+ BioInformatics at UCSB
+ Obsessed with making things
easier
10
3
105. Why do I need SOA?
What are you trying to sell me?
+ The definitive way to scale your architecture
+ Allow different people to work on different code without stepping on toes
+ Separate deployment schedules
+ Separate machine and data requirements
+ Fail separately -- so you can have graceful degradation
10
5
113. To sum up
1
Services help you scale
2
SOA is an architecture style designed around services
3
A SOA is hard to manage
4
SmartStack makes managing SOA a breeze
11
3
115. 1
Service(s) you want to deliver
2
Zookeeper registry to track
everything
3
Nerve checks health and updates Zookeeper
4
Synapse routes between services
SERVICE
NERVE
ZOOKEEPER
SYNAPSE
117. haproxy
At the core of synapse
We get myriad benefits from haproxy
+ Stable and well-tested
+ Performs in-process connectivity
checks
+ Great introspection and logging
+ Lots of load-balancing algorithms
(RR, least-conn)
+ Somewhat dynamically reconfigurable
(stats socket)
11
7
120. Abstraction
+ The same code in the same language is always doing
discovery/registration
+ Your application doesn’t know about nerve/synapse -- it only knows about
its dependencies
+ Always consistent across your infrastructure
12
0
121. Automatic Failure Handling
You don’t have to wake up
+ Bad backends are automatically taken out of rotation
+ Useful during both problems and routine maintenance/deploys
+ Push-based => very rapid detection; avoid those little blips
+ haproxy even routes around network partitions!
12
1
122. Introspection
See what’s REALLY going on
Leverage the power of haproxy
+ status page that lets you see local
state
+ lots of available integrations to
gather global state
+ world-class logging for large-scale
analysis
12
2
123. Distributed by Design
No central point of failure
+ Traffic flows directly between boxes -- no routing layer
+ Even if SmartStack is stopped or broken, haproxy keeps traffic flowing
+ Zookeeper helps to avoid common pitfalls (like different backends in
different network segments)
12
3
124. The Impact
How SmartStack has changed Airbnb
100+
2K
3K
30
Services
using
SmartStack
Requests per
second
LOC
deleted
Engineers
using
SmartStack
12
4
125. Spike : “Nerve and Synapse have greatly simplified my
life as an application developer, and have enabled me to
launch our first Node.js services with very little ops
overhead.”
Sean: “Smart Stack has made deployment of new java
services a matter of beer and 20 lines of ruby”
Our engineers
love
SmartStack
Ben: “SmartStack is great! It helped me to discover
services – and quit smoking”
Barbara: “I love it!”
Phillippe: “Distributed computing? And all this time I
thought everything was running on one machine”
126. Future Direction
Is this project, like, done...?
1
Better resiliency: more graceful handling of zookeeper edge
cases
2
Better testing: improve on the current integration test suite
3
Dynamic registration: for services running on Mesos et. al.
4
A push API for nerve: allow services to communicate coming downtime
5
An auto-scaling layer: use nerve information to determine load
levels
12
6
129. Where is the code?
https://github.com/airbnb/nerve.git
https://github.com/airbnb/synapse.git
12
9
130. AWS re:Invent Pub Crawl
Join the AWS Startup Team this evening at the AWS Pub Crawl
When: Wednesday November 13, 5:30pm - 7:30pm
Where: Canaletto at The Venetian, 2nd Floor
Who Will Be There: Startups, the AWS Startup Team,
Startup Launch Companies, and
AWS re:Invent Hackathon winners
131. Startup Spotlight Sessions with Dr. Werner Vogels
Thurs. Nov 14, Marcello Room 4406
SPOT 203 – Fireside Chats – Startup Founders, 1:30-2:30pm
– Eliot Horowitz, CTO of MongoDB
– Jeff Lawson, CEO of Twilio
– Valentino Volonghi, Chief Architect of AdRoll
SPOT 204 – Fireside Chats – Startup Influencers, 3:00-4:00pm
– Albert Wegner, Managing Partner at Union Square Ventures
– David Cohen, Founder and CEO of TechStars
SPOT 101 - Startup Launches, 4:15-5:15pm
– 5 companies powered by AWS launching at AWS re:Invent 2013
132. We are sincerely eager to hear
your feedback on this
presentation and on re:Invent.
Please fill out an evaluation form
when you have a chance.