SlideShare a Scribd company logo
1 of 64
Download to read offline
A Cloud Outage
                  Under the Lens of
                “Profound Knowledge”
                              @botchagalupe



                                              1

Wednesday, October 31, 12
Welcome to Devopsdays NYC (first one) hell yeah...
Normally I do the SOTU but I have done a few this year and there all about the same
(on video)


This morning I am going to Demingize you all by telling you a cloud outage story.
Going to use something called the System of Profound Knowledge (sound Profound?)

#### No apologies for spelling and grammar in the notes. If that kind of stuff annoys
you please wait for the screen cast.
GOALS




             • Understanding Complexity
             • Overview of SoPK
             • Amazon’s Outage on 10/22/12



                                             2

Wednesday, October 31, 12
Goody we are going to talk about big bad old Amazon’s outage last week...
SoPK - Understanding Complexity




                                           3

Wednesday, October 31, 12
An Improvement .. might be an upgrade, a bug fix, an emergency change a new
product..
One variable X will change the outcome (y)
x-> y (Y is the dependent variable)
SoPK - Understanding Complexity




                                           3

Wednesday, October 31, 12
An Improvement .. might be an upgrade, a bug fix, an emergency change a new
product..
One variable X will change the outcome (y)
x-> y (Y is the dependent variable)
SoPK - Understanding Complexity




                                           3

Wednesday, October 31, 12
An Improvement .. might be an upgrade, a bug fix, an emergency change a new
product..
One variable X will change the outcome (y)
x-> y (Y is the dependent variable)
SoPK - Understanding Complexity




                                           3

Wednesday, October 31, 12
An Improvement .. might be an upgrade, a bug fix, an emergency change a new
product..
One variable X will change the outcome (y)
x-> y (Y is the dependent variable)
SoPK - Understanding Complexity




                                               4

Wednesday, October 31, 12
In real life you get many variables (messiness of life)
There are direct effects against the dependent var (y)
SoPK - Understanding Complexity




                            T1          T2


                                             5

Wednesday, October 31, 12


You also get time dependent variables
SoPK - Understanding Complexity




                            T1            T2


                                               6

Wednesday, October 31, 12
There are also indirect effects on the dependent variables (y)
for example X1 in concert with X4 conjointly effect the dependent var Y
as does X3->X4
This is a different model that X->Y
System of Profound Knowledge (SoPK)




                                             7

Wednesday, October 31, 12
Do we have any photographers in the audience?
Use a camera lens as a metaphor for SoPK
They call this the exposure triangle.
To take a perfect picture of an event you must have a good lens and understand how it
works.
The ISO must be understood for sensitivity to light
The Aperture must be understood for DOF (a portrait or an area)
The Shutter Seed to understand motion
System of Profound Knowledge (SoPK)




                                        • Appreciation of a system
                                        • Knowledge of variation
                                        • Theory of knowledge
                                        • Knowledge of psychology




                                                8

Wednesday, October 31, 12
Well Dr. Deming gave such a lens to break down complexity (the real world just like a
camera does)
Let’s say a lens for improvement of something (an enhancement, a bug fix, new product
idea)
An outcome X->y
Dr Deming gave us a tool called “The System of Profound Knowledge”

SoPK is a Lens to break down complexity and give ourselves an advantage to not over
simplify what we are trying to do. In otherwise clear up the messiness of real life just like
a camera lens does.

(S) Appreciation of a System - Systems thinking - Deming would say understanding the
AIM of a system.
Deming said every system must have an AIM.
Is your AIM to keep a server up or keep a protect a customer SLA (they might not be the
same thing as we will soon see)
Eli Goldarat (TOC) would say Global optimization over local optimization even if
suboptimization is sub optimal. Understanding subsystems and dependent systems.

(V) Variation - Not understanding Variation is the root of all evil. Deming would get mad
at ppl. Knee jerk reactions due to not understanding the kind of variation. How do you
understand varation? Statistics (primarily STD and and it’s relationship to a process
i.e., it’s distribution)
Give you an example. A large cloud provider rates API calls at 100 per (x). for Most
customers that’s fine, however, others they get treated as DDOS. Where did they get
Amazon’s EBS Outage 10/22/2012




                                              9

Wednesday, October 31, 12
Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for training
purposes)
An EBS Services outage correct?
The way it’s supposed to work...
Fleet Monitor (hardware monitoring)
Failover for both Fleet monitor and EBS Server
DNS of course
Metics, performance monitoring to disk from agent...
Of course humans.. remember they are part of of every system
We will give Amazon the benefit of the doubt that it’s in their value stream map
However, a lot of orgs do not have this “human” process in the VSM
Pre automation .. remediation systems (complex adaptive systems).
or just plane old humans

LENS OF SOPK
## This is a system not just the EBS Service Hint: Why did I say it’s just one system?
Amazon’s EBS Outage 10/22/2012




                                              9

Wednesday, October 31, 12
Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for training
purposes)
An EBS Services outage correct?
The way it’s supposed to work...
Fleet Monitor (hardware monitoring)
Failover for both Fleet monitor and EBS Server
DNS of course
Metics, performance monitoring to disk from agent...
Of course humans.. remember they are part of of every system
We will give Amazon the benefit of the doubt that it’s in their value stream map
However, a lot of orgs do not have this “human” process in the VSM
Pre automation .. remediation systems (complex adaptive systems).
or just plane old humans

LENS OF SOPK
## This is a system not just the EBS Service Hint: Why did I say it’s just one system?
Amazon’s EBS Outage 10/22/2012


                                                            This is one System




                                              9

Wednesday, October 31, 12
Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for training
purposes)
An EBS Services outage correct?
The way it’s supposed to work...
Fleet Monitor (hardware monitoring)
Failover for both Fleet monitor and EBS Server
DNS of course
Metics, performance monitoring to disk from agent...
Of course humans.. remember they are part of of every system
We will give Amazon the benefit of the doubt that it’s in their value stream map
However, a lot of orgs do not have this “human” process in the VSM
Pre automation .. remediation systems (complex adaptive systems).
or just plane old humans

LENS OF SOPK
## This is a system not just the EBS Service Hint: Why did I say it’s just one system?
Amazon’s EBS Outage 10/22/2012




                                             10

Wednesday, October 31, 12
Monitor Server has a failure (system down)

X0 - Fleet Management monitoring server fails
Amazon’s EBS Outage 10/22/2012




                                             10

Wednesday, October 31, 12
Monitor Server has a failure (system down)

X0 - Fleet Management monitoring server fails
Amazon’s EBS Outage 10/22/2012


                                                  X0 -> Server Failure




                                             10

Wednesday, October 31, 12
Monitor Server has a failure (system down)

X0 - Fleet Management monitoring server fails
Amazon’s EBS Outage 10/22/2012




                                             11

Wednesday, October 31, 12
X1 - Fleet Management failover - anyone see this first issue?

Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You
might say surely they had automation to DNS. However I would say no. Because ..
Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover
(automation or manual) apparently didn’t have a proper measure for success. Should
have verified that the they were actually using the new server (duh).
Amazon’s EBS Outage 10/22/2012




                                             11

Wednesday, October 31, 12
X1 - Fleet Management failover - anyone see this first issue?

Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You
might say surely they had automation to DNS. However I would say no. Because ..
Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover
(automation or manual) apparently didn’t have a proper measure for success. Should
have verified that the they were actually using the new server (duh).
Amazon’s EBS Outage 10/22/2012


                                                         X1 -> Failover




                                             11

Wednesday, October 31, 12
X1 - Fleet Management failover - anyone see this first issue?

Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You
might say surely they had automation to DNS. However I would say no. Because ..
Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover
(automation or manual) apparently didn’t have a proper measure for success. Should
have verified that the they were actually using the new server (duh).
Amazon’s EBS Outage 10/22/2012


                                                         X1 -> Failover




                                             11

Wednesday, October 31, 12
X1 - Fleet Management failover - anyone see this first issue?

Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You
might say surely they had automation to DNS. However I would say no. Because ..
Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover
(automation or manual) apparently didn’t have a proper measure for success. Should
have verified that the they were actually using the new server (duh).
Amazon’s Outage 10/22/2012




                                            12

Wednesday, October 31, 12
DNS does not propagate....

X2 - DNS propagation failure

Lens #1 (P) We could argue that maybe because the fleet servers are managed by
hardware guys and DNS is by systems guys and may they’ are different cultural tribes
and don’t understand the importance of each. Maybe they don’t go to lunch together.
Amazon’s Outage 10/22/2012




                                            12

Wednesday, October 31, 12
DNS does not propagate....

X2 - DNS propagation failure

Lens #1 (P) We could argue that maybe because the fleet servers are managed by
hardware guys and DNS is by systems guys and may they’ are different cultural tribes
and don’t understand the importance of each. Maybe they don’t go to lunch together.
Amazon’s Outage 10/22/2012


                                                        X2 -> DNS Failure




                                            12

Wednesday, October 31, 12
DNS does not propagate....

X2 - DNS propagation failure

Lens #1 (P) We could argue that maybe because the fleet servers are managed by
hardware guys and DNS is by systems guys and may they’ are different cultural tribes
and don’t understand the importance of each. Maybe they don’t go to lunch together.
Amazon’s Outage 10/22/2012


                                                        X2 -> DNS Failure




                                            12

Wednesday, October 31, 12
DNS does not propagate....

X2 - DNS propagation failure

Lens #1 (P) We could argue that maybe because the fleet servers are managed by
hardware guys and DNS is by systems guys and may they’ are different cultural tribes
and don’t understand the importance of each. Maybe they don’t go to lunch together.
Amazon’s Outage 10/22/2012




                                             13

Wednesday, October 31, 12
So now we have fixed the first problem of the bad server
Seemed like a flawless operation right?
But DNS still says the offline server is the monitor server
The agent on the EBS server (services) keeps trying to send to hw data to the original
fleet monitor server.
However, it is by design fault tolerant not to screw w/production if fails..
The hardware guys probably don’t even know about this. They probably don’t monitor
things like server or process memory.

X4- Memory Leak in the hardware agent on the EBS server

Lens #1 (S) The hardware guys should know that they are part of a bigger system other
than just hardware monitor. Was there a systems view for QA and smoke testing of
agent code changes?
Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.

X3 is the HW agent devs bug.
Amazon’s Outage 10/22/2012




                                             13

Wednesday, October 31, 12
So now we have fixed the first problem of the bad server
Seemed like a flawless operation right?
But DNS still says the offline server is the monitor server
The agent on the EBS server (services) keeps trying to send to hw data to the original
fleet monitor server.
However, it is by design fault tolerant not to screw w/production if fails..
The hardware guys probably don’t even know about this. They probably don’t monitor
things like server or process memory.

X4- Memory Leak in the hardware agent on the EBS server

Lens #1 (S) The hardware guys should know that they are part of a bigger system other
than just hardware monitor. Was there a systems view for QA and smoke testing of
agent code changes?
Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.

X3 is the HW agent devs bug.
Amazon’s Outage 10/22/2012


                                                         X4 -> Memory Leak




                                             13

Wednesday, October 31, 12
So now we have fixed the first problem of the bad server
Seemed like a flawless operation right?
But DNS still says the offline server is the monitor server
The agent on the EBS server (services) keeps trying to send to hw data to the original
fleet monitor server.
However, it is by design fault tolerant not to screw w/production if fails..
The hardware guys probably don’t even know about this. They probably don’t monitor
things like server or process memory.

X4- Memory Leak in the hardware agent on the EBS server

Lens #1 (S) The hardware guys should know that they are part of a bigger system other
than just hardware monitor. Was there a systems view for QA and smoke testing of
agent code changes?
Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.

X3 is the HW agent devs bug.
Amazon’s Outage 10/22/2012


                                                         X4 -> Memory Leak




                                             13

Wednesday, October 31, 12
So now we have fixed the first problem of the bad server
Seemed like a flawless operation right?
But DNS still says the offline server is the monitor server
The agent on the EBS server (services) keeps trying to send to hw data to the original
fleet monitor server.
However, it is by design fault tolerant not to screw w/production if fails..
The hardware guys probably don’t even know about this. They probably don’t monitor
things like server or process memory.

X4- Memory Leak in the hardware agent on the EBS server

Lens #1 (S) The hardware guys should know that they are part of a bigger system other
than just hardware monitor. Was there a systems view for QA and smoke testing of
agent code changes?
Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.

X3 is the HW agent devs bug.
Amazon’s Outage 10/22/2012


                                                         X4 -> Memory Leak




   ((X0->X1->X2)->X4)
                                             13

Wednesday, October 31, 12
So now we have fixed the first problem of the bad server
Seemed like a flawless operation right?
But DNS still says the offline server is the monitor server
The agent on the EBS server (services) keeps trying to send to hw data to the original
fleet monitor server.
However, it is by design fault tolerant not to screw w/production if fails..
The hardware guys probably don’t even know about this. They probably don’t monitor
things like server or process memory.

X4- Memory Leak in the hardware agent on the EBS server

Lens #1 (S) The hardware guys should know that they are part of a bigger system other
than just hardware monitor. Was there a systems view for QA and smoke testing of
agent code changes?
Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.

X3 is the HW agent devs bug.
Amazon’s Outage 10/22/2012


                                                         X4 -> Memory Leak




   ((X0->X1->X2)->X4)
         X3->X4                              13

Wednesday, October 31, 12
So now we have fixed the first problem of the bad server
Seemed like a flawless operation right?
But DNS still says the offline server is the monitor server
The agent on the EBS server (services) keeps trying to send to hw data to the original
fleet monitor server.
However, it is by design fault tolerant not to screw w/production if fails..
The hardware guys probably don’t even know about this. They probably don’t monitor
things like server or process memory.

X4- Memory Leak in the hardware agent on the EBS server

Lens #1 (S) The hardware guys should know that they are part of a bigger system other
than just hardware monitor. Was there a systems view for QA and smoke testing of
agent code changes?
Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.

X3 is the HW agent devs bug.
Amazon’s Outage 10/22/2012




                                           14

Wednesday, October 31, 12
The fault tolerant code has a memory leak masks an issue with (creates) Low memory
on the EBS Servers...
EBS server starts to run out of memory

X5 Out of Memory
Amazon’s Outage 10/22/2012




                                           14

Wednesday, October 31, 12
The fault tolerant code has a memory leak masks an issue with (creates) Low memory
on the EBS Servers...
EBS server starts to run out of memory

X5 Out of Memory
Amazon’s Outage 10/22/2012


                                                      X5 -> Out of Memory




                                           14

Wednesday, October 31, 12
The fault tolerant code has a memory leak masks an issue with (creates) Low memory
on the EBS Servers...
EBS server starts to run out of memory

X5 Out of Memory
Amazon’s Outage 10/22/2012


                                                      X5 -> Out of Memory




   (X3, X4)->X5
                                           14

Wednesday, October 31, 12
The fault tolerant code has a memory leak masks an issue with (creates) Low memory
on the EBS Servers...
EBS server starts to run out of memory

X5 Out of Memory
Amazon’s Outage 10/22/2012




                                             15

Wednesday, October 31, 12
The memory low wakes up the humans (yellow) seeing low memory from the systems
monitoring DB
The fucking humans get involved and all hell breaks lose.
The humans see something is wrong with memory low on EBS servers.
They start to throttle API calls due to low memory (that they don’t know why (local
optimization)

X6 Throttling
System guys see this as a X->Y issue (Low memory therefore throttle)
However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)
Chances are they might not even know about the the fleet server failover.
The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)

Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they
don’t understand hardware monitoring as a sub system
Lens #2 (V) The systems guys don’t understand common vs special cause variation ..
they react to a “S” that should of been a “C”. Turns out
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and
Systems guys.
.
Amazon’s Outage 10/22/2012




                                             15

Wednesday, October 31, 12
The memory low wakes up the humans (yellow) seeing low memory from the systems
monitoring DB
The fucking humans get involved and all hell breaks lose.
The humans see something is wrong with memory low on EBS servers.
They start to throttle API calls due to low memory (that they don’t know why (local
optimization)

X6 Throttling
System guys see this as a X->Y issue (Low memory therefore throttle)
However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)
Chances are they might not even know about the the fleet server failover.
The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)

Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they
don’t understand hardware monitoring as a sub system
Lens #2 (V) The systems guys don’t understand common vs special cause variation ..
they react to a “S” that should of been a “C”. Turns out
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and
Systems guys.
.
Amazon’s Outage 10/22/2012


                                                         X6 -> Throttling
                                                              (X->Y)




                                             15

Wednesday, October 31, 12
The memory low wakes up the humans (yellow) seeing low memory from the systems
monitoring DB
The fucking humans get involved and all hell breaks lose.
The humans see something is wrong with memory low on EBS servers.
They start to throttle API calls due to low memory (that they don’t know why (local
optimization)

X6 Throttling
System guys see this as a X->Y issue (Low memory therefore throttle)
However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)
Chances are they might not even know about the the fleet server failover.
The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)

Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they
don’t understand hardware monitoring as a sub system
Lens #2 (V) The systems guys don’t understand common vs special cause variation ..
they react to a “S” that should of been a “C”. Turns out
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and
Systems guys.
.
Amazon’s Outage 10/22/2012


                                                         X6 -> Throttling
                                                              (X->Y)




                                             15

Wednesday, October 31, 12
The memory low wakes up the humans (yellow) seeing low memory from the systems
monitoring DB
The fucking humans get involved and all hell breaks lose.
The humans see something is wrong with memory low on EBS servers.
They start to throttle API calls due to low memory (that they don’t know why (local
optimization)

X6 Throttling
System guys see this as a X->Y issue (Low memory therefore throttle)
However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)
Chances are they might not even know about the the fleet server failover.
The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)

Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they
don’t understand hardware monitoring as a sub system
Lens #2 (V) The systems guys don’t understand common vs special cause variation ..
they react to a “S” that should of been a “C”. Turns out
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and
Systems guys.
.
Amazon’s Outage 10/22/2012


                                                         X6 -> Throttling
                                                              (X->Y)




   X5->X6
                                             15

Wednesday, October 31, 12
The memory low wakes up the humans (yellow) seeing low memory from the systems
monitoring DB
The fucking humans get involved and all hell breaks lose.
The humans see something is wrong with memory low on EBS servers.
They start to throttle API calls due to low memory (that they don’t know why (local
optimization)

X6 Throttling
System guys see this as a X->Y issue (Low memory therefore throttle)
However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)
Chances are they might not even know about the the fleet server failover.
The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)

Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they
don’t understand hardware monitoring as a sub system
Lens #2 (V) The systems guys don’t understand common vs special cause variation ..
they react to a “S” that should of been a “C”. Turns out
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and
Systems guys.
.
Amazon’s Outage 10/22/2012




                                             16

Wednesday, October 31, 12
Things continue to get worse....
Some customers (yellow) were already getting issues but throttling makes it worse for
the customers.

X7 API Issues

Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?

X7 is caused by both X6 and X5 independently (X6 just made things worse)
Amazon’s Outage 10/22/2012




                                             16

Wednesday, October 31, 12
Things continue to get worse....
Some customers (yellow) were already getting issues but throttling makes it worse for
the customers.

X7 API Issues

Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?

X7 is caused by both X6 and X5 independently (X6 just made things worse)
Amazon’s Outage 10/22/2012


                                                         X7 -> API Issues




                                             16

Wednesday, October 31, 12
Things continue to get worse....
Some customers (yellow) were already getting issues but throttling makes it worse for
the customers.

X7 API Issues

Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?

X7 is caused by both X6 and X5 independently (X6 just made things worse)
Amazon’s Outage 10/22/2012


                                                         X7 -> API Issues




                                             16

Wednesday, October 31, 12
Things continue to get worse....
Some customers (yellow) were already getting issues but throttling makes it worse for
the customers.

X7 API Issues

Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?

X7 is caused by both X6 and X5 independently (X6 just made things worse)
Amazon’s Outage 10/22/2012


                                                         X7 -> API Issues




           (X6->X7)
                                             16

Wednesday, October 31, 12
Things continue to get worse....
Some customers (yellow) were already getting issues but throttling makes it worse for
the customers.

X7 API Issues

Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?

X7 is caused by both X6 and X5 independently (X6 just made things worse)
Amazon’s Outage 10/22/2012


                                                         X7 -> API Issues




           (X6->X7)
           (X5->X7)                          16

Wednesday, October 31, 12
Things continue to get worse....
Some customers (yellow) were already getting issues but throttling makes it worse for
the customers.

X7 API Issues

Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?

X7 is caused by both X6 and X5 independently (X6 just made things worse)
Amazon’s Outage 10/22/2012




                                              17

Wednesday, October 31, 12
The still can’t pin the problem and they decide to force a failover (red) EBS Server
Agin local optimization

X9 EBS Failover

Lens #1 (K) Measures with out results are not fixes (failover). They should have looked
at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Lens #2 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.

X8 is customer hammering the services...
Amazon’s Outage 10/22/2012




                                              17

Wednesday, October 31, 12
The still can’t pin the problem and they decide to force a failover (red) EBS Server
Agin local optimization

X9 EBS Failover

Lens #1 (K) Measures with out results are not fixes (failover). They should have looked
at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Lens #2 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.

X8 is customer hammering the services...
Amazon’s Outage 10/22/2012


                                                           X9 -> EBS Failover




                                              17

Wednesday, October 31, 12
The still can’t pin the problem and they decide to force a failover (red) EBS Server
Agin local optimization

X9 EBS Failover

Lens #1 (K) Measures with out results are not fixes (failover). They should have looked
at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Lens #2 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.

X8 is customer hammering the services...
Amazon’s Outage 10/22/2012


                                                           X9 -> EBS Failover




                                              17

Wednesday, October 31, 12
The still can’t pin the problem and they decide to force a failover (red) EBS Server
Agin local optimization

X9 EBS Failover

Lens #1 (K) Measures with out results are not fixes (failover). They should have looked
at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Lens #2 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.

X8 is customer hammering the services...
Amazon’s Outage 10/22/2012


                                                           X9 -> EBS Failover




           (X7->X9)
                                              17

Wednesday, October 31, 12
The still can’t pin the problem and they decide to force a failover (red) EBS Server
Agin local optimization

X9 EBS Failover

Lens #1 (K) Measures with out results are not fixes (failover). They should have looked
at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Lens #2 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.

X8 is customer hammering the services...
Amazon’s Outage 10/22/2012


                                                           X9 -> EBS Failover




           (X7->X9)
           (X8->X9)                           17

Wednesday, October 31, 12
The still can’t pin the problem and they decide to force a failover (red) EBS Server
Agin local optimization

X9 EBS Failover

Lens #1 (K) Measures with out results are not fixes (failover). They should have looked
at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Lens #2 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.

X8 is customer hammering the services...
Amazon’s Outage 10/22/2012


                                                             X10 ->Twitter Effect




                                                18

Wednesday, October 31, 12
Meanwhile .. the throttling effects a bigger problem..
The twitter effect kicks in... ppl start hammering AWS API web services to test
availability.
PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..
Come on admit it how many of you tried to spank the money last monday...(Netflix
Chaos Monkey)
Load goes up...

#X10 Twitter effect

Introducing another external subsystem.
Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)
Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might
have handled this but for the purposes of this presentation it’s fun to assume they didn’t
as a leanring exercise.

X9 is an aggregate effect from from one customer to other customer and non
customers..
Amazon’s Outage 10/22/2012


                                                             X10 ->Twitter Effect




                                                18

Wednesday, October 31, 12
Meanwhile .. the throttling effects a bigger problem..
The twitter effect kicks in... ppl start hammering AWS API web services to test
availability.
PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..
Come on admit it how many of you tried to spank the money last monday...(Netflix
Chaos Monkey)
Load goes up...

#X10 Twitter effect

Introducing another external subsystem.
Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)
Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might
have handled this but for the purposes of this presentation it’s fun to assume they didn’t
as a leanring exercise.

X9 is an aggregate effect from from one customer to other customer and non
customers..
Amazon’s Outage 10/22/2012


                                                             X10 ->Twitter Effect




                                                18

Wednesday, October 31, 12
Meanwhile .. the throttling effects a bigger problem..
The twitter effect kicks in... ppl start hammering AWS API web services to test
availability.
PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..
Come on admit it how many of you tried to spank the money last monday...(Netflix
Chaos Monkey)
Load goes up...

#X10 Twitter effect

Introducing another external subsystem.
Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)
Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might
have handled this but for the purposes of this presentation it’s fun to assume they didn’t
as a leanring exercise.

X9 is an aggregate effect from from one customer to other customer and non
customers..
Amazon’s Outage 10/22/2012


                                                             X10 ->Twitter Effect




           (X8->X10)
                                                18

Wednesday, October 31, 12
Meanwhile .. the throttling effects a bigger problem..
The twitter effect kicks in... ppl start hammering AWS API web services to test
availability.
PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..
Come on admit it how many of you tried to spank the money last monday...(Netflix
Chaos Monkey)
Load goes up...

#X10 Twitter effect

Introducing another external subsystem.
Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)
Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might
have handled this but for the purposes of this presentation it’s fun to assume they didn’t
as a leanring exercise.

X9 is an aggregate effect from from one customer to other customer and non
customers..
Amazon’s Outage 10/22/2012


                                                             X10 ->Twitter Effect




           (X8->X10)
           (X9->X10)                            18

Wednesday, October 31, 12
Meanwhile .. the throttling effects a bigger problem..
The twitter effect kicks in... ppl start hammering AWS API web services to test
availability.
PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..
Come on admit it how many of you tried to spank the money last monday...(Netflix
Chaos Monkey)
Load goes up...

#X10 Twitter effect

Introducing another external subsystem.
Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)
Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might
have handled this but for the purposes of this presentation it’s fun to assume they didn’t
as a leanring exercise.

X9 is an aggregate effect from from one customer to other customer and non
customers..
Amazon’s Outage 10/22/2012


                                                          X11 -> FA Server Dies




                                              19

Wednesday, October 31, 12
The system becomes a systemic breakdown...
Now the backup (FA) server fails..

X11 Failover server fails

Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
Amazon’s Outage 10/22/2012


                                                          X11 -> FA Server Dies




                                              19

Wednesday, October 31, 12
The system becomes a systemic breakdown...
Now the backup (FA) server fails..

X11 Failover server fails

Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
Amazon’s Outage 10/22/2012


                                                          X11 -> FA Server Dies




           (X6->X11)
                                              19

Wednesday, October 31, 12
The system becomes a systemic breakdown...
Now the backup (FA) server fails..

X11 Failover server fails

Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
Amazon’s Outage 10/22/2012


                                                          X11 -> FA Server Dies




           (X6->X11)
          (X10->X11)                          19

Wednesday, October 31, 12
The system becomes a systemic breakdown...
Now the backup (FA) server fails..

X11 Failover server fails

Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
Amazon’s Outage 10/22/2012


                                          Systemic Outage




                                     20

Wednesday, October 31, 12
The whole system is hosed...
The complexity was masked
To bad they had not read deming...
Amazon’s Outage 10/22/2012


                                          Systemic Outage




                                     20

Wednesday, October 31, 12
The whole system is hosed...
The complexity was masked
To bad they had not read deming...
Amazon’s Outage 10/22/2012


                                          Systemic Outage




                            X->Y
                                     20

Wednesday, October 31, 12
The whole system is hosed...
The complexity was masked
To bad they had not read deming...

More Related Content

Similar to A Cloud Outage Under the Lens of “Profound Knowledge”

nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdfnyomans1
 
Continuous Automated Testing - Cast conference workshop august 2014
Continuous Automated Testing - Cast conference workshop august 2014Continuous Automated Testing - Cast conference workshop august 2014
Continuous Automated Testing - Cast conference workshop august 2014Noah Sussman
 
Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...
Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...
Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...Wee Hyong Tok
 
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...John Mathon
 
Analyzing a Complex Cloud Outage - CloudStack Collaboration Conference - Vegas
Analyzing a Complex Cloud Outage - CloudStack Collaboration Conference - VegasAnalyzing a Complex Cloud Outage - CloudStack Collaboration Conference - Vegas
Analyzing a Complex Cloud Outage - CloudStack Collaboration Conference - VegasJohn Willis
 
Lightning talk: highly scalable databases and the PACELC theorem
Lightning talk: highly scalable databases and the PACELC theoremLightning talk: highly scalable databases and the PACELC theorem
Lightning talk: highly scalable databases and the PACELC theoremVishal Bardoloi
 
05-transformers.pdf
05-transformers.pdf05-transformers.pdf
05-transformers.pdfChaoYang81
 
Entity Component Systems
Entity Component SystemsEntity Component Systems
Entity Component SystemsYos Riady
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
Classification case study + intro to cnn
Classification case study + intro to cnnClassification case study + intro to cnn
Classification case study + intro to cnnVincent Tatan
 
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBMSolr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBMLucidworks
 
introduction to DL network deep learning.ppt
introduction to DL network deep learning.pptintroduction to DL network deep learning.ppt
introduction to DL network deep learning.pptQuangMinhHuynh
 
introduction to deep Learning with full detail
introduction to deep Learning with full detailintroduction to deep Learning with full detail
introduction to deep Learning with full detailsonykhan3
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and DeterminismDaniel Abadi
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiersbutest
 
Holland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teamsHolland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teamsDobo Radichkov
 
Tg noh jeju_workshop
Tg noh jeju_workshopTg noh jeju_workshop
Tg noh jeju_workshopTae-Gil Noh
 

Similar to A Cloud Outage Under the Lens of “Profound Knowledge” (20)

nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdf
 
Continuous Automated Testing - Cast conference workshop august 2014
Continuous Automated Testing - Cast conference workshop august 2014Continuous Automated Testing - Cast conference workshop august 2014
Continuous Automated Testing - Cast conference workshop august 2014
 
Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...
Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...
Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...
 
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
 
alexVAE_New.pdf
alexVAE_New.pdfalexVAE_New.pdf
alexVAE_New.pdf
 
Analyzing a Complex Cloud Outage - CloudStack Collaboration Conference - Vegas
Analyzing a Complex Cloud Outage - CloudStack Collaboration Conference - VegasAnalyzing a Complex Cloud Outage - CloudStack Collaboration Conference - Vegas
Analyzing a Complex Cloud Outage - CloudStack Collaboration Conference - Vegas
 
Lightning talk: highly scalable databases and the PACELC theorem
Lightning talk: highly scalable databases and the PACELC theoremLightning talk: highly scalable databases and the PACELC theorem
Lightning talk: highly scalable databases and the PACELC theorem
 
05-transformers.pdf
05-transformers.pdf05-transformers.pdf
05-transformers.pdf
 
Entity Component Systems
Entity Component SystemsEntity Component Systems
Entity Component Systems
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Classification case study + intro to cnn
Classification case study + intro to cnnClassification case study + intro to cnn
Classification case study + intro to cnn
 
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBMSolr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
 
introduction to DL network deep learning.ppt
introduction to DL network deep learning.pptintroduction to DL network deep learning.ppt
introduction to DL network deep learning.ppt
 
introduction to deep Learning with full detail
introduction to deep Learning with full detailintroduction to deep Learning with full detail
introduction to deep Learning with full detail
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
 
Dssg talk CNN intro
Dssg talk CNN introDssg talk CNN intro
Dssg talk CNN intro
 
Holland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teamsHolland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teams
 
Tg noh jeju_workshop
Tg noh jeju_workshopTg noh jeju_workshop
Tg noh jeju_workshop
 
Deep learning
Deep learningDeep learning
Deep learning
 

More from John Willis

Automated Governance
Automated GovernanceAutomated Governance
Automated GovernanceJohn Willis
 
Devops Long Strange Trip
Devops Long Strange Trip Devops Long Strange Trip
Devops Long Strange Trip John Willis
 
I Got 99 Problems and a Bash DSL Ain't One of Them
I Got 99 Problems and a Bash DSL Ain't One of ThemI Got 99 Problems and a Bash DSL Ain't One of Them
I Got 99 Problems and a Bash DSL Ain't One of ThemJohn Willis
 
The 7 deadly diseases of DevOps 2019
The 7 deadly diseases of DevOps 2019The 7 deadly diseases of DevOps 2019
The 7 deadly diseases of DevOps 2019John Willis
 
Next Generation Infrastructure - Devops Enterprise Summit 2018
Next Generation Infrastructure - Devops Enterprise Summit 2018Next Generation Infrastructure - Devops Enterprise Summit 2018
Next Generation Infrastructure - Devops Enterprise Summit 2018John Willis
 
swampUP - 2018 - The Divine and Felonious Nature of Cyber Security
swampUP - 2018 - The Divine and Felonious Nature of Cyber SecurityswampUP - 2018 - The Divine and Felonious Nature of Cyber Security
swampUP - 2018 - The Divine and Felonious Nature of Cyber SecurityJohn Willis
 
Divine and felonios cyber security devopsdays austin 2018
Divine and felonios cyber security  devopsdays austin 2018Divine and felonios cyber security  devopsdays austin 2018
Divine and felonios cyber security devopsdays austin 2018John Willis
 
Devops - A Long Strange Trip It's Been
Devops - A Long Strange Trip It's BeenDevops - A Long Strange Trip It's Been
Devops - A Long Strange Trip It's BeenJohn Willis
 
DevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's Been
DevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's BeenDevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's Been
DevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's BeenJohn Willis
 
You build it - Cyber Chicago Keynote
You build it -  Cyber Chicago KeynoteYou build it -  Cyber Chicago Keynote
You build it - Cyber Chicago KeynoteJohn Willis
 
Art of the Possible - Serverless Conference NYC 2017
Art of the Possible - Serverless Conference NYC 2017 Art of the Possible - Serverless Conference NYC 2017
Art of the Possible - Serverless Conference NYC 2017 John Willis
 
Why Executives Can't Change
Why Executives Can't Change Why Executives Can't Change
Why Executives Can't Change John Willis
 
Devops Kaizen - DevopsDays Dallas 2017
Devops Kaizen - DevopsDays Dallas 2017 Devops Kaizen - DevopsDays Dallas 2017
Devops Kaizen - DevopsDays Dallas 2017 John Willis
 
Evolve 2017 - Vegas - Devops, Docker and Security
Evolve 2017 - Vegas - Devops, Docker and Security Evolve 2017 - Vegas - Devops, Docker and Security
Evolve 2017 - Vegas - Devops, Docker and Security John Willis
 
Alibaba Cloud Conference 2016 - Docker Open Source
Alibaba Cloud Conference   2016 - Docker Open Source Alibaba Cloud Conference   2016 - Docker Open Source
Alibaba Cloud Conference 2016 - Docker Open Source John Willis
 
Alibaba Cloud Conference 2016 - Docker Enterprise
Alibaba Cloud Conference   2016 - Docker EnterpriseAlibaba Cloud Conference   2016 - Docker Enterprise
Alibaba Cloud Conference 2016 - Docker EnterpriseJohn Willis
 
Breaking Bad Equilibrium - Devops Connect 2017 RSAC
Breaking Bad Equilibrium - Devops Connect 2017 RSACBreaking Bad Equilibrium - Devops Connect 2017 RSAC
Breaking Bad Equilibrium - Devops Connect 2017 RSACJohn Willis
 
Breaking Bad Equilibrium - Devops Connect 2016 LA
Breaking Bad Equilibrium - Devops Connect 2016 LABreaking Bad Equilibrium - Devops Connect 2016 LA
Breaking Bad Equilibrium - Devops Connect 2016 LAJohn Willis
 
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
All daydevops   2016 - Turning Human Capital into High Performance Organizati...All daydevops   2016 - Turning Human Capital into High Performance Organizati...
All daydevops 2016 - Turning Human Capital into High Performance Organizati...John Willis
 

More from John Willis (20)

Automated Governance
Automated GovernanceAutomated Governance
Automated Governance
 
Devops Long Strange Trip
Devops Long Strange Trip Devops Long Strange Trip
Devops Long Strange Trip
 
I Got 99 Problems and a Bash DSL Ain't One of Them
I Got 99 Problems and a Bash DSL Ain't One of ThemI Got 99 Problems and a Bash DSL Ain't One of Them
I Got 99 Problems and a Bash DSL Ain't One of Them
 
Math is cool
Math is coolMath is cool
Math is cool
 
The 7 deadly diseases of DevOps 2019
The 7 deadly diseases of DevOps 2019The 7 deadly diseases of DevOps 2019
The 7 deadly diseases of DevOps 2019
 
Next Generation Infrastructure - Devops Enterprise Summit 2018
Next Generation Infrastructure - Devops Enterprise Summit 2018Next Generation Infrastructure - Devops Enterprise Summit 2018
Next Generation Infrastructure - Devops Enterprise Summit 2018
 
swampUP - 2018 - The Divine and Felonious Nature of Cyber Security
swampUP - 2018 - The Divine and Felonious Nature of Cyber SecurityswampUP - 2018 - The Divine and Felonious Nature of Cyber Security
swampUP - 2018 - The Divine and Felonious Nature of Cyber Security
 
Divine and felonios cyber security devopsdays austin 2018
Divine and felonios cyber security  devopsdays austin 2018Divine and felonios cyber security  devopsdays austin 2018
Divine and felonios cyber security devopsdays austin 2018
 
Devops - A Long Strange Trip It's Been
Devops - A Long Strange Trip It's BeenDevops - A Long Strange Trip It's Been
Devops - A Long Strange Trip It's Been
 
DevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's Been
DevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's BeenDevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's Been
DevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's Been
 
You build it - Cyber Chicago Keynote
You build it -  Cyber Chicago KeynoteYou build it -  Cyber Chicago Keynote
You build it - Cyber Chicago Keynote
 
Art of the Possible - Serverless Conference NYC 2017
Art of the Possible - Serverless Conference NYC 2017 Art of the Possible - Serverless Conference NYC 2017
Art of the Possible - Serverless Conference NYC 2017
 
Why Executives Can't Change
Why Executives Can't Change Why Executives Can't Change
Why Executives Can't Change
 
Devops Kaizen - DevopsDays Dallas 2017
Devops Kaizen - DevopsDays Dallas 2017 Devops Kaizen - DevopsDays Dallas 2017
Devops Kaizen - DevopsDays Dallas 2017
 
Evolve 2017 - Vegas - Devops, Docker and Security
Evolve 2017 - Vegas - Devops, Docker and Security Evolve 2017 - Vegas - Devops, Docker and Security
Evolve 2017 - Vegas - Devops, Docker and Security
 
Alibaba Cloud Conference 2016 - Docker Open Source
Alibaba Cloud Conference   2016 - Docker Open Source Alibaba Cloud Conference   2016 - Docker Open Source
Alibaba Cloud Conference 2016 - Docker Open Source
 
Alibaba Cloud Conference 2016 - Docker Enterprise
Alibaba Cloud Conference   2016 - Docker EnterpriseAlibaba Cloud Conference   2016 - Docker Enterprise
Alibaba Cloud Conference 2016 - Docker Enterprise
 
Breaking Bad Equilibrium - Devops Connect 2017 RSAC
Breaking Bad Equilibrium - Devops Connect 2017 RSACBreaking Bad Equilibrium - Devops Connect 2017 RSAC
Breaking Bad Equilibrium - Devops Connect 2017 RSAC
 
Breaking Bad Equilibrium - Devops Connect 2016 LA
Breaking Bad Equilibrium - Devops Connect 2016 LABreaking Bad Equilibrium - Devops Connect 2016 LA
Breaking Bad Equilibrium - Devops Connect 2016 LA
 
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
All daydevops   2016 - Turning Human Capital into High Performance Organizati...All daydevops   2016 - Turning Human Capital into High Performance Organizati...
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
 

A Cloud Outage Under the Lens of “Profound Knowledge”

  • 1. A Cloud Outage Under the Lens of “Profound Knowledge” @botchagalupe 1 Wednesday, October 31, 12 Welcome to Devopsdays NYC (first one) hell yeah... Normally I do the SOTU but I have done a few this year and there all about the same (on video) This morning I am going to Demingize you all by telling you a cloud outage story. Going to use something called the System of Profound Knowledge (sound Profound?) #### No apologies for spelling and grammar in the notes. If that kind of stuff annoys you please wait for the screen cast.
  • 2. GOALS • Understanding Complexity • Overview of SoPK • Amazon’s Outage on 10/22/12 2 Wednesday, October 31, 12 Goody we are going to talk about big bad old Amazon’s outage last week...
  • 3. SoPK - Understanding Complexity 3 Wednesday, October 31, 12 An Improvement .. might be an upgrade, a bug fix, an emergency change a new product.. One variable X will change the outcome (y) x-> y (Y is the dependent variable)
  • 4. SoPK - Understanding Complexity 3 Wednesday, October 31, 12 An Improvement .. might be an upgrade, a bug fix, an emergency change a new product.. One variable X will change the outcome (y) x-> y (Y is the dependent variable)
  • 5. SoPK - Understanding Complexity 3 Wednesday, October 31, 12 An Improvement .. might be an upgrade, a bug fix, an emergency change a new product.. One variable X will change the outcome (y) x-> y (Y is the dependent variable)
  • 6. SoPK - Understanding Complexity 3 Wednesday, October 31, 12 An Improvement .. might be an upgrade, a bug fix, an emergency change a new product.. One variable X will change the outcome (y) x-> y (Y is the dependent variable)
  • 7. SoPK - Understanding Complexity 4 Wednesday, October 31, 12 In real life you get many variables (messiness of life) There are direct effects against the dependent var (y)
  • 8. SoPK - Understanding Complexity T1 T2 5 Wednesday, October 31, 12 You also get time dependent variables
  • 9. SoPK - Understanding Complexity T1 T2 6 Wednesday, October 31, 12 There are also indirect effects on the dependent variables (y) for example X1 in concert with X4 conjointly effect the dependent var Y as does X3->X4 This is a different model that X->Y
  • 10. System of Profound Knowledge (SoPK) 7 Wednesday, October 31, 12 Do we have any photographers in the audience? Use a camera lens as a metaphor for SoPK They call this the exposure triangle. To take a perfect picture of an event you must have a good lens and understand how it works. The ISO must be understood for sensitivity to light The Aperture must be understood for DOF (a portrait or an area) The Shutter Seed to understand motion
  • 11. System of Profound Knowledge (SoPK) • Appreciation of a system • Knowledge of variation • Theory of knowledge • Knowledge of psychology 8 Wednesday, October 31, 12 Well Dr. Deming gave such a lens to break down complexity (the real world just like a camera does) Let’s say a lens for improvement of something (an enhancement, a bug fix, new product idea) An outcome X->y Dr Deming gave us a tool called “The System of Profound Knowledge” SoPK is a Lens to break down complexity and give ourselves an advantage to not over simplify what we are trying to do. In otherwise clear up the messiness of real life just like a camera lens does. (S) Appreciation of a System - Systems thinking - Deming would say understanding the AIM of a system. Deming said every system must have an AIM. Is your AIM to keep a server up or keep a protect a customer SLA (they might not be the same thing as we will soon see) Eli Goldarat (TOC) would say Global optimization over local optimization even if suboptimization is sub optimal. Understanding subsystems and dependent systems. (V) Variation - Not understanding Variation is the root of all evil. Deming would get mad at ppl. Knee jerk reactions due to not understanding the kind of variation. How do you understand varation? Statistics (primarily STD and and it’s relationship to a process i.e., it’s distribution) Give you an example. A large cloud provider rates API calls at 100 per (x). for Most customers that’s fine, however, others they get treated as DDOS. Where did they get
  • 12. Amazon’s EBS Outage 10/22/2012 9 Wednesday, October 31, 12 Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for training purposes) An EBS Services outage correct? The way it’s supposed to work... Fleet Monitor (hardware monitoring) Failover for both Fleet monitor and EBS Server DNS of course Metics, performance monitoring to disk from agent... Of course humans.. remember they are part of of every system We will give Amazon the benefit of the doubt that it’s in their value stream map However, a lot of orgs do not have this “human” process in the VSM Pre automation .. remediation systems (complex adaptive systems). or just plane old humans LENS OF SOPK ## This is a system not just the EBS Service Hint: Why did I say it’s just one system?
  • 13. Amazon’s EBS Outage 10/22/2012 9 Wednesday, October 31, 12 Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for training purposes) An EBS Services outage correct? The way it’s supposed to work... Fleet Monitor (hardware monitoring) Failover for both Fleet monitor and EBS Server DNS of course Metics, performance monitoring to disk from agent... Of course humans.. remember they are part of of every system We will give Amazon the benefit of the doubt that it’s in their value stream map However, a lot of orgs do not have this “human” process in the VSM Pre automation .. remediation systems (complex adaptive systems). or just plane old humans LENS OF SOPK ## This is a system not just the EBS Service Hint: Why did I say it’s just one system?
  • 14. Amazon’s EBS Outage 10/22/2012 This is one System 9 Wednesday, October 31, 12 Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for training purposes) An EBS Services outage correct? The way it’s supposed to work... Fleet Monitor (hardware monitoring) Failover for both Fleet monitor and EBS Server DNS of course Metics, performance monitoring to disk from agent... Of course humans.. remember they are part of of every system We will give Amazon the benefit of the doubt that it’s in their value stream map However, a lot of orgs do not have this “human” process in the VSM Pre automation .. remediation systems (complex adaptive systems). or just plane old humans LENS OF SOPK ## This is a system not just the EBS Service Hint: Why did I say it’s just one system?
  • 15. Amazon’s EBS Outage 10/22/2012 10 Wednesday, October 31, 12 Monitor Server has a failure (system down) X0 - Fleet Management monitoring server fails
  • 16. Amazon’s EBS Outage 10/22/2012 10 Wednesday, October 31, 12 Monitor Server has a failure (system down) X0 - Fleet Management monitoring server fails
  • 17. Amazon’s EBS Outage 10/22/2012 X0 -> Server Failure 10 Wednesday, October 31, 12 Monitor Server has a failure (system down) X0 - Fleet Management monitoring server fails
  • 18. Amazon’s EBS Outage 10/22/2012 11 Wednesday, October 31, 12 X1 - Fleet Management failover - anyone see this first issue? Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You might say surely they had automation to DNS. However I would say no. Because .. Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover (automation or manual) apparently didn’t have a proper measure for success. Should have verified that the they were actually using the new server (duh).
  • 19. Amazon’s EBS Outage 10/22/2012 11 Wednesday, October 31, 12 X1 - Fleet Management failover - anyone see this first issue? Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You might say surely they had automation to DNS. However I would say no. Because .. Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover (automation or manual) apparently didn’t have a proper measure for success. Should have verified that the they were actually using the new server (duh).
  • 20. Amazon’s EBS Outage 10/22/2012 X1 -> Failover 11 Wednesday, October 31, 12 X1 - Fleet Management failover - anyone see this first issue? Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You might say surely they had automation to DNS. However I would say no. Because .. Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover (automation or manual) apparently didn’t have a proper measure for success. Should have verified that the they were actually using the new server (duh).
  • 21. Amazon’s EBS Outage 10/22/2012 X1 -> Failover 11 Wednesday, October 31, 12 X1 - Fleet Management failover - anyone see this first issue? Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You might say surely they had automation to DNS. However I would say no. Because .. Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover (automation or manual) apparently didn’t have a proper measure for success. Should have verified that the they were actually using the new server (duh).
  • 22. Amazon’s Outage 10/22/2012 12 Wednesday, October 31, 12 DNS does not propagate.... X2 - DNS propagation failure Lens #1 (P) We could argue that maybe because the fleet servers are managed by hardware guys and DNS is by systems guys and may they’ are different cultural tribes and don’t understand the importance of each. Maybe they don’t go to lunch together.
  • 23. Amazon’s Outage 10/22/2012 12 Wednesday, October 31, 12 DNS does not propagate.... X2 - DNS propagation failure Lens #1 (P) We could argue that maybe because the fleet servers are managed by hardware guys and DNS is by systems guys and may they’ are different cultural tribes and don’t understand the importance of each. Maybe they don’t go to lunch together.
  • 24. Amazon’s Outage 10/22/2012 X2 -> DNS Failure 12 Wednesday, October 31, 12 DNS does not propagate.... X2 - DNS propagation failure Lens #1 (P) We could argue that maybe because the fleet servers are managed by hardware guys and DNS is by systems guys and may they’ are different cultural tribes and don’t understand the importance of each. Maybe they don’t go to lunch together.
  • 25. Amazon’s Outage 10/22/2012 X2 -> DNS Failure 12 Wednesday, October 31, 12 DNS does not propagate.... X2 - DNS propagation failure Lens #1 (P) We could argue that maybe because the fleet servers are managed by hardware guys and DNS is by systems guys and may they’ are different cultural tribes and don’t understand the importance of each. Maybe they don’t go to lunch together.
  • 26. Amazon’s Outage 10/22/2012 13 Wednesday, October 31, 12 So now we have fixed the first problem of the bad server Seemed like a flawless operation right? But DNS still says the offline server is the monitor server The agent on the EBS server (services) keeps trying to send to hw data to the original fleet monitor server. However, it is by design fault tolerant not to screw w/production if fails.. The hardware guys probably don’t even know about this. They probably don’t monitor things like server or process memory. X4- Memory Leak in the hardware agent on the EBS server Lens #1 (S) The hardware guys should know that they are part of a bigger system other than just hardware monitor. Was there a systems view for QA and smoke testing of agent code changes? Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents. X3 is the HW agent devs bug.
  • 27. Amazon’s Outage 10/22/2012 13 Wednesday, October 31, 12 So now we have fixed the first problem of the bad server Seemed like a flawless operation right? But DNS still says the offline server is the monitor server The agent on the EBS server (services) keeps trying to send to hw data to the original fleet monitor server. However, it is by design fault tolerant not to screw w/production if fails.. The hardware guys probably don’t even know about this. They probably don’t monitor things like server or process memory. X4- Memory Leak in the hardware agent on the EBS server Lens #1 (S) The hardware guys should know that they are part of a bigger system other than just hardware monitor. Was there a systems view for QA and smoke testing of agent code changes? Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents. X3 is the HW agent devs bug.
  • 28. Amazon’s Outage 10/22/2012 X4 -> Memory Leak 13 Wednesday, October 31, 12 So now we have fixed the first problem of the bad server Seemed like a flawless operation right? But DNS still says the offline server is the monitor server The agent on the EBS server (services) keeps trying to send to hw data to the original fleet monitor server. However, it is by design fault tolerant not to screw w/production if fails.. The hardware guys probably don’t even know about this. They probably don’t monitor things like server or process memory. X4- Memory Leak in the hardware agent on the EBS server Lens #1 (S) The hardware guys should know that they are part of a bigger system other than just hardware monitor. Was there a systems view for QA and smoke testing of agent code changes? Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents. X3 is the HW agent devs bug.
  • 29. Amazon’s Outage 10/22/2012 X4 -> Memory Leak 13 Wednesday, October 31, 12 So now we have fixed the first problem of the bad server Seemed like a flawless operation right? But DNS still says the offline server is the monitor server The agent on the EBS server (services) keeps trying to send to hw data to the original fleet monitor server. However, it is by design fault tolerant not to screw w/production if fails.. The hardware guys probably don’t even know about this. They probably don’t monitor things like server or process memory. X4- Memory Leak in the hardware agent on the EBS server Lens #1 (S) The hardware guys should know that they are part of a bigger system other than just hardware monitor. Was there a systems view for QA and smoke testing of agent code changes? Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents. X3 is the HW agent devs bug.
  • 30. Amazon’s Outage 10/22/2012 X4 -> Memory Leak ((X0->X1->X2)->X4) 13 Wednesday, October 31, 12 So now we have fixed the first problem of the bad server Seemed like a flawless operation right? But DNS still says the offline server is the monitor server The agent on the EBS server (services) keeps trying to send to hw data to the original fleet monitor server. However, it is by design fault tolerant not to screw w/production if fails.. The hardware guys probably don’t even know about this. They probably don’t monitor things like server or process memory. X4- Memory Leak in the hardware agent on the EBS server Lens #1 (S) The hardware guys should know that they are part of a bigger system other than just hardware monitor. Was there a systems view for QA and smoke testing of agent code changes? Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents. X3 is the HW agent devs bug.
  • 31. Amazon’s Outage 10/22/2012 X4 -> Memory Leak ((X0->X1->X2)->X4) X3->X4 13 Wednesday, October 31, 12 So now we have fixed the first problem of the bad server Seemed like a flawless operation right? But DNS still says the offline server is the monitor server The agent on the EBS server (services) keeps trying to send to hw data to the original fleet monitor server. However, it is by design fault tolerant not to screw w/production if fails.. The hardware guys probably don’t even know about this. They probably don’t monitor things like server or process memory. X4- Memory Leak in the hardware agent on the EBS server Lens #1 (S) The hardware guys should know that they are part of a bigger system other than just hardware monitor. Was there a systems view for QA and smoke testing of agent code changes? Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents. X3 is the HW agent devs bug.
  • 32. Amazon’s Outage 10/22/2012 14 Wednesday, October 31, 12 The fault tolerant code has a memory leak masks an issue with (creates) Low memory on the EBS Servers... EBS server starts to run out of memory X5 Out of Memory
  • 33. Amazon’s Outage 10/22/2012 14 Wednesday, October 31, 12 The fault tolerant code has a memory leak masks an issue with (creates) Low memory on the EBS Servers... EBS server starts to run out of memory X5 Out of Memory
  • 34. Amazon’s Outage 10/22/2012 X5 -> Out of Memory 14 Wednesday, October 31, 12 The fault tolerant code has a memory leak masks an issue with (creates) Low memory on the EBS Servers... EBS server starts to run out of memory X5 Out of Memory
  • 35. Amazon’s Outage 10/22/2012 X5 -> Out of Memory (X3, X4)->X5 14 Wednesday, October 31, 12 The fault tolerant code has a memory leak masks an issue with (creates) Low memory on the EBS Servers... EBS server starts to run out of memory X5 Out of Memory
  • 36. Amazon’s Outage 10/22/2012 15 Wednesday, October 31, 12 The memory low wakes up the humans (yellow) seeing low memory from the systems monitoring DB The fucking humans get involved and all hell breaks lose. The humans see something is wrong with memory low on EBS servers. They start to throttle API calls due to low memory (that they don’t know why (local optimization) X6 Throttling System guys see this as a X->Y issue (Low memory therefore throttle) However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6) Chances are they might not even know about the the fleet server failover. The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5) Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they don’t understand hardware monitoring as a sub system Lens #2 (V) The systems guys don’t understand common vs special cause variation .. they react to a “S” that should of been a “C”. Turns out Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and Systems guys. .
  • 37. Amazon’s Outage 10/22/2012 15 Wednesday, October 31, 12 The memory low wakes up the humans (yellow) seeing low memory from the systems monitoring DB The fucking humans get involved and all hell breaks lose. The humans see something is wrong with memory low on EBS servers. They start to throttle API calls due to low memory (that they don’t know why (local optimization) X6 Throttling System guys see this as a X->Y issue (Low memory therefore throttle) However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6) Chances are they might not even know about the the fleet server failover. The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5) Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they don’t understand hardware monitoring as a sub system Lens #2 (V) The systems guys don’t understand common vs special cause variation .. they react to a “S” that should of been a “C”. Turns out Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and Systems guys. .
  • 38. Amazon’s Outage 10/22/2012 X6 -> Throttling (X->Y) 15 Wednesday, October 31, 12 The memory low wakes up the humans (yellow) seeing low memory from the systems monitoring DB The fucking humans get involved and all hell breaks lose. The humans see something is wrong with memory low on EBS servers. They start to throttle API calls due to low memory (that they don’t know why (local optimization) X6 Throttling System guys see this as a X->Y issue (Low memory therefore throttle) However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6) Chances are they might not even know about the the fleet server failover. The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5) Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they don’t understand hardware monitoring as a sub system Lens #2 (V) The systems guys don’t understand common vs special cause variation .. they react to a “S” that should of been a “C”. Turns out Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and Systems guys. .
  • 39. Amazon’s Outage 10/22/2012 X6 -> Throttling (X->Y) 15 Wednesday, October 31, 12 The memory low wakes up the humans (yellow) seeing low memory from the systems monitoring DB The fucking humans get involved and all hell breaks lose. The humans see something is wrong with memory low on EBS servers. They start to throttle API calls due to low memory (that they don’t know why (local optimization) X6 Throttling System guys see this as a X->Y issue (Low memory therefore throttle) However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6) Chances are they might not even know about the the fleet server failover. The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5) Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they don’t understand hardware monitoring as a sub system Lens #2 (V) The systems guys don’t understand common vs special cause variation .. they react to a “S” that should of been a “C”. Turns out Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and Systems guys. .
  • 40. Amazon’s Outage 10/22/2012 X6 -> Throttling (X->Y) X5->X6 15 Wednesday, October 31, 12 The memory low wakes up the humans (yellow) seeing low memory from the systems monitoring DB The fucking humans get involved and all hell breaks lose. The humans see something is wrong with memory low on EBS servers. They start to throttle API calls due to low memory (that they don’t know why (local optimization) X6 Throttling System guys see this as a X->Y issue (Low memory therefore throttle) However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6) Chances are they might not even know about the the fleet server failover. The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5) Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they don’t understand hardware monitoring as a sub system Lens #2 (V) The systems guys don’t understand common vs special cause variation .. they react to a “S” that should of been a “C”. Turns out Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and Systems guys. .
  • 41. Amazon’s Outage 10/22/2012 16 Wednesday, October 31, 12 Things continue to get worse.... Some customers (yellow) were already getting issues but throttling makes it worse for the customers. X7 API Issues Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? X7 is caused by both X6 and X5 independently (X6 just made things worse)
  • 42. Amazon’s Outage 10/22/2012 16 Wednesday, October 31, 12 Things continue to get worse.... Some customers (yellow) were already getting issues but throttling makes it worse for the customers. X7 API Issues Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? X7 is caused by both X6 and X5 independently (X6 just made things worse)
  • 43. Amazon’s Outage 10/22/2012 X7 -> API Issues 16 Wednesday, October 31, 12 Things continue to get worse.... Some customers (yellow) were already getting issues but throttling makes it worse for the customers. X7 API Issues Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? X7 is caused by both X6 and X5 independently (X6 just made things worse)
  • 44. Amazon’s Outage 10/22/2012 X7 -> API Issues 16 Wednesday, October 31, 12 Things continue to get worse.... Some customers (yellow) were already getting issues but throttling makes it worse for the customers. X7 API Issues Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? X7 is caused by both X6 and X5 independently (X6 just made things worse)
  • 45. Amazon’s Outage 10/22/2012 X7 -> API Issues (X6->X7) 16 Wednesday, October 31, 12 Things continue to get worse.... Some customers (yellow) were already getting issues but throttling makes it worse for the customers. X7 API Issues Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? X7 is caused by both X6 and X5 independently (X6 just made things worse)
  • 46. Amazon’s Outage 10/22/2012 X7 -> API Issues (X6->X7) (X5->X7) 16 Wednesday, October 31, 12 Things continue to get worse.... Some customers (yellow) were already getting issues but throttling makes it worse for the customers. X7 API Issues Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? X7 is caused by both X6 and X5 independently (X6 just made things worse)
  • 47. Amazon’s Outage 10/22/2012 17 Wednesday, October 31, 12 The still can’t pin the problem and they decide to force a failover (red) EBS Server Agin local optimization X9 EBS Failover Lens #1 (K) Measures with out results are not fixes (failover). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? Lens #2 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke. X8 is customer hammering the services...
  • 48. Amazon’s Outage 10/22/2012 17 Wednesday, October 31, 12 The still can’t pin the problem and they decide to force a failover (red) EBS Server Agin local optimization X9 EBS Failover Lens #1 (K) Measures with out results are not fixes (failover). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? Lens #2 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke. X8 is customer hammering the services...
  • 49. Amazon’s Outage 10/22/2012 X9 -> EBS Failover 17 Wednesday, October 31, 12 The still can’t pin the problem and they decide to force a failover (red) EBS Server Agin local optimization X9 EBS Failover Lens #1 (K) Measures with out results are not fixes (failover). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? Lens #2 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke. X8 is customer hammering the services...
  • 50. Amazon’s Outage 10/22/2012 X9 -> EBS Failover 17 Wednesday, October 31, 12 The still can’t pin the problem and they decide to force a failover (red) EBS Server Agin local optimization X9 EBS Failover Lens #1 (K) Measures with out results are not fixes (failover). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? Lens #2 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke. X8 is customer hammering the services...
  • 51. Amazon’s Outage 10/22/2012 X9 -> EBS Failover (X7->X9) 17 Wednesday, October 31, 12 The still can’t pin the problem and they decide to force a failover (red) EBS Server Agin local optimization X9 EBS Failover Lens #1 (K) Measures with out results are not fixes (failover). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? Lens #2 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke. X8 is customer hammering the services...
  • 52. Amazon’s Outage 10/22/2012 X9 -> EBS Failover (X7->X9) (X8->X9) 17 Wednesday, October 31, 12 The still can’t pin the problem and they decide to force a failover (red) EBS Server Agin local optimization X9 EBS Failover Lens #1 (K) Measures with out results are not fixes (failover). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened? Lens #2 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke. X8 is customer hammering the services...
  • 53. Amazon’s Outage 10/22/2012 X10 ->Twitter Effect 18 Wednesday, October 31, 12 Meanwhile .. the throttling effects a bigger problem.. The twitter effect kicks in... ppl start hammering AWS API web services to test availability. PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start.. Come on admit it how many of you tried to spank the money last monday...(Netflix Chaos Monkey) Load goes up... #X10 Twitter effect Introducing another external subsystem. Lens #1 (S) Not including this whole other subsystem (admittedly this is hard) Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might have handled this but for the purposes of this presentation it’s fun to assume they didn’t as a leanring exercise. X9 is an aggregate effect from from one customer to other customer and non customers..
  • 54. Amazon’s Outage 10/22/2012 X10 ->Twitter Effect 18 Wednesday, October 31, 12 Meanwhile .. the throttling effects a bigger problem.. The twitter effect kicks in... ppl start hammering AWS API web services to test availability. PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start.. Come on admit it how many of you tried to spank the money last monday...(Netflix Chaos Monkey) Load goes up... #X10 Twitter effect Introducing another external subsystem. Lens #1 (S) Not including this whole other subsystem (admittedly this is hard) Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might have handled this but for the purposes of this presentation it’s fun to assume they didn’t as a leanring exercise. X9 is an aggregate effect from from one customer to other customer and non customers..
  • 55. Amazon’s Outage 10/22/2012 X10 ->Twitter Effect 18 Wednesday, October 31, 12 Meanwhile .. the throttling effects a bigger problem.. The twitter effect kicks in... ppl start hammering AWS API web services to test availability. PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start.. Come on admit it how many of you tried to spank the money last monday...(Netflix Chaos Monkey) Load goes up... #X10 Twitter effect Introducing another external subsystem. Lens #1 (S) Not including this whole other subsystem (admittedly this is hard) Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might have handled this but for the purposes of this presentation it’s fun to assume they didn’t as a leanring exercise. X9 is an aggregate effect from from one customer to other customer and non customers..
  • 56. Amazon’s Outage 10/22/2012 X10 ->Twitter Effect (X8->X10) 18 Wednesday, October 31, 12 Meanwhile .. the throttling effects a bigger problem.. The twitter effect kicks in... ppl start hammering AWS API web services to test availability. PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start.. Come on admit it how many of you tried to spank the money last monday...(Netflix Chaos Monkey) Load goes up... #X10 Twitter effect Introducing another external subsystem. Lens #1 (S) Not including this whole other subsystem (admittedly this is hard) Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might have handled this but for the purposes of this presentation it’s fun to assume they didn’t as a leanring exercise. X9 is an aggregate effect from from one customer to other customer and non customers..
  • 57. Amazon’s Outage 10/22/2012 X10 ->Twitter Effect (X8->X10) (X9->X10) 18 Wednesday, October 31, 12 Meanwhile .. the throttling effects a bigger problem.. The twitter effect kicks in... ppl start hammering AWS API web services to test availability. PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start.. Come on admit it how many of you tried to spank the money last monday...(Netflix Chaos Monkey) Load goes up... #X10 Twitter effect Introducing another external subsystem. Lens #1 (S) Not including this whole other subsystem (admittedly this is hard) Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might have handled this but for the purposes of this presentation it’s fun to assume they didn’t as a leanring exercise. X9 is an aggregate effect from from one customer to other customer and non customers..
  • 58. Amazon’s Outage 10/22/2012 X11 -> FA Server Dies 19 Wednesday, October 31, 12 The system becomes a systemic breakdown... Now the backup (FA) server fails.. X11 Failover server fails Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
  • 59. Amazon’s Outage 10/22/2012 X11 -> FA Server Dies 19 Wednesday, October 31, 12 The system becomes a systemic breakdown... Now the backup (FA) server fails.. X11 Failover server fails Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
  • 60. Amazon’s Outage 10/22/2012 X11 -> FA Server Dies (X6->X11) 19 Wednesday, October 31, 12 The system becomes a systemic breakdown... Now the backup (FA) server fails.. X11 Failover server fails Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
  • 61. Amazon’s Outage 10/22/2012 X11 -> FA Server Dies (X6->X11) (X10->X11) 19 Wednesday, October 31, 12 The system becomes a systemic breakdown... Now the backup (FA) server fails.. X11 Failover server fails Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
  • 62. Amazon’s Outage 10/22/2012 Systemic Outage 20 Wednesday, October 31, 12 The whole system is hosed... The complexity was masked To bad they had not read deming...
  • 63. Amazon’s Outage 10/22/2012 Systemic Outage 20 Wednesday, October 31, 12 The whole system is hosed... The complexity was masked To bad they had not read deming...
  • 64. Amazon’s Outage 10/22/2012 Systemic Outage X->Y 20 Wednesday, October 31, 12 The whole system is hosed... The complexity was masked To bad they had not read deming...