All daydevops 2016 - Turning Human Capital into High Performance Organizati...
A Cloud Outage Under the Lens of “Profound Knowledge”
1. A Cloud Outage
Under the Lens of
“Profound Knowledge”
@botchagalupe
1
Wednesday, October 31, 12
Welcome to Devopsdays NYC (first one) hell yeah...
Normally I do the SOTU but I have done a few this year and there all about the same
(on video)
This morning I am going to Demingize you all by telling you a cloud outage story.
Going to use something called the System of Profound Knowledge (sound Profound?)
#### No apologies for spelling and grammar in the notes. If that kind of stuff annoys
you please wait for the screen cast.
2. GOALS
• Understanding Complexity
• Overview of SoPK
• Amazon’s Outage on 10/22/12
2
Wednesday, October 31, 12
Goody we are going to talk about big bad old Amazon’s outage last week...
3. SoPK - Understanding Complexity
3
Wednesday, October 31, 12
An Improvement .. might be an upgrade, a bug fix, an emergency change a new
product..
One variable X will change the outcome (y)
x-> y (Y is the dependent variable)
4. SoPK - Understanding Complexity
3
Wednesday, October 31, 12
An Improvement .. might be an upgrade, a bug fix, an emergency change a new
product..
One variable X will change the outcome (y)
x-> y (Y is the dependent variable)
5. SoPK - Understanding Complexity
3
Wednesday, October 31, 12
An Improvement .. might be an upgrade, a bug fix, an emergency change a new
product..
One variable X will change the outcome (y)
x-> y (Y is the dependent variable)
6. SoPK - Understanding Complexity
3
Wednesday, October 31, 12
An Improvement .. might be an upgrade, a bug fix, an emergency change a new
product..
One variable X will change the outcome (y)
x-> y (Y is the dependent variable)
7. SoPK - Understanding Complexity
4
Wednesday, October 31, 12
In real life you get many variables (messiness of life)
There are direct effects against the dependent var (y)
8. SoPK - Understanding Complexity
T1 T2
5
Wednesday, October 31, 12
You also get time dependent variables
9. SoPK - Understanding Complexity
T1 T2
6
Wednesday, October 31, 12
There are also indirect effects on the dependent variables (y)
for example X1 in concert with X4 conjointly effect the dependent var Y
as does X3->X4
This is a different model that X->Y
10. System of Profound Knowledge (SoPK)
7
Wednesday, October 31, 12
Do we have any photographers in the audience?
Use a camera lens as a metaphor for SoPK
They call this the exposure triangle.
To take a perfect picture of an event you must have a good lens and understand how it
works.
The ISO must be understood for sensitivity to light
The Aperture must be understood for DOF (a portrait or an area)
The Shutter Seed to understand motion
11. System of Profound Knowledge (SoPK)
• Appreciation of a system
• Knowledge of variation
• Theory of knowledge
• Knowledge of psychology
8
Wednesday, October 31, 12
Well Dr. Deming gave such a lens to break down complexity (the real world just like a
camera does)
Let’s say a lens for improvement of something (an enhancement, a bug fix, new product
idea)
An outcome X->y
Dr Deming gave us a tool called “The System of Profound Knowledge”
SoPK is a Lens to break down complexity and give ourselves an advantage to not over
simplify what we are trying to do. In otherwise clear up the messiness of real life just like
a camera lens does.
(S) Appreciation of a System - Systems thinking - Deming would say understanding the
AIM of a system.
Deming said every system must have an AIM.
Is your AIM to keep a server up or keep a protect a customer SLA (they might not be the
same thing as we will soon see)
Eli Goldarat (TOC) would say Global optimization over local optimization even if
suboptimization is sub optimal. Understanding subsystems and dependent systems.
(V) Variation - Not understanding Variation is the root of all evil. Deming would get mad
at ppl. Knee jerk reactions due to not understanding the kind of variation. How do you
understand varation? Statistics (primarily STD and and it’s relationship to a process
i.e., it’s distribution)
Give you an example. A large cloud provider rates API calls at 100 per (x). for Most
customers that’s fine, however, others they get treated as DDOS. Where did they get
12. Amazon’s EBS Outage 10/22/2012
9
Wednesday, October 31, 12
Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for training
purposes)
An EBS Services outage correct?
The way it’s supposed to work...
Fleet Monitor (hardware monitoring)
Failover for both Fleet monitor and EBS Server
DNS of course
Metics, performance monitoring to disk from agent...
Of course humans.. remember they are part of of every system
We will give Amazon the benefit of the doubt that it’s in their value stream map
However, a lot of orgs do not have this “human” process in the VSM
Pre automation .. remediation systems (complex adaptive systems).
or just plane old humans
LENS OF SOPK
## This is a system not just the EBS Service Hint: Why did I say it’s just one system?
13. Amazon’s EBS Outage 10/22/2012
9
Wednesday, October 31, 12
Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for training
purposes)
An EBS Services outage correct?
The way it’s supposed to work...
Fleet Monitor (hardware monitoring)
Failover for both Fleet monitor and EBS Server
DNS of course
Metics, performance monitoring to disk from agent...
Of course humans.. remember they are part of of every system
We will give Amazon the benefit of the doubt that it’s in their value stream map
However, a lot of orgs do not have this “human” process in the VSM
Pre automation .. remediation systems (complex adaptive systems).
or just plane old humans
LENS OF SOPK
## This is a system not just the EBS Service Hint: Why did I say it’s just one system?
14. Amazon’s EBS Outage 10/22/2012
This is one System
9
Wednesday, October 31, 12
Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for training
purposes)
An EBS Services outage correct?
The way it’s supposed to work...
Fleet Monitor (hardware monitoring)
Failover for both Fleet monitor and EBS Server
DNS of course
Metics, performance monitoring to disk from agent...
Of course humans.. remember they are part of of every system
We will give Amazon the benefit of the doubt that it’s in their value stream map
However, a lot of orgs do not have this “human” process in the VSM
Pre automation .. remediation systems (complex adaptive systems).
or just plane old humans
LENS OF SOPK
## This is a system not just the EBS Service Hint: Why did I say it’s just one system?
15. Amazon’s EBS Outage 10/22/2012
10
Wednesday, October 31, 12
Monitor Server has a failure (system down)
X0 - Fleet Management monitoring server fails
16. Amazon’s EBS Outage 10/22/2012
10
Wednesday, October 31, 12
Monitor Server has a failure (system down)
X0 - Fleet Management monitoring server fails
17. Amazon’s EBS Outage 10/22/2012
X0 -> Server Failure
10
Wednesday, October 31, 12
Monitor Server has a failure (system down)
X0 - Fleet Management monitoring server fails
18. Amazon’s EBS Outage 10/22/2012
11
Wednesday, October 31, 12
X1 - Fleet Management failover - anyone see this first issue?
Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You
might say surely they had automation to DNS. However I would say no. Because ..
Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover
(automation or manual) apparently didn’t have a proper measure for success. Should
have verified that the they were actually using the new server (duh).
19. Amazon’s EBS Outage 10/22/2012
11
Wednesday, October 31, 12
X1 - Fleet Management failover - anyone see this first issue?
Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You
might say surely they had automation to DNS. However I would say no. Because ..
Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover
(automation or manual) apparently didn’t have a proper measure for success. Should
have verified that the they were actually using the new server (duh).
20. Amazon’s EBS Outage 10/22/2012
X1 -> Failover
11
Wednesday, October 31, 12
X1 - Fleet Management failover - anyone see this first issue?
Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You
might say surely they had automation to DNS. However I would say no. Because ..
Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover
(automation or manual) apparently didn’t have a proper measure for success. Should
have verified that the they were actually using the new server (duh).
21. Amazon’s EBS Outage 10/22/2012
X1 -> Failover
11
Wednesday, October 31, 12
X1 - Fleet Management failover - anyone see this first issue?
Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You
might say surely they had automation to DNS. However I would say no. Because ..
Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover
(automation or manual) apparently didn’t have a proper measure for success. Should
have verified that the they were actually using the new server (duh).
22. Amazon’s Outage 10/22/2012
12
Wednesday, October 31, 12
DNS does not propagate....
X2 - DNS propagation failure
Lens #1 (P) We could argue that maybe because the fleet servers are managed by
hardware guys and DNS is by systems guys and may they’ are different cultural tribes
and don’t understand the importance of each. Maybe they don’t go to lunch together.
23. Amazon’s Outage 10/22/2012
12
Wednesday, October 31, 12
DNS does not propagate....
X2 - DNS propagation failure
Lens #1 (P) We could argue that maybe because the fleet servers are managed by
hardware guys and DNS is by systems guys and may they’ are different cultural tribes
and don’t understand the importance of each. Maybe they don’t go to lunch together.
24. Amazon’s Outage 10/22/2012
X2 -> DNS Failure
12
Wednesday, October 31, 12
DNS does not propagate....
X2 - DNS propagation failure
Lens #1 (P) We could argue that maybe because the fleet servers are managed by
hardware guys and DNS is by systems guys and may they’ are different cultural tribes
and don’t understand the importance of each. Maybe they don’t go to lunch together.
25. Amazon’s Outage 10/22/2012
X2 -> DNS Failure
12
Wednesday, October 31, 12
DNS does not propagate....
X2 - DNS propagation failure
Lens #1 (P) We could argue that maybe because the fleet servers are managed by
hardware guys and DNS is by systems guys and may they’ are different cultural tribes
and don’t understand the importance of each. Maybe they don’t go to lunch together.
26. Amazon’s Outage 10/22/2012
13
Wednesday, October 31, 12
So now we have fixed the first problem of the bad server
Seemed like a flawless operation right?
But DNS still says the offline server is the monitor server
The agent on the EBS server (services) keeps trying to send to hw data to the original
fleet monitor server.
However, it is by design fault tolerant not to screw w/production if fails..
The hardware guys probably don’t even know about this. They probably don’t monitor
things like server or process memory.
X4- Memory Leak in the hardware agent on the EBS server
Lens #1 (S) The hardware guys should know that they are part of a bigger system other
than just hardware monitor. Was there a systems view for QA and smoke testing of
agent code changes?
Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.
X3 is the HW agent devs bug.
27. Amazon’s Outage 10/22/2012
13
Wednesday, October 31, 12
So now we have fixed the first problem of the bad server
Seemed like a flawless operation right?
But DNS still says the offline server is the monitor server
The agent on the EBS server (services) keeps trying to send to hw data to the original
fleet monitor server.
However, it is by design fault tolerant not to screw w/production if fails..
The hardware guys probably don’t even know about this. They probably don’t monitor
things like server or process memory.
X4- Memory Leak in the hardware agent on the EBS server
Lens #1 (S) The hardware guys should know that they are part of a bigger system other
than just hardware monitor. Was there a systems view for QA and smoke testing of
agent code changes?
Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.
X3 is the HW agent devs bug.
28. Amazon’s Outage 10/22/2012
X4 -> Memory Leak
13
Wednesday, October 31, 12
So now we have fixed the first problem of the bad server
Seemed like a flawless operation right?
But DNS still says the offline server is the monitor server
The agent on the EBS server (services) keeps trying to send to hw data to the original
fleet monitor server.
However, it is by design fault tolerant not to screw w/production if fails..
The hardware guys probably don’t even know about this. They probably don’t monitor
things like server or process memory.
X4- Memory Leak in the hardware agent on the EBS server
Lens #1 (S) The hardware guys should know that they are part of a bigger system other
than just hardware monitor. Was there a systems view for QA and smoke testing of
agent code changes?
Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.
X3 is the HW agent devs bug.
29. Amazon’s Outage 10/22/2012
X4 -> Memory Leak
13
Wednesday, October 31, 12
So now we have fixed the first problem of the bad server
Seemed like a flawless operation right?
But DNS still says the offline server is the monitor server
The agent on the EBS server (services) keeps trying to send to hw data to the original
fleet monitor server.
However, it is by design fault tolerant not to screw w/production if fails..
The hardware guys probably don’t even know about this. They probably don’t monitor
things like server or process memory.
X4- Memory Leak in the hardware agent on the EBS server
Lens #1 (S) The hardware guys should know that they are part of a bigger system other
than just hardware monitor. Was there a systems view for QA and smoke testing of
agent code changes?
Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.
X3 is the HW agent devs bug.
30. Amazon’s Outage 10/22/2012
X4 -> Memory Leak
((X0->X1->X2)->X4)
13
Wednesday, October 31, 12
So now we have fixed the first problem of the bad server
Seemed like a flawless operation right?
But DNS still says the offline server is the monitor server
The agent on the EBS server (services) keeps trying to send to hw data to the original
fleet monitor server.
However, it is by design fault tolerant not to screw w/production if fails..
The hardware guys probably don’t even know about this. They probably don’t monitor
things like server or process memory.
X4- Memory Leak in the hardware agent on the EBS server
Lens #1 (S) The hardware guys should know that they are part of a bigger system other
than just hardware monitor. Was there a systems view for QA and smoke testing of
agent code changes?
Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.
X3 is the HW agent devs bug.
31. Amazon’s Outage 10/22/2012
X4 -> Memory Leak
((X0->X1->X2)->X4)
X3->X4 13
Wednesday, October 31, 12
So now we have fixed the first problem of the bad server
Seemed like a flawless operation right?
But DNS still says the offline server is the monitor server
The agent on the EBS server (services) keeps trying to send to hw data to the original
fleet monitor server.
However, it is by design fault tolerant not to screw w/production if fails..
The hardware guys probably don’t even know about this. They probably don’t monitor
things like server or process memory.
X4- Memory Leak in the hardware agent on the EBS server
Lens #1 (S) The hardware guys should know that they are part of a bigger system other
than just hardware monitor. Was there a systems view for QA and smoke testing of
agent code changes?
Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.
X3 is the HW agent devs bug.
32. Amazon’s Outage 10/22/2012
14
Wednesday, October 31, 12
The fault tolerant code has a memory leak masks an issue with (creates) Low memory
on the EBS Servers...
EBS server starts to run out of memory
X5 Out of Memory
33. Amazon’s Outage 10/22/2012
14
Wednesday, October 31, 12
The fault tolerant code has a memory leak masks an issue with (creates) Low memory
on the EBS Servers...
EBS server starts to run out of memory
X5 Out of Memory
34. Amazon’s Outage 10/22/2012
X5 -> Out of Memory
14
Wednesday, October 31, 12
The fault tolerant code has a memory leak masks an issue with (creates) Low memory
on the EBS Servers...
EBS server starts to run out of memory
X5 Out of Memory
35. Amazon’s Outage 10/22/2012
X5 -> Out of Memory
(X3, X4)->X5
14
Wednesday, October 31, 12
The fault tolerant code has a memory leak masks an issue with (creates) Low memory
on the EBS Servers...
EBS server starts to run out of memory
X5 Out of Memory
36. Amazon’s Outage 10/22/2012
15
Wednesday, October 31, 12
The memory low wakes up the humans (yellow) seeing low memory from the systems
monitoring DB
The fucking humans get involved and all hell breaks lose.
The humans see something is wrong with memory low on EBS servers.
They start to throttle API calls due to low memory (that they don’t know why (local
optimization)
X6 Throttling
System guys see this as a X->Y issue (Low memory therefore throttle)
However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)
Chances are they might not even know about the the fleet server failover.
The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)
Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they
don’t understand hardware monitoring as a sub system
Lens #2 (V) The systems guys don’t understand common vs special cause variation ..
they react to a “S” that should of been a “C”. Turns out
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and
Systems guys.
.
37. Amazon’s Outage 10/22/2012
15
Wednesday, October 31, 12
The memory low wakes up the humans (yellow) seeing low memory from the systems
monitoring DB
The fucking humans get involved and all hell breaks lose.
The humans see something is wrong with memory low on EBS servers.
They start to throttle API calls due to low memory (that they don’t know why (local
optimization)
X6 Throttling
System guys see this as a X->Y issue (Low memory therefore throttle)
However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)
Chances are they might not even know about the the fleet server failover.
The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)
Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they
don’t understand hardware monitoring as a sub system
Lens #2 (V) The systems guys don’t understand common vs special cause variation ..
they react to a “S” that should of been a “C”. Turns out
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and
Systems guys.
.
38. Amazon’s Outage 10/22/2012
X6 -> Throttling
(X->Y)
15
Wednesday, October 31, 12
The memory low wakes up the humans (yellow) seeing low memory from the systems
monitoring DB
The fucking humans get involved and all hell breaks lose.
The humans see something is wrong with memory low on EBS servers.
They start to throttle API calls due to low memory (that they don’t know why (local
optimization)
X6 Throttling
System guys see this as a X->Y issue (Low memory therefore throttle)
However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)
Chances are they might not even know about the the fleet server failover.
The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)
Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they
don’t understand hardware monitoring as a sub system
Lens #2 (V) The systems guys don’t understand common vs special cause variation ..
they react to a “S” that should of been a “C”. Turns out
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and
Systems guys.
.
39. Amazon’s Outage 10/22/2012
X6 -> Throttling
(X->Y)
15
Wednesday, October 31, 12
The memory low wakes up the humans (yellow) seeing low memory from the systems
monitoring DB
The fucking humans get involved and all hell breaks lose.
The humans see something is wrong with memory low on EBS servers.
They start to throttle API calls due to low memory (that they don’t know why (local
optimization)
X6 Throttling
System guys see this as a X->Y issue (Low memory therefore throttle)
However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)
Chances are they might not even know about the the fleet server failover.
The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)
Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they
don’t understand hardware monitoring as a sub system
Lens #2 (V) The systems guys don’t understand common vs special cause variation ..
they react to a “S” that should of been a “C”. Turns out
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and
Systems guys.
.
40. Amazon’s Outage 10/22/2012
X6 -> Throttling
(X->Y)
X5->X6
15
Wednesday, October 31, 12
The memory low wakes up the humans (yellow) seeing low memory from the systems
monitoring DB
The fucking humans get involved and all hell breaks lose.
The humans see something is wrong with memory low on EBS servers.
They start to throttle API calls due to low memory (that they don’t know why (local
optimization)
X6 Throttling
System guys see this as a X->Y issue (Low memory therefore throttle)
However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)
Chances are they might not even know about the the fleet server failover.
The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)
Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they
don’t understand hardware monitoring as a sub system
Lens #2 (V) The systems guys don’t understand common vs special cause variation ..
they react to a “S” that should of been a “C”. Turns out
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and
Systems guys.
.
41. Amazon’s Outage 10/22/2012
16
Wednesday, October 31, 12
Things continue to get worse....
Some customers (yellow) were already getting issues but throttling makes it worse for
the customers.
X7 API Issues
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
X7 is caused by both X6 and X5 independently (X6 just made things worse)
42. Amazon’s Outage 10/22/2012
16
Wednesday, October 31, 12
Things continue to get worse....
Some customers (yellow) were already getting issues but throttling makes it worse for
the customers.
X7 API Issues
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
X7 is caused by both X6 and X5 independently (X6 just made things worse)
43. Amazon’s Outage 10/22/2012
X7 -> API Issues
16
Wednesday, October 31, 12
Things continue to get worse....
Some customers (yellow) were already getting issues but throttling makes it worse for
the customers.
X7 API Issues
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
X7 is caused by both X6 and X5 independently (X6 just made things worse)
44. Amazon’s Outage 10/22/2012
X7 -> API Issues
16
Wednesday, October 31, 12
Things continue to get worse....
Some customers (yellow) were already getting issues but throttling makes it worse for
the customers.
X7 API Issues
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
X7 is caused by both X6 and X5 independently (X6 just made things worse)
45. Amazon’s Outage 10/22/2012
X7 -> API Issues
(X6->X7)
16
Wednesday, October 31, 12
Things continue to get worse....
Some customers (yellow) were already getting issues but throttling makes it worse for
the customers.
X7 API Issues
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
X7 is caused by both X6 and X5 independently (X6 just made things worse)
46. Amazon’s Outage 10/22/2012
X7 -> API Issues
(X6->X7)
(X5->X7) 16
Wednesday, October 31, 12
Things continue to get worse....
Some customers (yellow) were already getting issues but throttling makes it worse for
the customers.
X7 API Issues
Lens #3 (K) Measures with out results are not fixes (throttling). They should have
looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets
worse. What do you think happened?
X7 is caused by both X6 and X5 independently (X6 just made things worse)
47. Amazon’s Outage 10/22/2012
17
Wednesday, October 31, 12
The still can’t pin the problem and they decide to force a failover (red) EBS Server
Agin local optimization
X9 EBS Failover
Lens #1 (K) Measures with out results are not fixes (failover). They should have looked
at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Lens #2 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.
X8 is customer hammering the services...
48. Amazon’s Outage 10/22/2012
17
Wednesday, October 31, 12
The still can’t pin the problem and they decide to force a failover (red) EBS Server
Agin local optimization
X9 EBS Failover
Lens #1 (K) Measures with out results are not fixes (failover). They should have looked
at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Lens #2 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.
X8 is customer hammering the services...
49. Amazon’s Outage 10/22/2012
X9 -> EBS Failover
17
Wednesday, October 31, 12
The still can’t pin the problem and they decide to force a failover (red) EBS Server
Agin local optimization
X9 EBS Failover
Lens #1 (K) Measures with out results are not fixes (failover). They should have looked
at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Lens #2 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.
X8 is customer hammering the services...
50. Amazon’s Outage 10/22/2012
X9 -> EBS Failover
17
Wednesday, October 31, 12
The still can’t pin the problem and they decide to force a failover (red) EBS Server
Agin local optimization
X9 EBS Failover
Lens #1 (K) Measures with out results are not fixes (failover). They should have looked
at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Lens #2 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.
X8 is customer hammering the services...
51. Amazon’s Outage 10/22/2012
X9 -> EBS Failover
(X7->X9)
17
Wednesday, October 31, 12
The still can’t pin the problem and they decide to force a failover (red) EBS Server
Agin local optimization
X9 EBS Failover
Lens #1 (K) Measures with out results are not fixes (failover). They should have looked
at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Lens #2 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.
X8 is customer hammering the services...
52. Amazon’s Outage 10/22/2012
X9 -> EBS Failover
(X7->X9)
(X8->X9) 17
Wednesday, October 31, 12
The still can’t pin the problem and they decide to force a failover (red) EBS Server
Agin local optimization
X9 EBS Failover
Lens #1 (K) Measures with out results are not fixes (failover). They should have looked
at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Lens #2 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.
X8 is customer hammering the services...
53. Amazon’s Outage 10/22/2012
X10 ->Twitter Effect
18
Wednesday, October 31, 12
Meanwhile .. the throttling effects a bigger problem..
The twitter effect kicks in... ppl start hammering AWS API web services to test
availability.
PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..
Come on admit it how many of you tried to spank the money last monday...(Netflix
Chaos Monkey)
Load goes up...
#X10 Twitter effect
Introducing another external subsystem.
Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)
Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might
have handled this but for the purposes of this presentation it’s fun to assume they didn’t
as a leanring exercise.
X9 is an aggregate effect from from one customer to other customer and non
customers..
54. Amazon’s Outage 10/22/2012
X10 ->Twitter Effect
18
Wednesday, October 31, 12
Meanwhile .. the throttling effects a bigger problem..
The twitter effect kicks in... ppl start hammering AWS API web services to test
availability.
PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..
Come on admit it how many of you tried to spank the money last monday...(Netflix
Chaos Monkey)
Load goes up...
#X10 Twitter effect
Introducing another external subsystem.
Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)
Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might
have handled this but for the purposes of this presentation it’s fun to assume they didn’t
as a leanring exercise.
X9 is an aggregate effect from from one customer to other customer and non
customers..
55. Amazon’s Outage 10/22/2012
X10 ->Twitter Effect
18
Wednesday, October 31, 12
Meanwhile .. the throttling effects a bigger problem..
The twitter effect kicks in... ppl start hammering AWS API web services to test
availability.
PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..
Come on admit it how many of you tried to spank the money last monday...(Netflix
Chaos Monkey)
Load goes up...
#X10 Twitter effect
Introducing another external subsystem.
Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)
Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might
have handled this but for the purposes of this presentation it’s fun to assume they didn’t
as a leanring exercise.
X9 is an aggregate effect from from one customer to other customer and non
customers..
56. Amazon’s Outage 10/22/2012
X10 ->Twitter Effect
(X8->X10)
18
Wednesday, October 31, 12
Meanwhile .. the throttling effects a bigger problem..
The twitter effect kicks in... ppl start hammering AWS API web services to test
availability.
PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..
Come on admit it how many of you tried to spank the money last monday...(Netflix
Chaos Monkey)
Load goes up...
#X10 Twitter effect
Introducing another external subsystem.
Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)
Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might
have handled this but for the purposes of this presentation it’s fun to assume they didn’t
as a leanring exercise.
X9 is an aggregate effect from from one customer to other customer and non
customers..
57. Amazon’s Outage 10/22/2012
X10 ->Twitter Effect
(X8->X10)
(X9->X10) 18
Wednesday, October 31, 12
Meanwhile .. the throttling effects a bigger problem..
The twitter effect kicks in... ppl start hammering AWS API web services to test
availability.
PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..
Come on admit it how many of you tried to spank the money last monday...(Netflix
Chaos Monkey)
Load goes up...
#X10 Twitter effect
Introducing another external subsystem.
Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)
Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might
have handled this but for the purposes of this presentation it’s fun to assume they didn’t
as a leanring exercise.
X9 is an aggregate effect from from one customer to other customer and non
customers..
58. Amazon’s Outage 10/22/2012
X11 -> FA Server Dies
19
Wednesday, October 31, 12
The system becomes a systemic breakdown...
Now the backup (FA) server fails..
X11 Failover server fails
Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
59. Amazon’s Outage 10/22/2012
X11 -> FA Server Dies
19
Wednesday, October 31, 12
The system becomes a systemic breakdown...
Now the backup (FA) server fails..
X11 Failover server fails
Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
60. Amazon’s Outage 10/22/2012
X11 -> FA Server Dies
(X6->X11)
19
Wednesday, October 31, 12
The system becomes a systemic breakdown...
Now the backup (FA) server fails..
X11 Failover server fails
Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
61. Amazon’s Outage 10/22/2012
X11 -> FA Server Dies
(X6->X11)
(X10->X11) 19
Wednesday, October 31, 12
The system becomes a systemic breakdown...
Now the backup (FA) server fails..
X11 Failover server fails
Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
62. Amazon’s Outage 10/22/2012
Systemic Outage
20
Wednesday, October 31, 12
The whole system is hosed...
The complexity was masked
To bad they had not read deming...
63. Amazon’s Outage 10/22/2012
Systemic Outage
20
Wednesday, October 31, 12
The whole system is hosed...
The complexity was masked
To bad they had not read deming...
64. Amazon’s Outage 10/22/2012
Systemic Outage
X->Y
20
Wednesday, October 31, 12
The whole system is hosed...
The complexity was masked
To bad they had not read deming...