• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
A Cloud Outage Under the Lens of  “Profound Knowledge”
 

A Cloud Outage Under the Lens of “Profound Knowledge”

on

  • 996 views

This was supposed to be the SOTU at NYC Devopsdays 11/1/2012. I will also be doing a screencast later today.

This was supposed to be the SOTU at NYC Devopsdays 11/1/2012. I will also be doing a screencast later today.

Statistics

Views

Total Views
996
Views on SlideShare
960
Embed Views
36

Actions

Likes
1
Downloads
2
Comments
0

1 Embed 36

https://twitter.com 36

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    A Cloud Outage Under the Lens of  “Profound Knowledge” A Cloud Outage Under the Lens of “Profound Knowledge” Document Transcript

    • A Cloud Outage Under the Lens of “Profound Knowledge” @botchagalupe 1Wednesday, October 31, 12Welcome to Devopsdays NYC (first one) hell yeah...Normally I do the SOTU but I have done a few this year and there all about the same(on video)This morning I am going to Demingize you all by telling you a cloud outage story.Going to use something called the System of Profound Knowledge (sound Profound?)#### No apologies for spelling and grammar in the notes. If that kind of stuff annoysyou please wait for the screen cast.
    • GOALS • Understanding Complexity • Overview of SoPK • Amazon’s Outage on 10/22/12 2Wednesday, October 31, 12Goody we are going to talk about big bad old Amazon’s outage last week...
    • SoPK - Understanding Complexity 3Wednesday, October 31, 12An Improvement .. might be an upgrade, a bug fix, an emergency change a newproduct..One variable X will change the outcome (y)x-> y (Y is the dependent variable)
    • SoPK - Understanding Complexity 3Wednesday, October 31, 12An Improvement .. might be an upgrade, a bug fix, an emergency change a newproduct..One variable X will change the outcome (y)x-> y (Y is the dependent variable)
    • SoPK - Understanding Complexity 3Wednesday, October 31, 12An Improvement .. might be an upgrade, a bug fix, an emergency change a newproduct..One variable X will change the outcome (y)x-> y (Y is the dependent variable)
    • SoPK - Understanding Complexity 3Wednesday, October 31, 12An Improvement .. might be an upgrade, a bug fix, an emergency change a newproduct..One variable X will change the outcome (y)x-> y (Y is the dependent variable)
    • SoPK - Understanding Complexity 4Wednesday, October 31, 12In real life you get many variables (messiness of life)There are direct effects against the dependent var (y)
    • SoPK - Understanding Complexity T1 T2 5Wednesday, October 31, 12You also get time dependent variables
    • SoPK - Understanding Complexity T1 T2 6Wednesday, October 31, 12There are also indirect effects on the dependent variables (y)for example X1 in concert with X4 conjointly effect the dependent var Yas does X3->X4This is a different model that X->Y
    • System of Profound Knowledge (SoPK) 7Wednesday, October 31, 12Do we have any photographers in the audience?Use a camera lens as a metaphor for SoPKThey call this the exposure triangle.To take a perfect picture of an event you must have a good lens and understand how itworks.The ISO must be understood for sensitivity to lightThe Aperture must be understood for DOF (a portrait or an area)The Shutter Seed to understand motion
    • System of Profound Knowledge (SoPK) • Appreciation of a system • Knowledge of variation • Theory of knowledge • Knowledge of psychology 8Wednesday, October 31, 12Well Dr. Deming gave such a lens to break down complexity (the real world just like acamera does)Let’s say a lens for improvement of something (an enhancement, a bug fix, new productidea)An outcome X->yDr Deming gave us a tool called “The System of Profound Knowledge”SoPK is a Lens to break down complexity and give ourselves an advantage to not oversimplify what we are trying to do. In otherwise clear up the messiness of real life just likea camera lens does.(S) Appreciation of a System - Systems thinking - Deming would say understanding theAIM of a system.Deming said every system must have an AIM.Is your AIM to keep a server up or keep a protect a customer SLA (they might not be thesame thing as we will soon see)Eli Goldarat (TOC) would say Global optimization over local optimization even ifsuboptimization is sub optimal. Understanding subsystems and dependent systems.(V) Variation - Not understanding Variation is the root of all evil. Deming would get madat ppl. Knee jerk reactions due to not understanding the kind of variation. How do youunderstand varation? Statistics (primarily STD and and it’s relationship to a processi.e., it’s distribution)Give you an example. A large cloud provider rates API calls at 100 per (x). for Mostcustomers that’s fine, however, others they get treated as DDOS. Where did they get
    • Amazon’s EBS Outage 10/22/2012 9Wednesday, October 31, 12Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for trainingpurposes)An EBS Services outage correct?The way it’s supposed to work...Fleet Monitor (hardware monitoring)Failover for both Fleet monitor and EBS ServerDNS of courseMetics, performance monitoring to disk from agent...Of course humans.. remember they are part of of every systemWe will give Amazon the benefit of the doubt that it’s in their value stream mapHowever, a lot of orgs do not have this “human” process in the VSMPre automation .. remediation systems (complex adaptive systems).or just plane old humansLENS OF SOPK## This is a system not just the EBS Service Hint: Why did I say it’s just one system?
    • Amazon’s EBS Outage 10/22/2012 9Wednesday, October 31, 12Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for trainingpurposes)An EBS Services outage correct?The way it’s supposed to work...Fleet Monitor (hardware monitoring)Failover for both Fleet monitor and EBS ServerDNS of courseMetics, performance monitoring to disk from agent...Of course humans.. remember they are part of of every systemWe will give Amazon the benefit of the doubt that it’s in their value stream mapHowever, a lot of orgs do not have this “human” process in the VSMPre automation .. remediation systems (complex adaptive systems).or just plane old humansLENS OF SOPK## This is a system not just the EBS Service Hint: Why did I say it’s just one system?
    • Amazon’s EBS Outage 10/22/2012 This is one System 9Wednesday, October 31, 12Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for trainingpurposes)An EBS Services outage correct?The way it’s supposed to work...Fleet Monitor (hardware monitoring)Failover for both Fleet monitor and EBS ServerDNS of courseMetics, performance monitoring to disk from agent...Of course humans.. remember they are part of of every systemWe will give Amazon the benefit of the doubt that it’s in their value stream mapHowever, a lot of orgs do not have this “human” process in the VSMPre automation .. remediation systems (complex adaptive systems).or just plane old humansLENS OF SOPK## This is a system not just the EBS Service Hint: Why did I say it’s just one system?
    • Amazon’s EBS Outage 10/22/2012 10Wednesday, October 31, 12Monitor Server has a failure (system down)X0 - Fleet Management monitoring server fails
    • Amazon’s EBS Outage 10/22/2012 10Wednesday, October 31, 12Monitor Server has a failure (system down)X0 - Fleet Management monitoring server fails
    • Amazon’s EBS Outage 10/22/2012 X0 -> Server Failure 10Wednesday, October 31, 12Monitor Server has a failure (system down)X0 - Fleet Management monitoring server fails
    • Amazon’s EBS Outage 10/22/2012 11Wednesday, October 31, 12X1 - Fleet Management failover - anyone see this first issue?Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. Youmight say surely they had automation to DNS. However I would say no. Because ..Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover(automation or manual) apparently didn’t have a proper measure for success. Shouldhave verified that the they were actually using the new server (duh).
    • Amazon’s EBS Outage 10/22/2012 11Wednesday, October 31, 12X1 - Fleet Management failover - anyone see this first issue?Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. Youmight say surely they had automation to DNS. However I would say no. Because ..Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover(automation or manual) apparently didn’t have a proper measure for success. Shouldhave verified that the they were actually using the new server (duh).
    • Amazon’s EBS Outage 10/22/2012 X1 -> Failover 11Wednesday, October 31, 12X1 - Fleet Management failover - anyone see this first issue?Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. Youmight say surely they had automation to DNS. However I would say no. Because ..Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover(automation or manual) apparently didn’t have a proper measure for success. Shouldhave verified that the they were actually using the new server (duh).
    • Amazon’s EBS Outage 10/22/2012 X1 -> Failover 11Wednesday, October 31, 12X1 - Fleet Management failover - anyone see this first issue?Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. Youmight say surely they had automation to DNS. However I would say no. Because ..Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover(automation or manual) apparently didn’t have a proper measure for success. Shouldhave verified that the they were actually using the new server (duh).
    • Amazon’s Outage 10/22/2012 12Wednesday, October 31, 12DNS does not propagate....X2 - DNS propagation failureLens #1 (P) We could argue that maybe because the fleet servers are managed byhardware guys and DNS is by systems guys and may they’ are different cultural tribesand don’t understand the importance of each. Maybe they don’t go to lunch together.
    • Amazon’s Outage 10/22/2012 12Wednesday, October 31, 12DNS does not propagate....X2 - DNS propagation failureLens #1 (P) We could argue that maybe because the fleet servers are managed byhardware guys and DNS is by systems guys and may they’ are different cultural tribesand don’t understand the importance of each. Maybe they don’t go to lunch together.
    • Amazon’s Outage 10/22/2012 X2 -> DNS Failure 12Wednesday, October 31, 12DNS does not propagate....X2 - DNS propagation failureLens #1 (P) We could argue that maybe because the fleet servers are managed byhardware guys and DNS is by systems guys and may they’ are different cultural tribesand don’t understand the importance of each. Maybe they don’t go to lunch together.
    • Amazon’s Outage 10/22/2012 X2 -> DNS Failure 12Wednesday, October 31, 12DNS does not propagate....X2 - DNS propagation failureLens #1 (P) We could argue that maybe because the fleet servers are managed byhardware guys and DNS is by systems guys and may they’ are different cultural tribesand don’t understand the importance of each. Maybe they don’t go to lunch together.
    • Amazon’s Outage 10/22/2012 13Wednesday, October 31, 12So now we have fixed the first problem of the bad serverSeemed like a flawless operation right?But DNS still says the offline server is the monitor serverThe agent on the EBS server (services) keeps trying to send to hw data to the originalfleet monitor server.However, it is by design fault tolerant not to screw w/production if fails..The hardware guys probably don’t even know about this. They probably don’t monitorthings like server or process memory.X4- Memory Leak in the hardware agent on the EBS serverLens #1 (S) The hardware guys should know that they are part of a bigger system otherthan just hardware monitor. Was there a systems view for QA and smoke testing ofagent code changes?Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do theyhave the same Theories. Do the EBS guys do CD Smoke testing with hardwaremonitoring agents.X3 is the HW agent devs bug.
    • Amazon’s Outage 10/22/2012 13Wednesday, October 31, 12So now we have fixed the first problem of the bad serverSeemed like a flawless operation right?But DNS still says the offline server is the monitor serverThe agent on the EBS server (services) keeps trying to send to hw data to the originalfleet monitor server.However, it is by design fault tolerant not to screw w/production if fails..The hardware guys probably don’t even know about this. They probably don’t monitorthings like server or process memory.X4- Memory Leak in the hardware agent on the EBS serverLens #1 (S) The hardware guys should know that they are part of a bigger system otherthan just hardware monitor. Was there a systems view for QA and smoke testing ofagent code changes?Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do theyhave the same Theories. Do the EBS guys do CD Smoke testing with hardwaremonitoring agents.X3 is the HW agent devs bug.
    • Amazon’s Outage 10/22/2012 X4 -> Memory Leak 13Wednesday, October 31, 12So now we have fixed the first problem of the bad serverSeemed like a flawless operation right?But DNS still says the offline server is the monitor serverThe agent on the EBS server (services) keeps trying to send to hw data to the originalfleet monitor server.However, it is by design fault tolerant not to screw w/production if fails..The hardware guys probably don’t even know about this. They probably don’t monitorthings like server or process memory.X4- Memory Leak in the hardware agent on the EBS serverLens #1 (S) The hardware guys should know that they are part of a bigger system otherthan just hardware monitor. Was there a systems view for QA and smoke testing ofagent code changes?Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do theyhave the same Theories. Do the EBS guys do CD Smoke testing with hardwaremonitoring agents.X3 is the HW agent devs bug.
    • Amazon’s Outage 10/22/2012 X4 -> Memory Leak 13Wednesday, October 31, 12So now we have fixed the first problem of the bad serverSeemed like a flawless operation right?But DNS still says the offline server is the monitor serverThe agent on the EBS server (services) keeps trying to send to hw data to the originalfleet monitor server.However, it is by design fault tolerant not to screw w/production if fails..The hardware guys probably don’t even know about this. They probably don’t monitorthings like server or process memory.X4- Memory Leak in the hardware agent on the EBS serverLens #1 (S) The hardware guys should know that they are part of a bigger system otherthan just hardware monitor. Was there a systems view for QA and smoke testing ofagent code changes?Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do theyhave the same Theories. Do the EBS guys do CD Smoke testing with hardwaremonitoring agents.X3 is the HW agent devs bug.
    • Amazon’s Outage 10/22/2012 X4 -> Memory Leak ((X0->X1->X2)->X4) 13Wednesday, October 31, 12So now we have fixed the first problem of the bad serverSeemed like a flawless operation right?But DNS still says the offline server is the monitor serverThe agent on the EBS server (services) keeps trying to send to hw data to the originalfleet monitor server.However, it is by design fault tolerant not to screw w/production if fails..The hardware guys probably don’t even know about this. They probably don’t monitorthings like server or process memory.X4- Memory Leak in the hardware agent on the EBS serverLens #1 (S) The hardware guys should know that they are part of a bigger system otherthan just hardware monitor. Was there a systems view for QA and smoke testing ofagent code changes?Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do theyhave the same Theories. Do the EBS guys do CD Smoke testing with hardwaremonitoring agents.X3 is the HW agent devs bug.
    • Amazon’s Outage 10/22/2012 X4 -> Memory Leak ((X0->X1->X2)->X4) X3->X4 13Wednesday, October 31, 12So now we have fixed the first problem of the bad serverSeemed like a flawless operation right?But DNS still says the offline server is the monitor serverThe agent on the EBS server (services) keeps trying to send to hw data to the originalfleet monitor server.However, it is by design fault tolerant not to screw w/production if fails..The hardware guys probably don’t even know about this. They probably don’t monitorthings like server or process memory.X4- Memory Leak in the hardware agent on the EBS serverLens #1 (S) The hardware guys should know that they are part of a bigger system otherthan just hardware monitor. Was there a systems view for QA and smoke testing ofagent code changes?Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do theyhave the same Theories. Do the EBS guys do CD Smoke testing with hardwaremonitoring agents.X3 is the HW agent devs bug.
    • Amazon’s Outage 10/22/2012 14Wednesday, October 31, 12The fault tolerant code has a memory leak masks an issue with (creates) Low memoryon the EBS Servers...EBS server starts to run out of memoryX5 Out of Memory
    • Amazon’s Outage 10/22/2012 14Wednesday, October 31, 12The fault tolerant code has a memory leak masks an issue with (creates) Low memoryon the EBS Servers...EBS server starts to run out of memoryX5 Out of Memory
    • Amazon’s Outage 10/22/2012 X5 -> Out of Memory 14Wednesday, October 31, 12The fault tolerant code has a memory leak masks an issue with (creates) Low memoryon the EBS Servers...EBS server starts to run out of memoryX5 Out of Memory
    • Amazon’s Outage 10/22/2012 X5 -> Out of Memory (X3, X4)->X5 14Wednesday, October 31, 12The fault tolerant code has a memory leak masks an issue with (creates) Low memoryon the EBS Servers...EBS server starts to run out of memoryX5 Out of Memory
    • Amazon’s Outage 10/22/2012 15Wednesday, October 31, 12The memory low wakes up the humans (yellow) seeing low memory from the systemsmonitoring DBThe fucking humans get involved and all hell breaks lose.The humans see something is wrong with memory low on EBS servers.They start to throttle API calls due to low memory (that they don’t know why (localoptimization)X6 ThrottlingSystem guys see this as a X->Y issue (Low memory therefore throttle)However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)Chances are they might not even know about the the fleet server failover.The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and theydon’t understand hardware monitoring as a sub systemLens #2 (V) The systems guys don’t understand common vs special cause variation ..they react to a “S” that should of been a “C”. Turns outLens #3 (K) Measures with out results are not fixes (throttling). They should havelooked at the results.Three potential outcomes a) get better b) Stays the same c) Getsworse. What do you think happened?Lens #4 (P) Tribal understanding of behavior differences between Hardware guys andSystems guys..
    • Amazon’s Outage 10/22/2012 15Wednesday, October 31, 12The memory low wakes up the humans (yellow) seeing low memory from the systemsmonitoring DBThe fucking humans get involved and all hell breaks lose.The humans see something is wrong with memory low on EBS servers.They start to throttle API calls due to low memory (that they don’t know why (localoptimization)X6 ThrottlingSystem guys see this as a X->Y issue (Low memory therefore throttle)However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)Chances are they might not even know about the the fleet server failover.The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and theydon’t understand hardware monitoring as a sub systemLens #2 (V) The systems guys don’t understand common vs special cause variation ..they react to a “S” that should of been a “C”. Turns outLens #3 (K) Measures with out results are not fixes (throttling). They should havelooked at the results.Three potential outcomes a) get better b) Stays the same c) Getsworse. What do you think happened?Lens #4 (P) Tribal understanding of behavior differences between Hardware guys andSystems guys..
    • Amazon’s Outage 10/22/2012 X6 -> Throttling (X->Y) 15Wednesday, October 31, 12The memory low wakes up the humans (yellow) seeing low memory from the systemsmonitoring DBThe fucking humans get involved and all hell breaks lose.The humans see something is wrong with memory low on EBS servers.They start to throttle API calls due to low memory (that they don’t know why (localoptimization)X6 ThrottlingSystem guys see this as a X->Y issue (Low memory therefore throttle)However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)Chances are they might not even know about the the fleet server failover.The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and theydon’t understand hardware monitoring as a sub systemLens #2 (V) The systems guys don’t understand common vs special cause variation ..they react to a “S” that should of been a “C”. Turns outLens #3 (K) Measures with out results are not fixes (throttling). They should havelooked at the results.Three potential outcomes a) get better b) Stays the same c) Getsworse. What do you think happened?Lens #4 (P) Tribal understanding of behavior differences between Hardware guys andSystems guys..
    • Amazon’s Outage 10/22/2012 X6 -> Throttling (X->Y) 15Wednesday, October 31, 12The memory low wakes up the humans (yellow) seeing low memory from the systemsmonitoring DBThe fucking humans get involved and all hell breaks lose.The humans see something is wrong with memory low on EBS servers.They start to throttle API calls due to low memory (that they don’t know why (localoptimization)X6 ThrottlingSystem guys see this as a X->Y issue (Low memory therefore throttle)However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)Chances are they might not even know about the the fleet server failover.The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and theydon’t understand hardware monitoring as a sub systemLens #2 (V) The systems guys don’t understand common vs special cause variation ..they react to a “S” that should of been a “C”. Turns outLens #3 (K) Measures with out results are not fixes (throttling). They should havelooked at the results.Three potential outcomes a) get better b) Stays the same c) Getsworse. What do you think happened?Lens #4 (P) Tribal understanding of behavior differences between Hardware guys andSystems guys..
    • Amazon’s Outage 10/22/2012 X6 -> Throttling (X->Y) X5->X6 15Wednesday, October 31, 12The memory low wakes up the humans (yellow) seeing low memory from the systemsmonitoring DBThe fucking humans get involved and all hell breaks lose.The humans see something is wrong with memory low on EBS servers.They start to throttle API calls due to low memory (that they don’t know why (localoptimization)X6 ThrottlingSystem guys see this as a X->Y issue (Low memory therefore throttle)However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6)Chances are they might not even know about the the fleet server failover.The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and theydon’t understand hardware monitoring as a sub systemLens #2 (V) The systems guys don’t understand common vs special cause variation ..they react to a “S” that should of been a “C”. Turns outLens #3 (K) Measures with out results are not fixes (throttling). They should havelooked at the results.Three potential outcomes a) get better b) Stays the same c) Getsworse. What do you think happened?Lens #4 (P) Tribal understanding of behavior differences between Hardware guys andSystems guys..
    • Amazon’s Outage 10/22/2012 16Wednesday, October 31, 12Things continue to get worse....Some customers (yellow) were already getting issues but throttling makes it worse forthe customers.X7 API IssuesLens #3 (K) Measures with out results are not fixes (throttling). They should havelooked at the results.Three potential outcomes a) get better b) Stays the same c) Getsworse. What do you think happened?X7 is caused by both X6 and X5 independently (X6 just made things worse)
    • Amazon’s Outage 10/22/2012 16Wednesday, October 31, 12Things continue to get worse....Some customers (yellow) were already getting issues but throttling makes it worse forthe customers.X7 API IssuesLens #3 (K) Measures with out results are not fixes (throttling). They should havelooked at the results.Three potential outcomes a) get better b) Stays the same c) Getsworse. What do you think happened?X7 is caused by both X6 and X5 independently (X6 just made things worse)
    • Amazon’s Outage 10/22/2012 X7 -> API Issues 16Wednesday, October 31, 12Things continue to get worse....Some customers (yellow) were already getting issues but throttling makes it worse forthe customers.X7 API IssuesLens #3 (K) Measures with out results are not fixes (throttling). They should havelooked at the results.Three potential outcomes a) get better b) Stays the same c) Getsworse. What do you think happened?X7 is caused by both X6 and X5 independently (X6 just made things worse)
    • Amazon’s Outage 10/22/2012 X7 -> API Issues 16Wednesday, October 31, 12Things continue to get worse....Some customers (yellow) were already getting issues but throttling makes it worse forthe customers.X7 API IssuesLens #3 (K) Measures with out results are not fixes (throttling). They should havelooked at the results.Three potential outcomes a) get better b) Stays the same c) Getsworse. What do you think happened?X7 is caused by both X6 and X5 independently (X6 just made things worse)
    • Amazon’s Outage 10/22/2012 X7 -> API Issues (X6->X7) 16Wednesday, October 31, 12Things continue to get worse....Some customers (yellow) were already getting issues but throttling makes it worse forthe customers.X7 API IssuesLens #3 (K) Measures with out results are not fixes (throttling). They should havelooked at the results.Three potential outcomes a) get better b) Stays the same c) Getsworse. What do you think happened?X7 is caused by both X6 and X5 independently (X6 just made things worse)
    • Amazon’s Outage 10/22/2012 X7 -> API Issues (X6->X7) (X5->X7) 16Wednesday, October 31, 12Things continue to get worse....Some customers (yellow) were already getting issues but throttling makes it worse forthe customers.X7 API IssuesLens #3 (K) Measures with out results are not fixes (throttling). They should havelooked at the results.Three potential outcomes a) get better b) Stays the same c) Getsworse. What do you think happened?X7 is caused by both X6 and X5 independently (X6 just made things worse)
    • Amazon’s Outage 10/22/2012 17Wednesday, October 31, 12The still can’t pin the problem and they decide to force a failover (red) EBS ServerAgin local optimizationX9 EBS FailoverLens #1 (K) Measures with out results are not fixes (failover). They should have lookedat the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.What do you think happened?Lens #2 (P) Not understanding customer behavior.. Customers increase there actions(API) calls testing services, qa, smoke.X8 is customer hammering the services...
    • Amazon’s Outage 10/22/2012 17Wednesday, October 31, 12The still can’t pin the problem and they decide to force a failover (red) EBS ServerAgin local optimizationX9 EBS FailoverLens #1 (K) Measures with out results are not fixes (failover). They should have lookedat the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.What do you think happened?Lens #2 (P) Not understanding customer behavior.. Customers increase there actions(API) calls testing services, qa, smoke.X8 is customer hammering the services...
    • Amazon’s Outage 10/22/2012 X9 -> EBS Failover 17Wednesday, October 31, 12The still can’t pin the problem and they decide to force a failover (red) EBS ServerAgin local optimizationX9 EBS FailoverLens #1 (K) Measures with out results are not fixes (failover). They should have lookedat the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.What do you think happened?Lens #2 (P) Not understanding customer behavior.. Customers increase there actions(API) calls testing services, qa, smoke.X8 is customer hammering the services...
    • Amazon’s Outage 10/22/2012 X9 -> EBS Failover 17Wednesday, October 31, 12The still can’t pin the problem and they decide to force a failover (red) EBS ServerAgin local optimizationX9 EBS FailoverLens #1 (K) Measures with out results are not fixes (failover). They should have lookedat the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.What do you think happened?Lens #2 (P) Not understanding customer behavior.. Customers increase there actions(API) calls testing services, qa, smoke.X8 is customer hammering the services...
    • Amazon’s Outage 10/22/2012 X9 -> EBS Failover (X7->X9) 17Wednesday, October 31, 12The still can’t pin the problem and they decide to force a failover (red) EBS ServerAgin local optimizationX9 EBS FailoverLens #1 (K) Measures with out results are not fixes (failover). They should have lookedat the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.What do you think happened?Lens #2 (P) Not understanding customer behavior.. Customers increase there actions(API) calls testing services, qa, smoke.X8 is customer hammering the services...
    • Amazon’s Outage 10/22/2012 X9 -> EBS Failover (X7->X9) (X8->X9) 17Wednesday, October 31, 12The still can’t pin the problem and they decide to force a failover (red) EBS ServerAgin local optimizationX9 EBS FailoverLens #1 (K) Measures with out results are not fixes (failover). They should have lookedat the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.What do you think happened?Lens #2 (P) Not understanding customer behavior.. Customers increase there actions(API) calls testing services, qa, smoke.X8 is customer hammering the services...
    • Amazon’s Outage 10/22/2012 X10 ->Twitter Effect 18Wednesday, October 31, 12Meanwhile .. the throttling effects a bigger problem..The twitter effect kicks in... ppl start hammering AWS API web services to testavailability.PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..Come on admit it how many of you tried to spank the money last monday...(NetflixChaos Monkey)Load goes up...#X10 Twitter effectIntroducing another external subsystem.Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They mighthave handled this but for the purposes of this presentation it’s fun to assume they didn’tas a leanring exercise.X9 is an aggregate effect from from one customer to other customer and noncustomers..
    • Amazon’s Outage 10/22/2012 X10 ->Twitter Effect 18Wednesday, October 31, 12Meanwhile .. the throttling effects a bigger problem..The twitter effect kicks in... ppl start hammering AWS API web services to testavailability.PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..Come on admit it how many of you tried to spank the money last monday...(NetflixChaos Monkey)Load goes up...#X10 Twitter effectIntroducing another external subsystem.Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They mighthave handled this but for the purposes of this presentation it’s fun to assume they didn’tas a leanring exercise.X9 is an aggregate effect from from one customer to other customer and noncustomers..
    • Amazon’s Outage 10/22/2012 X10 ->Twitter Effect 18Wednesday, October 31, 12Meanwhile .. the throttling effects a bigger problem..The twitter effect kicks in... ppl start hammering AWS API web services to testavailability.PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..Come on admit it how many of you tried to spank the money last monday...(NetflixChaos Monkey)Load goes up...#X10 Twitter effectIntroducing another external subsystem.Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They mighthave handled this but for the purposes of this presentation it’s fun to assume they didn’tas a leanring exercise.X9 is an aggregate effect from from one customer to other customer and noncustomers..
    • Amazon’s Outage 10/22/2012 X10 ->Twitter Effect (X8->X10) 18Wednesday, October 31, 12Meanwhile .. the throttling effects a bigger problem..The twitter effect kicks in... ppl start hammering AWS API web services to testavailability.PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..Come on admit it how many of you tried to spank the money last monday...(NetflixChaos Monkey)Load goes up...#X10 Twitter effectIntroducing another external subsystem.Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They mighthave handled this but for the purposes of this presentation it’s fun to assume they didn’tas a leanring exercise.X9 is an aggregate effect from from one customer to other customer and noncustomers..
    • Amazon’s Outage 10/22/2012 X10 ->Twitter Effect (X8->X10) (X9->X10) 18Wednesday, October 31, 12Meanwhile .. the throttling effects a bigger problem..The twitter effect kicks in... ppl start hammering AWS API web services to testavailability.PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..Come on admit it how many of you tried to spank the money last monday...(NetflixChaos Monkey)Load goes up...#X10 Twitter effectIntroducing another external subsystem.Lens #1 (S) Not including this whole other subsystem (admittedly this is hard)Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They mighthave handled this but for the purposes of this presentation it’s fun to assume they didn’tas a leanring exercise.X9 is an aggregate effect from from one customer to other customer and noncustomers..
    • Amazon’s Outage 10/22/2012 X11 -> FA Server Dies 19Wednesday, October 31, 12The system becomes a systemic breakdown...Now the backup (FA) server fails..X11 Failover server failsCould be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
    • Amazon’s Outage 10/22/2012 X11 -> FA Server Dies 19Wednesday, October 31, 12The system becomes a systemic breakdown...Now the backup (FA) server fails..X11 Failover server failsCould be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
    • Amazon’s Outage 10/22/2012 X11 -> FA Server Dies (X6->X11) 19Wednesday, October 31, 12The system becomes a systemic breakdown...Now the backup (FA) server fails..X11 Failover server failsCould be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
    • Amazon’s Outage 10/22/2012 X11 -> FA Server Dies (X6->X11) (X10->X11) 19Wednesday, October 31, 12The system becomes a systemic breakdown...Now the backup (FA) server fails..X11 Failover server failsCould be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...
    • Amazon’s Outage 10/22/2012 Systemic Outage 20Wednesday, October 31, 12The whole system is hosed...The complexity was maskedTo bad they had not read deming...
    • Amazon’s Outage 10/22/2012 Systemic Outage 20Wednesday, October 31, 12The whole system is hosed...The complexity was maskedTo bad they had not read deming...
    • Amazon’s Outage 10/22/2012 Systemic Outage X->Y 20Wednesday, October 31, 12The whole system is hosed...The complexity was maskedTo bad they had not read deming...