Analyzing a Complex Cloud                 Outage                               @botchagalupe                           VP ...
WHO AM I                                  2Saturday, December 1, 1230 yrs itubuntu cloud evangelist startupOpscodeDTO awes...
GOALS    • Look at a complex cloud outage.    • Understanding complexity.    • Analyze a complex cloud outage.            ...
Amazon’s EBS Outage 10/22/2012                                      4Saturday, December 1, 12Fed reserve story just the WS...
Amazon’s EBS Outage 10/22/2012                               5Saturday, December 1, 12
Amazon’s EBS Outage 10/22/2012                                              6Saturday, December 1, 12#Let’s take a look at...
Amazon’s EBS Outage 10/22/2012                                                           The EBS System                   ...
Amazon’s EBS Outage 10/22/2012                                                           Server Failure                   ...
Amazon’s EBS Outage 10/22/2012                                                        Server Failover                     ...
Amazon’s EBS Outage 10/22/2012                                                   DNS Propagation Failure                  ...
Amazon’s EBS Outage 10/22/2012                                                         Agent Memory Leak                  ...
Amazon’s EBS Outage 10/22/2012                                              EBS Service is slowing down                   ...
Amazon’s EBS Outage 10/22/2012                                                       Throttling the API                   ...
Amazon’s EBS Outage 10/22/2012                                                       Customer Issues                      ...
Amazon’s EBS Outage 10/22/2012                                                          EBS Failover                      ...
Amazon’s EBS Outage 10/22/2012                                                             Twitter Effect                 ...
Amazon’s EBS Outage 10/22/2012                                                             Failover Server Dies           ...
Amazon’s EBS Outage 10/22/2012                                                             Systemic Outage                ...
Understanding Complexity                                              18Saturday, December 1, 12#So lets talk about comple...
Understanding Complexity                                              18Saturday, December 1, 12#So lets talk about comple...
Understanding Complexity                                              18Saturday, December 1, 12#So lets talk about comple...
Understanding Complexity                                              18Saturday, December 1, 12#So lets talk about comple...
Understanding Complexity                                                19Saturday, December 1, 12#However, in real life i...
Understanding Complexity                           T1             T2                                             20Saturda...
Understanding Complexity                           T1             T2                                             21Saturda...
W. Edwards Deming (1900 – 1993)     • Father of Quality     • Understanding of the system     • Understanding variation   ...
System of Profound Knowledge (SoPK)                                           23Saturday, December 1, 12#Let’s say a lens ...
Knowledge of a System     • Systems Thinking     • End to End Value Stream     • What is the Aim of a System?     • The Pu...
Knowledge of a System                                            25Saturday, December 1, 12One big exercise of non systems...
Knowledge of a System                                            25Saturday, December 1, 12One big exercise of non systems...
Knowledge of Variation      • There is always Variation      • Special Cause Variation      • Common Cause Variation      ...
Control Chart                                               27Saturday, December 1, 12 •   #Approximately 99% - 100% of th...
Knowledge of Variation                                              28Saturday, December 1, 12The biggest issue here is th...
Theory of Knowledge     • Scientific Method     • Knowledge Must Have Theory     • Theory Must Have Prediction     • Predi...
Theory of Knowledge                                             30Saturday, December 1, 12 #Lens #1 (K) Theory can not be ...
Theory of Psychology     • Understanding Behavior     • Understanding Tribes     • Understanding Worldviews               ...
Theory of Psychology                                            32Saturday, December 1, 12Lens #1 (P) We could argue that ...
Amazon’s Outage 10/22/2012                           Let’s Review                                           33Saturday, De...
Amazon’s Outage 10/22/2012                           Let’s Review                           X->Y                          ...
Upcoming SlideShare
Loading in …5
×

Analyzing a Complex Cloud Outage - CloudStack Collaboration Conference - Vegas

2,897 views

Published on

An updates latest and greatest version in my Deming to Devops Series

Published in: Education
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,897
On SlideShare
0
From Embeds
0
Number of Embeds
47
Actions
Shares
0
Downloads
46
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Analyzing a Complex Cloud Outage - CloudStack Collaboration Conference - Vegas

  1. 1. Analyzing a Complex Cloud Outage @botchagalupe VP of Services enStratus 1Saturday, December 1, 12John WillisCall me BotchagalupeVP of Services
  2. 2. WHO AM I 2Saturday, December 1, 1230 yrs itubuntu cloud evangelist startupOpscodeDTO awesome dudesEnstratus GR called..
  3. 3. GOALS • Look at a complex cloud outage. • Understanding complexity. • Analyze a complex cloud outage. 3Saturday, December 1, 12Review bullets...
  4. 4. Amazon’s EBS Outage 10/22/2012 4Saturday, December 1, 12Fed reserve story just the WSJ partAmazon outages are big dealMinecraft
  5. 5. Amazon’s EBS Outage 10/22/2012 5Saturday, December 1, 12
  6. 6. Amazon’s EBS Outage 10/22/2012 6Saturday, December 1, 12#Let’s take a look at the value stream of the service that failed on 10/22#If we look in the middle we see the green storage server#This box is a simplified process of a larger service (meaning many servers.. KISS4now)#We always try to look at a VS from right to left.. (form the customer back)#In this example customers use something called EBS (block storage). Cloud basedSAN)#Next the thing that is often left out of most VS’s the humans. In this ex they were anintegral part of this system#Next we have an operations monitoring database (disk).. Thins the humans need toknow about.#Could talk about autonomation (pre automation) Sakichi Toyoda auto stop the loom ifthreads broke or ran out.#Next there is the EBS server failover machine... Most production systems in large ITcenters will have FA#It is important to realize that there are core services o this (e.g, EBS) and non coreservices on this box.## Non core services are things like teh operations monitoring agent that feed into themonitor disk DB)## Also we will see in a minute there is a hardware agent on this EBS server forhardware detection failures#Next is a Fleet monitor server... basically a hardware monitor that can phone home orauto order defective parts from the manufacturer (in large infras like amazon, google,facebook this is common)#It has a FA server
  7. 7. Amazon’s EBS Outage 10/22/2012 The EBS System 6Saturday, December 1, 12#Let’s take a look at the value stream of the service that failed on 10/22#If we look in the middle we see the green storage server#This box is a simplified process of a larger service (meaning many servers.. KISS4now)#We always try to look at a VS from right to left.. (form the customer back)#In this example customers use something called EBS (block storage). Cloud basedSAN)#Next the thing that is often left out of most VS’s the humans. In this ex they were anintegral part of this system#Next we have an operations monitoring database (disk).. Thins the humans need toknow about.#Could talk about autonomation (pre automation) Sakichi Toyoda auto stop the loom ifthreads broke or ran out.#Next there is the EBS server failover machine... Most production systems in large ITcenters will have FA#It is important to realize that there are core services o this (e.g, EBS) and non coreservices on this box.## Non core services are things like teh operations monitoring agent that feed into themonitor disk DB)## Also we will see in a minute there is a hardware agent on this EBS server forhardware detection failures#Next is a Fleet monitor server... basically a hardware monitor that can phone home orauto order defective parts from the manufacturer (in large infras like amazon, google,facebook this is common)#It has a FA server
  8. 8. Amazon’s EBS Outage 10/22/2012 Server Failure 7Saturday, December 1, 12#on this one fine afternoon the fleet server died.. remember this is receiving data fromthe HW agents on the EBS server
  9. 9. Amazon’s EBS Outage 10/22/2012 Server Failover 8Saturday, December 1, 12#At this point there is most likely automated FA/HA# we see the FA server now supposed to be logically in the VS (the new arrow)# Any system thinkers out there see the first problem with this red circle?
  10. 10. Amazon’s EBS Outage 10/22/2012 DNS Propagation Failure 9Saturday, December 1, 12#The HA/HA seems to work flawlessly. From the FM guys perspective.#However our second problem happens and that is that DNS does not update it’srecords correctly#therefore the HW agent running on the EBS server is still pointing to the down fleetserver. (everyone see that?)
  11. 11. Amazon’s EBS Outage 10/22/2012 Agent Memory Leak 10Saturday, December 1, 12#Now a third problem jumps in.. (the yellow box)#When the HW agent tries to write back to the wrong FMS (the dead one) some not welltested code fails and creates a memory leak...#To make matters worse this particular agent is designed to be fault tolerant. In otherwords is should die silently and not disrupt any core service. E.g., it is designed to be okto fail if it can’t send to the FMS. he assumption is that is will get it next time.#Now you can start seeing the fist level of a complex system emerge.##The first issue seems to be fixed (the FMS FA)## DNS isn’t showing up as a failure on anybodies dashboard## and we have a silent error occurring on one of our core serves (a customer service)
  12. 12. Amazon’s EBS Outage 10/22/2012 EBS Service is slowing down 11Saturday, December 1, 12#The memory leak continues undetected and eventually it starts slowing down the EBSservicebecause of low memory...# Key point here is that is probably still not detected by the IT staff and maybe it’s juststarting to annoy customers but not enough to turn the customer box yellow (yet)
  13. 13. Amazon’s EBS Outage 10/22/2012 Throttling the API 12Saturday, December 1, 12#At some point the IT staff notices the slowdown. We would hope before the customercomplain and in AMZN’s case that is probably true (they are pretty good).#However, as we said earlier, they still don’t know why it’s slow...#Another bit of complexity is introduced here.## The EBS servers always run hot (high) on memory. Therefore the undetectedmemory leak is most likely unnoticed at this point. (we will discuss this in detail when weget the the analysis part of this preso. .##from AMZN’s RCA is was pretty clear this was the case that they had not detected themem leak##Next a human interacting is take and that is they (the humans) decide to active athrottling tool##They use this to throttle customer API requests as a stop gap to give them time tofigure out and hopefully fix the issue (the slow down).
  14. 14. Amazon’s EBS Outage 10/22/2012 Customer Issues 13Saturday, December 1, 12#By now the customer is getting a double whammy.## one, they were already experience slow responses from the service##two, now the throttling has really made it worse for them
  15. 15. Amazon’s EBS Outage 10/22/2012 EBS Failover 14Saturday, December 1, 12#This situation continues on where teh IT staff still doesn’t know why the EBS service ishaving issues#The customers situation get worse.# And now the IT guys decide to punt again (like throttling).## they force a FA/HA of the EBS service (servers).##keep in mind they still don’t know what is wrong... gasping at straws twice now..remember this 4 later.##Notice now that the new HA/FA server is in place (show the arrow).## COmplexity strikes again... This is a classic IT outage scenario.. where somethingseems to be fixed and when it really isn’t.##The new FA server seems to have solved the problem. The new server is not slow atfirst...##However, what they don’t know is all they have done is delay the inevitable.#the mem leak just starts all over again on the FA server.##customer is still orange mainly because of throttling...
  16. 16. Amazon’s EBS Outage 10/22/2012 Twitter Effect 15Saturday, December 1, 12#At this point we start getting what are called indirect effects##The first effect (and this was in the RCA) is that suers’ tend to use more serviceswhen a potential outage is perceived. The start testing more services, trying otherservices.##the next indirect effect is what I call the twitter effect. That is now the outage startstrending on twitter and everyone in extended system starts trying to kick the tires onAWS.. Let’s start up Netflix, I wonder if Guthub is working ok. ...
  17. 17. Amazon’s EBS Outage 10/22/2012 Failover Server Dies 16Saturday, December 1, 12#And of course the FA server eventually gets to the same state as the original EBSserver#meanwhile it is very likely that the IT staff still does not know why this is all happening.
  18. 18. Amazon’s EBS Outage 10/22/2012 Systemic Outage 17Saturday, December 1, 12#Now our complete system is in a systemic failure...#Ironically the original failed over FMS is just fine (no red there). Now one is using it..remember why?10 Minutes
  19. 19. Understanding Complexity 18Saturday, December 1, 12#So lets talk about complexity from a theoretical standpoint.#typically humans think linear. Our first instinct is that it’s always an X->Y#One variable X will change the outcome (y) - (Y is the dependent variable)--That is for an new improvement (change, bug fix, maintenance, etc..)--An emergency (like the amazon issues)--A new product, feature etc...
  20. 20. Understanding Complexity 18Saturday, December 1, 12#So lets talk about complexity from a theoretical standpoint.#typically humans think linear. Our first instinct is that it’s always an X->Y#One variable X will change the outcome (y) - (Y is the dependent variable)--That is for an new improvement (change, bug fix, maintenance, etc..)--An emergency (like the amazon issues)--A new product, feature etc...
  21. 21. Understanding Complexity 18Saturday, December 1, 12#So lets talk about complexity from a theoretical standpoint.#typically humans think linear. Our first instinct is that it’s always an X->Y#One variable X will change the outcome (y) - (Y is the dependent variable)--That is for an new improvement (change, bug fix, maintenance, etc..)--An emergency (like the amazon issues)--A new product, feature etc...
  22. 22. Understanding Complexity 18Saturday, December 1, 12#So lets talk about complexity from a theoretical standpoint.#typically humans think linear. Our first instinct is that it’s always an X->Y#One variable X will change the outcome (y) - (Y is the dependent variable)--That is for an new improvement (change, bug fix, maintenance, etc..)--An emergency (like the amazon issues)--A new product, feature etc...
  23. 23. Understanding Complexity 19Saturday, December 1, 12#However, in real life it’s never really x-<y it’s usually many In real life you get manyvariables#In statistics this referred to the don’t confuse correlation with causation#X->Y is correlation but it’s dangerous to assume it’s causation ..#real life is not that simple... we call it the messiness of like.#X1 a simple server failure#X2 The failover#X3 The DNSDeming wrote of Chanticleer, the barnyard rooster who had a theory. He crowed everymorning, putting forth all his energy, flapped his wings. The sun came up. Theconnexion was clear: His crowing caused the sun to come up. There was no questionabout his importance.There came a snag. He forgot one morning to crow. The sun came up anyhow.
  24. 24. Understanding Complexity T1 T2 20Saturday, December 1, 12#You also more likely get time dependent variables that add to the complexity#X1-X3 happen at T1 andX4-X5 happen at T2#X4 is the memory leak#X5The dreaded throttling...
  25. 25. Understanding Complexity T1 T2 21Saturday, December 1, 12#There are also indirect effects on the dependent variables (y)#for example X1 in concert with X4 can conjointly effect the dependent var Y## X1 changes X4 and the combination effect is different on Y##same with X5# This is a different model than a simple X->Y#X? The customer respond with more usage#X? The twitter effect15 Minutes
  26. 26. W. Edwards Deming (1900 – 1993) • Father of Quality • Understanding of the system • Understanding variation • Understanding human behavior • Introduced sampling into US Census • WWII success credited to his quality approach • Taught Japan after WWII and transformed quality • In 1980 Transformed American quality revolution • The Foundations of Six Sigma 22Saturday, December 1, 12#There is a tool that has been used by successful companies like Toyota, (lean) andmany others.# Dr. Edwards Deming gave such a lens to break down complexity## (the real world just like a camera does)20 Minutes
  27. 27. System of Profound Knowledge (SoPK) 23Saturday, December 1, 12#Let’s say a lens for improvement of something (an enhancement, a bug fix, newproduct idea)#An outcome X->y#Dr Deming gave us a tool called “The System of Profound Knowledge”#Just common sense... However, Mark Twain said nothing common about commonsense,,#SoPK is a Lens to break down complexity and give ourselves an advantage to not oversimplify what we are trying to do.I#n otherwords clear up the messiness of real life just like a camera lens does.(
  28. 28. Knowledge of a System • Systems Thinking • End to End Value Stream • What is the Aim of a System? • The Purpose of the System • Global Optimization • Not Local Optimization 24Saturday, December 1, 12(S) Appreciation of a System - Systems thinking - Deming would say understanding theAIM of a system.Deming said every system must have an AIM.Is your AIM to keep a server up or keep a protect a customer SLA (they might not be thesame thing as we will soon see)Eli Goldarat (TOC) would say Global optimization over local optimization even if suboptimization is sub optimal. Understanding subsystems and dependent systems.
  29. 29. Knowledge of a System 25Saturday, December 1, 12One big exercise of non systems thinking...Clearly there were independant views of the systemWhat was the AIM of this systemDid the hardware guys have the same aim as the core services guys.#Lens #1 Not having a systems view- Not seeing this as dependent systems. You mightsay surely they had automation to DNS. However I would say no. BecauseLens #2 The hardware guys should know that they are part of a bigger system otherthan just hardware monitor. They had code on a core service was it smoke testedimmune tested.. Was there a systems view for QA and smoke testing of agent codechanges?Lens #3 (X->Y) Humans try to correct the memory issue with throttling and they don’tunderstand hardware monitoring as a sub system.. local optimization....
  30. 30. Knowledge of a System 25Saturday, December 1, 12One big exercise of non systems thinking...Clearly there were independant views of the systemWhat was the AIM of this systemDid the hardware guys have the same aim as the core services guys.#Lens #1 Not having a systems view- Not seeing this as dependent systems. You mightsay surely they had automation to DNS. However I would say no. BecauseLens #2 The hardware guys should know that they are part of a bigger system otherthan just hardware monitor. They had code on a core service was it smoke testedimmune tested.. Was there a systems view for QA and smoke testing of agent codechanges?Lens #3 (X->Y) Humans try to correct the memory issue with throttling and they don’tunderstand hardware monitoring as a sub system.. local optimization....
  31. 31. Knowledge of Variation • There is always Variation • Special Cause Variation • Common Cause Variation • Understanding Variation 26Saturday, December 1, 12Continuous improvement requires the understanding of variationYou have a power outage and it takes key personnel a long time to get to the datacenterThat is special cause Var. A bad reaction would be to create a new policy that allpersonnel live with 5 minutes of the data center (i.e., treating it like common cause)Conversely. Firing a new programmer who brings down a production system would betreating common cause as a special cause situation. More than likely it was badsafeguards, insufficient training....(V) Variation - Not understanding Variation is the root of all evil. Deming would get madat ppl. Knee jerk reactions due to not understanding the kind of variation. How do youunderstand variation? Statistics (primarily STD and and it’s relationship to a processi.e., it’s distribution)Give you an example. A large cloud provider rates API calls at 100 (why 100) per (x).for Most customers that’s fine, however, others they get treated as DDOS. Where didthey get 100? It had to be a guess. If they understood SPC (variation) they might comeup with the number and have a CI process in place when they found special variation.(
  32. 32. Control Chart 27Saturday, December 1, 12 • #Approximately 99% - 100% of the values will fall within 3 standard deviations of the mean • Approximately 90% - 98% of the values will fall within 2 standard deviations of the mean • Approximately 60% - 78% of the values will fall within 1 standard deviation of the mean • Approximately 90% - 98% of the values will fall within 2 standard deviations of the mean • Approximately 60% - 78% of the values will fall within 1 standard deviation of the mean
  33. 33. Knowledge of Variation 28Saturday, December 1, 12The biggest issue here is the knee jerk reactions... Throttling and forced EBS serverfailover. They didn’t understand the type of variation.Lens #1 The systems guys don’t understand common vs special cause variation .. theyreact to a “S” that should of been a “C”. Turns out ... monitoring sub processes monitoring looking at individual monitors... e.g.,they might have gone from 95% to 96% which caused the issue. However, if they werelooking at the individual agent memory they.
  34. 34. Theory of Knowledge • Scientific Method • Knowledge Must Have Theory • Theory Must Have Prediction • Prediction Must Have Tests • Aim-->Measure-->Change 29Saturday, December 1, 12(K) is the simplest but hardest to understand by most ppl. Simply put it is usingScientific method to everything you do. Deming says you must have Theory to haveknowledge and you can’t have knowledge with out prediction and you predication without a test is useless.PDSA others call it (AMC) AIM,Measure (a.what process u gonna change b.measure ifthe change worked), Change. You have to test any improvement to see if it worked,failed or did nothing. Imagine someone staring a failover system with automation butnot testing to see if it really worked (could never happen).
  35. 35. Theory of Knowledge 30Saturday, December 1, 12 #Lens #1 (K) Theory can not be an un measured guess. Whoever, did the failover(automation or manual) apparently didn’t have a proper measure for success. Shouldhave verified that the they were actually using the new server (duh).Lens #2 Measures with out results are not fixes (throttling). They should have looked atthe results.Three potential outcomes a) get better b) Stays the same c) Gets worse.What do you think happened?
  36. 36. Theory of Psychology • Understanding Behavior • Understanding Tribes • Understanding Worldviews 31Saturday, December 1, 12(P) Another easy one but hardest to implement. Understanding behavior. Why ppl dothe things they do. Tribal behavior. Things that are important to one group might not beimportant to other groups. Understanding Human Behavior (another lens factor).Worldviews. Imagine a server that has software on it from two totally different devgroups. Further imagine these to group’s worldview are so far apart. One does agile CI,TDD, BDD, CD the other has never even hear of those things. (groupthinkexperiment)...
  37. 37. Theory of Psychology 32Saturday, December 1, 12Lens #1 (P) We could argue that maybe because the fleet servers are managed byhardware guys and DNS is by systems guys and may they’ are different cultural tribesand don’t understand the importance of each. Maybe they don’t go to lunch together.Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do theyhave the same Theories. Do the EBS guys do CD Smoke testing with hardwaremonitoring agents.Lens #3 (P) Tribal understanding of behavior differences between Hardware guys andSystems guys.Lens #4 (P) Not understanding customer behavior.. Customers increase there actions(API) calls testing services, qa, smoke.
  38. 38. Amazon’s Outage 10/22/2012 Let’s Review 33Saturday, December 1, 12#X1 a simple server failure#X2 The failover#X3 The DNS#X4 is the memory leak#X5 Bad TDD hygene by FMS eng/dev#X6 The dreaded throttling...#X7 The customer respond with more usage#X8 EBS Server failover#X9 The twitter effect#The complexity was masked#This was not an X->Y#To bad they had not read deming...
  39. 39. Amazon’s Outage 10/22/2012 Let’s Review X->Y 33Saturday, December 1, 12#X1 a simple server failure#X2 The failover#X3 The DNS#X4 is the memory leak#X5 Bad TDD hygene by FMS eng/dev#X6 The dreaded throttling...#X7 The customer respond with more usage#X8 EBS Server failover#X9 The twitter effect#The complexity was masked#This was not an X->Y#To bad they had not read deming...

×