Outages, PostMortems, and Human Error

12,563 views
9,616 views

Published on

This is a talk I gave at the Code As Craft Etsy tech talk series.

Published in: Technology, Business

Outages, PostMortems, and Human Error

  1. Outages,Post-Mortems,and Human Error 101 John Allspaw SVP, Tech Ops
  2. ScienceTimeTravelMythbustingReadingHomework
  3. Here To Challenge You
  4. Resilience EngineeringDr. Erik Hollnagel Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
  5. COMPLEX +DYNAMIC
  6. COMPLEX !=COMPLICATED
  7. Fundamental Surprises
  8. E.T.T.O. Efficiency Thoroughness
  9. Organizations, Policies, Procedures, Regulations BLUNT Resources & Constraints
  10. Organizations, Policies, Procedures, Regulations BLUNT Resources & Constraints Slips Adjustments OperatorsCompensations Mistakes Lapses Recoveries Violations Improvisations SHARP
  11. FUTURE + PAST
  12. Why Do Them?
  13. Why?Understand the FAILURE
  14. Why?Understand the SYSTEM
  15. Where “System” = Networks Servers Applications Processes People
  16. Where “System” = Networks Servers Applications Processes People
  17. PEOPLE
  18. (Anticipation) Knowing What To Expect
  19. (Anticipation) Knowing Knowing What What To Expect To Look For (Monitoring)
  20. (Anticipation) (Response) Knowing Knowing Knowing What What What To Expect To Look For To Do (Monitoring)
  21. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  22. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  23. Microphones are ON?
  24. Event AwarenessCode Deploys
  25. Timeline RICH DATA
  26. Traces of DataTimeline RICH DATA
  27. Status BlogTwitter Feed
  28. Flight Data Recorder
  29. Annotation Traces
  30. InvestigationBasics
  31. Start?
  32. TTDHow?
  33. TTR
  34. Stable(“all clear”)
  35. Impact Time = TTR - Start
  36. SEVERITY
  37. SEV 1A. Total loss of serviceB. Severe degradation, effectively unusableC. Loss of a critical feature
  38. SEV 2A. Major degradation/feature loss for SUBSET of membersB. Minor degradation/feature loss for ALL members
  39. SEV 3Noticeablenon-critical feature loss or degradation
  40. SEV 4No visible impact, loss of redundancy or capacityheadroom
  41. SEV 5No-impact but unexpected failure
  42. 5/11/2011 - Payments/Checkout system issue Start 4:10pm TTD 4:15pm TTR 4:30pm Stable 4:35pm Total Impact 20 min Severity 1
  43. Basic metricsTimeline with detailsRemediations/Observations
  44. “normal” Incident PostMortemoperation Time
  45. How?Why?Prevention?
  46. How
  47. Crisis PatternsProblem Starts PostMortem Time
  48. Crisis PatternsProblem Starts Detection PostMortem Time
  49. Crisis PatternsProblem Starts Detection Evaluation PostMortem Time
  50. Crisis PatternsProblem Starts Detection Evaluation Response PostMortem Time
  51. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Time
  52. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  53. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  54. Crisis PatternsProblem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  55. Crisis PatternsProblem Starts PostMortem Time
  56. Crisis PatternsProblem Starts Detection PostMortem Time
  57. Crisis PatternsProblem Starts Detection Evaluation PostMortem Time
  58. Crisis PatternsProblem Starts Detection Evaluation Response PostMortem Time
  59. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Time
  60. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  61. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  62. Crisis PatternsProblem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  63. Crisis PatternsProblem Starts PostMortem Time
  64. Crisis PatternsProblem Starts Detection PostMortem Time
  65. Crisis PatternsProblem Starts Detection Evaluation PostMortem Time
  66. Crisis PatternsProblem Starts Detection Evaluation Response PostMortem Time
  67. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Time
  68. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  69. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  70. Crisis PatternsProblem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  71. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  72. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  73. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time Internal and External Update Communications
  74. Crisis PatternsProblem Starts Detection Evaluation Response Stable Confirmation All Clear Time
  75. Crisis Patterns
  76. Crisis PatternsForced beyond learned roles
  77. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to see
  78. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to seeCognitively and perceptively noisy
  79. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to seeCognitively and perceptively noisyCoordinative load increases exponentially
  80. Thematic Vagabonding“butterfly minds” NOT STUCK ENOUGH
  81. Goal Fixation (encystment)TOO STUCK
  82. HeroismNon-communicating lone wolf-isms
  83. DistractionIrrelevant noise in comm channels
  84. IMPROVISATIONREQUIREMENT for troubleshootingcomplex systems
  85. IMPROVISATION
  86. IMPROVISATION
  87. Why?
  88. Root Cause Analysis“With the unknown, one is confronted with danger,discomfort, and care; the first instinct is to abolishthese painful states.First principle: any explanation is better thannone.” Friedrich Nietzsche Twilight of the Idols, or How to Philosophize with a Hammer
  89. Hindsight BiasKnowledge of the outcome influences theanalysis of the processMakes steps towards failure appearforeseeable and obvious
  90. After The Fact
  91. After The Fact
  92. Reality: Before and During
  93. Hindsight Bias“Should have known better”“All the signs were there, you just needed topay attention”
  94. Hindsight BiasBEFORE accident: The future seems implausibleAFTER accident: Obviously clear: “how could they not see what mistake they were about to make?!”
  95. Hindsight Bias“...people’s need to be rightis stronger than their abilityto be objective.” N. Crawford American Psychological Association
  96. Outcome BiasJudging a past decisionbased on its outcome.
  97. ID Fishbone/Ishikawa FMEA Five Whys Fault Tree CED CRT
  98. why? OUTAGE
  99. why?because this OUTAGE
  100. why? why? because this OUTAGE
  101. why? why?because because this this OUTAGE
  102. why? why? why? because because this this OUTAGE
  103. why? why? why?because because because this this this OUTAGE
  104. why? why? why?because because because this this this OUTAGE but: WHY?
  105. which causedSome Caused otherAction Some Things things OUTAGE to happen
  106. Sequence-Of-Events
  107. • Satisfyingly simple, easy to explain and document
  108. • Satisfyingly simple, easy to explain and document• Solves for a specific case• Ignorant of surrounding circumstances• Too focused on components• Validates Hindsight and Outcome bias
  109. NOT HELPFUL• Satisfyingly simple, easy to explain and document• Solves for a specific case• Ignorant of surrounding circumstances• Too focused on components• Validates Hindsight and Outcome bias
  110. Epidemiological (adapted from Reason, 1990)
  111. (adapted from Reason, 1990)
  112. Holes = Active/Latent Failures,Bad Things™ Waiting to Happen (adapted from Reason, 1990)
  113. Holes = Active/Latent Failures, Bad Things™ Waiting to HappenCheese = Safety Barriers, Layers of Defense (adapted from Reason, 1990)
  114. (adapted from Reason, 1990)
  115. Code Servers (adapted from Reason, 1990)
  116. Code ServersSchedule Training (adapted from Reason, 1990)
  117. Unmonitored Disk Space (latent condition)
  118. Capacity Unmonitored Disk Space Miscalculation (latent condition)(latent condition)
  119. Capacity Unmonitored Disk Space Miscalculation (latent condition)(latent condition) Unit Test In Transition
  120. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Transition
  121. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced Transition (active failure)
  122. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  123. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  124. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  125. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  126. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  127. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  128. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  129. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure) FAILURE!
  130. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  131. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  132. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  133. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure) NO FAILURE!
  134. • Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at sharp and blunt ends)
  135. • Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at sharp and blunt ends)• Doesn’t explain lineups or orientation of holes• Only identifies defects/gaps, nothing more• Still encourages judgements of decisions
  136. Better, but still NOT HELPFUL• Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at sharp and blunt ends)• Doesn’t explain lineups or orientation of holes• Only identifies defects/gaps, nothing more• Still encourages judgements of decisions
  137. Multiple Contributors “each necessary but only jointly sufficient”
  138. Resultant Versus Emergent
  139. Systemic DatabaseRouter Memcache Webserver
  140. Systemic DatabaseRouter Memcache Webserver Feature Roadmap
  141. Systemic DatabaseRouter Memcache Last Round Webserver of Funding Feature Roadmap
  142. Systemic Dashboard Design DatabaseRouter Memcache Last Round Webserver of Funding Feature Roadmap
  143. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature RoadmapDBA’s car
  144. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Roadmap No EngDBA’s car Training
  145. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down No EngDBA’s car Training
  146. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Is AtDBA’s car Training Velocity
  147. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  148. Systemic Hiring Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  149. Systemic Hiring S3 is slow Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  150. Systemic Hiring S3 is slow Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  151. Systemic
  152. Systemic
  153. Systemic
  154. Systemic
  155. Systemic
  156. Systemic
  157. Systemic
  158. Functional ResonanceIn isolation, components act within bounds.Interconnected, they produce emergingbehaviors.
  159. Causes Are Constructed,Not Found WYLFIWYF Pre-conceived notions on “causes” and behaviors
  160. Contributors, not causes
  161. There is no root cause.
  162. LEARNING
  163. Quantifying ResponseTime to detect?Time for escalation, internal notification?Time to notify the public?Time to graceful degradation? (feature off)Time to stable/resolve?Time to all clear?
  164. Qualifying ResponseHigh signal:noise in comm channels?Troubleshooting fatigue?Troubleshooting handoff?All tools on-hand?Metrics visibility?Collaborative and skillful communication?Improvised tooling or solutions?
  165. All Together Now• Start/TTD/TTR/Stable/etc.• Severity• DATA (graphs, IRC, etc.)• Description (timeline, etc.)• Observations (motivations, latent conditions, etc.)• Actions (remediation tickets, followup)
  166. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  167. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  168. Human Error“...knowledge and error flow from thesame mental sources, only successcan tell one from the other.” Ernest Mach, 1905
  169. Human ErrorNobody comes to work to do a bad job.
  170. Human ErrorUseless as a label and ending point.
  171. Human ErrorHuman error isn’t a cause, it’san effect.
  172. Why did it make senseto the personat the time?
  173. Error Categories• Slips• Lapses• Mismatches• Violations
  174. Error Categories
  175. First Stories“Human error” seen as root cause.Counterfactuals: saying what they “should”have done.Prevention: be “more careful!”
  176. Second Stories“Human error” seen as systemic vulnerabilities,deeper inside the organization.Digging into why it made sense for them to do what theydid, at the time they did it.Prevention...
  177. Why did it make senseto the personat the time?
  178. Why did it make senseto the personat the time?
  179. Why did it make senseto the personat the time?
  180. Substitution TestCould peers have made thesame error under the samecircumstances?
  181. Near MissesHey everybody -Don’t be like me. I tried to X, butbecause it was no good.It almost exploded everyone.So, don’t do: (details about X) Love, Joe
  182. WHERE to learn from ?
  183. Two Propositions
  184. 100 deploys6 deploy-related issues
  185. 100 > 6
  186. Proposition #1“Ways in which things go right are special casesof the ways in which things go wrong.”
  187. Proposition #1Successes = failures gone wrongStudy the failures, generalize from that. 6 100 data sources: out of
  188. Proposition #2“Ways in which things go wrong are specialcases of the ways in which things go right.”
  189. Proposition #2Failures = successes gone wrongStudy the successes, generalize from thatdata sources: 94 out of 100
  190. 94/100 ? OR6/100 ?
  191. What and WHY Do ThingsGo RIGHT?
  192. Not just: why did we fail?But also: why did we succeed?
  193. Taking the New ViewRecognize that human error isan attribution.
  194. Taking the New ViewPursue Second Stories.
  195. Taking the New ViewEscape Hindsight Bias.
  196. Taking the New ViewUnderstand work as performedat the sharp end.
  197. Taking the New ViewExamine how changes (at alllayers) will produce newvulnerabilities.
  198. Taking the New ViewUse technology to support andenhance human expertise.
  199. Taking the New ViewTame complexity through newforms of feedback.
  200. Taking the New ViewRealize that your systems arenot inherently safe.
  201. Taking the New ViewHuman error is an inevitableby-product of strainedcomplex systems.
  202. Taking the New ViewHuman error isn’t at the root ofyour safety problems.
  203. Taking the New ViewHuman error isn’t random.
  204. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  205. Just Culture
  206. Just CultureBalancing accountability withlearning.
  207. Intentional Malice
  208. Negligence Found Severity of the Outage
  209. NameBlameShame
  210. NameBlameShame
  211. Name WHY? Blame#!@%$&#Shame
  212. “Must set an example!”
  213. “Has to be some fear that notdoing one’s job correctly couldlead to punishment.”
  214. “Must set an example!” Punishing Deterrents is a not“Has to be some fear that Losing coulddoing one’s job correctly Propositionlead to punishment.”
  215. Holding People Accountable != Blaming People
  216. Accountability and Learning Punishment For Errors
  217. No Bad ApplesOnly Bad Theories of Error
  218. Name WHY? Blame#!@%$&#Shame
  219. Signs Of Old View“Gross Misconduct”“Carelessness”“Negligence”“Egregious Behavior”“Willful Violations”
  220. Discretionary Spaces
  221. AcceptableUnacceptable
  222. AcceptableUnacceptable
  223. AcceptableUnacceptable
  224. AcceptableUnacceptable
  225. Acceptable (who draws this subjective line?)Unacceptable
  226. Increase Accountability BySupporting Learning
  227. Empower PeopleLet them own their own stories.Don’t make people pay penalties.Allow them to educate the organization.
  228. Reduce UncertaintyMake it clear who definesacceptable behavior.
  229. Organizational RootsAccountability =Responsibility + Requisite Authority
  230. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  231. (Thanks, Fellas)Dr. Erik Hollnagel Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
  232. Homework!
  233. THEEND
  234. • Photo Credits http://www.flickr.com/photos/51035644987@N01/2678090600/• http://www.flickr.com/photos/67196253@N00/2941655917/• http://www.flickr.com/photos/stirwise/417629641• http://www.flickr.com/photos/38383999@N06/3888057995/• http://www.flickr.com/photos/94443490@N00/361543080/• http://www.flickr.com/photos/7729940@N06/4333396494/• http://www.flickr.com/photos/cpstorm/167418602• http://www.flickr.com/photos/63474264@N00/4366221069/• http://www.flickr.com/photos/proimos/4199675334• http://www.flickr.com/photos/30475691@N07/2862060992/• http://www.flickr.com/photos/14663487@N00/797755046/• http://www.flickr.com/photos/25080113@N06/5361445631/

×