Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Advanced PostMortem Fu and Human Error 101 (Velocity 2011)

  • 15,734 views
Uploaded on

Not sure that these slides will make too much sense without the video, but here they are.

Not sure that these slides will make too much sense without the video, but here they are.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
15,734
On Slideshare
13,589
From Embeds
2,145
Number of Embeds
17

Actions

Shares
Downloads
372
Comments
0
Likes
41

Embeds 2,145

http://www.standingonthebrink.com 1,429
http://velocityconf.com 554
http://lanyrd.com 80
http://blog.dremer.net 52
http://www.linkedin.com 4
url_unknown 4
http://www.slideshare.net 4
http://paper.li 3
https://twimg0-a.akamaihd.net 3
http://t.co 3
http://twitter.com 2
https://twitter.com 2
http://news.google.com 1
http://www.onlydoo.com 1
http://nird.blogspot.com 1
http://jesucristocarolinaamor-carolina.blogspot.com 1
https://www.linkedin.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. AdvancedPostMortem Fu& Human Error 101 John Allspaw VP, Tech Ops Velocity 2011
  • 2. We Want YOU etsy.com/careers
  • 3. ScienceTimeTravelMythbustingReadingHomework
  • 4. Here To Challenge You
  • 5. Resilience EngineeringDr. Erik Hollnagel Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
  • 6. Complex, Dynamic
  • 7. Fundamental Surprises
  • 8. E.T.T.O. Efficiency Thoroughness
  • 9. Organizations, Policies, Procedures, Regulations BLUNT Resources & Constraints
  • 10. Organizations, Policies, Procedures, Regulations BLUNT Resources & Constraints Slips Adjustments OperatorsCompensations Mistakes Lapses Recoveries Violations Improvisations SHARP
  • 11. FUTURE +PAST
  • 12. Why Do Them?
  • 13. Why?Understand the failure
  • 14. Why?Understand the system
  • 15. Where “System” = Networks Servers Applications Processes People
  • 16. Where “System” = Networks Servers Applications Processes People
  • 17. People
  • 18. (Anticipation) Knowing What To Expect
  • 19. (Anticipation) Knowing Knowing What What To Expect To Look For (Monitoring)
  • 20. (Anticipation) (Response) Knowing Knowing Knowing What What What To Expect To Look For To Do (Monitoring)
  • 21. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  • 22. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  • 23. Microphones are ON?
  • 24. Event AwarenessCode Deploys
  • 25. TIMELINE IRC logging = Rich Data
  • 26. TIMELINE Traces of Data IRC logging = Rich Data
  • 27. IRC Logs Fed Into Solr
  • 28. Status BlogTwitter Feed
  • 29. Flight Data Recorder
  • 30. Annotation Traces
  • 31. Investigation Basics
  • 32. Start?
  • 33. TTDHow?
  • 34. TTR
  • 35. Stable(“all clear”)
  • 36. Impact Time = TTR - Start
  • 37. SEVERITY
  • 38. Severity 1A. Total loss of serviceB. Severe degradation, effectively unusableC. Loss of a critical feature
  • 39. Severity 2A. Major degradation/feature loss for SUBSET of membersB. Minor degradation/feature loss for ALL members
  • 40. Severity 3Noticeable non-critical feature loss or degradation
  • 41. Severity 4No visible impact, loss of redundancy or capacityheadroom
  • 42. Severity 5No-impact but unexpected failure
  • 43. 5/11/2011 - Payments/Checkout system issue Start 4:10pm TTD 4:15pm TTR 4:30pm Stable 4:35pm Total Impact 20 min Severity 1
  • 44. • Basic metrics• Timeline with details• Remediations/Observations
  • 45. “normal” Incident PostMortemoperation Time
  • 46. How?Why?Prevention?
  • 47. How
  • 48. Crisis PatternsProblem Starts PostMortem Time
  • 49. Crisis PatternsProblem Starts Detection PostMortem Time
  • 50. Crisis PatternsProblem Starts Detection Evaluation PostMortem Time
  • 51. Crisis PatternsProblem Starts Detection Evaluation Response PostMortem Time
  • 52. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Time
  • 53. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  • 54. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  • 55. Crisis PatternsProblem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  • 56. Crisis PatternsProblem Starts PostMortem Time
  • 57. Crisis PatternsProblem Starts Detection PostMortem Time
  • 58. Crisis PatternsProblem Starts Detection Evaluation PostMortem Time
  • 59. Crisis PatternsProblem Starts Detection Evaluation Response PostMortem Time
  • 60. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Time
  • 61. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  • 62. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  • 63. Crisis PatternsProblem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  • 64. Crisis PatternsProblem Starts PostMortem Time
  • 65. Crisis PatternsProblem Starts Detection PostMortem Time
  • 66. Crisis PatternsProblem Starts Detection Evaluation PostMortem Time
  • 67. Crisis PatternsProblem Starts Detection Evaluation Response PostMortem Time
  • 68. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Time
  • 69. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  • 70. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  • 71. Crisis PatternsProblem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  • 72. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  • 73. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  • 74. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time Internal and External Update Communications
  • 75. Crisis PatternsProblem Starts Detection Evaluation Response Stable Confirmation All Clear Time
  • 76. Crisis Patterns
  • 77. Crisis PatternsForced beyond learned roles
  • 78. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to see
  • 79. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to seeCognitively and perceptively noisy
  • 80. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to seeCognitively and perceptively noisyCoordinative load increases exponentially
  • 81. Thematic Vagabonding“butterfly minds” NOT STUCK ENOUGH
  • 82. Goal Fixation (encystment)TOO STUCK
  • 83. HeroismNon-communicating lone wolf-isms
  • 84. DistractionIrrelevant noise in comm channels
  • 85. IMPROVISATIONREQUIREMENT for troubleshootingcomplex systems
  • 86. IMPROVISATION
  • 87. IMPROVISATION
  • 88. Why?
  • 89. Root Cause Analysis“With the unknown, one is confronted with danger,discomfort, and care; the first instinct is to abolishthese painful states.First principle: any explanation is better thannone.” Friedrich Nietzsche Twilight of the Idols, or How to Philosophize with a Hammer
  • 90. Hindsight BiasKnowledge of the outcome influences theanalysis of the processMakes steps towards failure appearforeseeable and obvious
  • 91. After The Fact
  • 92. After The Fact
  • 93. Reality: Before and During
  • 94. Hindsight Bias“Should have known better”“All the signs were there, you just needed topay attention”
  • 95. Hindsight BiasBEFORE accident: The future seems implausibleAFTER accident: Obviously clear: “how could they not see what mistake they were about to make”
  • 96. Hindsight Bias“...people’s need to be rightis stronger than their abilityto be objective.” N. Crawford American Psychological Association
  • 97. Outcome BiasJudging a past decisionbased on its outcome.
  • 98. ID Fishbone/Ishikawa FMEA Five Whys Fault Tree CED CRT
  • 99. why? OUTAGE
  • 100. why?because this OUTAGE
  • 101. why? why? because this OUTAGE
  • 102. why? why?because because this this OUTAGE
  • 103. why? why? why? because because this this OUTAGE
  • 104. why? why? why?because because because this this this OUTAGE
  • 105. why? why? why?because because because this this this OUTAGE but: WHY?
  • 106. which causedSome Caused otherAction Some Things things OUTAGE to happen
  • 107. Sequence-Of-Events
  • 108. • Satisfyingly simple, easy to explain and document
  • 109. • Satisfyingly simple, easy to explain and document• Solves for a specific case• Ignorant of surrounding circumstances• Too focused on components• Validates Hindsight and Outcome bias
  • 110. NOT HELPFUL• Satisfyingly simple, easy to explain and document• Solves for a specific case• Ignorant of surrounding circumstances• Too focused on components• Validates Hindsight and Outcome bias
  • 111. Epidemiological (adapted from Reason, 1990)
  • 112. (adapted from Reason, 1990)
  • 113. Holes = Active/Latent Failures,Bad Things™ Waiting to Happen (adapted from Reason, 1990)
  • 114. Holes = Active/Latent Failures, Bad Things™ Waiting to HappenCheese = Safety Barriers, Layers of Defense (adapted from Reason, 1990)
  • 115. (adapted from Reason, 1990)
  • 116. Code Servers (adapted from Reason, 1990)
  • 117. Code ServersSchedule Training (adapted from Reason, 1990)
  • 118. Unmonitored Disk Space (latent condition)
  • 119. Capacity Unmonitored Disk Space Miscalculation (latent condition)(latent condition)
  • 120. Capacity Unmonitored Disk Space Miscalculation (latent condition)(latent condition) Unit Test In Transition
  • 121. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Transition
  • 122. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced Transition (active failure)
  • 123. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  • 124. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  • 125. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  • 126. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  • 127. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  • 128. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  • 129. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  • 130. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure) FAILURE!
  • 131. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  • 132. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  • 133. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  • 134. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure) NO FAILURE!
  • 135. • Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at sharp and blunt ends)
  • 136. • Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at sharp and blunt ends)• Doesn’t explain lineups or orientation of holes• Only identifies defects/gaps, nothing more• Still encourages judgements of decisions
  • 137. Better, but still NOT HELPFUL• Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at sharp and blunt ends)• Doesn’t explain lineups or orientation of holes• Only identifies defects/gaps, nothing more• Still encourages judgements of decisions
  • 138. Multiple Contributors “each necessary but only jointly sufficient”
  • 139. Resultant Versus Emergent
  • 140. Systemic DatabaseRouter Memcache Webserver
  • 141. Systemic DatabaseRouter Memcache Webserver Feature Roadmap
  • 142. Systemic DatabaseRouter Memcache Last Round Webserver of Funding Feature Roadmap
  • 143. Systemic Dashboard Design DatabaseRouter Memcache Last Round Webserver of Funding Feature Roadmap
  • 144. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature RoadmapDBA’s car
  • 145. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Roadmap No EngDBA’s car Training
  • 146. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down No EngDBA’s car Training
  • 147. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Is AtDBA’s car Training Velocity
  • 148. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  • 149. Systemic Hiring Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  • 150. Systemic Hiring S3 is slow Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  • 151. Systemic Hiring S3 is slow Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  • 152. Systemic
  • 153. Systemic
  • 154. Systemic
  • 155. Systemic
  • 156. Systemic
  • 157. Systemic
  • 158. Systemic
  • 159. Functional ResonanceIn isolation, components act within bounds.Interconnected, they produce emergingbehaviors.
  • 160. Causes Are Constructed,Not Found WYLFIWYF Pre-conceived notions on “causes” and behaviors
  • 161. Contributors, not causes
  • 162. There is no root cause.
  • 163. LEARNING
  • 164. Quantifying ResponseTime to detect?Time for escalation, internal notification?Time to notify the public?Time to graceful degradation? (feature off)Time to stable/resolve?Time to all clear?
  • 165. Qualifying ResponseHigh signal:noise in comm channels?Troubleshooting fatigue?Troubleshooting handoff?All tools on-hand?Metrics visibility?Collaborative and skillful communication?Improvised tooling or solutions?
  • 166. All Together Now• Start/TTD/TTR/Stable/etc.• Severity• DATA (graphs, IRC, etc.)• Description (timeline, etc.)• Observations (motivations, latent conditions, etc.)• Actions (remediation tickets, followup)
  • 167. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  • 168. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  • 169. Human Error“...knowledge and error flow from thesame mental sources, only successcan tell one from the other.” Ernest Mach, 1905
  • 170. Human ErrorNobody comes to work to do a bad job.
  • 171. Human ErrorUseless as a label and ending point.
  • 172. Human ErrorHuman error isn’t a cause, it’san effect.
  • 173. Why did it make senseto the personat the time?
  • 174. Error Categories• Slips• Lapses• Mismatches• Violations
  • 175. Error Categories
  • 176. First Stories“Human error” seen as root cause.Counterfactuals: saying what they “should”have done.Prevention: be “more careful!”
  • 177. Second Stories“Human error” seen as systemic vulnerabilities,deeper inside the organization.Digging into why it made sense for them to do what theydid, at the time they did it.Prevention...
  • 178. Why did it make senseto the personat the time?
  • 179. Why did it make senseto the personat the time?
  • 180. Why did it make senseto the personat the time?
  • 181. Substitution TestCould peers have made thesame error under the samecircumstances?
  • 182. WHERE to learn from ?
  • 183. Two Propositions
  • 184. 100 deploys6 deploy-related issues
  • 185. 100 > 6
  • 186. Proposition #1“Ways in which things go right are special casesof the ways in which things go wrong.”
  • 187. Proposition #1Successes = failures gone wrongStudy the failures, generalize from that. data sources: 6 out of 100
  • 188. Proposition #2“Ways in which things go wrong are specialcases of the ways in which things go right.”
  • 189. Proposition #2Failures = successes gone wrongStudy the successes, generalize from thatdata sources: 94 out of 100
  • 190. 94/100 ? OR6/100 ?
  • 191. What and WHY Do ThingsGo RIGHT?
  • 192. Not just: why did we fail?But also: why did we succeed?
  • 193. Near MissesHey everybody -Don’t be like me. I tried to X, butbecause it was no good.It almost exploded everyone.So, don’t do: (details about X) Love, Joe
  • 194. Taking the New ViewRecognize that human error isan attribution.
  • 195. Taking the New ViewPursue Second Stories.
  • 196. Taking the New ViewEscape Hindsight Bias.
  • 197. Taking the New ViewUnderstand work as performedat the sharp end.
  • 198. Taking the New ViewExamine how changes (at alllayers) will produce newvulnerabilities.
  • 199. Taking the New ViewUse technology to support andenhance human expertise.
  • 200. Taking the New ViewTame complexity through newforms of feedback.
  • 201. Taking the New ViewRealize that your systems arenot inherently safe.
  • 202. Taking the New ViewHuman error is an inevitableby-product of strainedcomplex systems.
  • 203. Taking the New ViewHuman error isn’t at the root ofyour safety problems.
  • 204. Taking the New ViewHuman error isn’t random.
  • 205. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  • 206. Just Culture
  • 207. Just CultureBalancing accountability withlearning.
  • 208. Intentional Malice
  • 209. Negligence Found Severity of the Outage
  • 210. NameBlameShame
  • 211. NameBlameShame
  • 212. Name WHY? Blame#!@%$&#Shame
  • 213. “Must set an example!”
  • 214. “Has to be some fear that notdoing one’s job correctly couldlead to punishment.”
  • 215. “Must set an example!” Punishing Deterrents is a not“Has to be some fear that Losing coulddoing one’s job correctly Propositionlead to punishment.”
  • 216. Holding People Accountable != Blaming People
  • 217. Accountability and Learning Punishment For Errors
  • 218. No Bad ApplesOnly Bad Theories of Error
  • 219. Name WHY? Blame#!@%$&#Shame
  • 220. Signs Of Old View“Gross Misconduct”“Carelessness”“Negligence”“Egregious Behavior”“Willful Violations”
  • 221. Discretionary Spaces
  • 222. AcceptableUnacceptable
  • 223. AcceptableUnacceptable
  • 224. AcceptableUnacceptable
  • 225. AcceptableUnacceptable
  • 226. Acceptable (who draws this subjective line?)Unacceptable
  • 227. Increase Accountability BySupporting Learning
  • 228. Empower PeopleLet them own their own stories.Don’t make people pay penalties.Allow them to educate the organization.
  • 229. Reduce UncertaintyMake it clear who definesacceptable behavior.
  • 230. Organizational RootsAccountability =Responsibility + Requisite Authority
  • 231. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  • 232. (Thanks, Fellas)Dr. Erik Hollnagel Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
  • 233. Homework!
  • 234. We Want YOU etsy.com/careers
  • 235. • Photo Credits http://www.flickr.com/photos/51035644987@N01/2678090600/• http://www.flickr.com/photos/67196253@N00/2941655917/• http://www.flickr.com/photos/stirwise/417629641• http://www.flickr.com/photos/38383999@N06/3888057995/• http://www.flickr.com/photos/94443490@N00/361543080/• http://www.flickr.com/photos/7729940@N06/4333396494/• http://www.flickr.com/photos/cpstorm/167418602• http://www.flickr.com/photos/63474264@N00/4366221069/• http://www.flickr.com/photos/proimos/4199675334• http://www.flickr.com/photos/30475691@N07/2862060992/• http://www.flickr.com/photos/14663487@N00/797755046/• http://www.flickr.com/photos/25080113@N06/5361445631/
  • 236. THE END