Advertisement
Advertisement

More Related Content

Viewers also liked(20)

Advertisement
Advertisement

Advanced PostMortem Fu and Human Error 101 (Velocity 2011)

  1. Advanced PostMortem Fu & Human Error 101 John Allspaw VP, Tech Ops Velocity 2011
  2. We Want YOU etsy.com/careers
  3. Science TimeTravel Mythbusting Reading Homework
  4. Here To Challenge You
  5. Resilience Engineering Dr. Erik Hollnagel Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
  6. Complex, Dynamic
  7. Fundamental Surprises
  8. E.T.T.O. Efficiency Thoroughness
  9. Organizations, Policies, Procedures, Regulations BLUNT Resources & Constraints
  10. Organizations, Policies, Procedures, Regulations BLUNT Resources & Constraints Slips Adjustments Operators Compensations Mistakes Lapses Recoveries Violations Improvisations SHARP
  11. FUTURE + PAST
  12. Why Do Them?
  13. Why? Understand the failure
  14. Why? Understand the system
  15. Where “System” = Networks Servers Applications Processes People
  16. Where “System” = Networks Servers Applications Processes People
  17. People
  18. (Anticipation) Knowing What To Expect
  19. (Anticipation) Knowing Knowing What What To Expect To Look For (Monitoring)
  20. (Anticipation) (Response) Knowing Knowing Knowing What What What To Expect To Look For To Do (Monitoring)
  21. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  22. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  23. Microphones are ON?
  24. Event Awareness Code Deploys
  25. TIMELINE IRC logging = Rich Data
  26. TIMELINE Traces of Data IRC logging = Rich Data
  27. IRC Logs Fed Into Solr
  28. Status Blog Twitter Feed
  29. Flight Data Recorder
  30. Annotation Traces
  31. Investigation Basics
  32. Start?
  33. TTD How?
  34. TTR
  35. Stable (“all clear”)
  36. Impact Time = TTR - Start
  37. SEVERITY
  38. Severity 1 A. Total loss of service B. Severe degradation, effectively unusable C. Loss of a critical feature
  39. Severity 2 A. Major degradation/feature loss for SUBSET of members B. Minor degradation/feature loss for ALL members
  40. Severity 3 Noticeable non-critical feature loss or degradation
  41. Severity 4 No visible impact, loss of redundancy or capacity headroom
  42. Severity 5 No-impact but unexpected failure
  43. 5/11/2011 - Payments/Checkout system issue Start 4:10pm TTD 4:15pm TTR 4:30pm Stable 4:35pm Total Impact 20 min Severity 1
  44. Basic metrics • Timeline with details • Remediations/Observations
  45. “normal” Incident PostMortem operation Time
  46. How? Why? Prevention?
  47. How
  48. Crisis Patterns Problem Starts PostMortem Time
  49. Crisis Patterns Problem Starts Detection PostMortem Time
  50. Crisis Patterns Problem Starts Detection Evaluation PostMortem Time
  51. Crisis Patterns Problem Starts Detection Evaluation Response PostMortem Time
  52. Crisis Patterns Problem Starts Detection Evaluation Response Stable PostMortem Time
  53. Crisis Patterns Problem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  54. Crisis Patterns Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  55. Crisis Patterns Problem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  56. Crisis Patterns Problem Starts PostMortem Time
  57. Crisis Patterns Problem Starts Detection PostMortem Time
  58. Crisis Patterns Problem Starts Detection Evaluation PostMortem Time
  59. Crisis Patterns Problem Starts Detection Evaluation Response PostMortem Time
  60. Crisis Patterns Problem Starts Detection Evaluation Response Stable PostMortem Time
  61. Crisis Patterns Problem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  62. Crisis Patterns Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  63. Crisis Patterns Problem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  64. Crisis Patterns Problem Starts PostMortem Time
  65. Crisis Patterns Problem Starts Detection PostMortem Time
  66. Crisis Patterns Problem Starts Detection Evaluation PostMortem Time
  67. Crisis Patterns Problem Starts Detection Evaluation Response PostMortem Time
  68. Crisis Patterns Problem Starts Detection Evaluation Response Stable PostMortem Time
  69. Crisis Patterns Problem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  70. Crisis Patterns Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  71. Crisis Patterns Problem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  72. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  73. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  74. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time Internal and External Update Communications
  75. Crisis Patterns Problem Starts Detection Evaluation Response Stable Confirmation All Clear Time
  76. Crisis Patterns
  77. Crisis Patterns Forced beyond learned roles
  78. Crisis Patterns Forced beyond learned roles Actions whose consequences are both important and difficult to see
  79. Crisis Patterns Forced beyond learned roles Actions whose consequences are both important and difficult to see Cognitively and perceptively noisy
  80. Crisis Patterns Forced beyond learned roles Actions whose consequences are both important and difficult to see Cognitively and perceptively noisy Coordinative load increases exponentially
  81. Thematic Vagabonding “butterfly minds” NOT STUCK ENOUGH
  82. Goal Fixation (encystment) TOO STUCK
  83. Heroism Non-communicating lone wolf-isms
  84. Distraction Irrelevant noise in comm channels
  85. IMPROVISATION REQUIREMENT for troubleshooting complex systems
  86. IMPROVISATION
  87. IMPROVISATION
  88. Why?
  89. Root Cause Analysis “With the unknown, one is confronted with danger, discomfort, and care; the first instinct is to abolish these painful states. First principle: any explanation is better than none.” Friedrich Nietzsche Twilight of the Idols, or How to Philosophize with a Hammer
  90. Hindsight Bias Knowledge of the outcome influences the analysis of the process Makes steps towards failure appear foreseeable and obvious
  91. After The Fact
  92. After The Fact
  93. Reality: Before and During
  94. Hindsight Bias “Should have known better” “All the signs were there, you just needed to pay attention”
  95. Hindsight Bias BEFORE accident: The future seems implausible AFTER accident: Obviously clear: “how could they not see what mistake they were about to make”
  96. Hindsight Bias “...people’s need to be right is stronger than their ability to be objective.” N. Crawford American Psychological Association
  97. Outcome Bias Judging a past decision based on its outcome.
  98. ID Fishbone/Ishikawa FMEA Five Whys Fault Tree CED CRT
  99. why? OUTAGE
  100. why? because this OUTAGE
  101. why? why? because this OUTAGE
  102. why? why? because because this this OUTAGE
  103. why? why? why? because because this this OUTAGE
  104. why? why? why? because because because this this this OUTAGE
  105. why? why? why? because because because this this this OUTAGE but: WHY?
  106. which caused Some Caused other Action Some Things things OUTAGE to happen
  107. Sequence-Of-Events
  108. • Satisfyingly simple, easy to explain and document
  109. Satisfyingly simple, easy to explain and document • Solves for a specific case • Ignorant of surrounding circumstances • Too focused on components • Validates Hindsight and Outcome bias
  110. NOT HELPFUL • Satisfyingly simple, easy to explain and document • Solves for a specific case • Ignorant of surrounding circumstances • Too focused on components • Validates Hindsight and Outcome bias
  111. Epidemiological (adapted from Reason, 1990)
  112. (adapted from Reason, 1990)
  113. Holes = Active/Latent Failures, Bad Things™ Waiting to Happen (adapted from Reason, 1990)
  114. Holes = Active/Latent Failures, Bad Things™ Waiting to Happen Cheese = Safety Barriers, Layers of Defense (adapted from Reason, 1990)
  115. (adapted from Reason, 1990)
  116. Code Servers (adapted from Reason, 1990)
  117. Code Servers Schedule Training (adapted from Reason, 1990)
  118. Unmonitored Disk Space (latent condition)
  119. Capacity Unmonitored Disk Space Miscalculation (latent condition) (latent condition)
  120. Capacity Unmonitored Disk Space Miscalculation (latent condition) (latent condition) Unit Test In Transition
  121. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Transition
  122. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced Transition (active failure)
  123. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  124. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  125. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  126. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  127. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  128. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  129. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  130. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure) FAILURE!
  131. Capacity Miscalculation (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  132. Capacity Miscalculation (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  133. Capacity Miscalculation (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  134. Capacity Miscalculation (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure) NO FAILURE!
  135. • Better than dominoes, but still linear layers of “defense” • Helps uncover multiple contributors and latent failures (at sharp and blunt ends)
  136. • Better than dominoes, but still linear layers of “defense” • Helps uncover multiple contributors and latent failures (at sharp and blunt ends) • Doesn’t explain lineups or orientation of holes • Only identifies defects/gaps, nothing more • Still encourages judgements of decisions
  137. Better, but still NOT HELPFUL • Better than dominoes, but still linear layers of “defense” • Helps uncover multiple contributors and latent failures (at sharp and blunt ends) • Doesn’t explain lineups or orientation of holes • Only identifies defects/gaps, nothing more • Still encourages judgements of decisions
  138. Multiple Contributors “each necessary but only jointly sufficient”
  139. Resultant Versus Emergent
  140. Systemic Database Router Memcache Webserver
  141. Systemic Database Router Memcache Webserver Feature Roadmap
  142. Systemic Database Router Memcache Last Round Webserver of Funding Feature Roadmap
  143. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Roadmap
  144. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Roadmap DBA’s car
  145. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Roadmap No Eng DBA’s car Training
  146. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down No Eng DBA’s car Training
  147. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Is At DBA’s car Training Velocity
  148. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is At DBA’s car Training Article Velocity
  149. Systemic Hiring Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is At DBA’s car Training Article Velocity
  150. Systemic Hiring S3 is slow Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is At DBA’s car Training Article Velocity
  151. Systemic Hiring S3 is slow Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is At DBA’s car Training Article Velocity
  152. Systemic
  153. Systemic
  154. Systemic
  155. Systemic
  156. Systemic
  157. Systemic
  158. Systemic
  159. Functional Resonance In isolation, components act within bounds. Interconnected, they produce emerging behaviors.
  160. Causes Are Constructed, Not Found WYLFIWYF Pre-conceived notions on “causes” and behaviors
  161. Contributors, not causes
  162. There is no root cause.
  163. LEARNING
  164. Quantifying Response Time to detect? Time for escalation, internal notification? Time to notify the public? Time to graceful degradation? (feature off) Time to stable/resolve? Time to all clear?
  165. Qualifying Response High signal:noise in comm channels? Troubleshooting fatigue? Troubleshooting handoff? All tools on-hand? Metrics visibility? Collaborative and skillful communication? Improvised tooling or solutions?
  166. All Together Now • Start/TTD/TTR/Stable/etc. • Severity • DATA (graphs, IRC, etc.) • Description (timeline, etc.) • Observations (motivations, latent conditions, etc.) • Actions (remediation tickets, followup)
  167. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  168. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  169. Human Error “...knowledge and error flow from the same mental sources, only success can tell one from the other.” Ernest Mach, 1905
  170. Human Error Nobody comes to work to do a bad job.
  171. Human Error Useless as a label and ending point.
  172. Human Error Human error isn’t a cause, it’s an effect.
  173. Why did it make sense to the person at the time?
  174. Error Categories • Slips • Lapses • Mismatches • Violations
  175. Error Categories
  176. First Stories “Human error” seen as root cause. Counterfactuals: saying what they “should” have done. Prevention: be “more careful!”
  177. Second Stories “Human error” seen as systemic vulnerabilities, deeper inside the organization. Digging into why it made sense for them to do what they did, at the time they did it. Prevention...
  178. Why did it make sense to the person at the time?
  179. Why did it make sense to the person at the time?
  180. Why did it make sense to the person at the time?
  181. Substitution Test Could peers have made the same error under the same circumstances?
  182. WHERE to learn from ?
  183. Two Propositions
  184. 100 deploys 6 deploy-related issues
  185. 100 > 6
  186. Proposition #1 “Ways in which things go right are special cases of the ways in which things go wrong.”
  187. Proposition #1 Successes = failures gone wrong Study the failures, generalize from that. data sources: 6 out of 100
  188. Proposition #2 “Ways in which things go wrong are special cases of the ways in which things go right.”
  189. Proposition #2 Failures = successes gone wrong Study the successes, generalize from that data sources: 94 out of 100
  190. 94/100 ? OR 6/100 ?
  191. What and WHY Do Things Go RIGHT?
  192. Not just: why did we fail? But also: why did we succeed?
  193. Near Misses Hey everybody - Don’t be like me. I tried to X, but because it was no good. It almost exploded everyone. So, don’t do: (details about X) Love, Joe
  194. Taking the New View Recognize that human error is an attribution.
  195. Taking the New View Pursue Second Stories.
  196. Taking the New View Escape Hindsight Bias.
  197. Taking the New View Understand work as performed at the sharp end.
  198. Taking the New View Examine how changes (at all layers) will produce new vulnerabilities.
  199. Taking the New View Use technology to support and enhance human expertise.
  200. Taking the New View Tame complexity through new forms of feedback.
  201. Taking the New View Realize that your systems are not inherently safe.
  202. Taking the New View Human error is an inevitable by-product of strained complex systems.
  203. Taking the New View Human error isn’t at the root of your safety problems.
  204. Taking the New View Human error isn’t random.
  205. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  206. Just Culture
  207. Just Culture Balancing accountability with learning.
  208. Intentional Malice
  209. Negligence Found Severity of the Outage
  210. Name Blame Shame
  211. Name Blame Shame
  212. Name WHY? Blame #!@%$&# Shame
  213. “Must set an example!”
  214. “Has to be some fear that not doing one’s job correctly could lead to punishment.”
  215. “Must set an example!” Punishing Deterrents is a not “Has to be some fear that Losing could doing one’s job correctly Proposition lead to punishment.”
  216. Holding People Accountable != Blaming People
  217. Accountability and Learning Punishment For Errors
  218. No Bad Apples Only Bad Theories of Error
  219. Name WHY? Blame #!@%$&# Shame
  220. Signs Of Old View “Gross Misconduct” “Carelessness” “Negligence” “Egregious Behavior” “Willful Violations”
  221. Discretionary Spaces
  222. Acceptable Unacceptable
  223. Acceptable Unacceptable
  224. Acceptable Unacceptable
  225. Acceptable Unacceptable
  226. Acceptable (who draws this subjective line?) Unacceptable
  227. Increase Accountability By Supporting Learning
  228. Empower People Let them own their own stories. Don’t make people pay penalties. Allow them to educate the organization.
  229. Reduce Uncertainty Make it clear who defines acceptable behavior.
  230. Organizational Roots Accountability = Responsibility + Requisite Authority
  231. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  232. (Thanks, Fellas) Dr. Erik Hollnagel Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
  233. Homework!
  234. We Want YOU etsy.com/careers
  235. Photo Credits http://www.flickr.com/photos/51035644987@N01/2678090600/ • http://www.flickr.com/photos/67196253@N00/2941655917/ • http://www.flickr.com/photos/stirwise/417629641 • http://www.flickr.com/photos/38383999@N06/3888057995/ • http://www.flickr.com/photos/94443490@N00/361543080/ • http://www.flickr.com/photos/7729940@N06/4333396494/ • http://www.flickr.com/photos/cpstorm/167418602 • http://www.flickr.com/photos/63474264@N00/4366221069/ • http://www.flickr.com/photos/proimos/4199675334 • http://www.flickr.com/photos/30475691@N07/2862060992/ • http://www.flickr.com/photos/14663487@N00/797755046/ • http://www.flickr.com/photos/25080113@N06/5361445631/
  236. THE END
Advertisement