AdvancedPostMortem Fu& Human Error 101                    John Allspaw                    VP, Tech Ops                    ...
We Want    YOU    etsy.com/careers
ScienceTimeTravelMythbustingReadingHomework
Here To Challenge You
Resilience EngineeringDr. Erik Hollnagel   Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
Complex, Dynamic
Fundamental Surprises
E.T.T.O.    Efficiency   Thoroughness
Organizations, Policies, Procedures,           Regulations                                       BLUNT       Resources & C...
Organizations, Policies, Procedures,                  Regulations                                                    BLUNT...
FUTURE  +PAST
Why Do Them?
Why?Understand the                 failure
Why?Understand the                 system
Where “System” =                     Networks                        Servers                   Applications               ...
Where “System” =                     Networks                        Servers                   Applications               ...
People
(Anticipation)    Knowing      What    To Expect
(Anticipation)    Knowing         Knowing      What            What    To Expect      To Look For                 (Monitor...
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      What            What          W...
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      Knowing      What            Wh...
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      Knowing      What            Wh...
Microphones are ON?
Event AwarenessCode Deploys
TIMELINE      IRC logging = Rich Data
TIMELINE       Traces of Data      IRC logging = Rich Data
IRC Logs Fed Into Solr
Status BlogTwitter Feed
Flight Data Recorder
Annotation Traces
Investigation Basics
Start?
TTDHow?
TTR
Stable(“all clear”)
Impact Time = TTR - Start
SEVERITY
Severity 1A. Total loss of serviceB. Severe degradation, effectively unusableC. Loss of a critical feature
Severity 2A. Major degradation/feature loss for SUBSET of membersB. Minor degradation/feature loss for ALL members
Severity 3Noticeable non-critical feature loss or degradation
Severity 4No visible impact, loss of redundancy or capacityheadroom
Severity 5No-impact but unexpected failure
5/11/2011 - Payments/Checkout system issue   Start                                     4:10pm   TTD                       ...
•   Basic metrics•   Timeline with details•   Remediations/Observations
“normal”            Incident   PostMortemoperation              Time
How?Why?Prevention?
How
Crisis PatternsProblem Starts                              PostMortem                 Time
Crisis PatternsProblem Starts      Detection                               PostMortem                  Time
Crisis PatternsProblem Starts      Detection             Evaluation                                       PostMortem      ...
Crisis PatternsProblem Starts      Detection             Evaluation                     Response                          ...
Crisis PatternsProblem Starts      Detection             Evaluation                     Response                          ...
Crisis PatternsProblem Starts      Detection             Evaluation                     Response                          ...
Crisis PatternsProblem Starts      Detection             Evaluation                     Response                          ...
Crisis PatternsProblem Starts                          Stress      Detection             Evaluation                     Re...
Crisis PatternsProblem Starts                              PostMortem                 Time
Crisis PatternsProblem Starts      Detection                               PostMortem                  Time
Crisis PatternsProblem Starts      Detection   Evaluation                               PostMortem                  Time
Crisis PatternsProblem Starts      Detection   Evaluation                    Response                               PostMo...
Crisis PatternsProblem Starts      Detection   Evaluation                    Response                          Stable     ...
Crisis PatternsProblem Starts      Detection   Evaluation                    Response                          Stable     ...
Crisis PatternsProblem Starts      Detection   Evaluation                    Response                          Stable     ...
Crisis PatternsProblem Starts                 Stress      Detection   Evaluation                    Response              ...
Crisis PatternsProblem Starts                              PostMortem                 Time
Crisis PatternsProblem Starts      Detection                               PostMortem                  Time
Crisis PatternsProblem Starts      Detection          Evaluation                                    PostMortem            ...
Crisis PatternsProblem Starts      Detection          Evaluation               Response                                   ...
Crisis PatternsProblem Starts      Detection          Evaluation               Response                                 St...
Crisis PatternsProblem Starts      Detection          Evaluation               Response                                 St...
Crisis PatternsProblem Starts      Detection          Evaluation               Response                                 St...
Crisis PatternsProblem Starts                      Stress      Detection          Evaluation               Response       ...
Problem Starts      Detection          Evaluation               Response                                 Stable           ...
Problem Starts      Detection          Evaluation               Response                                 Stable           ...
Problem Starts      Detection          Evaluation               Response                                 Stable           ...
Crisis PatternsProblem Starts      Detection          Evaluation               Response                                 St...
Crisis Patterns
Crisis PatternsForced beyond learned roles
Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to see
Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to seeCognitively and p...
Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to seeCognitively and p...
Thematic Vagabonding“butterfly minds”       NOT STUCK ENOUGH
Goal Fixation                 (encystment)TOO STUCK
HeroismNon-communicating lone wolf-isms
DistractionIrrelevant noise in comm channels
IMPROVISATIONREQUIREMENT for troubleshootingcomplex systems
IMPROVISATION
IMPROVISATION
Why?
Root Cause Analysis“With the unknown, one is confronted with danger,discomfort, and care; the first instinct is to abolisht...
Hindsight BiasKnowledge of the outcome influences theanalysis of the processMakes steps towards failure appearforeseeable a...
After The Fact
After The Fact
Reality: Before and During
Hindsight Bias“Should have known better”“All the signs were there, you just needed topay attention”
Hindsight BiasBEFORE accident:        The future seems implausibleAFTER accident:   Obviously clear: “how could they not s...
Hindsight Bias“...people’s need to be rightis stronger than their abilityto be objective.”                          N. Cra...
Outcome BiasJudging a past decisionbased on its outcome.
ID          Fishbone/Ishikawa     FMEA                         Five Whys                Fault Tree      CED      CRT
why?       OUTAGE
why?because  this             OUTAGE
why? why?  because    this            OUTAGE
why? why?because   because  this      this                    OUTAGE
why?     why? why?   because   because     this      this                       OUTAGE
why?       why? why?because   because   because  this      this      this                              OUTAGE
why?        why? why?because    because   because  this       this      this                               OUTAGE         ...
which                       causedSome       Caused       otherAction   Some Things   things                              ...
Sequence-Of-Events
• Satisfyingly simple, easy to explain and document
•   Satisfyingly simple, easy to explain and document•   Solves for a specific case•   Ignorant of surrounding circumstance...
NOT HELPFUL•   Satisfyingly simple, easy to explain and document•   Solves for a specific case•   Ignorant of surrounding c...
Epidemiological        (adapted from Reason, 1990)
(adapted from Reason, 1990)
Holes = Active/Latent Failures,Bad Things™ Waiting to Happen                  (adapted from Reason, 1990)
Holes = Active/Latent Failures,                            Bad Things™ Waiting to HappenCheese = Safety Barriers,   Layers...
(adapted from Reason, 1990)
Code   Servers           (adapted from Reason, 1990)
Code        ServersSchedule     Training   (adapted from Reason, 1990)
Unmonitored Disk Space   (latent condition)
Capacity                     Unmonitored Disk Space  Miscalculation                        (latent condition)(latent condi...
Capacity                     Unmonitored Disk Space  Miscalculation                        (latent condition)(latent condi...
Violation of known coding     Capacity                standards         Unmonitored Disk Space  Miscalculation         (la...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Capacity  Miscalculation(latent condition)                                                            Unit Test In        ...
Capacity  Miscalculation(latent condition)                                                            Unit Test In        ...
Capacity  Miscalculation(latent condition)                                                            Unit Test In        ...
Capacity  Miscalculation(latent condition)                                                            Unit Test In        ...
• Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at ...
• Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at ...
Better, but still                                               NOT HELPFUL• Better than dominoes, but still linear layers...
Multiple Contributors “each necessary but only jointly sufficient”
Resultant Versus Emergent
Systemic                DatabaseRouter                     Memcache    Webserver
Systemic                DatabaseRouter                        Memcache    Webserver                            Feature    ...
Systemic                DatabaseRouter                        Memcache                                  Last Round    Webs...
Systemic                        Dashboard                                 Design                DatabaseRouter            ...
Systemic                          Dashboard                                    Design                   Database  Router  ...
Systemic                          Dashboard                                    Design                   Database  Router  ...
Systemic                           Dashboard                                     Design                    Database  Route...
Systemic                           Dashboard                                     Design                    Database  Route...
Systemic                               Dashboard                                         Design                    Databas...
Systemic Hiring                                        Dashboard                                         Design        Dif...
Systemic Hiring                          S3 is slow    Dashboard                                         Design        Dif...
Systemic Hiring                          S3 is slow    Dashboard                                         Design        Dif...
Systemic
Systemic
Systemic
Systemic
Systemic
Systemic
Systemic
Functional ResonanceIn isolation, components act within bounds.Interconnected, they produce emergingbehaviors.
Causes Are Constructed,Not Found WYLFIWYF Pre-conceived notions on “causes” and behaviors
Contributors, not causes
There is no root cause.
LEARNING
Quantifying ResponseTime to detect?Time for escalation, internal notification?Time to notify the public?Time to graceful de...
Qualifying ResponseHigh signal:noise in comm channels?Troubleshooting fatigue?Troubleshooting handoff?All tools on-hand?Met...
All Together Now•   Start/TTD/TTR/Stable/etc.•   Severity•   DATA (graphs, IRC, etc.)•   Description (timeline, etc.)•   O...
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      Knowing      What            Wh...
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      Knowing      What            Wh...
Human Error“...knowledge and error flow from thesame mental sources, only successcan tell one from the other.”             ...
Human ErrorNobody comes to work to do a bad job.
Human ErrorUseless as a label and ending point.
Human ErrorHuman error isn’t a cause, it’san effect.
Why did it make senseto the personat the time?
Error Categories• Slips• Lapses• Mismatches• Violations
Error Categories
First Stories“Human error” seen as root cause.Counterfactuals: saying what they “should”have done.Prevention: be “more car...
Second Stories“Human error” seen as systemic vulnerabilities,deeper inside the organization.Digging into why it made sense...
Why did it make senseto the personat the time?
Why did it make senseto the personat the time?
Why did it make senseto the personat the time?
Substitution TestCould peers have made thesame error under the samecircumstances?
WHERE to learn from ?
Two Propositions
100 deploys6 deploy-related issues
100 > 6
Proposition #1“Ways in which things go right are special casesof the ways in which things go wrong.”
Proposition #1Successes = failures gone wrongStudy the failures, generalize from that.       data sources: 6 out of 100
Proposition #2“Ways in which things go wrong are specialcases of the ways in which things go right.”
Proposition #2Failures = successes gone wrongStudy the successes, generalize from thatdata sources:   94 out of 100
94/100 ?   OR6/100 ?
What and WHY Do ThingsGo RIGHT?
Not just:                why did we fail?But also:            why did we succeed?
Near MissesHey everybody -Don’t be like me. I tried to X, butbecause it was no good.It almost exploded everyone.So, don’t ...
Taking the New ViewRecognize that human error isan attribution.
Taking the New ViewPursue Second Stories.
Taking the New ViewEscape Hindsight Bias.
Taking the New ViewUnderstand work as performedat the sharp end.
Taking the New ViewExamine how changes (at alllayers) will produce newvulnerabilities.
Taking the New ViewUse technology to support andenhance human expertise.
Taking the New ViewTame complexity through newforms of feedback.
Taking the New ViewRealize that your systems arenot inherently safe.
Taking the New ViewHuman error is an inevitableby-product of strainedcomplex systems.
Taking the New ViewHuman error isn’t at the root ofyour safety problems.
Taking the New ViewHuman error isn’t random.
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      Knowing      What            Wh...
Just Culture
Just CultureBalancing accountability withlearning.
Intentional Malice
Negligence  Found         Severity of the Outage
NameBlameShame
NameBlameShame
Name  WHY? Blame#!@%$&#Shame
“Must set an example!”
“Has to be some fear that notdoing one’s job correctly couldlead to punishment.”
“Must set an example!”       Punishing       Deterrents is a not“Has to be some fear that            Losing coulddoing one...
Holding People Accountable          !=     Blaming People
Accountability    and  Learning                 Punishment For Errors
No Bad ApplesOnly Bad Theories of Error
Name  WHY? Blame#!@%$&#Shame
Signs Of Old View“Gross Misconduct”“Carelessness”“Negligence”“Egregious Behavior”“Willful Violations”
Discretionary Spaces
AcceptableUnacceptable
AcceptableUnacceptable
AcceptableUnacceptable
AcceptableUnacceptable
Acceptable   (who draws this subjective line?)Unacceptable
Increase Accountability BySupporting Learning
Empower PeopleLet them own their own stories.Don’t make people pay penalties.Allow them to educate the organization.
Reduce UncertaintyMake it clear who definesacceptable behavior.
Organizational RootsAccountability =Responsibility + Requisite Authority
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      Knowing      What            Wh...
(Thanks, Fellas)Dr. Erik Hollnagel   Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
Homework!
We Want    YOU    etsy.com/careers
•                                             Photo Credits    http://www.flickr.com/photos/51035644987@N01/2678090600/•   ...
THE END
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Upcoming SlideShare
Loading in...5
×

Advanced PostMortem Fu and Human Error 101 (Velocity 2011)

15,451

Published on

Not sure that these slides will make too much sense without the video, but here they are.

Published in: Technology

Advanced PostMortem Fu and Human Error 101 (Velocity 2011)

  1. 1. AdvancedPostMortem Fu& Human Error 101 John Allspaw VP, Tech Ops Velocity 2011
  2. 2. We Want YOU etsy.com/careers
  3. 3. ScienceTimeTravelMythbustingReadingHomework
  4. 4. Here To Challenge You
  5. 5. Resilience EngineeringDr. Erik Hollnagel Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
  6. 6. Complex, Dynamic
  7. 7. Fundamental Surprises
  8. 8. E.T.T.O. Efficiency Thoroughness
  9. 9. Organizations, Policies, Procedures, Regulations BLUNT Resources & Constraints
  10. 10. Organizations, Policies, Procedures, Regulations BLUNT Resources & Constraints Slips Adjustments OperatorsCompensations Mistakes Lapses Recoveries Violations Improvisations SHARP
  11. 11. FUTURE +PAST
  12. 12. Why Do Them?
  13. 13. Why?Understand the failure
  14. 14. Why?Understand the system
  15. 15. Where “System” = Networks Servers Applications Processes People
  16. 16. Where “System” = Networks Servers Applications Processes People
  17. 17. People
  18. 18. (Anticipation) Knowing What To Expect
  19. 19. (Anticipation) Knowing Knowing What What To Expect To Look For (Monitoring)
  20. 20. (Anticipation) (Response) Knowing Knowing Knowing What What What To Expect To Look For To Do (Monitoring)
  21. 21. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  22. 22. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  23. 23. Microphones are ON?
  24. 24. Event AwarenessCode Deploys
  25. 25. TIMELINE IRC logging = Rich Data
  26. 26. TIMELINE Traces of Data IRC logging = Rich Data
  27. 27. IRC Logs Fed Into Solr
  28. 28. Status BlogTwitter Feed
  29. 29. Flight Data Recorder
  30. 30. Annotation Traces
  31. 31. Investigation Basics
  32. 32. Start?
  33. 33. TTDHow?
  34. 34. TTR
  35. 35. Stable(“all clear”)
  36. 36. Impact Time = TTR - Start
  37. 37. SEVERITY
  38. 38. Severity 1A. Total loss of serviceB. Severe degradation, effectively unusableC. Loss of a critical feature
  39. 39. Severity 2A. Major degradation/feature loss for SUBSET of membersB. Minor degradation/feature loss for ALL members
  40. 40. Severity 3Noticeable non-critical feature loss or degradation
  41. 41. Severity 4No visible impact, loss of redundancy or capacityheadroom
  42. 42. Severity 5No-impact but unexpected failure
  43. 43. 5/11/2011 - Payments/Checkout system issue Start 4:10pm TTD 4:15pm TTR 4:30pm Stable 4:35pm Total Impact 20 min Severity 1
  44. 44. • Basic metrics• Timeline with details• Remediations/Observations
  45. 45. “normal” Incident PostMortemoperation Time
  46. 46. How?Why?Prevention?
  47. 47. How
  48. 48. Crisis PatternsProblem Starts PostMortem Time
  49. 49. Crisis PatternsProblem Starts Detection PostMortem Time
  50. 50. Crisis PatternsProblem Starts Detection Evaluation PostMortem Time
  51. 51. Crisis PatternsProblem Starts Detection Evaluation Response PostMortem Time
  52. 52. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Time
  53. 53. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  54. 54. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  55. 55. Crisis PatternsProblem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  56. 56. Crisis PatternsProblem Starts PostMortem Time
  57. 57. Crisis PatternsProblem Starts Detection PostMortem Time
  58. 58. Crisis PatternsProblem Starts Detection Evaluation PostMortem Time
  59. 59. Crisis PatternsProblem Starts Detection Evaluation Response PostMortem Time
  60. 60. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Time
  61. 61. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  62. 62. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  63. 63. Crisis PatternsProblem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  64. 64. Crisis PatternsProblem Starts PostMortem Time
  65. 65. Crisis PatternsProblem Starts Detection PostMortem Time
  66. 66. Crisis PatternsProblem Starts Detection Evaluation PostMortem Time
  67. 67. Crisis PatternsProblem Starts Detection Evaluation Response PostMortem Time
  68. 68. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Time
  69. 69. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  70. 70. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  71. 71. Crisis PatternsProblem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  72. 72. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  73. 73. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  74. 74. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time Internal and External Update Communications
  75. 75. Crisis PatternsProblem Starts Detection Evaluation Response Stable Confirmation All Clear Time
  76. 76. Crisis Patterns
  77. 77. Crisis PatternsForced beyond learned roles
  78. 78. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to see
  79. 79. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to seeCognitively and perceptively noisy
  80. 80. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to seeCognitively and perceptively noisyCoordinative load increases exponentially
  81. 81. Thematic Vagabonding“butterfly minds” NOT STUCK ENOUGH
  82. 82. Goal Fixation (encystment)TOO STUCK
  83. 83. HeroismNon-communicating lone wolf-isms
  84. 84. DistractionIrrelevant noise in comm channels
  85. 85. IMPROVISATIONREQUIREMENT for troubleshootingcomplex systems
  86. 86. IMPROVISATION
  87. 87. IMPROVISATION
  88. 88. Why?
  89. 89. Root Cause Analysis“With the unknown, one is confronted with danger,discomfort, and care; the first instinct is to abolishthese painful states.First principle: any explanation is better thannone.” Friedrich Nietzsche Twilight of the Idols, or How to Philosophize with a Hammer
  90. 90. Hindsight BiasKnowledge of the outcome influences theanalysis of the processMakes steps towards failure appearforeseeable and obvious
  91. 91. After The Fact
  92. 92. After The Fact
  93. 93. Reality: Before and During
  94. 94. Hindsight Bias“Should have known better”“All the signs were there, you just needed topay attention”
  95. 95. Hindsight BiasBEFORE accident: The future seems implausibleAFTER accident: Obviously clear: “how could they not see what mistake they were about to make”
  96. 96. Hindsight Bias“...people’s need to be rightis stronger than their abilityto be objective.” N. Crawford American Psychological Association
  97. 97. Outcome BiasJudging a past decisionbased on its outcome.
  98. 98. ID Fishbone/Ishikawa FMEA Five Whys Fault Tree CED CRT
  99. 99. why? OUTAGE
  100. 100. why?because this OUTAGE
  101. 101. why? why? because this OUTAGE
  102. 102. why? why?because because this this OUTAGE
  103. 103. why? why? why? because because this this OUTAGE
  104. 104. why? why? why?because because because this this this OUTAGE
  105. 105. why? why? why?because because because this this this OUTAGE but: WHY?
  106. 106. which causedSome Caused otherAction Some Things things OUTAGE to happen
  107. 107. Sequence-Of-Events
  108. 108. • Satisfyingly simple, easy to explain and document
  109. 109. • Satisfyingly simple, easy to explain and document• Solves for a specific case• Ignorant of surrounding circumstances• Too focused on components• Validates Hindsight and Outcome bias
  110. 110. NOT HELPFUL• Satisfyingly simple, easy to explain and document• Solves for a specific case• Ignorant of surrounding circumstances• Too focused on components• Validates Hindsight and Outcome bias
  111. 111. Epidemiological (adapted from Reason, 1990)
  112. 112. (adapted from Reason, 1990)
  113. 113. Holes = Active/Latent Failures,Bad Things™ Waiting to Happen (adapted from Reason, 1990)
  114. 114. Holes = Active/Latent Failures, Bad Things™ Waiting to HappenCheese = Safety Barriers, Layers of Defense (adapted from Reason, 1990)
  115. 115. (adapted from Reason, 1990)
  116. 116. Code Servers (adapted from Reason, 1990)
  117. 117. Code ServersSchedule Training (adapted from Reason, 1990)
  118. 118. Unmonitored Disk Space (latent condition)
  119. 119. Capacity Unmonitored Disk Space Miscalculation (latent condition)(latent condition)
  120. 120. Capacity Unmonitored Disk Space Miscalculation (latent condition)(latent condition) Unit Test In Transition
  121. 121. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Transition
  122. 122. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced Transition (active failure)
  123. 123. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  124. 124. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  125. 125. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  126. 126. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  127. 127. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  128. 128. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  129. 129. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  130. 130. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure) FAILURE!
  131. 131. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  132. 132. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  133. 133. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  134. 134. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure) NO FAILURE!
  135. 135. • Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at sharp and blunt ends)
  136. 136. • Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at sharp and blunt ends)• Doesn’t explain lineups or orientation of holes• Only identifies defects/gaps, nothing more• Still encourages judgements of decisions
  137. 137. Better, but still NOT HELPFUL• Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at sharp and blunt ends)• Doesn’t explain lineups or orientation of holes• Only identifies defects/gaps, nothing more• Still encourages judgements of decisions
  138. 138. Multiple Contributors “each necessary but only jointly sufficient”
  139. 139. Resultant Versus Emergent
  140. 140. Systemic DatabaseRouter Memcache Webserver
  141. 141. Systemic DatabaseRouter Memcache Webserver Feature Roadmap
  142. 142. Systemic DatabaseRouter Memcache Last Round Webserver of Funding Feature Roadmap
  143. 143. Systemic Dashboard Design DatabaseRouter Memcache Last Round Webserver of Funding Feature Roadmap
  144. 144. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature RoadmapDBA’s car
  145. 145. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Roadmap No EngDBA’s car Training
  146. 146. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down No EngDBA’s car Training
  147. 147. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Is AtDBA’s car Training Velocity
  148. 148. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  149. 149. Systemic Hiring Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  150. 150. Systemic Hiring S3 is slow Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  151. 151. Systemic Hiring S3 is slow Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  152. 152. Systemic
  153. 153. Systemic
  154. 154. Systemic
  155. 155. Systemic
  156. 156. Systemic
  157. 157. Systemic
  158. 158. Systemic
  159. 159. Functional ResonanceIn isolation, components act within bounds.Interconnected, they produce emergingbehaviors.
  160. 160. Causes Are Constructed,Not Found WYLFIWYF Pre-conceived notions on “causes” and behaviors
  161. 161. Contributors, not causes
  162. 162. There is no root cause.
  163. 163. LEARNING
  164. 164. Quantifying ResponseTime to detect?Time for escalation, internal notification?Time to notify the public?Time to graceful degradation? (feature off)Time to stable/resolve?Time to all clear?
  165. 165. Qualifying ResponseHigh signal:noise in comm channels?Troubleshooting fatigue?Troubleshooting handoff?All tools on-hand?Metrics visibility?Collaborative and skillful communication?Improvised tooling or solutions?
  166. 166. All Together Now• Start/TTD/TTR/Stable/etc.• Severity• DATA (graphs, IRC, etc.)• Description (timeline, etc.)• Observations (motivations, latent conditions, etc.)• Actions (remediation tickets, followup)
  167. 167. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  168. 168. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  169. 169. Human Error“...knowledge and error flow from thesame mental sources, only successcan tell one from the other.” Ernest Mach, 1905
  170. 170. Human ErrorNobody comes to work to do a bad job.
  171. 171. Human ErrorUseless as a label and ending point.
  172. 172. Human ErrorHuman error isn’t a cause, it’san effect.
  173. 173. Why did it make senseto the personat the time?
  174. 174. Error Categories• Slips• Lapses• Mismatches• Violations
  175. 175. Error Categories
  176. 176. First Stories“Human error” seen as root cause.Counterfactuals: saying what they “should”have done.Prevention: be “more careful!”
  177. 177. Second Stories“Human error” seen as systemic vulnerabilities,deeper inside the organization.Digging into why it made sense for them to do what theydid, at the time they did it.Prevention...
  178. 178. Why did it make senseto the personat the time?
  179. 179. Why did it make senseto the personat the time?
  180. 180. Why did it make senseto the personat the time?
  181. 181. Substitution TestCould peers have made thesame error under the samecircumstances?
  182. 182. WHERE to learn from ?
  183. 183. Two Propositions
  184. 184. 100 deploys6 deploy-related issues
  185. 185. 100 > 6
  186. 186. Proposition #1“Ways in which things go right are special casesof the ways in which things go wrong.”
  187. 187. Proposition #1Successes = failures gone wrongStudy the failures, generalize from that. data sources: 6 out of 100
  188. 188. Proposition #2“Ways in which things go wrong are specialcases of the ways in which things go right.”
  189. 189. Proposition #2Failures = successes gone wrongStudy the successes, generalize from thatdata sources: 94 out of 100
  190. 190. 94/100 ? OR6/100 ?
  191. 191. What and WHY Do ThingsGo RIGHT?
  192. 192. Not just: why did we fail?But also: why did we succeed?
  193. 193. Near MissesHey everybody -Don’t be like me. I tried to X, butbecause it was no good.It almost exploded everyone.So, don’t do: (details about X) Love, Joe
  194. 194. Taking the New ViewRecognize that human error isan attribution.
  195. 195. Taking the New ViewPursue Second Stories.
  196. 196. Taking the New ViewEscape Hindsight Bias.
  197. 197. Taking the New ViewUnderstand work as performedat the sharp end.
  198. 198. Taking the New ViewExamine how changes (at alllayers) will produce newvulnerabilities.
  199. 199. Taking the New ViewUse technology to support andenhance human expertise.
  200. 200. Taking the New ViewTame complexity through newforms of feedback.
  201. 201. Taking the New ViewRealize that your systems arenot inherently safe.
  202. 202. Taking the New ViewHuman error is an inevitableby-product of strainedcomplex systems.
  203. 203. Taking the New ViewHuman error isn’t at the root ofyour safety problems.
  204. 204. Taking the New ViewHuman error isn’t random.
  205. 205. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  206. 206. Just Culture
  207. 207. Just CultureBalancing accountability withlearning.
  208. 208. Intentional Malice
  209. 209. Negligence Found Severity of the Outage
  210. 210. NameBlameShame
  211. 211. NameBlameShame
  212. 212. Name WHY? Blame#!@%$&#Shame
  213. 213. “Must set an example!”
  214. 214. “Has to be some fear that notdoing one’s job correctly couldlead to punishment.”
  215. 215. “Must set an example!” Punishing Deterrents is a not“Has to be some fear that Losing coulddoing one’s job correctly Propositionlead to punishment.”
  216. 216. Holding People Accountable != Blaming People
  217. 217. Accountability and Learning Punishment For Errors
  218. 218. No Bad ApplesOnly Bad Theories of Error
  219. 219. Name WHY? Blame#!@%$&#Shame
  220. 220. Signs Of Old View“Gross Misconduct”“Carelessness”“Negligence”“Egregious Behavior”“Willful Violations”
  221. 221. Discretionary Spaces
  222. 222. AcceptableUnacceptable
  223. 223. AcceptableUnacceptable
  224. 224. AcceptableUnacceptable
  225. 225. AcceptableUnacceptable
  226. 226. Acceptable (who draws this subjective line?)Unacceptable
  227. 227. Increase Accountability BySupporting Learning
  228. 228. Empower PeopleLet them own their own stories.Don’t make people pay penalties.Allow them to educate the organization.
  229. 229. Reduce UncertaintyMake it clear who definesacceptable behavior.
  230. 230. Organizational RootsAccountability =Responsibility + Requisite Authority
  231. 231. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  232. 232. (Thanks, Fellas)Dr. Erik Hollnagel Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
  233. 233. Homework!
  234. 234. We Want YOU etsy.com/careers
  235. 235. • Photo Credits http://www.flickr.com/photos/51035644987@N01/2678090600/• http://www.flickr.com/photos/67196253@N00/2941655917/• http://www.flickr.com/photos/stirwise/417629641• http://www.flickr.com/photos/38383999@N06/3888057995/• http://www.flickr.com/photos/94443490@N00/361543080/• http://www.flickr.com/photos/7729940@N06/4333396494/• http://www.flickr.com/photos/cpstorm/167418602• http://www.flickr.com/photos/63474264@N00/4366221069/• http://www.flickr.com/photos/proimos/4199675334• http://www.flickr.com/photos/30475691@N07/2862060992/• http://www.flickr.com/photos/14663487@N00/797755046/• http://www.flickr.com/photos/25080113@N06/5361445631/
  236. 236. THE END
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×