Outages,Post-Mortems,and Human Error 101                 John Allspaw                SVP, Tech Ops
ScienceTimeTravelMythbustingReadingHomework
Here To Challenge You
Resilience                             EngineeringDr. Erik Hollnagel   Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
COMPLEX   +DYNAMIC
COMPLEX     !=COMPLICATED
Fundamental Surprises
E.T.T.O.    Efficiency   Thoroughness
Organizations, Policies, Procedures,           Regulations                                       BLUNT       Resources & C...
Organizations, Policies, Procedures,                  Regulations                                                    BLUNT...
FUTURE   + PAST
Why Do Them?
Why?Understand the FAILURE
Why?Understand the SYSTEM
Where “System” =              Networks                 Servers            Applications             Processes              ...
Where “System” =              Networks                 Servers            Applications             Processes              ...
PEOPLE
(Anticipation)    Knowing      What    To Expect
(Anticipation)    Knowing         Knowing      What            What    To Expect      To Look For                 (Monitor...
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      What            What          W...
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      Knowing      What            Wh...
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      Knowing      What            Wh...
Microphones are ON?
Event AwarenessCode Deploys
Timeline       RICH DATA
Traces of             DataTimeline       RICH DATA
Status BlogTwitter Feed
Flight Data   Recorder
Annotation Traces
InvestigationBasics
Start?
TTDHow?
TTR
Stable(“all clear”)
Impact Time = TTR - Start
SEVERITY
SEV 1A. Total loss of serviceB. Severe degradation, effectively unusableC. Loss of a critical feature
SEV 2A. Major degradation/feature loss for SUBSET of membersB. Minor degradation/feature loss for ALL members
SEV 3Noticeablenon-critical feature loss or degradation
SEV 4No visible impact, loss of redundancy or capacityheadroom
SEV 5No-impact but unexpected failure
5/11/2011 - Payments/Checkout system issue   Start                                     4:10pm   TTD                       ...
Basic metricsTimeline with detailsRemediations/Observations
“normal”            Incident   PostMortemoperation              Time
How?Why?Prevention?
How
Crisis PatternsProblem Starts                           PostMortem                 Time
Crisis PatternsProblem Starts      Detection                           PostMortem                  Time
Crisis PatternsProblem Starts      Detection             Evaluation                                 PostMortem            ...
Crisis PatternsProblem Starts      Detection             Evaluation                     Response                          ...
Crisis PatternsProblem Starts      Detection             Evaluation                     Response                          ...
Crisis PatternsProblem Starts      Detection             Evaluation                     Response                          ...
Crisis PatternsProblem Starts      Detection             Evaluation                     Response                          ...
Crisis PatternsProblem Starts                          Stress      Detection             Evaluation                     Re...
Crisis PatternsProblem Starts                           PostMortem                 Time
Crisis PatternsProblem Starts      Detection                           PostMortem                  Time
Crisis PatternsProblem Starts      Detection   Evaluation                               PostMortem                  Time
Crisis PatternsProblem Starts      Detection   Evaluation                    Response                               PostMo...
Crisis PatternsProblem Starts      Detection   Evaluation                    Response                          Stable     ...
Crisis PatternsProblem Starts      Detection   Evaluation                    Response                          Stable     ...
Crisis PatternsProblem Starts      Detection   Evaluation                    Response                          Stable     ...
Crisis PatternsProblem Starts                 Stress      Detection   Evaluation                    Response              ...
Crisis PatternsProblem Starts                           PostMortem                 Time
Crisis PatternsProblem Starts      Detection                           PostMortem                  Time
Crisis PatternsProblem Starts      Detection          Evaluation                              PostMortem                  ...
Crisis PatternsProblem Starts      Detection          Evaluation               Response                                 Po...
Crisis PatternsProblem Starts      Detection          Evaluation               Response                                 St...
Crisis PatternsProblem Starts      Detection          Evaluation               Response                                 St...
Crisis PatternsProblem Starts      Detection          Evaluation               Response                                 St...
Crisis PatternsProblem Starts           Stress  Detection      Evaluation           Response                             S...
Problem Starts      Detection          Evaluation               Response                                 Stable           ...
Problem Starts      Detection          Evaluation               Response                                 Stable           ...
Problem Starts      Detection          Evaluation               Response                                 Stable           ...
Crisis PatternsProblem Starts      Detection          Evaluation               Response                                 St...
Crisis Patterns
Crisis PatternsForced beyond learned roles
Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to see
Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to seeCognitively and p...
Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to seeCognitively and p...
Thematic Vagabonding“butterfly minds”       NOT STUCK ENOUGH
Goal Fixation                 (encystment)TOO STUCK
HeroismNon-communicating lone wolf-isms
DistractionIrrelevant noise in comm channels
IMPROVISATIONREQUIREMENT for troubleshootingcomplex systems
IMPROVISATION
IMPROVISATION
Why?
Root Cause Analysis“With the unknown, one is confronted with danger,discomfort, and care; the first instinct is to abolisht...
Hindsight BiasKnowledge of the outcome influences theanalysis of the processMakes steps towards failure appearforeseeable a...
After The Fact
After The Fact
Reality: Before and During
Hindsight Bias“Should have known better”“All the signs were there, you just needed topay attention”
Hindsight BiasBEFORE accident:        The future seems implausibleAFTER accident:   Obviously clear: “how could they not s...
Hindsight Bias“...people’s need to be rightis stronger than their abilityto be objective.”                          N. Cra...
Outcome BiasJudging a past decisionbased on its outcome.
ID          Fishbone/Ishikawa     FMEA                         Five Whys                Fault Tree      CED      CRT
why?       OUTAGE
why?because  this             OUTAGE
why? why?  because    this            OUTAGE
why? why?because   because  this      this                    OUTAGE
why?     why? why?   because   because     this      this                       OUTAGE
why?       why? why?because   because   because  this      this      this                              OUTAGE
why?        why? why?because    because   because  this       this      this                               OUTAGE         ...
which                       causedSome       Caused       otherAction   Some Things   things                              ...
Sequence-Of-Events
• Satisfyingly simple, easy to explain and document
•   Satisfyingly simple, easy to explain and document•   Solves for a specific case•   Ignorant of surrounding circumstance...
NOT HELPFUL•   Satisfyingly simple, easy to explain and document•   Solves for a specific case•   Ignorant of surrounding c...
Epidemiological        (adapted from Reason, 1990)
(adapted from Reason, 1990)
Holes = Active/Latent Failures,Bad Things™ Waiting to Happen                  (adapted from Reason, 1990)
Holes = Active/Latent Failures,                            Bad Things™ Waiting to HappenCheese = Safety Barriers,   Layers...
(adapted from Reason, 1990)
Code   Servers           (adapted from Reason, 1990)
Code        ServersSchedule     Training   (adapted from Reason, 1990)
Unmonitored Disk Space   (latent condition)
Capacity                     Unmonitored Disk Space  Miscalculation                        (latent condition)(latent condi...
Capacity                     Unmonitored Disk Space  Miscalculation                        (latent condition)(latent condi...
Violation of known coding     Capacity                standards         Unmonitored Disk Space  Miscalculation         (la...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Violation of known coding     Capacity                          standards         Unmonitored Disk Space  Miscalculation  ...
Capacity  Miscalculation(latent condition)                                                            Unit Test In        ...
Capacity  Miscalculation(latent condition)                                                            Unit Test In        ...
Capacity  Miscalculation(latent condition)                                                            Unit Test In        ...
Capacity  Miscalculation(latent condition)                                                            Unit Test In        ...
• Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at ...
• Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at ...
Better, but still                                               NOT HELPFUL• Better than dominoes, but still linear layers...
Multiple Contributors “each necessary but only jointly sufficient”
Resultant Versus Emergent
Systemic                DatabaseRouter                     Memcache    Webserver
Systemic                DatabaseRouter                        Memcache    Webserver                            Feature    ...
Systemic                DatabaseRouter                        Memcache                                  Last Round    Webs...
Systemic                        Dashboard                                 Design                DatabaseRouter            ...
Systemic                          Dashboard                                    Design                   Database  Router  ...
Systemic                          Dashboard                                    Design                   Database  Router  ...
Systemic                           Dashboard                                     Design                    Database  Route...
Systemic                           Dashboard                                     Design                    Database  Route...
Systemic                               Dashboard                                         Design                    Databas...
Systemic Hiring                                        Dashboard                                         Design        Dif...
Systemic Hiring                          S3 is slow    Dashboard                                         Design        Dif...
Systemic Hiring                          S3 is slow    Dashboard                                         Design        Dif...
Systemic
Systemic
Systemic
Systemic
Systemic
Systemic
Systemic
Functional ResonanceIn isolation, components act within bounds.Interconnected, they produce emergingbehaviors.
Causes Are Constructed,Not Found WYLFIWYF Pre-conceived notions on “causes” and behaviors
Contributors, not causes
There is no root cause.
LEARNING
Quantifying ResponseTime to detect?Time for escalation, internal notification?Time to notify the public?Time to graceful de...
Qualifying ResponseHigh signal:noise in comm channels?Troubleshooting fatigue?Troubleshooting handoff?All tools on-hand?Met...
All Together Now•   Start/TTD/TTR/Stable/etc.•   Severity•   DATA (graphs, IRC, etc.)•   Description (timeline, etc.)•   O...
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      Knowing      What            Wh...
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      Knowing      What            Wh...
Human Error“...knowledge and error flow from thesame mental sources, only successcan tell one from the other.”             ...
Human ErrorNobody comes to work to do a bad job.
Human ErrorUseless as a label and ending point.
Human ErrorHuman error isn’t a cause, it’san effect.
Why did it make senseto the personat the time?
Error Categories• Slips• Lapses• Mismatches• Violations
Error Categories
First Stories“Human error” seen as root cause.Counterfactuals: saying what they “should”have done.Prevention: be “more car...
Second Stories“Human error” seen as systemic vulnerabilities,deeper inside the organization.Digging into why it made sense...
Why did it make senseto the personat the time?
Why did it make senseto the personat the time?
Why did it make senseto the personat the time?
Substitution TestCould peers have made thesame error under the samecircumstances?
Near MissesHey everybody -Don’t be like me. I tried to X, butbecause it was no good.It almost exploded everyone.So, don’t ...
WHERE to learn from ?
Two Propositions
100 deploys6 deploy-related issues
100 > 6
Proposition #1“Ways in which things go right are special casesof the ways in which things go wrong.”
Proposition #1Successes = failures gone wrongStudy the failures, generalize from that.                   6 100    data sou...
Proposition #2“Ways in which things go wrong are specialcases of the ways in which things go right.”
Proposition #2Failures = successes gone wrongStudy the successes, generalize from thatdata sources:   94 out of 100
94/100 ?   OR6/100 ?
What and WHY Do ThingsGo RIGHT?
Not just:                why did we fail?But also:            why did we succeed?
Taking the New ViewRecognize that human error isan attribution.
Taking the New ViewPursue Second Stories.
Taking the New ViewEscape Hindsight Bias.
Taking the New ViewUnderstand work as performedat the sharp end.
Taking the New ViewExamine how changes (at alllayers) will produce newvulnerabilities.
Taking the New ViewUse technology to support andenhance human expertise.
Taking the New ViewTame complexity through newforms of feedback.
Taking the New ViewRealize that your systems arenot inherently safe.
Taking the New ViewHuman error is an inevitableby-product of strainedcomplex systems.
Taking the New ViewHuman error isn’t at the root ofyour safety problems.
Taking the New ViewHuman error isn’t random.
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      Knowing      What            Wh...
Just Culture
Just CultureBalancing accountability withlearning.
Intentional Malice
Negligence  Found         Severity of the Outage
NameBlameShame
NameBlameShame
Name  WHY? Blame#!@%$&#Shame
“Must set an example!”
“Has to be some fear that notdoing one’s job correctly couldlead to punishment.”
“Must set an example!”       Punishing       Deterrents is a not“Has to be some fear that            Losing coulddoing one...
Holding People Accountable          !=     Blaming People
Accountability    and  Learning                 Punishment For Errors
No Bad ApplesOnly Bad Theories of Error
Name  WHY? Blame#!@%$&#Shame
Signs Of Old View“Gross Misconduct”“Carelessness”“Negligence”“Egregious Behavior”“Willful Violations”
Discretionary Spaces
AcceptableUnacceptable
AcceptableUnacceptable
AcceptableUnacceptable
AcceptableUnacceptable
Acceptable   (who draws this subjective line?)Unacceptable
Increase Accountability BySupporting Learning
Empower PeopleLet them own their own stories.Don’t make people pay penalties.Allow them to educate the organization.
Reduce UncertaintyMake it clear who definesacceptable behavior.
Organizational RootsAccountability =Responsibility + Requisite Authority
(Anticipation)                   (Response)    Knowing         Knowing        Knowing      Knowing      What            Wh...
(Thanks, Fellas)Dr. Erik Hollnagel   Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
Homework!
THEEND
•                                             Photo Credits    http://www.flickr.com/photos/51035644987@N01/2678090600/•   ...
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
Upcoming SlideShare
Loading in...5
×

Outages, PostMortems, and Human Error

5,310

Published on

This is a talk I gave at the Code As Craft Etsy tech talk series.

Published in: Technology, Business
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,310
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Outages, PostMortems, and Human Error

  1. 1. Outages,Post-Mortems,and Human Error 101 John Allspaw SVP, Tech Ops
  2. 2. ScienceTimeTravelMythbustingReadingHomework
  3. 3. Here To Challenge You
  4. 4. Resilience EngineeringDr. Erik Hollnagel Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
  5. 5. COMPLEX +DYNAMIC
  6. 6. COMPLEX !=COMPLICATED
  7. 7. Fundamental Surprises
  8. 8. E.T.T.O. Efficiency Thoroughness
  9. 9. Organizations, Policies, Procedures, Regulations BLUNT Resources & Constraints
  10. 10. Organizations, Policies, Procedures, Regulations BLUNT Resources & Constraints Slips Adjustments OperatorsCompensations Mistakes Lapses Recoveries Violations Improvisations SHARP
  11. 11. FUTURE + PAST
  12. 12. Why Do Them?
  13. 13. Why?Understand the FAILURE
  14. 14. Why?Understand the SYSTEM
  15. 15. Where “System” = Networks Servers Applications Processes People
  16. 16. Where “System” = Networks Servers Applications Processes People
  17. 17. PEOPLE
  18. 18. (Anticipation) Knowing What To Expect
  19. 19. (Anticipation) Knowing Knowing What What To Expect To Look For (Monitoring)
  20. 20. (Anticipation) (Response) Knowing Knowing Knowing What What What To Expect To Look For To Do (Monitoring)
  21. 21. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  22. 22. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  23. 23. Microphones are ON?
  24. 24. Event AwarenessCode Deploys
  25. 25. Timeline RICH DATA
  26. 26. Traces of DataTimeline RICH DATA
  27. 27. Status BlogTwitter Feed
  28. 28. Flight Data Recorder
  29. 29. Annotation Traces
  30. 30. InvestigationBasics
  31. 31. Start?
  32. 32. TTDHow?
  33. 33. TTR
  34. 34. Stable(“all clear”)
  35. 35. Impact Time = TTR - Start
  36. 36. SEVERITY
  37. 37. SEV 1A. Total loss of serviceB. Severe degradation, effectively unusableC. Loss of a critical feature
  38. 38. SEV 2A. Major degradation/feature loss for SUBSET of membersB. Minor degradation/feature loss for ALL members
  39. 39. SEV 3Noticeablenon-critical feature loss or degradation
  40. 40. SEV 4No visible impact, loss of redundancy or capacityheadroom
  41. 41. SEV 5No-impact but unexpected failure
  42. 42. 5/11/2011 - Payments/Checkout system issue Start 4:10pm TTD 4:15pm TTR 4:30pm Stable 4:35pm Total Impact 20 min Severity 1
  43. 43. Basic metricsTimeline with detailsRemediations/Observations
  44. 44. “normal” Incident PostMortemoperation Time
  45. 45. How?Why?Prevention?
  46. 46. How
  47. 47. Crisis PatternsProblem Starts PostMortem Time
  48. 48. Crisis PatternsProblem Starts Detection PostMortem Time
  49. 49. Crisis PatternsProblem Starts Detection Evaluation PostMortem Time
  50. 50. Crisis PatternsProblem Starts Detection Evaluation Response PostMortem Time
  51. 51. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Time
  52. 52. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  53. 53. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  54. 54. Crisis PatternsProblem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  55. 55. Crisis PatternsProblem Starts PostMortem Time
  56. 56. Crisis PatternsProblem Starts Detection PostMortem Time
  57. 57. Crisis PatternsProblem Starts Detection Evaluation PostMortem Time
  58. 58. Crisis PatternsProblem Starts Detection Evaluation Response PostMortem Time
  59. 59. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Time
  60. 60. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  61. 61. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  62. 62. Crisis PatternsProblem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  63. 63. Crisis PatternsProblem Starts PostMortem Time
  64. 64. Crisis PatternsProblem Starts Detection PostMortem Time
  65. 65. Crisis PatternsProblem Starts Detection Evaluation PostMortem Time
  66. 66. Crisis PatternsProblem Starts Detection Evaluation Response PostMortem Time
  67. 67. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Time
  68. 68. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation Time
  69. 69. Crisis PatternsProblem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  70. 70. Crisis PatternsProblem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  71. 71. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  72. 72. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time
  73. 73. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time Internal and External Update Communications
  74. 74. Crisis PatternsProblem Starts Detection Evaluation Response Stable Confirmation All Clear Time
  75. 75. Crisis Patterns
  76. 76. Crisis PatternsForced beyond learned roles
  77. 77. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to see
  78. 78. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to seeCognitively and perceptively noisy
  79. 79. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddifficult to seeCognitively and perceptively noisyCoordinative load increases exponentially
  80. 80. Thematic Vagabonding“butterfly minds” NOT STUCK ENOUGH
  81. 81. Goal Fixation (encystment)TOO STUCK
  82. 82. HeroismNon-communicating lone wolf-isms
  83. 83. DistractionIrrelevant noise in comm channels
  84. 84. IMPROVISATIONREQUIREMENT for troubleshootingcomplex systems
  85. 85. IMPROVISATION
  86. 86. IMPROVISATION
  87. 87. Why?
  88. 88. Root Cause Analysis“With the unknown, one is confronted with danger,discomfort, and care; the first instinct is to abolishthese painful states.First principle: any explanation is better thannone.” Friedrich Nietzsche Twilight of the Idols, or How to Philosophize with a Hammer
  89. 89. Hindsight BiasKnowledge of the outcome influences theanalysis of the processMakes steps towards failure appearforeseeable and obvious
  90. 90. After The Fact
  91. 91. After The Fact
  92. 92. Reality: Before and During
  93. 93. Hindsight Bias“Should have known better”“All the signs were there, you just needed topay attention”
  94. 94. Hindsight BiasBEFORE accident: The future seems implausibleAFTER accident: Obviously clear: “how could they not see what mistake they were about to make?!”
  95. 95. Hindsight Bias“...people’s need to be rightis stronger than their abilityto be objective.” N. Crawford American Psychological Association
  96. 96. Outcome BiasJudging a past decisionbased on its outcome.
  97. 97. ID Fishbone/Ishikawa FMEA Five Whys Fault Tree CED CRT
  98. 98. why? OUTAGE
  99. 99. why?because this OUTAGE
  100. 100. why? why? because this OUTAGE
  101. 101. why? why?because because this this OUTAGE
  102. 102. why? why? why? because because this this OUTAGE
  103. 103. why? why? why?because because because this this this OUTAGE
  104. 104. why? why? why?because because because this this this OUTAGE but: WHY?
  105. 105. which causedSome Caused otherAction Some Things things OUTAGE to happen
  106. 106. Sequence-Of-Events
  107. 107. • Satisfyingly simple, easy to explain and document
  108. 108. • Satisfyingly simple, easy to explain and document• Solves for a specific case• Ignorant of surrounding circumstances• Too focused on components• Validates Hindsight and Outcome bias
  109. 109. NOT HELPFUL• Satisfyingly simple, easy to explain and document• Solves for a specific case• Ignorant of surrounding circumstances• Too focused on components• Validates Hindsight and Outcome bias
  110. 110. Epidemiological (adapted from Reason, 1990)
  111. 111. (adapted from Reason, 1990)
  112. 112. Holes = Active/Latent Failures,Bad Things™ Waiting to Happen (adapted from Reason, 1990)
  113. 113. Holes = Active/Latent Failures, Bad Things™ Waiting to HappenCheese = Safety Barriers, Layers of Defense (adapted from Reason, 1990)
  114. 114. (adapted from Reason, 1990)
  115. 115. Code Servers (adapted from Reason, 1990)
  116. 116. Code ServersSchedule Training (adapted from Reason, 1990)
  117. 117. Unmonitored Disk Space (latent condition)
  118. 118. Capacity Unmonitored Disk Space Miscalculation (latent condition)(latent condition)
  119. 119. Capacity Unmonitored Disk Space Miscalculation (latent condition)(latent condition) Unit Test In Transition
  120. 120. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Transition
  121. 121. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced Transition (active failure)
  122. 122. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  123. 123. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  124. 124. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  125. 125. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  126. 126. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  127. 127. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  128. 128. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  129. 129. Violation of known coding Capacity standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition)(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure) FAILURE!
  130. 130. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  131. 131. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  132. 132. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
  133. 133. Capacity Miscalculation(latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure) NO FAILURE!
  134. 134. • Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at sharp and blunt ends)
  135. 135. • Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at sharp and blunt ends)• Doesn’t explain lineups or orientation of holes• Only identifies defects/gaps, nothing more• Still encourages judgements of decisions
  136. 136. Better, but still NOT HELPFUL• Better than dominoes, but still linear layers of “defense”• Helps uncover multiple contributors and latent failures (at sharp and blunt ends)• Doesn’t explain lineups or orientation of holes• Only identifies defects/gaps, nothing more• Still encourages judgements of decisions
  137. 137. Multiple Contributors “each necessary but only jointly sufficient”
  138. 138. Resultant Versus Emergent
  139. 139. Systemic DatabaseRouter Memcache Webserver
  140. 140. Systemic DatabaseRouter Memcache Webserver Feature Roadmap
  141. 141. Systemic DatabaseRouter Memcache Last Round Webserver of Funding Feature Roadmap
  142. 142. Systemic Dashboard Design DatabaseRouter Memcache Last Round Webserver of Funding Feature Roadmap
  143. 143. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature RoadmapDBA’s car
  144. 144. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Roadmap No EngDBA’s car Training
  145. 145. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down No EngDBA’s car Training
  146. 146. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Is AtDBA’s car Training Velocity
  147. 147. Systemic Dashboard Design Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  148. 148. Systemic Hiring Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  149. 149. Systemic Hiring S3 is slow Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  150. 150. Systemic Hiring S3 is slow Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is AtDBA’s car Training Article Velocity
  151. 151. Systemic
  152. 152. Systemic
  153. 153. Systemic
  154. 154. Systemic
  155. 155. Systemic
  156. 156. Systemic
  157. 157. Systemic
  158. 158. Functional ResonanceIn isolation, components act within bounds.Interconnected, they produce emergingbehaviors.
  159. 159. Causes Are Constructed,Not Found WYLFIWYF Pre-conceived notions on “causes” and behaviors
  160. 160. Contributors, not causes
  161. 161. There is no root cause.
  162. 162. LEARNING
  163. 163. Quantifying ResponseTime to detect?Time for escalation, internal notification?Time to notify the public?Time to graceful degradation? (feature off)Time to stable/resolve?Time to all clear?
  164. 164. Qualifying ResponseHigh signal:noise in comm channels?Troubleshooting fatigue?Troubleshooting handoff?All tools on-hand?Metrics visibility?Collaborative and skillful communication?Improvised tooling or solutions?
  165. 165. All Together Now• Start/TTD/TTR/Stable/etc.• Severity• DATA (graphs, IRC, etc.)• Description (timeline, etc.)• Observations (motivations, latent conditions, etc.)• Actions (remediation tickets, followup)
  166. 166. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  167. 167. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  168. 168. Human Error“...knowledge and error flow from thesame mental sources, only successcan tell one from the other.” Ernest Mach, 1905
  169. 169. Human ErrorNobody comes to work to do a bad job.
  170. 170. Human ErrorUseless as a label and ending point.
  171. 171. Human ErrorHuman error isn’t a cause, it’san effect.
  172. 172. Why did it make senseto the personat the time?
  173. 173. Error Categories• Slips• Lapses• Mismatches• Violations
  174. 174. Error Categories
  175. 175. First Stories“Human error” seen as root cause.Counterfactuals: saying what they “should”have done.Prevention: be “more careful!”
  176. 176. Second Stories“Human error” seen as systemic vulnerabilities,deeper inside the organization.Digging into why it made sense for them to do what theydid, at the time they did it.Prevention...
  177. 177. Why did it make senseto the personat the time?
  178. 178. Why did it make senseto the personat the time?
  179. 179. Why did it make senseto the personat the time?
  180. 180. Substitution TestCould peers have made thesame error under the samecircumstances?
  181. 181. Near MissesHey everybody -Don’t be like me. I tried to X, butbecause it was no good.It almost exploded everyone.So, don’t do: (details about X) Love, Joe
  182. 182. WHERE to learn from ?
  183. 183. Two Propositions
  184. 184. 100 deploys6 deploy-related issues
  185. 185. 100 > 6
  186. 186. Proposition #1“Ways in which things go right are special casesof the ways in which things go wrong.”
  187. 187. Proposition #1Successes = failures gone wrongStudy the failures, generalize from that. 6 100 data sources: out of
  188. 188. Proposition #2“Ways in which things go wrong are specialcases of the ways in which things go right.”
  189. 189. Proposition #2Failures = successes gone wrongStudy the successes, generalize from thatdata sources: 94 out of 100
  190. 190. 94/100 ? OR6/100 ?
  191. 191. What and WHY Do ThingsGo RIGHT?
  192. 192. Not just: why did we fail?But also: why did we succeed?
  193. 193. Taking the New ViewRecognize that human error isan attribution.
  194. 194. Taking the New ViewPursue Second Stories.
  195. 195. Taking the New ViewEscape Hindsight Bias.
  196. 196. Taking the New ViewUnderstand work as performedat the sharp end.
  197. 197. Taking the New ViewExamine how changes (at alllayers) will produce newvulnerabilities.
  198. 198. Taking the New ViewUse technology to support andenhance human expertise.
  199. 199. Taking the New ViewTame complexity through newforms of feedback.
  200. 200. Taking the New ViewRealize that your systems arenot inherently safe.
  201. 201. Taking the New ViewHuman error is an inevitableby-product of strainedcomplex systems.
  202. 202. Taking the New ViewHuman error isn’t at the root ofyour safety problems.
  203. 203. Taking the New ViewHuman error isn’t random.
  204. 204. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  205. 205. Just Culture
  206. 206. Just CultureBalancing accountability withlearning.
  207. 207. Intentional Malice
  208. 208. Negligence Found Severity of the Outage
  209. 209. NameBlameShame
  210. 210. NameBlameShame
  211. 211. Name WHY? Blame#!@%$&#Shame
  212. 212. “Must set an example!”
  213. 213. “Has to be some fear that notdoing one’s job correctly couldlead to punishment.”
  214. 214. “Must set an example!” Punishing Deterrents is a not“Has to be some fear that Losing coulddoing one’s job correctly Propositionlead to punishment.”
  215. 215. Holding People Accountable != Blaming People
  216. 216. Accountability and Learning Punishment For Errors
  217. 217. No Bad ApplesOnly Bad Theories of Error
  218. 218. Name WHY? Blame#!@%$&#Shame
  219. 219. Signs Of Old View“Gross Misconduct”“Carelessness”“Negligence”“Egregious Behavior”“Willful Violations”
  220. 220. Discretionary Spaces
  221. 221. AcceptableUnacceptable
  222. 222. AcceptableUnacceptable
  223. 223. AcceptableUnacceptable
  224. 224. AcceptableUnacceptable
  225. 225. Acceptable (who draws this subjective line?)Unacceptable
  226. 226. Increase Accountability BySupporting Learning
  227. 227. Empower PeopleLet them own their own stories.Don’t make people pay penalties.Allow them to educate the organization.
  228. 228. Reduce UncertaintyMake it clear who definesacceptable behavior.
  229. 229. Organizational RootsAccountability =Responsibility + Requisite Authority
  230. 230. (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Look For To Do Has Happened (Monitoring) (Learning)
  231. 231. (Thanks, Fellas)Dr. Erik Hollnagel Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook
  232. 232. Homework!
  233. 233. THEEND
  234. 234. • Photo Credits http://www.flickr.com/photos/51035644987@N01/2678090600/• http://www.flickr.com/photos/67196253@N00/2941655917/• http://www.flickr.com/photos/stirwise/417629641• http://www.flickr.com/photos/38383999@N06/3888057995/• http://www.flickr.com/photos/94443490@N00/361543080/• http://www.flickr.com/photos/7729940@N06/4333396494/• http://www.flickr.com/photos/cpstorm/167418602• http://www.flickr.com/photos/63474264@N00/4366221069/• http://www.flickr.com/photos/proimos/4199675334• http://www.flickr.com/photos/30475691@N07/2862060992/• http://www.flickr.com/photos/14663487@N00/797755046/• http://www.flickr.com/photos/25080113@N06/5361445631/

×