85. Crisis Patterns
Forced beyond learned roles
Actions whose consequences are both important and
difficult to see
Cognitively and perceptively noisy
86. Crisis Patterns
Forced beyond learned roles
Actions whose consequences are both important and
difficult to see
Cognitively and perceptively noisy
Coordinative load increases exponentially
95. Root Cause Analysis
“With the unknown, one is confronted with danger,
discomfort, and care; the first instinct is to abolish
these painful states.
First principle: any explanation is better than
none.”
Friedrich Nietzsche
Twilight of the Idols, or
How to Philosophize with a Hammer
96. Hindsight Bias
Knowledge of the outcome influences the
analysis of the process
Makes steps towards failure appear
foreseeable and obvious
101. Hindsight Bias
BEFORE accident:
The future seems implausible
AFTER accident:
Obviously clear: “how could they not see what
mistake they were about to make”
102. Hindsight Bias
“...people’s need to be right
is stronger than their ability
to be objective.”
N. Crawford
American Psychological Association
116. • Satisfyingly simple, easy to explain and document
• Solves for a specific case
• Ignorant of surrounding circumstances
• Too focused on components
• Validates Hindsight and Outcome bias
117. NOT HELPFUL
• Satisfyingly simple, easy to explain and document
• Solves for a specific case
• Ignorant of surrounding circumstances
• Too focused on components
• Validates Hindsight and Outcome bias
127. Capacity
Unmonitored Disk Space
Miscalculation
(latent condition)
(latent condition)
128. Capacity
Unmonitored Disk Space
Miscalculation
(latent condition)
(latent condition)
Unit Test In
Transition
129. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Transition
130. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced
Transition
(active failure)
131. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
132. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
133. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
134. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
135. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
136. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
137. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
138. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
FAILURE!
139. Capacity
Miscalculation
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
140. Capacity
Miscalculation
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
141. Capacity
Miscalculation
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
142. Capacity
Miscalculation
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
NO FAILURE!
143.
144. • Better than dominoes, but still linear layers of “defense”
• Helps uncover multiple contributors and latent failures (at
sharp and blunt ends)
145. • Better than dominoes, but still linear layers of “defense”
• Helps uncover multiple contributors and latent failures (at
sharp and blunt ends)
• Doesn’t explain lineups or orientation of holes
• Only identifies defects/gaps, nothing more
• Still encourages judgements of decisions
146. Better, but still
NOT HELPFUL
• Better than dominoes, but still linear layers of “defense”
• Helps uncover multiple contributors and latent failures (at
sharp and blunt ends)
• Doesn’t explain lineups or orientation of holes
• Only identifies defects/gaps, nothing more
• Still encourages judgements of decisions
151. Systemic
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Roadmap
152. Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Roadmap
153. Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Roadmap
DBA’s car
154. Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Roadmap
No Eng
DBA’s car Training
155. Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
No Eng
DBA’s car Training
156. Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Is At
DBA’s car Training Velocity
157. Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Techcrunch Is At
DBA’s car Training Article Velocity
158. Systemic Hiring
Dashboard
Design
Difficulties Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Techcrunch Is At
DBA’s car Training Article Velocity
159. Systemic Hiring
S3 is slow Dashboard
Design
Difficulties Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Techcrunch Is At
DBA’s car Training Article Velocity
160. Systemic Hiring
S3 is slow Dashboard
Design
Difficulties Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Techcrunch Is At
DBA’s car Training Article Velocity
173. Quantifying Response
Time to detect?
Time for escalation, internal notification?
Time to notify the public?
Time to graceful degradation? (feature off)
Time to stable/resolve?
Time to all clear?
174. Qualifying Response
High signal:noise in comm channels?
Troubleshooting fatigue?
Troubleshooting handoff?
All tools on-hand?
Metrics visibility?
Collaborative and skillful communication?
Improvised tooling or solutions?
175. All Together Now
• Start/TTD/TTR/Stable/etc.
• Severity
• DATA (graphs, IRC, etc.)
• Description (timeline, etc.)
• Observations (motivations, latent conditions, etc.)
• Actions (remediation tickets, followup)
176. (Anticipation) (Response)
Knowing Knowing Knowing Knowing
What What What What
To Expect To Look For To Do Has Happened
(Monitoring) (Learning)
177. (Anticipation) (Response)
Knowing Knowing Knowing Knowing
What What What What
To Expect To Look For To Do Has Happened
(Monitoring) (Learning)
178. Human Error
“...knowledge and error flow from the
same mental sources, only success
can tell one from the other.”
Ernest Mach, 1905
185. First Stories
“Human error” seen as root cause.
Counterfactuals: saying what they “should”
have done.
Prevention: be “more careful!”
186. Second Stories
“Human error” seen as systemic vulnerabilities,
deeper inside the organization.
Digging into why it made sense for them to do what they
did, at the time they did it.
Prevention...
187. Why did it make sense
to the person
at the time?
188. Why did it make sense
to the person
at the time?
189. Why did it make sense
to the person
at the time?
201. Not just:
why did we fail?
But also:
why did we succeed?
202. Near Misses
Hey everybody -
Don’t be like me. I tried to X, but
because it was no good.
It almost exploded everyone.
So, don’t do: (details about X)
Love,
Joe
203. Taking the New View
Recognize that human error is
an attribution.
224. “Has to be some fear that not
doing one’s job correctly could
lead to punishment.”
225. “Must set an example!”
Punishing
Deterrents is a not
“Has to be some fear that
Losing could
doing one’s job correctly
Proposition
lead to punishment.”