85. Crisis Patterns
Forced beyond learned roles
Actions whose consequences are both important and
diļ¬cult to see
Cognitively and perceptively noisy
86. Crisis Patterns
Forced beyond learned roles
Actions whose consequences are both important and
diļ¬cult to see
Cognitively and perceptively noisy
Coordinative load increases exponentially
95. Root Cause Analysis
āWith the unknown, one is confronted with danger,
discomfort, and care; the ļ¬rst instinct is to abolish
these painful states.
First principle: any explanation is better than
none.ā
Friedrich Nietzsche
Twilight of the Idols, or
How to Philosophize with a Hammer
96. Hindsight Bias
Knowledge of the outcome inļ¬uences the
analysis of the process
Makes steps towards failure appear
foreseeable and obvious
101. Hindsight Bias
BEFORE accident:
The future seems implausible
AFTER accident:
Obviously clear: āhow could they not see what
mistake they were about to makeā
102. Hindsight Bias
ā...peopleās need to be right
is stronger than their ability
to be objective.ā
N. Crawford
American Psychological Association
116. ā¢ Satisfyingly simple, easy to explain and document
ā¢ Solves for a speciļ¬c case
ā¢ Ignorant of surrounding circumstances
ā¢ Too focused on components
ā¢ Validates Hindsight and Outcome bias
117. NOT HELPFUL
ā¢ Satisfyingly simple, easy to explain and document
ā¢ Solves for a speciļ¬c case
ā¢ Ignorant of surrounding circumstances
ā¢ Too focused on components
ā¢ Validates Hindsight and Outcome bias
127. Capacity
Unmonitored Disk Space
Miscalculation
(latent condition)
(latent condition)
128. Capacity
Unmonitored Disk Space
Miscalculation
(latent condition)
(latent condition)
Unit Test In
Transition
129. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Transition
130. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced
Transition
(active failure)
131. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
132. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
133. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
134. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
135. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
136. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
137. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
138. Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
FAILURE!
139. Capacity
Miscalculation
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
140. Capacity
Miscalculation
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
141. Capacity
Miscalculation
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
142. Capacity
Miscalculation
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
NO FAILURE!
143.
144. ā¢ Better than dominoes, but still linear layers of ādefenseā
ā¢ Helps uncover multiple contributors and latent failures (at
sharp and blunt ends)
145. ā¢ Better than dominoes, but still linear layers of ādefenseā
ā¢ Helps uncover multiple contributors and latent failures (at
sharp and blunt ends)
ā¢ Doesnāt explain lineups or orientation of holes
ā¢ Only identiļ¬es defects/gaps, nothing more
ā¢ Still encourages judgements of decisions
146. Better, but still
NOT HELPFUL
ā¢ Better than dominoes, but still linear layers of ādefenseā
ā¢ Helps uncover multiple contributors and latent failures (at
sharp and blunt ends)
ā¢ Doesnāt explain lineups or orientation of holes
ā¢ Only identiļ¬es defects/gaps, nothing more
ā¢ Still encourages judgements of decisions
151. Systemic
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Roadmap
152. Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Roadmap
153. Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Roadmap
DBAās car
154. Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Roadmap
No Eng
DBAās car Training
155. Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
No Eng
DBAās car Training
156. Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Is At
DBAās car Training Velocity
157. Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Techcrunch Is At
DBAās car Training Article Velocity
158. Systemic Hiring
Dashboard
Design
Difļ¬culties Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Techcrunch Is At
DBAās car Training Article Velocity
159. Systemic Hiring
S3 is slow Dashboard
Design
Difļ¬culties Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Techcrunch Is At
DBAās car Training Article Velocity
160. Systemic Hiring
S3 is slow Dashboard
Design
Difļ¬culties Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Techcrunch Is At
DBAās car Training Article Velocity
173. Quantifying Response
Time to detect?
Time for escalation, internal notiļ¬cation?
Time to notify the public?
Time to graceful degradation? (feature oļ¬)
Time to stable/resolve?
Time to all clear?
174. Qualifying Response
High signal:noise in comm channels?
Troubleshooting fatigue?
Troubleshooting handoļ¬?
All tools on-hand?
Metrics visibility?
Collaborative and skillful communication?
Improvised tooling or solutions?
175. All Together Now
ā¢ Start/TTD/TTR/Stable/etc.
ā¢ Severity
ā¢ DATA (graphs, IRC, etc.)
ā¢ Description (timeline, etc.)
ā¢ Observations (motivations, latent conditions, etc.)
ā¢ Actions (remediation tickets, followup)
176. (Anticipation) (Response)
Knowing Knowing Knowing Knowing
What What What What
To Expect To Look For To Do Has Happened
(Monitoring) (Learning)
177. (Anticipation) (Response)
Knowing Knowing Knowing Knowing
What What What What
To Expect To Look For To Do Has Happened
(Monitoring) (Learning)
178. Human Error
ā...knowledge and error ļ¬ow from the
same mental sources, only success
can tell one from the other.ā
Ernest Mach, 1905
185. First Stories
āHuman errorā seen as root cause.
Counterfactuals: saying what they āshouldā
have done.
Prevention: be āmore careful!ā
186. Second Stories
āHuman errorā seen as systemic vulnerabilities,
deeper inside the organization.
Digging into why it made sense for them to do what they
did, at the time they did it.
Prevention...
187. Why did it make sense
to the person
at the time?
188. Why did it make sense
to the person
at the time?
189. Why did it make sense
to the person
at the time?
201. Not just:
why did we fail?
But also:
why did we succeed?
202. Near Misses
Hey everybody -
Donāt be like me. I tried to X, but
because it was no good.
It almost exploded everyone.
So, donāt do: (details about X)
Love,
Joe
203. Taking the New View
Recognize that human error is
an attribution.
224. āHas to be some fear that not
doing oneās job correctly could
lead to punishment.ā
225. āMust set an example!ā
Punishing
Deterrents is a not
āHas to be some fear that
Losing could
doing oneās job correctly
Proposition
lead to punishment.ā