Crisis Patterns
Forced beyond learned roles
Actions whose consequences are both important and
difficult to see
Cognitively and perceptively noisy
Crisis Patterns
Forced beyond learned roles
Actions whose consequences are both important and
difficult to see
Cognitively and perceptively noisy
Coordinative load increases exponentially
Root Cause Analysis
“With the unknown, one is confronted with danger,
discomfort, and care; the first instinct is to abolish
these painful states.
First principle: any explanation is better than
none.”
Friedrich Nietzsche
Twilight of the Idols, or
How to Philosophize with a Hammer
Hindsight Bias
Knowledge of the outcome influences the
analysis of the process
Makes steps towards failure appear
foreseeable and obvious
Hindsight Bias
BEFORE accident:
The future seems implausible
AFTER accident:
Obviously clear: “how could they not see what
mistake they were about to make”
Hindsight Bias
“...people’s need to be right
is stronger than their ability
to be objective.”
N. Crawford
American Psychological Association
• Satisfyingly simple, easy to explain and document
• Solves for a specific case
• Ignorant of surrounding circumstances
• Too focused on components
• Validates Hindsight and Outcome bias
NOT HELPFUL
• Satisfyingly simple, easy to explain and document
• Solves for a specific case
• Ignorant of surrounding circumstances
• Too focused on components
• Validates Hindsight and Outcome bias
Capacity
Unmonitored Disk Space
Miscalculation
(latent condition)
(latent condition)
Capacity
Unmonitored Disk Space
Miscalculation
(latent condition)
(latent condition)
Unit Test In
Transition
Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Transition
Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced
Transition
(active failure)
Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
Violation of known coding
Capacity standards Unmonitored Disk Space
Miscalculation (latent condition) (latent condition)
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
FAILURE!
Capacity
Miscalculation
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
Capacity
Miscalculation
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
Capacity
Miscalculation
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
Capacity
Miscalculation
(latent condition)
Unit Test In
Bug Introduced External API call
Transition
(active failure) (active failure)
NO FAILURE!
• Better than dominoes, but still linear layers of “defense”
• Helps uncover multiple contributors and latent failures (at
sharp and blunt ends)
• Better than dominoes, but still linear layers of “defense”
• Helps uncover multiple contributors and latent failures (at
sharp and blunt ends)
• Doesn’t explain lineups or orientation of holes
• Only identifies defects/gaps, nothing more
• Still encourages judgements of decisions
Better, but still
NOT HELPFUL
• Better than dominoes, but still linear layers of “defense”
• Helps uncover multiple contributors and latent failures (at
sharp and blunt ends)
• Doesn’t explain lineups or orientation of holes
• Only identifies defects/gaps, nothing more
• Still encourages judgements of decisions
Systemic
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Roadmap
Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Roadmap
Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Roadmap
DBA’s car
Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Roadmap
No Eng
DBA’s car Training
Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
No Eng
DBA’s car Training
Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Is At
DBA’s car Training Velocity
Systemic Dashboard
Design
Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Techcrunch Is At
DBA’s car Training Article Velocity
Systemic Hiring
Dashboard
Design
Difficulties Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Techcrunch Is At
DBA’s car Training Article Velocity
Systemic Hiring
S3 is slow Dashboard
Design
Difficulties Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Techcrunch Is At
DBA’s car Training Article Velocity
Systemic Hiring
S3 is slow Dashboard
Design
Difficulties Database
Router Memcache
Last Round
Webserver of Funding
Feature
Email is
Roadmap
down
Entire
Ops Team
No Eng Techcrunch Is At
DBA’s car Training Article Velocity
Quantifying Response
Time to detect?
Time for escalation, internal notification?
Time to notify the public?
Time to graceful degradation? (feature off)
Time to stable/resolve?
Time to all clear?
Qualifying Response
High signal:noise in comm channels?
Troubleshooting fatigue?
Troubleshooting handoff?
All tools on-hand?
Metrics visibility?
Collaborative and skillful communication?
Improvised tooling or solutions?
All Together Now
• Start/TTD/TTR/Stable/etc.
• Severity
• DATA (graphs, IRC, etc.)
• Description (timeline, etc.)
• Observations (motivations, latent conditions, etc.)
• Actions (remediation tickets, followup)
(Anticipation) (Response)
Knowing Knowing Knowing Knowing
What What What What
To Expect To Look For To Do Has Happened
(Monitoring) (Learning)
(Anticipation) (Response)
Knowing Knowing Knowing Knowing
What What What What
To Expect To Look For To Do Has Happened
(Monitoring) (Learning)
Human Error
“...knowledge and error flow from the
same mental sources, only success
can tell one from the other.”
Ernest Mach, 1905
First Stories
“Human error” seen as root cause.
Counterfactuals: saying what they “should”
have done.
Prevention: be “more careful!”
Second Stories
“Human error” seen as systemic vulnerabilities,
deeper inside the organization.
Digging into why it made sense for them to do what they
did, at the time they did it.
Prevention...
Not just:
why did we fail?
But also:
why did we succeed?
Near Misses
Hey everybody -
Don’t be like me. I tried to X, but
because it was no good.
It almost exploded everyone.
So, don’t do: (details about X)
Love,
Joe
Taking the New View
Recognize that human error is
an attribution.