SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Crisis Patterns Forced beyond learned
roles Actions whose consequences are both important and difficult to see Cognitively and perceptively noisy Coordinative load increases exponentially
Root Cause Analysis “With the
unknown, one is confronted with danger, discomfort, and care; the first instinct is to abolish these painful states. First principle: any explanation is better than none.” Friedrich Nietzsche Twilight of the Idols, or How to Philosophize with a Hammer
Hindsight Bias Knowledge of the
outcome influences the analysis of the process Makes steps towards failure appear foreseeable and obvious
Hindsight Bias BEFORE accident: The
future seems implausible AFTER accident: Obviously clear: “how could they not see what mistake they were about to make”
Hindsight Bias “...people’s need to
be right is stronger than their ability to be objective.” N. Crawford American Psychological Association
• Satisfyingly simple, easy to
explain and document • Solves for a specific case • Ignorant of surrounding circumstances • Too focused on components • Validates Hindsight and Outcome bias
NOT HELPFUL • Satisfyingly simple,
easy to explain and document • Solves for a specific case • Ignorant of surrounding circumstances • Too focused on components • Validates Hindsight and Outcome bias
Violation of known coding Capacity
standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Transition
Violation of known coding Capacity
standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced Transition (active failure)
Violation of known coding Capacity
standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
Violation of known coding Capacity
standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
Violation of known coding Capacity
standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
Violation of known coding Capacity
standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
Violation of known coding Capacity
standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
Violation of known coding Capacity
standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
Violation of known coding Capacity
standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure)
Violation of known coding Capacity
standards Unmonitored Disk Space Miscalculation (latent condition) (latent condition) (latent condition) Unit Test In Bug Introduced External API call Transition (active failure) (active failure) FAILURE!
• Better than dominoes, but
still linear layers of “defense” • Helps uncover multiple contributors and latent failures (at sharp and blunt ends)
• Better than dominoes, but
still linear layers of “defense” • Helps uncover multiple contributors and latent failures (at sharp and blunt ends) • Doesn’t explain lineups or orientation of holes • Only identifies defects/gaps, nothing more • Still encourages judgements of decisions
Better, but still NOT HELPFUL
• Better than dominoes, but still linear layers of “defense” • Helps uncover multiple contributors and latent failures (at sharp and blunt ends) • Doesn’t explain lineups or orientation of holes • Only identifies defects/gaps, nothing more • Still encourages judgements of decisions
Systemic Dashboard Design Database Router
Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Is At DBA’s car Training Velocity
Systemic Dashboard Design Database Router
Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is At DBA’s car Training Article Velocity
Systemic Hiring Dashboard Design Difficulties
Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is At DBA’s car Training Article Velocity
Systemic Hiring S3 is slow
Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is At DBA’s car Training Article Velocity
Systemic Hiring S3 is slow
Dashboard Design Difficulties Database Router Memcache Last Round Webserver of Funding Feature Email is Roadmap down Entire Ops Team No Eng Techcrunch Is At DBA’s car Training Article Velocity
Quantifying Response Time to detect?
Time for escalation, internal notification? Time to notify the public? Time to graceful degradation? (feature off) Time to stable/resolve? Time to all clear?
Qualifying Response High signal:noise in
comm channels? Troubleshooting fatigue? Troubleshooting handoff? All tools on-hand? Metrics visibility? Collaborative and skillful communication? Improvised tooling or solutions?
All Together Now • Start/TTD/TTR/Stable/etc.
• Severity • DATA (graphs, IRC, etc.) • Description (timeline, etc.) • Observations (motivations, latent conditions, etc.) • Actions (remediation tickets, followup)
First Stories “Human error” seen
as root cause. Counterfactuals: saying what they “should” have done. Prevention: be “more careful!”
Second Stories “Human error” seen
as systemic vulnerabilities, deeper inside the organization. Digging into why it made sense for them to do what they did, at the time they did it. Prevention...
Near Misses Hey everybody -
Don’t be like me. I tried to X, but because it was no good. It almost exploded everyone. So, don’t do: (details about X) Love, Joe