What Is a Post-Mortem Anyways?• Something you do when your company has badly screwed up• E.g. your CEO demos your cloud storage system to an early prospectivecustomer, and, when he runs a search, it shows other customers’ data (I havedone this, it was not awesome)• You get a bunch of people into a room and say: “How on earth did thathappen? And how can we make sure it never, ever happens again?”• That’s a Post-Mortem• But, there’s a problem....
Human Beings Will Eff It Up• Humans (unlike robots) feel this intense emotion called shame• Shame will suggest (strongly) “Slow Down, Stop Making So Many Mistakes”• Aka “Destroy your company by way of opportunity costs, immediately!”• Has potential to be incredibly damaging to your startup• And I have some bad news...
You Will Totally Experience Shame (I Still Do)F.A.E.
This Emotional Experience Can Not Be Avoided• I’ve run c. 50 post-mortems, have studied failure... and I still have thisemotional reaction• You will, too. And so will your team.• Much more strongly than you realize right now• This is the “Fundamental Attribution Error” (FAE), from psychology• FAE = humans vastly underestimate the power of a situation on our behavior
Big Idea: Adopt Economic, Not Moral Mindset$, FTW
What Does That Mean• Let me tell you a story...
After Every Axe Murdering...• Have to, like, hire a new guy, train him on the machine, takes forever• Questions we asked before are now somehow deeply wrong:• “What if we just cut down on the rate, so there’s less axe murdering?”• “Hey, we can train a pool of temps on all the machines, when someone getskilled, we’ll just swap some new guy in, bang, problem solved!”• “How much is it really costing us, anyways?”• These ideas seem obscene, not merely bad
Moral Mindset = Axe Murderer“Search for villains,elevation of accusers,and mobilization of authority tomete out punishment”(Pinker, The Blank Slate)
Moral Mindset, Key Words• “Villains”, “Accusers”, “Authority”, “Punishment”• I believe that most companies, in investigating outages, act much morelike they’re looking for an axe murderer, than trying to ﬁx a brokenmachine
Challenge #1, As Person Running Post-MortemsGet team out of moral mindset.Note: this is not, in fact, easy.
Why It’s Hard• Mindsets control how we interpret the world...• ...including what people say to us• So, a team sitting there, fearing moral censure, hears you say “We’re notlooking to blame anyone”, they just think you’re lying. How could you meanthat, when the thing that happened was so terrible and wrong?• The deep trick (and this is the point of this whole presentation, frankly), is thatyou have to take advantage of the thing that separates humans and robots...
Humor == Breaking Frames• That’s what humor actually is -- something that stretches or breaks themental frame that people are using to interpret a situation• So, you use humor to break the frame, release people from the blame/fear/punishment of the moral mindset, and then refocus them on the economicchallenges you’re facing• The humor is, IMHO, not a nice-to-have. It’s absolutely central. I’ve seensmart, caring leaders get this one wrong, and ﬁnish their post-mortems with aroom full of tense, closed-up team members (and no good ideas on the table)• Talk has speciﬁc examples of this, but this is a central point
Place The Bad Thing on a Continuum• Moral mindset is very absolutist: this bad thing is The Worst Thing Ever• I like to say “Okay, well it’s pretty bad, let’s compare it to some things”• Did we irretrievably lose customer data? (I’ve done that, not awesome)• Did we almost get our customer ﬁred by her boss (also, not awesome)• Did we send hundreds of emails to everyone on our customer’s mailing list...but the emails were all question marks? For a customer who was in theproofreading business? (done that, very much not awesome)• People laugh, and then say “Okay, how bad was this, really?” Win.
More Stories of Actual Failures (Just For Fun)• Did we break our allergies-to-medicines module, and risk having a doctorprescribe the wrong medication to someone?• Did our internet-connected home thermostat system have a server crash,causing all the thermostats to set the temp to the default... of 85 degrees?• Did our high-frequency trading program have ﬂaws that led to our companylosing 450 million dollars? (that is a tough one to beat, IMHO)• Collect your own! It’s fun!
Tip 2: Mock Hindsight Bias To Its Face“Let’s plan for a futurewhere we’re all as stupidas we are today.”
How Hindsight Bias Shows up in Post-Mortems• Someone says “Oh, yeah, I screwed that one up, I knew I had to run thedeploy in that one order, and I just forgot. I’m really sorry, I won’t make thatmistake again, totally my bad.”• You have to utterly reject this. It’s pure hindsight bias (easy to see errors afterthe fact, very difﬁcult in the moment).• I say “It’s like we’re saying ‘I was stupid, this one time, and we’ll ﬁx thatproblem by never being stupid again.’”• Hence: “planning for a future where we’re as stupid as we are today”• aka “Must create a system which is resilient to occasional bouts of reallyintense stupidity”.
You Will Find That Your Code is a Mess• E.g. you’ve refactored, and rewritten in python (or node or something), andmoved to the cloud, but this 5 whys is making clear that your most importantreport is still run by a VisualCron job on a Windows server that never quitemade it out of the ofﬁce... and someone just tripped on the power cord• Team will feel ashamed, you have to give them license to relish absurdity• I often point out “There are two kinds of startups: the ones that achieve somemodest traction on top of a pile of code of which they are vaguely ashamed...and the ones that go out of business. That’s it. No third kind.”• Also sometimes it helps to just laugh: “It’s kind of amazing this works at all”
Three Axioms For Leading Post-Mortems• Everyone involved acted in good faith• Everyone involved is competent• We’re doing this to ﬁnd improvements
Axioms == Ground Truth From Which You Start• If you don’t start with these as givens...• ...you’ll ﬁnd yourself seeing every incident as human error• Whereas, if you can convince/trick yourself into such beliefs...• ...you’ll ﬁnd a thousand valuable improvements to make• Or, to put it another way:
Restate the Problem To Include TTRWe broke the db access code.
Restate the Problem To Include TTRWe pushed a deploy...which broke db access code.
Restate the Problem To Include TTRWe pushed a deploy...which broke the db access code...and didn’t ﬁnd out until customers complained.
Restate the Problem To Include TTRWe pushed a deploy...which broke the db access code...didn’t ﬁnd out until customers complained...and couldn’t ﬁx it for three hours.
Redeﬁning Problem Is Very Valuable• People tend to focus on a single mistake• Broaden that, to include full cycle back to restored service• At what point was the triggering decision made?• How long did it take to ﬁnd out something was wrong?• How long did it take to restore service?
Handling a Fork in the Road• Which is the Root Cause? DB access bug or monitoring failure?• Answer: don’t care about “root causes”. They don’t exist (multiple thingsconspire for failures to happen). Also, kind of moral/blame-ish.• Ask instead: if we made an incremental improvement in area A or area B,which would prevent the broadest class of problems going ahead?• Much better conversation. Answer here is clear: monitoring.
Require Small Steps From Your Team• Team will tell you they have no option but to do Some Huge Thing• You have to totally reject this, push for a small step• e.g. “What’s the simplest, dumbest thing that will make it slightly better?”• After some hemming and hawing, great, cheap ideas emerge• Might be: small steps towards Huge Thing• Or: installing data collection to prove Huge Thing is necessary
“Tooling” => Humans Solve Your Problems• How do the humans currently do their jobs?• What tools do they use?• When you give them a new tool, do they actually use it?• How badly did you just screw up their jobs?• YOU MUST ITERATE
Here’s What’s Happening, Right Now• Your systems are experiencing constant, small-scale failures... invisibly• Your team is struggling to keep your systems running... but are so habituatedto it, they don’t even realize that’s true• Your smart people are spending their smart cycles just trying to work aroundthe complexity in your system• The business side is making plans which aren’t supported by yourinfrastructure• Customers are getting ready to surprise you, and it won’t be fun
Do This• Elect a Post-Mortem Boss (Man|Lady)• Look for a Goldilocks incident• Expect awkwardness• THERE MUST BE FIXES• Incrementally improve the incremental improvements
Read This• How Complex Systems Fail, Richard Cook (SOOOOO GOOOD)• How the Mind Works, Steven Pinker (moral instinct, much other awesome)• Thinking Fast and Slow, Daniel Kahneman• The Field Guide to Understanding Human Error, Sidney Dekker• Complications and Better, Atul Gawande (marvelous narratives)• Kitchen Soap, blog by John Allspaw
Photo Credits, I• “Wonderworks Upside Down Building”, by Andy Leonard, http://www.ﬂickr.com/photos/rover75/3901166997/• “Robot de Martillo”, by Luis Perez, http://www.ﬂickr.com/photos/pe5pe/2454661748/• “Helios-Factory ﬂoor”, http://commons.wikimedia.org/wiki/File:Helioshall2.jpg• “old machine”, by Jun Aoyama, http://www.ﬂickr.com/photos/jam343/1730140/• “Axe Marks The Spot”, by Alan Levine, http://www.ﬂickr.com/photos/cogdog/4461665810/• “Failboat Has Arrived”, http://www.rotskyinstitute.com/rotsky/wp-content/uploads/2008/02/failboat2.jpg
Photo Credits, II• “14 plugs but only 6 sockets”, by Jason Rogers, http://www.ﬂickr.com/photos/restlessglobetrotter/2661016046/• “shame in scranton”, by Shira Golding Evergreen, http://www.ﬂickr.com/photos/boojee/3613772785/• “tiny dollhouse steps”, by Yi-Tao “Timo” Lee, http://www.ﬂickr.com/photos/timojazz/6235519218/• “Computers can be stupid”, by Brent Moore, http://www.ﬂickr.com/photos/brent_nashville/2634912345/• “Robot Uprising”, http://gordonandthewhale.com/wp-content/uploads/2010/10/How-To-Survive-a-Robot-Uprising.jpeg• “Shark”, by Steve Garner, http://www.ﬂickr.com/photos/22032337@N02/8314569214/