We can work to prevent incidents, reduce their impact, and shorten their timelines. But they’re probably not going to disappear altogether anytime soon.
Anyone who’s worked with technology at any scale is familiar with failure. Failure cares not about the architecture designs you labor over, the code you write and review, or the alerts and metrics you meticulously pore through.
3. Why Should We NOT Do That?
People are de-incentivized to give the details
Practically guarantees a repeat of the problem
Prevents finding a solution (or at least a better state)
6. What are the causes of the
downtimes we had in the
past?
7. Postmortem
”Examine mistakes in a way that focuses on the situational aspects of a
failure’s mechanism and the decision making process pf individuals
proximate to the failure”
8. Second Story
First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of
systemic vulnerabilities deeper inside
the organization
Saying what people should have done is
a satisfying way to describe failure
Saying what people should have done
doesn’t explain why it made sense for
them to do what they did
Telling people to be more careful will
make the problem go away
Only by constantly seeking out its
vulnerabilities can organizations
enhance safety
10. Retrospective Prime Directive
"Regardless of what we discover, we understand and truly believe that
everyone did the best job they could, given what they knew at the
time, their skills and abilities, the resources available, and the situation
at hand."
Norm Kerth, Project Retrospectives: A Handbook for Team Review
11. How that help us now?
What problems will resolve if
we do it?
12. Benefits
Create safety AND accountability
Understand why actions made sense (at the time)
Move away from idea of “individuals are problems”
Create new “experts”
14. The Meeting
After the service is restored
Set the tone
Agree on the timeline
Agree on the list of factors contributed (technical or human)
List of actions – corrective or preventive
17. 3 R
REGRET
• start out by acknowledging the problem and apologizing for what happened
• empathy for both the customers affected by the outage and for the person who was involved in the
firefight
REASON
• this should include everything from initial incident detection to resolution
• the more details you can provide, the better
• there’s no point in hiding facts
REMEDY
• make sure your remediation items are SMART (Specific Measurable Agreeable Realistic Timebound)
• it’s okay if you don’t know the remedy - but you should be committed to continuing the investigation
• you need to communicate (internally & with your customers) that you’re working on improving the
situation and that you have an idea of what needs to be done to prevent future outages
Source : http://blog.travis-ci.com/2014-06-26-three-ingredients-to-a-great-postmortem/