Blameless Postmortem
Why Should We NOT Do That?
People are de-incentivized to give the details
Practically guarantees a repeat of the problem
Prevents finding a solution (or at least a better state)
Brain
Yerkes-Dodson Model
Source : https://en.wikipedia.org/wiki/Yerkes–Dodson_law
What are the causes of the
downtimes we had in the
past?
Postmortem
”Examine mistakes in a way that focuses on the situational aspects of a
failure’s mechanism and the decision making process pf individuals
proximate to the failure”
Second Story
First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of
systemic vulnerabilities deeper inside
the organization
Saying what people should have done is
a satisfying way to describe failure
Saying what people should have done
doesn’t explain why it made sense for
them to do what they did
Telling people to be more careful will
make the problem go away
Only by constantly seeking out its
vulnerabilities can organizations
enhance safety
Benefits
Create safety AND accountability
Retrospective Prime Directive
"Regardless of what we discover, we understand and truly believe that
everyone did the best job they could, given what they knew at the
time, their skills and abilities, the resources available, and the situation
at hand."
Norm Kerth, Project Retrospectives: A Handbook for Team Review
How that help us now?
What problems will resolve if
we do it?
Benefits
Create safety AND accountability
Understand why actions made sense (at the time)
Move away from idea of “individuals are problems”
Create new “experts”
What should we discuss?
When should we do the
meeting?
The Meeting
After the service is restored
Set the tone
Agree on the timeline
Agree on the list of factors contributed (technical or human)
List of actions – corrective or preventive
Who should be informed?
How?
The Report
Summary Timeline Analysis Root causes Lessons
learned
Actions
3 R
REGRET
• start out by acknowledging the problem and apologizing for what happened
• empathy for both the customers affected by the outage and for the person who was involved in the
firefight
REASON
• this should include everything from initial incident detection to resolution
• the more details you can provide, the better
• there’s no point in hiding facts
REMEDY
• make sure your remediation items are SMART (Specific Measurable Agreeable Realistic Timebound)
• it’s okay if you don’t know the remedy - but you should be committed to continuing the investigation
• you need to communicate (internally & with your customers) that you’re working on improving the
situation and that you have an idea of what needs to be done to prevent future outages
Source : http://blog.travis-ci.com/2014-06-26-three-ingredients-to-a-great-postmortem/
Next step
Conduct an one hour postmortem next time a downtime occurs

Blameless postmoretem

  • 1.
  • 3.
    Why Should WeNOT Do That? People are de-incentivized to give the details Practically guarantees a repeat of the problem Prevents finding a solution (or at least a better state)
  • 4.
  • 5.
    Yerkes-Dodson Model Source :https://en.wikipedia.org/wiki/Yerkes–Dodson_law
  • 6.
    What are thecauses of the downtimes we had in the past?
  • 7.
    Postmortem ”Examine mistakes ina way that focuses on the situational aspects of a failure’s mechanism and the decision making process pf individuals proximate to the failure”
  • 8.
    Second Story First StoriesSecond Stories Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety
  • 9.
  • 10.
    Retrospective Prime Directive "Regardlessof what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand." Norm Kerth, Project Retrospectives: A Handbook for Team Review
  • 11.
    How that helpus now? What problems will resolve if we do it?
  • 12.
    Benefits Create safety ANDaccountability Understand why actions made sense (at the time) Move away from idea of “individuals are problems” Create new “experts”
  • 13.
    What should wediscuss? When should we do the meeting?
  • 14.
    The Meeting After theservice is restored Set the tone Agree on the timeline Agree on the list of factors contributed (technical or human) List of actions – corrective or preventive
  • 15.
    Who should beinformed? How?
  • 16.
    The Report Summary TimelineAnalysis Root causes Lessons learned Actions
  • 17.
    3 R REGRET • startout by acknowledging the problem and apologizing for what happened • empathy for both the customers affected by the outage and for the person who was involved in the firefight REASON • this should include everything from initial incident detection to resolution • the more details you can provide, the better • there’s no point in hiding facts REMEDY • make sure your remediation items are SMART (Specific Measurable Agreeable Realistic Timebound) • it’s okay if you don’t know the remedy - but you should be committed to continuing the investigation • you need to communicate (internally & with your customers) that you’re working on improving the situation and that you have an idea of what needs to be done to prevent future outages Source : http://blog.travis-ci.com/2014-06-26-three-ingredients-to-a-great-postmortem/
  • 18.
    Next step Conduct anone hour postmortem next time a downtime occurs