(Blameless)
post-mortems
@jasonhand
It’s Not Your Fault
Jason Hand
DevOps
“Handyman”
jason@VictorOps.com
!
@jasonhand
@jasonhand
A little about me…
Dir. of Platform Support - AppDirect
Dir. of Technical Support - Standing Cloud
Dir. of Operational Systems - American Fasteners, Inc.
Hiker, climber, brewer, runner, biker, boarder, surfer,
painter, singer, reader, writer, picker, coder, racer,
camper, volunteer …. all the usual “Colorado 1-upper” crap.
@jasonhand
Alternative names
Also known as:
(Note: Public & Internal)
Project Retrospectives
Post-mortem analysis Post-project review
Project Analysis Review
Quality Improvement Review
Autopsy Review
Santayana Review
After Action Review
Touchdown Meeting
@jasonhand
Post-mortem
Defined
A process intended to inform improvements by determining
aspects that were successful or unsuccessful.
What ?
@jasonhand
Post-mortem
Defined
As soon as feasible after the Incident is resolved.
When ?
@jasonhand
Post-mortem
Defined
Everybody
Who ?
@jasonhand
Post-mortem
Defined
To communicate with your team
Why ?
To understand what happened for learning and improving
@jasonhand
Post-mortem
Defined
Talk about the incident timeline
Escalation steps
What was done to resolve the problem
Create a remediation plan
Make it available
How ?
@jasonhand
The Three R’s
Regret
Acknowledgement and apology
Reason
Initial incident detection to resolution, including
the so-called “root causes.”
Remedy
Actionable remediation items
Dave Zwieback
VP Engineering - Next Big Sound
@jasonhand
( simple format )
(Remedy)
Specific
Measurable
Agreed Upon/Agreeable
Realistic
Timebound
Use SMART recommendations
Moving from Reaction to Action
@jasonhand
Blameless
image from “Across the Universe” @jasonhand
2011 - Hired to Standing Cloud
Cool story, bro
Cloud marketplace & automated deployment of apps
Build Support team
Provide Managed services
@jasonhand
Cool story, bro
@jasonhand
– Sydney Dekker
“Reprimanding bad apples may
seem like a quick and rewarding
fix, but it’s like peeing in your
pants.
!
You feel relieved and perhaps even
nice and warm for a little while,
but then it gets cold and
uncomfortable.
!
And you look like a fool”
Quote first seen in J. Paul Reed’s “A Look at Looking in the Mirror"
@jasonhand
What is a blameless
post-mortem?
Team members are accountable but not responsible
Complete Transparency
Deeper look at circumstances
What happened and how to improve it (specific details)
Real conditions of failure in complex systems
@jasonhand
– Dave Zwieback
“Your organization must
continually affirm that
individuals are NEVER the “root
cause” of outages.”
@jasonhand
Paraphrased from “Fallible Humans” by Ian Malpass
- DevOpsDays - Minneapolis
source: http://www.indecorous.com/fallible_humans/@jasonhand
(Efficiency Thoroughness Trade Off)
The trade off between:
!
being efficient
vs
being thorough
ETTO
Efficient
Thorough
@jasonhand
- Ian Malpass
“We can be thorough and really
dig into the task at hand and
understand it well but this takes
time:
it is inefficient.”
@jasonhand
Cause & Effect
There are many factors that played a part in the problem
source: http://xkcd.com
“may be”
@jasonhand
Stress
& Cognitive
Bias
@jasonhand
Yerkes-Dodson Model
source: The Human Side of Postmortems
@jasonhand
@jasonhand
Reduce Stress?
… build
muscle memory
Simulate many types of problems
and outages as “practice” …
@jasonhand
Evaluative Threat
Being negatively judged
plays a big role in stress
@jasonhand
What is stress surface?
Variables of a situation
Novel or unusual
Unpredictable
Controllable situation
Negative judgement
Lack of sleep
Problems at home
Health
Relationships
@jasonhand
Evaluative threats
ALSO
Etc…
Capturing the
Human-side
Ask questions
@jasonhand
Stress Questionnaire
The situation was novel or unusual?
The situation was unpredictable?
You were unable to control the situation?
Others could judge your actions negatively?
0 = Never 1 = Almost Never 2 = Sometimes
3 = Fairly Often 4 = Very Often
During the outage, how often have you felt or thought that:
@jasonhand
Why we don’t punish
De-incentivized to give the details
Practically guarantees a repeat of the problem
Understand why actions made sense (at the time)
Create safety AND accountability
Move away from idea of “individuals are problems”
Create new “experts”
@jasonhand
@jasonhand
Promoting from within
Where do we start?
• Document your timeline or log data
• Document conversations
• Leave room for notes
• Mean time to resolution / Time calculations
• Level of severity
• Archive it for historical retrieval
• Remediation. Make it actionable
@jasonhand
The basics:
Tools
Etsy’s Morgue
VictorOps
Post-mortem Report
@jasonhand
Internal Wiki
@jasonhand
Seek the truth
Don’t blame others …
!
Don’t blame yourself
Thank You
Questions ?
@jasonhand
Resources
“The Human Side of Postmortems” - Dave Zwieback
“The Field Guide to Understanding Human Error” - Sydney Dekker
“A Look at Looking in the Mirror” - J. Paul Reed
“Fallible Humans” - Ian Malpass (http://www.indecorous.com/fallible_humans/)
“4 Questions to ask for an effective Technical Post Mortem” - Jeffrey O’Brien (http://www.maintenanceassistant.com/blog/
4-questions-effective-technical-post-mortem/)
“Nine steps to IT post-mortem excellence” - Michael Krigsman (http://www.zdnet.com/blog/projectfailures/nine-steps-to-it-
post-mortem-excellence/1069)
“Postmortem reviews: purpose and approaches in software engineering” - Torgeir Dingsøyr (http://www.uio.no/studier/
emner/matnat/ifi/INF5180/v10/undervisningsmateriale/reading-materials/p08/post-mortems.pdf)
“Blameless PostMortems and a Just Culture” - John Allspaw (http://codeascraft.com/2012/05/22/blameless-postmortems/)
“What blameless really means” - Jessica Harllee (http://www.jessicaharllee.com/notes/what-blameless-really-means/)
“Each necessary, but only jointly sufficient” - John Allspaw (http://www.kitchensoap.com/2012/02/10/each-necessary-but-
only-jointly-sufficient/)
@jasonhand

It's Not Your Fault - Blameless Post-mortems

  • 1.
  • 2.
  • 3.
    A little aboutme… Dir. of Platform Support - AppDirect Dir. of Technical Support - Standing Cloud Dir. of Operational Systems - American Fasteners, Inc. Hiker, climber, brewer, runner, biker, boarder, surfer, painter, singer, reader, writer, picker, coder, racer, camper, volunteer …. all the usual “Colorado 1-upper” crap. @jasonhand
  • 4.
    Alternative names Also knownas: (Note: Public & Internal) Project Retrospectives Post-mortem analysis Post-project review Project Analysis Review Quality Improvement Review Autopsy Review Santayana Review After Action Review Touchdown Meeting @jasonhand
  • 5.
    Post-mortem Defined A process intendedto inform improvements by determining aspects that were successful or unsuccessful. What ? @jasonhand
  • 6.
    Post-mortem Defined As soon asfeasible after the Incident is resolved. When ? @jasonhand
  • 7.
  • 8.
    Post-mortem Defined To communicate withyour team Why ? To understand what happened for learning and improving @jasonhand
  • 9.
    Post-mortem Defined Talk about theincident timeline Escalation steps What was done to resolve the problem Create a remediation plan Make it available How ? @jasonhand
  • 10.
    The Three R’s Regret Acknowledgementand apology Reason Initial incident detection to resolution, including the so-called “root causes.” Remedy Actionable remediation items Dave Zwieback VP Engineering - Next Big Sound @jasonhand ( simple format )
  • 11.
    (Remedy) Specific Measurable Agreed Upon/Agreeable Realistic Timebound Use SMARTrecommendations Moving from Reaction to Action @jasonhand
  • 12.
    Blameless image from “Acrossthe Universe” @jasonhand
  • 13.
    2011 - Hiredto Standing Cloud Cool story, bro Cloud marketplace & automated deployment of apps Build Support team Provide Managed services @jasonhand
  • 14.
  • 15.
    – Sydney Dekker “Reprimandingbad apples may seem like a quick and rewarding fix, but it’s like peeing in your pants. ! You feel relieved and perhaps even nice and warm for a little while, but then it gets cold and uncomfortable. ! And you look like a fool” Quote first seen in J. Paul Reed’s “A Look at Looking in the Mirror" @jasonhand
  • 16.
    What is ablameless post-mortem? Team members are accountable but not responsible Complete Transparency Deeper look at circumstances What happened and how to improve it (specific details) Real conditions of failure in complex systems @jasonhand
  • 17.
    – Dave Zwieback “Yourorganization must continually affirm that individuals are NEVER the “root cause” of outages.” @jasonhand
  • 18.
    Paraphrased from “FallibleHumans” by Ian Malpass - DevOpsDays - Minneapolis source: http://www.indecorous.com/fallible_humans/@jasonhand
  • 19.
    (Efficiency Thoroughness TradeOff) The trade off between: ! being efficient vs being thorough ETTO Efficient Thorough @jasonhand
  • 20.
    - Ian Malpass “Wecan be thorough and really dig into the task at hand and understand it well but this takes time: it is inefficient.” @jasonhand
  • 21.
    Cause & Effect Thereare many factors that played a part in the problem source: http://xkcd.com “may be” @jasonhand
  • 22.
  • 23.
    Yerkes-Dodson Model source: TheHuman Side of Postmortems @jasonhand
  • 24.
  • 25.
    Reduce Stress? … build musclememory Simulate many types of problems and outages as “practice” … @jasonhand
  • 26.
    Evaluative Threat Being negativelyjudged plays a big role in stress @jasonhand
  • 27.
    What is stresssurface? Variables of a situation Novel or unusual Unpredictable Controllable situation Negative judgement Lack of sleep Problems at home Health Relationships @jasonhand Evaluative threats ALSO Etc…
  • 28.
  • 29.
    Stress Questionnaire The situationwas novel or unusual? The situation was unpredictable? You were unable to control the situation? Others could judge your actions negatively? 0 = Never 1 = Almost Never 2 = Sometimes 3 = Fairly Often 4 = Very Often During the outage, how often have you felt or thought that: @jasonhand
  • 30.
    Why we don’tpunish De-incentivized to give the details Practically guarantees a repeat of the problem Understand why actions made sense (at the time) Create safety AND accountability Move away from idea of “individuals are problems” Create new “experts” @jasonhand
  • 31.
  • 32.
    Promoting from within Wheredo we start? • Document your timeline or log data • Document conversations • Leave room for notes • Mean time to resolution / Time calculations • Level of severity • Archive it for historical retrieval • Remediation. Make it actionable @jasonhand The basics:
  • 33.
  • 34.
    @jasonhand Seek the truth Don’tblame others … ! Don’t blame yourself Thank You
  • 35.
  • 36.
    Resources “The Human Sideof Postmortems” - Dave Zwieback “The Field Guide to Understanding Human Error” - Sydney Dekker “A Look at Looking in the Mirror” - J. Paul Reed “Fallible Humans” - Ian Malpass (http://www.indecorous.com/fallible_humans/) “4 Questions to ask for an effective Technical Post Mortem” - Jeffrey O’Brien (http://www.maintenanceassistant.com/blog/ 4-questions-effective-technical-post-mortem/) “Nine steps to IT post-mortem excellence” - Michael Krigsman (http://www.zdnet.com/blog/projectfailures/nine-steps-to-it- post-mortem-excellence/1069) “Postmortem reviews: purpose and approaches in software engineering” - Torgeir Dingsøyr (http://www.uio.no/studier/ emner/matnat/ifi/INF5180/v10/undervisningsmateriale/reading-materials/p08/post-mortems.pdf) “Blameless PostMortems and a Just Culture” - John Allspaw (http://codeascraft.com/2012/05/22/blameless-postmortems/) “What blameless really means” - Jessica Harllee (http://www.jessicaharllee.com/notes/what-blameless-really-means/) “Each necessary, but only jointly sufficient” - John Allspaw (http://www.kitchensoap.com/2012/02/10/each-necessary-but- only-jointly-sufficient/) @jasonhand