3. A little about me…
Dir. of Platform Support - AppDirect
Dir. of Technical Support - Standing Cloud
Dir. of Operational Systems - American Fasteners, Inc.
Hiker, climber, brewer, runner, biker, boarder, surfer,
painter, singer, reader, writer, picker, coder, racer,
camper, volunteer …. all the usual “Colorado 1-upper” crap.
@jasonhand
4. Alternative names
Also known as:
(Note: Public & Internal)
Project Retrospectives
Post-mortem analysis Post-project review
Project Analysis Review
Quality Improvement Review
Autopsy Review
Santayana Review
After Action Review
Touchdown Meeting
@jasonhand
9. Post-mortem
Defined
Talk about the incident timeline
Escalation steps
What was done to resolve the problem
Create a remediation plan
Make it available
How ?
@jasonhand
10. The Three R’s
Regret
Acknowledgement and apology
Reason
Initial incident detection to resolution, including
the so-called “root causes.”
Remedy
Actionable remediation items
Dave Zwieback
VP Engineering - Next Big Sound
@jasonhand
( simple format )
13. 2011 - Hired to Standing Cloud
Cool story, bro
Cloud marketplace & automated deployment of apps
Build Support team
Provide Managed services
@jasonhand
15. – Sydney Dekker
“Reprimanding bad apples may
seem like a quick and rewarding
fix, but it’s like peeing in your
pants.
!
You feel relieved and perhaps even
nice and warm for a little while,
but then it gets cold and
uncomfortable.
!
And you look like a fool”
Quote first seen in J. Paul Reed’s “A Look at Looking in the Mirror"
@jasonhand
16. What is a blameless
post-mortem?
Team members are accountable but not responsible
Complete Transparency
Deeper look at circumstances
What happened and how to improve it (specific details)
Real conditions of failure in complex systems
@jasonhand
17. – Dave Zwieback
“Your organization must
continually affirm that
individuals are NEVER the “root
cause” of outages.”
@jasonhand
18. Paraphrased from “Fallible Humans” by Ian Malpass
- DevOpsDays - Minneapolis
source: http://www.indecorous.com/fallible_humans/@jasonhand
19. (Efficiency Thoroughness Trade Off)
The trade off between:
!
being efficient
vs
being thorough
ETTO
Efficient
Thorough
@jasonhand
20. - Ian Malpass
“We can be thorough and really
dig into the task at hand and
understand it well but this takes
time:
it is inefficient.”
@jasonhand
21. Cause & Effect
There are many factors that played a part in the problem
source: http://xkcd.com
“may be”
@jasonhand
27. What is stress surface?
Variables of a situation
Novel or unusual
Unpredictable
Controllable situation
Negative judgement
Lack of sleep
Problems at home
Health
Relationships
@jasonhand
Evaluative threats
ALSO
Etc…
29. Stress Questionnaire
The situation was novel or unusual?
The situation was unpredictable?
You were unable to control the situation?
Others could judge your actions negatively?
0 = Never 1 = Almost Never 2 = Sometimes
3 = Fairly Often 4 = Very Often
During the outage, how often have you felt or thought that:
@jasonhand
30. Why we don’t punish
De-incentivized to give the details
Practically guarantees a repeat of the problem
Understand why actions made sense (at the time)
Create safety AND accountability
Move away from idea of “individuals are problems”
Create new “experts”
@jasonhand
32. Promoting from within
Where do we start?
• Document your timeline or log data
• Document conversations
• Leave room for notes
• Mean time to resolution / Time calculations
• Level of severity
• Archive it for historical retrieval
• Remediation. Make it actionable
@jasonhand
The basics:
36. Resources
“The Human Side of Postmortems” - Dave Zwieback
“The Field Guide to Understanding Human Error” - Sydney Dekker
“A Look at Looking in the Mirror” - J. Paul Reed
“Fallible Humans” - Ian Malpass (http://www.indecorous.com/fallible_humans/)
“4 Questions to ask for an effective Technical Post Mortem” - Jeffrey O’Brien (http://www.maintenanceassistant.com/blog/
4-questions-effective-technical-post-mortem/)
“Nine steps to IT post-mortem excellence” - Michael Krigsman (http://www.zdnet.com/blog/projectfailures/nine-steps-to-it-
post-mortem-excellence/1069)
“Postmortem reviews: purpose and approaches in software engineering” - Torgeir Dingsøyr (http://www.uio.no/studier/
emner/matnat/ifi/INF5180/v10/undervisningsmateriale/reading-materials/p08/post-mortems.pdf)
“Blameless PostMortems and a Just Culture” - John Allspaw (http://codeascraft.com/2012/05/22/blameless-postmortems/)
“What blameless really means” - Jessica Harllee (http://www.jessicaharllee.com/notes/what-blameless-really-means/)
“Each necessary, but only jointly sufficient” - John Allspaw (http://www.kitchensoap.com/2012/02/10/each-necessary-but-
only-jointly-sufficient/)
@jasonhand