Retrospecting
our
Retrospectives
@allspaw
September 18th
, 1980
Damascus, Arkansas
Dave Powell made a choice.
It was the kind of choice you
make when you're afraid of
getting in trouble.
He played dumb.
“Safety requires
prevention,
prevention requires
honesty, honesty
requires the absence
of fear”
“The Air Force officially blamed the
incident on human error. Every guy
that worked in those silos I know dropped
a wrench or a tool at some point in his
career.”
Errors as optical illusions
bit.ly/hindsightoinsight
Psychological purposes of accident investigation
From : S idney D ekker - 2014
“postmortems are deeply social events, conducted in ritual
fashion” Stella Reporthttps://snafucatchers.github.io/#4_1_Capturing_the_value_of_anomalies_through_postmortems
“Postmortems are not magic”
Stella Report
You can make them better!
https://www.flickr.com/photos/95205391@N05/29248897560
Millions of Users + Most of
Microsoft
Data: Internal Microsoft engineering system activity during calendar year 2017
Microsoft Internal Engineering in VSTS by the Numbers
2.8m
Pull requests
> 15m
Git Pushes
42,000
Deployments per day
> 4m
Builds per month
500m
Test executions
per day
500k
Work items
updated per day
5m
Work items
viewed per day
https://blogs.msdn.microsoft.com/vsoservice/?p=16295
In the earlier postmortem, we posed these two questions:
1.What caused the initial failure?
2.Why didn’t the system automatically recover given the reactive measures we already
have in place?
When doing a postmortem for a long running incident like this, it is usually the case that
what caused the start of the incident is the same thing that made the incident long
running… We found in this situation that the event that caused the bad state
and the reason for staying in that state were unrelated. That caused
us to follow the wrong path repeatedly.
A question for you
Is this the
system?
http://stella.report
https://snafucatchers.github.io/#3_1_Catching_the_Apache_SNAFU
The Ironies of Automation - Lisanne Bainbridge http://bit.ly/ironiesofautomation
http://stella.report
We spend a lot of time “below the line” http://stella.report
But “above the line” is where the work happens
http://stella.report
http://stella.report
The Ironies of Automation - Lisanne Bainbridge
http://bit.ly/ironiesofautomation
▪ The more advanced a control system is, so the more crucial may be the
contribution of the human operator.
What were you seeing?
What were you focused on?
What were you expecting to happen?
https://en.wikipedia.org/wiki/Gary_A._Klein
What were you trying to achieve?
Were there multiple goals at the same
time?
Was time pressure?
Did you ask anyone for help? What
signal brought you to ask for support or
assistance? Were you able to contact
the people you needed to contact?
One. Dropped. Socket.
A. should have
B. should have never;
C. shouldn't have…
D. shouldn't have
E. shouldn't have punctured a hole in it.
“So many things that shouldn't have happened. There's
no one that thought that scenario out.”
“Should have” “Would not have happened”
Well facilitated debriefings support recalibration
of mental models of how things really work and
break (vs what is on the wiki).
Why did it make sense at the time?
What we knew, what surprised us?
Learning from other domains
Defenses and barriers have their own special way of
creating other problems. And they may never help in
the future.
Resilience
Ability to respond
Ability to monitor
Ability to anticipate
Ability to learn
http://www.adaptivecapacitylabs.com/blog/2018/03/23/moving-past-shallow-incident-data/
@allspaw
“I think the other incident is related”
Accountability means inviting everyone to tell their story,
their “account” then systematizing and distributing the
lessons in it, and using this to sponsor vicarious learning for
all
“…Look not at the individual pieces of
the event, but at the relationships
between those pieces.”
1. Product
Owner
2. SWE
3. SRE
4. Human
Factors
$$
Product
Owner
SWE
Overwork/Burnout
Fall Asleep
SRE
Chaos Engineering
Accident boundaries
J. Rasmussen
Human Factors/Cognitive Systems Engineer
Helps us make sense of these boundaries,
Helps ”design for human use”
You are the culture. I am the culture.
“If there was to be a renewal, it
would take all of us and all parts
of each of us”
Satya Nadella
What will you do on Monday?
Read and Discuss with a colleague:
Etsy Debriefing Facilitation Guide http://bit.ly/fdebriefing
Hindsight to Insight http://bit.ly/hindsighttoinsight
Common Ground and Coordination in Joint Activity
http://bit.ly/commongroundandcoordination
The Ironies of Automation http://bit.ly/ironiesofautomation
The Stella Report https://stella.report
How Complex Systems Fail http://bit.ly/complexsystemsfail
Thank you!
Community of researchers and practitioners
@ri_cook
@allspaw
@bergstrom_johan
@sidneydekkercom@LauraMDMaguire
@caseyrosenthal
@nora_js
@jpaulreed

Retrospecting our Retrospectives

  • 1.
  • 2.
  • 3.
  • 4.
    Dave Powell madea choice. It was the kind of choice you make when you're afraid of getting in trouble. He played dumb.
  • 5.
    “Safety requires prevention, prevention requires honesty,honesty requires the absence of fear”
  • 6.
    “The Air Forceofficially blamed the incident on human error. Every guy that worked in those silos I know dropped a wrench or a tool at some point in his career.”
  • 7.
    Errors as opticalillusions bit.ly/hindsightoinsight
  • 8.
    Psychological purposes ofaccident investigation From : S idney D ekker - 2014
  • 9.
    “postmortems are deeplysocial events, conducted in ritual fashion” Stella Reporthttps://snafucatchers.github.io/#4_1_Capturing_the_value_of_anomalies_through_postmortems
  • 10.
    “Postmortems are notmagic” Stella Report You can make them better! https://www.flickr.com/photos/95205391@N05/29248897560
  • 11.
    Millions of Users+ Most of Microsoft
  • 12.
    Data: Internal Microsoftengineering system activity during calendar year 2017 Microsoft Internal Engineering in VSTS by the Numbers 2.8m Pull requests > 15m Git Pushes 42,000 Deployments per day > 4m Builds per month 500m Test executions per day 500k Work items updated per day 5m Work items viewed per day
  • 17.
    https://blogs.msdn.microsoft.com/vsoservice/?p=16295 In the earlierpostmortem, we posed these two questions: 1.What caused the initial failure? 2.Why didn’t the system automatically recover given the reactive measures we already have in place? When doing a postmortem for a long running incident like this, it is usually the case that what caused the start of the incident is the same thing that made the incident long running… We found in this situation that the event that caused the bad state and the reason for staying in that state were unrelated. That caused us to follow the wrong path repeatedly.
  • 18.
  • 19.
  • 20.
  • 21.
    https://snafucatchers.github.io/#3_1_Catching_the_Apache_SNAFU The Ironies ofAutomation - Lisanne Bainbridge http://bit.ly/ironiesofautomation http://stella.report
  • 22.
    We spend alot of time “below the line” http://stella.report
  • 23.
    But “above theline” is where the work happens http://stella.report
  • 24.
  • 25.
    The Ironies ofAutomation - Lisanne Bainbridge http://bit.ly/ironiesofautomation ▪ The more advanced a control system is, so the more crucial may be the contribution of the human operator.
  • 27.
    What were youseeing? What were you focused on? What were you expecting to happen? https://en.wikipedia.org/wiki/Gary_A._Klein
  • 28.
    What were youtrying to achieve? Were there multiple goals at the same time? Was time pressure?
  • 29.
    Did you askanyone for help? What signal brought you to ask for support or assistance? Were you able to contact the people you needed to contact?
  • 30.
    One. Dropped. Socket. A.should have B. should have never; C. shouldn't have… D. shouldn't have E. shouldn't have punctured a hole in it. “So many things that shouldn't have happened. There's no one that thought that scenario out.”
  • 31.
    “Should have” “Wouldnot have happened”
  • 32.
    Well facilitated debriefingssupport recalibration of mental models of how things really work and break (vs what is on the wiki).
  • 33.
    Why did itmake sense at the time?
  • 34.
    What we knew,what surprised us? Learning from other domains
  • 35.
    Defenses and barriershave their own special way of creating other problems. And they may never help in the future.
  • 36.
    Resilience Ability to respond Abilityto monitor Ability to anticipate Ability to learn
  • 37.
  • 38.
    @allspaw “I think theother incident is related”
  • 39.
    Accountability means invitingeveryone to tell their story, their “account” then systematizing and distributing the lessons in it, and using this to sponsor vicarious learning for all
  • 40.
    “…Look not atthe individual pieces of the event, but at the relationships between those pieces.”
  • 41.
    1. Product Owner 2. SWE 3.SRE 4. Human Factors $$ Product Owner SWE Overwork/Burnout Fall Asleep SRE Chaos Engineering Accident boundaries J. Rasmussen Human Factors/Cognitive Systems Engineer Helps us make sense of these boundaries, Helps ”design for human use”
  • 44.
    You are theculture. I am the culture.
  • 45.
    “If there wasto be a renewal, it would take all of us and all parts of each of us” Satya Nadella
  • 46.
    What will youdo on Monday? Read and Discuss with a colleague: Etsy Debriefing Facilitation Guide http://bit.ly/fdebriefing Hindsight to Insight http://bit.ly/hindsighttoinsight Common Ground and Coordination in Joint Activity http://bit.ly/commongroundandcoordination The Ironies of Automation http://bit.ly/ironiesofautomation The Stella Report https://stella.report How Complex Systems Fail http://bit.ly/complexsystemsfail
  • 47.
    Thank you! Community ofresearchers and practitioners @ri_cook @allspaw @bergstrom_johan @sidneydekkercom@LauraMDMaguire @caseyrosenthal @nora_js @jpaulreed