This document summarizes a presentation given by Kevin Finn-Braun of Intuit and J. Paul Reed at the DevOps Enterprise Summit 2016. The presentation discusses moving beyond traditional retrospective approaches to embrace complexity and service ownership. It outlines different levels of experience with incident analysis, from novice to expert, identifying behaviors and approaches associated with each level. These include how incidents are discussed, the focus of retrospectives, and how outcomes are applied. The document also introduces the incident lifecycle model of detection, response, remediation and prevention.
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership
1. K E V I N A F I N N - B R A U N
I N T U I T
J . PA U L R E E D
R E L E A S E E N G I N E E R I N G A P P R O A C H E S
D E V O P S E N T E R P R I S E S U M M I T, 2 0 1 6
B E Y O N D T H E R E T R O S P E C T I V E :
E M B R A C I N G C O M P L E X I T Y O N T H E
R O A D T O WA R D S S E R V I C E O W N E R S H I P
2. K E V I N A
F I N N - B R A U N
• Director of Product Infrastructure
Service Management at Intuit
• Director of Site Reliability Service
Management at Salesforce;
Business Continuity at Yahoo
• Geeks out on group dynamics and
behavior
• @kfinnbraun on
@jpaulreed@kfinnbraun #DOES2016
3. J . PA U L
R E E D
• @jpaulreed on
• @shipshowpodcast alum
• Managing Partner, Release
Engineering Approaches
• A “DevOps Consultant™”
• Master’s Candidate in Human
Factors & Systems Safety
@jpaulreed@kfinnbraun #DOES2016
4. A Q U I C K R E C A P F R O M L A S T D O E S
“The Blameless Cloud: Bringing Actionable Retrospectives to SFDC”
DOES 2015 @jpaulreed@kfinnbraun
5. N E W M A R C H I N G O R D E R S
@jpaulreed@kfinnbraun #DOES2016
6. “ S E R V I C E
O W N E R S H I P ? ”
@jpaulreed@kfinnbraun #DOES2016
7. I T ’ S J U S T W H AT S F D C
C A L L E D “ D E V O P S “
( S S H H H , D O N ’ T T E L L A N Y O N E )
@jpaulreed@kfinnbraun #DOES2016
8. W H I C H F L AV O R O F D E V O P S W O U L D Y O U L I K E ?
@jpaulreed@kfinnbraun #DOES2016
9. W H I C H F L AV O R O F D E V O P S W O U L D Y O U L I K E ?
@jpaulreed@kfinnbraun #DOES2016
10. W H I C H F L AV O R O F D E V O P S W O U L D Y O U L I K E ?
@jpaulreed@kfinnbraun #DOES2016
11. “ B U T H O W D O W E D O ‘ T H E D E V O P S ? ’ ”
• Learned helplessness?
• Uncontrollable bad event
• Perceived lack of control
• Generalized helpless behavior
@jpaulreed@kfinnbraun #DOES2016
12. • Learned helplessness?
• Uncontrollable bad event
• Perceived lack of control
• Generalized helpless behavior
• Actually: Structural blindness
“ B U T H O W D O W E D O ‘ T H E D E V O P S ? ’ ”
@jpaulreed@kfinnbraun #DOES2016
13. M A K I N G S E N S E O F S E R V I C E O W N E R S H I P
@jpaulreed@kfinnbraun #DOES2016
14. W O R K S H O P
S U R P R I S E S !
• Understanding teams’ local
rationality is key
• Words have meaning; meanings
are important; but they aren’t
necessarily shared
• Teams must be given space to
deliver on transformations
• Teams can be “retrospective blind”
@jpaulreed@kfinnbraun #DOES2016
15. D E V O P S & N U C L E A R M E LT D O W N S ?
@jpaulreed@kfinnbraun
16. A N E W A D V E N T U R E
@jpaulreed@kfinnbraun #DOES2016
17. A N E W A D V E N T U R E
Quickbooks
TurboTax
Mint
FY 2016: $4.7b revenue
8,000 employees
worldwide
Founded: 1983
Improving the financial lives of over 45 million customers
IPO: 1993
@jpaulreed@kfinnbraun #DOES2016
18. S O M E D I F F E R E N T
C H A L L E N G E S
• Intuit not “born in the cloud”
@jpaulreed@kfinnbraun #DOES2016
19. S O M E D I F F E R E N T
C H A L L E N G E S
• Intuit not “born in the cloud”
• “Incidents” meant something
different
@jpaulreed@kfinnbraun #DOES2016
20. S O M E D I F F E R E N T
C H A L L E N G E S
• Intuit not “born in the cloud”
• “Incidents” meant something
different
• No “Bermuda Blob”
@jpaulreed@kfinnbraun #DOES2016
21. S O M E D I F F E R E N T
C H A L L E N G E S
• Intuit not “born in the cloud”
• “Incidents” meant something
different
• No “Bermuda Blob”
• (No blob at all!)
@jpaulreed@kfinnbraun #DOES2016
22. S O M E D I F F E R E N T
C H A L L E N G E S
• Intuit not “born in the cloud”
• “Incidents” meant something
different
• No “Bermuda Blob”
• (No blob at all!)
• Different business lifecycle
@jpaulreed@kfinnbraun #DOES2016
23. B U T S I M I L A R C H A L L E N G E S , T O O
• Inconsistencies in operational responses
• Postmortems centered around “The Old View” of human error
• Some incidents & remediations got lost in the shuffle
• Surprising amount of (aggregated) service impact due to P3s/P4s
• “What, exactly, is an ‘incident?’”
@jpaulreed@kfinnbraun #DOES2016
24. “ B L A M E L E S S ”
“ P O S T M O R T E M S ” ?
• Brené Brown, research sociologist,
on vulnerability
• “Blame is a way to discharge pain
and discomfort”
• Postmortem has a heavy connotation
• “Awesome postmortems?” Really?!
• More at: http://jpaulreed.com/
blame-aware-postmortems
@jpaulreed@kfinnbraun #DOES2016
26. Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line.”
“I’m getting sent to the
principal’s office because
of this outage.”
Completes
the post-
incident
“paperwork.”
No formal retrospective/
hallway retrospectives.
LanguageBehaviors
@kfinnbraun / #DOES2016 / @jpaulreed
27. Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line.”
“I’m getting sent to the
principal’s office because
of this outage.”
“Let’s fix this as
fast as possible.”
“What’s the correct fix to
avoid this specific issue
in the future?”
Completes
the post-
incident
“paperwork.”
No formal retrospective/
hallway retrospectives.
Some
information
(inconsistently)
recorded.
Jumps to a
focus on why.
LanguageBehaviors
@kfinnbraun / #DOES2016 / @jpaulreed
28. Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line.”
“I’m getting sent to the
principal’s office because
of this outage.”
“Let’s fix this as
fast as possible.”
“What’s the correct fix to
avoid this specific issue
in the future?”
“Let’s review the
timeline/incident
report to answer that.”
“We need to find the root
cause of this incident.”
Completes
the post-
incident
“paperwork.”
No formal retrospective/
hallway retrospectives.
Some
information
(inconsistently)
recorded.
Jumps to a
focus on why.
Follows the prescribed
format for retrospectives.
Possesses and incorporates
complete dataset for the incident
into the retrospective.
LanguageBehaviors
@kfinnbraun / #DOES2016 / @jpaulreed
29. Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line.”
“I’m getting sent to the
principal’s office because
of this outage.”
“Let’s fix this as
fast as possible.”
“What’s the correct fix to
avoid this specific issue
in the future?”
“Let’s review the
timeline/incident
report to answer that.”
“We need to find the root
cause of this incident.” “Now that we’ve established
what happened,
how did it happen?”
“How did these
multiple factors
influence our
complex system?”
Completes
the post-
incident
“paperwork.”
No formal retrospective/
hallway retrospectives.
Some
information
(inconsistently)
recorded.
Jumps to a
focus on why.
Follows the prescribed
format for retrospectives.
Possesses and incorporates
complete dataset for the incident
into the retrospective.
Identifies
inherent bias
in self
and others.
Perspectives solicited from all involved
team members/functional groups.
LanguageBehaviors
@kfinnbraun / #DOES2016 / @jpaulreed
30. Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line.”
“I’m getting sent to the
principal’s office because
of this outage.”
“Let’s fix this as
fast as possible.”
“What’s the correct fix to
avoid this specific issue
in the future?”
“Let’s review the
timeline/incident
report to answer that.”
“We need to find the root
cause of this incident.” “Now that we’ve established
what happened,
how did it happen?”
“How did these
multiple factors
influence our
complex system?”
“How does our team/system
contribute to our successes?”
“What can we
incorporate from
this incident to
better respond
next time?”
Completes
the post-
incident
“paperwork.”
No formal retrospective/
hallway retrospectives.
Some
information
(inconsistently)
recorded.
Jumps to a
focus on why.
Follows the prescribed
format for retrospectives.
Possesses and incorporates
complete dataset for the incident
into the retrospective.
Identifies
inherent bias
in self
and others.
Perspectives solicited from all involved
team members/functional groups.
Able to facilitate
retrospectives by
healthily helping
others address
tendency to blame/
personal & systemic bias.
Retrospective outcomes
are fed back into the
system and prioritized.
LanguageBehaviors
@kfinnbraun / #DOES2016 / @jpaulreed
31. LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line.”
“I’m getting sent to the
principal’s office because
of this outage.”
“Let’s fix this as
fast as possible.”
“What’s the correct fix to
avoid this specific issue
in the future?”
“Let’s review the
timeline/incident
report to answer that.”
“We need to find the root
cause of this incident.” “Now that we’ve established
what happened,
how did it happen?”
“How did these
multiple factors
influence our
complex system?”
“How does our team/system
contribute to our successes?”
“What can we
incorporate from
this incident to
better respond
next time?”
Completes
the post-
incident
“paperwork.”
No formal retrospective/
hallway retrospectives.
Some
information
(inconsistently)
recorded.
Jumps to a
focus on why.
Follows the prescribed
format for retrospectives.
Possesses and incorporates
complete dataset for the incident
into the retrospective.
Identifies
inherent bias
in self
and others.
Perspectives solicited from all involved
team members/functional groups.
Able to facilitate
retrospectives by
healthily helping
others address
tendency to blame/
personal & systemic bias.
Retrospective outcomes
are fed back into the
system and prioritized.
@kfinnbraun / #DOES2016 / @jpaulreed
32. Incident Analysis
LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line.”
“I’m getting sent to the
principal’s office because
of this outage.”
“Let’s fix this as
fast as possible.”
“What’s the correct fix to
avoid this specific issue
in the future?”
“Let’s review the
timeline/incident
report to answer that.”
“We need to find the root
cause of this incident.” “Now that we’ve established
what happened,
how did it happen?”
“How did these
multiple factors
influence our
complex system?”
“How does our team/system
contribute to our successes?”
“What can we
incorporate from
this incident to
better respond
next time?”
Completes
the post-
incident
“paperwork.”
No formal retrospective/
hallway retrospectives.
Some
information
(inconsistently)
recorded.
Jumps to a
focus on why.
Follows the prescribed
format for retrospectives.
Possesses and incorporates
complete dataset for the incident
into the retrospective.
Identifies
inherent bias
in self
and others.
Perspectives solicited from all involved
team members/functional groups.
Able to facilitate
retrospectives by
healthily helping
others address
tendency to blame/
personal & systemic bias.
Retrospective outcomes
are fed back into the
system and prioritized.
@kfinnbraun / #DOES2016 / @jpaulreed
33. Incident Analysis
Incident
Detection Incident
Response
Incident
Remediation Incident
Prevention*
T H E I N C I D E N T L I F E C Y C L E
LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line.”
“I’m getting sent to the
principal’s office because
of this outage.”
“Let’s fix this as
fast as possible.”
“What’s the correct fix to
avoid this specific issue
in the future?”
“Let’s review the
timeline/incident
report to answer that.”
“We need to find the root
cause of this incident.” “Now that we’ve established
what happened,
how did it happen?”
“How did these
multiple factors
influence our
complex system?”
“How does our team/system
contribute to our successes?”
“What can we
incorporate from
this incident to
better respond
next time?”
Completes
the post-
incident
“paperwork.”
No formal retrospective/
hallway retrospectives.
Some
information
(inconsistently)
recorded.
Jumps to a
focus on why.
Follows the prescribed
format for retrospectives.
Possesses and incorporates
complete dataset for the incident
into the retrospective.
Identifies
inherent bias
in self
and others.
Perspectives solicited from all involved
team members/functional groups.
Able to facilitate
retrospectives by
healthily helping
others address
tendency to blame/
personal & systemic bias.
Retrospective outcomes
are fed back into the
system and prioritized.
@kfinnbraun / #DOES2016 / @jpaulreed
34. I N C I D E N T D E T E C T I O N
@kfinnbraun / #DOES2016 / @jpaulreed
35. Novice Competent Proficient ExpertBeginner
“Problems with our service
are obvious;
outages are obvious.”
“Other teams will notify us
of any problems.”
“Most of the time,
we’re the first to know
when a service is impacted.”
“We use historical data to
guess at service level changes.”
“We’ve detected service
level transitions via
monitoring and
reduced MTTD.”
“I know which specific
code/infra change caused this
service level change;
here’s how I know…”
“We prioritize feature requests
and bug reports to
monitoring hooks;
monitoring is a 1st class
citizen.”
“We’ve decoupled code/infra
deployment, because we
can roll back/forward.”
“We’re not paged
anymore for changes
automation can
react to.”
Manual and/or external
outage notifications.
No baseline metrics/
service levels are broadly bucketed.
External monitoring is in place
to detect real time service transitions.
Notifications are
automated.
External infra/API endpoints/
outward-facing interfaces
monitored/recorded.
Historical data exists and
has been used to establish
graduated service baselines.
Application
internals report data
to the monitoring system.
Monitoring systems employ
deep statistical methods
to (dis)prove service anomalies.
Monitoring output is
reincorporated
into operational
behavior in an
automated fashion.
Anomalies no longer result in
defined “incidents.”
LanguageBehaviors
@kfinnbraun / #DOES2016 / @jpaulreed
36. I N C I D E N T R E S P O N S E
@kfinnbraun / #DOES2016 / @jpaulreed
37. Novice Competent Proficient ExpertBeginner
“Have you tried turning it
off and turning it on again?”
“Something is wrong
with the X…”
“I think X is familiar
with Y; let’s find them.”
“I think there’s a problem with
the database, network, etc.”
Standard Incident
Management System
language used.
“The deployment caused
the database to hang…”
“The infrastructure on-calls:
perform a system status &
report back to the IC.”
Entire team is familiar
with standardized
IMS language.
Standardized IMS language
is used/valued by the
entire team.
“What parts of the
service did not
‘self-heal’ and
need attention?”
Team is event-focused;
the team is
“alarmed” by incidents.
Inconsistent response
once incident has commenced.
Response based on
“tribal knowledge.”
Team is
area-focused.
Team is action-focused.
Team has identified incident
“responders,” and those
people know their duties.
Team is technology-focused.
Incident response is an
aspect of org and team “culture.”
Incidents are embraced, but
outside-business hours or
repeated incidents
are considered inhumane.
Team is systems-focused.
LanguageBehaviors
@kfinnbraun / #DOES2016 / @jpaulreed
38. I N C I D E N T A N A LY S I S
@kfinnbraun / #DOES2016 / @jpaulreed
39. Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line.”
“I’m getting sent to the
principal’s office because
of this outage.”
“Let’s fix this as
fast as possible.”
“What’s the correct fix to
avoid this specific issue
in the future?”
“Let’s review the
timeline/incident
report to answer that.”
“We need to find the root
cause of this incident.” “Now that we’ve established
what happened,
how did it happen?”
“How did these
multiple factors
influence our
complex system?”
“How does our team/system
contribute to our successes?”
“What can we
incorporate from
this incident to
better respond
next time?”
Completes
the post-
incident
“paperwork.”
No formal retrospective/
hallway retrospectives.
Some
information
(inconsistently)
recorded.
Jumps to a
focus on why.
Follows the prescribed
format for retrospectives.
Possesses and incorporates
complete dataset for the incident
into the retrospective.
Identifies
inherent bias
in self
and others.
Perspectives solicited from all involved
team members/functional groups.
Able to facilitate
retrospectives by
healthily helping
others address
tendency to blame/
personal & systemic bias.
Retrospective outcomes
are fed back into the
system and prioritized.
LanguageBehaviors
@kfinnbraun / #DOES2016 / @jpaulreed
40. I N C I D E N T R E M E D I AT I O N
@kfinnbraun / #DOES2016 / @jpaulreed
41. Novice Competent Proficient ExpertBeginner
“Let’s just file a ticket
to track the issue.”
“I’m am sure this is the issue;
the fix will correct 100%
of the occurrences.”
“I’m pretty sure we
already fixed this?”
“We need an action plan
to address the process gaps.”
“This needs to be fixed
in the next release and
documented in our
incident response docs.”
“We need to look deeper than
this specific incident to really
address the problem.”
“What can we learn from
this incident?”
“What other system
aspects have we learned
from this incident? How can
we use that?”
“While operating
our system today,
how did we actively
create & sustain
success?”
Remediation
activities (or lack
thereof) contribute to
a “break-fix” cycle.
Discussions of the incident
are aggressive/blameful.
“Low hanging fruit”
may be fixed, but
not documented or
incorporated into team behavior.
More processes,
more procedures,
more rules.
Issues of all sizes are
actively managed.
Issues have a priority and teams
have bandwidth to address them.
Completed issue
remediation is
valued by the org.
Bandwidth exists to discuss, design
and implement resiliency improvements.
Remediation is not regarded
as a separate activity & is
culturally integrated into work.
Resilience is considered
in the design phase
for new infra/software.
LanguageBehaviors
@kfinnbraun / #DOES2016 / @jpaulreed
42. I N C I D E N T P R E V E N T I O N *
@kfinnbraun / #DOES2016 / @jpaulreed
43. Novice Competent Proficient ExpertBeginner
“Preventing future
incidents is difficult
because of
lacking data.”
“We can use
predictive metrics
to completely avoid
future incidents.”
“Our system has
reasonable coverage
of its metrics.”
“We use metrics to inform
attack/risk surface.”
“We use trend analysis
to raise ‘soft’ problems
to operators.”
“Old documentation is problematic
and dealt with accordingly.”
“When we started game days,
it was a real mess.”
“We now care less about
specific incidents &
more about crew formation.”
“The team is excited
about game days.”
“Our crews care
about their formation
and dissolution.”
Prevention efforts
include documentation,
process design,
metrics collection.
Retrospective focus is
on static causes/effects.
Retrospectives
include discussions
of active operator behaviors.
Docs, process,
metrics established,
but < 100%.
Preventative focus is
on reviewing docs+process+
metrics collection, but in a
day-to-day context.
Retrospectives focus
on the response of the team
to an incident.
We actively inject
failure into our
systems on a
known schedule,
to drill.
We review our
response to
induced failures.
The crew formation/dissolution
process is considered our
primary role+responsibility in
addressing and preventing
operational failure
We actively inject failure
at random intervals.
LanguageBehaviors
@kfinnbraun / #DOES2016 / @jpaulreed
44. H E L P U S M A K E I T B E T T E R !
https://github.com/preed/incident-lifecycle-model
@jpaulreed@kfinnbraun #DOES2016
45. FA C I L I TAT E T E A M S E X P L O R I N G
T H E I R D I S C R E T I O N A RY S PA C E
@jpaulreed@kfinnbraun #DOES2016
46. I N C I D E N T R E S P O N S E ! =
I N C I D E N T M A N A G E M E N T
@jpaulreed@kfinnbraun #DOES2016
47. I N C I D E N T R E S P O N S E ! =
I N C I D E N T M A N A G E M E N T
( Y O U R I N C I D E N T VA L U E S T R E A M
M AT T E R S )
@jpaulreed@kfinnbraun #DOES2016
48. Y O U A R E N E V E R D O N E .
@jpaulreed@kfinnbraun #DOES2016
49. Y O U . A R E . N E V E R . D O N E .
@jpaulreed@kfinnbraun #DOES2016
50. AV E N U E S F O R C O L L A B O R AT I O N
• Take a look at the extended incident lifecycle model and your
organization: see where it fits and doesn’t!
• (And then send us Github pull requests!)
• Compare your own (documented?) incident life cycle against your
actual incident value stream; share what you find!
@jpaulreed@kfinnbraun #DOES2016