Beyond The Mean Time To Recover

BEYOND THE
MTTR(Mean Time To Recover)
1 — @jasonhand

@JASONHAND
JASON HAND
VICTOROPS
2 — @jasonhand

The most relevant
metric in evaluating the
effectiveness of
emergency response is
how quickly the
response team can
bring the system back
to health
-- that is, the MTTR.
— Benjamin Treynor Sloss
3 — @jasonhand

HIGH
AVAILABILITY &
RELIABILITY4 — @jasonhand

99.999 %
UPTIME5 — @jasonhand

PREDICT &
PREVENT7 — @jasonhand

WHAT ABOUT
MTBF?
(Mean Time Between Failure)
8 — @jasonhand

COMPLEX
SYSTEMS9 — @jasonhand

Failure cares not about the architecture
designs you slave over, the code you write
and review, or the alerts and metrics you
meticulously pore through..
..Failure happens. This is a foregone
conclusion when working with complex
systems.
— John Allspaw (Former CTO Etsy)
11 — @jasonhand

AVAILABILITY =
MTBF/(MTBF + MTTR)
A commonly used measurement of how
often a system or service is available
compared to the total time it should be
usable.5
5
Effective DevOps (Jennifer Davis, Katherine Daniels)
13 — @jasonhand

AVAILABILITY
& RELIABILITY:
THE RESULT OF A TEAM'S ABILITY TO...14 — @jasonhand

RESPOND
& RECOVER
QUICKLY15 — @jasonhand

HIGH-PERFORMING
ORGANIZATIONSresolve production incidents 168 times faster than their
peers 7
7
State of DevOps
18 — @jasonhand

WHAT IS THE
ROI
OF DEVOPS?24 — @jasonhand

HOW MUCH DID THAT
OUTAGECOST THE COMPANY?26 — @jasonhand

LET'S CALCULATE34 — @jasonhand

COST OF DOWNTIME =
= Deployment frequency
= Change Failure Rate
= Mean Time To Recover (MTTR)
= Hourly Cost of Outage
35 — @jasonhand

DEPLOYMENT
FREQUENCY
36 — @jasonhand

DEPLOYMENT
FREQUENCY
How many times per year your own org
deploys changes1
High performers = 1,460/year
Low performers = 7/year
1
State of DevOps [2016] - Puppet labs.
37 — @jasonhand

CHANGE
FAILURE RATE38 — @jasonhand

CHANGE
FAILURE RATE
Percentage of changes that cause an
outage in an organization
Higher performers = 0-15%
Low performers = 16-30%
39 — @jasonhand

MTTR& HOURLY COST OF OUTAGE40 — @jasonhand

WHAT IS
MTTREXACTLY?
42 — @jasonhand

MTTR(Mean Time To Recover)
How long does it generally take to
restore when a service incident occurs
(e.g., unplanned outage, service
impairment)?11
11
State of DevOps Report (2016)
43 — @jasonhand

BUT WHAT DOES
MEAN
MEAN?44 — @jasonhand

ARITHMETIC MEAN
(ĂRˌĬTH-MĔTˈĬK MĒN)
n. The value obtained by dividing the sum of a set of
quantities by the number of quantities in the set
ALSO CALLED AVERAGE
45 — @jasonhand

AVERAGE(AS A METRIC)
TELLS YOU NOTHING ABOUT THE
ACTUAL INCIDENTS46 — @jasonhand

LIMITATIONS OF
ARITHMETIC MEANIn data sets that are skewed or where outliers are
present, calculating the arithmetic mean often provides
a misleading result.3
3
Arithmetic Mean
47 — @jasonhand

DISTORTED
BIG PICTURE
48 — @jasonhand

AVERAGES...
The number of data points can vary
greatly depending on the complexity
and scale of systems.
Furthermore, averages assume there is
a normal event or that your data is a
normal distribution.13
— Richart Thaler
Average is a horrible metric for
optimizing performance
13
(MisBehaving - The Making of Behavorial Economics)
49 — @jasonhand

MEAN TIME TO
WTF51 — @jasonhand

REASONS INCLUDE:
Incidents are auto-resolving?
Engineers are re-routing alerts?
Incidents remain unresolved until postmortem?
Engineers are constantly hitting “resolve” because ...
52 — @jasonhand

IS DOWN
AGAIN!53 — @jasonhand

SEV1 OR SEV2
OUTAGEJust one extended outage can severely
tip the scale on data points and
normalization
54 — @jasonhand

WHAT ELSE SHOULD WE
MEASURE?55 — @jasonhand

MEDIANTIME TO RECOVER
More robust to outliers.
56 — @jasonhand

VOLUME
OF ALERTS
BY SEVERITY
AND TOTAL
57 — @jasonhand

TOTAL NUMBER
OF OUTAGES
determine total downtime
(Related: Service Level Agreements)
58 — @jasonhand

NOISY HOSTS OR SERVICES
59 — @jasonhand

ALERTACTIONABILITY
60 — @jasonhand

ALERT TYPES
INFO, WARNING, CRITICAL61 — @jasonhand

ALERT
VOLUME/ day
(pssst ...careful of avg)
62 — @jasonhand

MTTAMEAN TIME TO
ACKNOWLEDGE64 — @jasonhand

BIOMETRIC DATA
67 — @jasonhand

SLEEP DISRUPTION
(MEAN TIME TO SLEEP)15
15
https://www.slideshare.net/lozzd/mean-time-to-sleep-quantifying-the-oncall-experience
68 — @jasonhand

YOUR CHALLENGE:
Examine your own
Mean Time To Recover
Discuss additional methods of
understanding data -
Deployment frequency
Change Failure Rate
Mean Time To Recover (MTTR)
Hourly Cost of Outage
69 — @jasonhand

INCENTIVE
STRUCTURES
Be mindful of incentive structures to
"encourage" a reduction of MTTR
We may believe that our efforts are
improving when the truth is they
aren't.
70 — @jasonhand

WHAT DOES
YOURDOWNTIME COST?
71 — @jasonhand

FINAL THOUGHTS:
> MTTR is important but not by itself
> Identify noisiest alerts and address
them now
> Bring more in to the fold (Devs on-
call)
> Shift observability left
> Share the information
> Share the pain
73 — @jasonhand

CONTINUOUSLY
IMPROVE75 — @jasonhand

BEYOND THE
MEAN TIME TO RECOVER
@JASONHAND
VICTOROPS
76 — @jasonhand

Abstract
Mean time to Repair (MTTR) has long been the defacto metric for those tapped with the responsibility of up-time. It’s the
cornerstone measurement of how well teams respond to service disruptions and a key performance metric that nearly
all in IT should aim to consistently improve. Swift recovery provides far more benefits than attempts to engineer failure
from complex systems.
As important as MTTR has become, the mean time to repair is no more than an average of how long it took to manage
individual incidents (from acknowledgement to resolve) over the course of time. The number of data points during that time
can vary greatly depending on the complexity and scale of systems. Furthermore, averages assume there is a normal
event or that your data is a normal distribution. Anyone who has been on-call can attest that some incidents require
longer to resolve than others and that variance is something you shouldn’t ignore.
Within any time-series dataset there are in fact high and low values hidden within the data. These outliers may indicate that
while we think the time it takes to recover from failure is good, bad, or otherwise, many high values in our average distort
or hide lower values and vice versa. We may believe that our efforts to reduce the time it takes to recover from failure is
in fact working when the truth is it’s not.
In this talk, we’ll discuss the metric of Mean Time To Repair as well as additional methods of understanding data related
to building and maintaining reliable systems at scale. MTTR must be made a priority for any IT team that habitually
follows old-view approaches to incident response however; a deeper understanding of the data provides much higher
fidelity regarding the true health of your systems and the teams that support them.
78 — @jasonhand

Additional resources:
VictorOps.com (http://www.victorops.com)
Kitchen Soap : Blogs by John Allspaw (http://
www.kitchensoap.com)
Signalvnoise.com (https://m.signalvnoise.com/)
79 — @jasonhand

Beyond The Mean Time To Recover

More Related Content

Similar to Beyond The Mean Time To Recover

Recently uploaded

Beyond The Mean Time To Recover