BEYOND THE
MTTR(Mean Time To Recover)
1 — @jasonhand
@JASONHAND
JASON HAND
VICTOROPS
2 — @jasonhand
The most relevant
metric in evaluating the
effectiveness of
emergency response is
how quickly the
response team can
bring the system back
to health 
-- that is, the MTTR.
— Benjamin Treynor Sloss
3 — @jasonhand
HIGH
AVAILABILITY &
RELIABILITY4 — @jasonhand
99.999 %
UPTIME5 — @jasonhand
6 — @jasonhand
PREDICT &
PREVENT7 — @jasonhand
WHAT ABOUT
MTBF?
(Mean Time Between Failure)
8 — @jasonhand
COMPLEX
SYSTEMS9 — @jasonhand
FAILURE
10 — @jasonhand
Failure cares not about the architecture
designs you slave over, the code you write
and review, or the alerts and metrics you
meticulously pore through..
..Failure happens. This is a foregone
conclusion when working with complex
systems.
— John Allspaw (Former CTO Etsy)
11 — @jasonhand
12 — @jasonhand
AVAILABILITY =
MTBF/(MTBF + MTTR)
A commonly used measurement of how
often a system or service is available
compared to the total time it should be
usable.5
5
Effective DevOps (Jennifer Davis, Katherine Daniels)
13 — @jasonhand
AVAILABILITY
& RELIABILITY:
THE RESULT OF A TEAM'S ABILITY TO...14 — @jasonhand
RESPOND
& RECOVER
QUICKLY15 — @jasonhand
16 — @jasonhand
17 — @jasonhand
HIGH-PERFORMING
ORGANIZATIONSresolve production incidents 168 times faster than their
peers 7
7
State of DevOps
18 — @jasonhand
19 — @jasonhand
20 — @jasonhand
21 — @jasonhand
22 — @jasonhand
23 — @jasonhand
WHAT IS THE
ROI
OF DEVOPS?24 — @jasonhand
25 — @jasonhand
HOW MUCH DID THAT
OUTAGECOST THE COMPANY?26 — @jasonhand
27 — @jasonhand
28 — @jasonhand
29 — @jasonhand
30 — @jasonhand
31 — @jasonhand
32 — @jasonhand
33 — @jasonhand
LET'S CALCULATE34 — @jasonhand
COST OF DOWNTIME =
= Deployment frequency
= Change Failure Rate
= Mean Time To Recover (MTTR)
= Hourly Cost of Outage
35 — @jasonhand
DEPLOYMENT
FREQUENCY
36 — @jasonhand
DEPLOYMENT
FREQUENCY
How many times per year your own org
deploys changes1
High performers = 1,460/year
Low performers = 7/year
1
State of DevOps [2016] - Puppet labs.
37 — @jasonhand
CHANGE
FAILURE RATE38 — @jasonhand
CHANGE
FAILURE RATE
Percentage of changes that cause an
outage in an organization
Higher performers = 0-15%
Low performers = 16-30%
39 — @jasonhand
MTTR& HOURLY COST OF OUTAGE40 — @jasonhand
41 — @jasonhand
WHAT IS
MTTREXACTLY?
42 — @jasonhand
MTTR(Mean Time To Recover)
How long does it generally take to
restore when a service incident occurs
(e.g., unplanned outage, service
impairment)?11
11
State of DevOps Report (2016)
43 — @jasonhand
BUT WHAT DOES
MEAN
MEAN?44 — @jasonhand
ARITHMETIC MEAN
(ĂRˌĬTH-MĔTˈĬK MĒN)
n. The value obtained by dividing the sum of a set of
quantities by the number of quantities in the set
ALSO CALLED AVERAGE
45 — @jasonhand
AVERAGE(AS A METRIC)
TELLS YOU NOTHING ABOUT THE
ACTUAL INCIDENTS46 — @jasonhand
LIMITATIONS OF
ARITHMETIC MEANIn data sets that are skewed or where outliers are
present, calculating the arithmetic mean often provides
a misleading result.3
3
Arithmetic Mean
47 — @jasonhand
DISTORTED
BIG PICTURE
48 — @jasonhand
AVERAGES...
The number of data points can vary
greatly depending on the complexity
and scale of systems.
Furthermore, averages assume there is
a normal event or that your data is a
normal distribution.13
— Richart Thaler
Average is a horrible metric for
optimizing performance
13
(MisBehaving - The Making of Behavorial Economics)
49 — @jasonhand
50 — @jasonhand
MEAN TIME TO
WTF51 — @jasonhand
REASONS INCLUDE:
Incidents are auto-resolving?
Engineers are re-routing alerts?
Incidents remain unresolved until postmortem?
Engineers are constantly hitting “resolve” because ...
52 — @jasonhand
IS DOWN
AGAIN!53 — @jasonhand
SEV1 OR SEV2
OUTAGEJust one extended outage can severely
tip the scale on data points and
normalization
54 — @jasonhand
WHAT ELSE SHOULD WE
MEASURE?55 — @jasonhand
MEDIANTIME TO RECOVER
More robust to outliers.
56 — @jasonhand
VOLUME
OF ALERTS
BY SEVERITY
AND TOTAL
57 — @jasonhand
TOTAL NUMBER
OF OUTAGES
determine total downtime
(Related: Service Level Agreements)
58 — @jasonhand
NOISY HOSTS OR SERVICES
59 — @jasonhand
ALERTACTIONABILITY
60 — @jasonhand
ALERT TYPES
INFO, WARNING, CRITICAL61 — @jasonhand
ALERT
VOLUME/ day
(pssst ...careful of avg)
62 — @jasonhand
ALERT TIMES
63 — @jasonhand
MTTAMEAN TIME TO
ACKNOWLEDGE64 — @jasonhand
65 — @jasonhand
66 — @jasonhand
BIOMETRIC DATA
67 — @jasonhand
SLEEP DISRUPTION
(MEAN TIME TO SLEEP)15
15
https://www.slideshare.net/lozzd/mean-time-to-sleep-quantifying-the-oncall-experience
68 — @jasonhand
YOUR CHALLENGE:
Examine your own
Mean Time To Recover
Discuss additional methods of
understanding data -
Deployment frequency
Change Failure Rate
Mean Time To Recover (MTTR)
Hourly Cost of Outage
69 — @jasonhand
INCENTIVE
STRUCTURES
Be mindful of incentive structures to
"encourage" a reduction of MTTR
We may believe that our efforts are
improving when the truth is they
aren't.
70 — @jasonhand
WHAT DOES
YOURDOWNTIME COST?
71 — @jasonhand
DEVOPS
ROI72 — @jasonhand
FINAL THOUGHTS:
> MTTR is important but not by itself
> Identify noisiest alerts and address
them now
> Bring more in to the fold (Devs on-
call)
> Shift observability left
> Share the information
> Share the pain
73 — @jasonhand
ABOVE
ALL...74 — @jasonhand
CONTINUOUSLY
IMPROVE75 — @jasonhand
BEYOND THE
MEAN TIME TO RECOVER
@JASONHAND
VICTOROPS
76 — @jasonhand
THANK
YOU77 — @jasonhand
Abstract
Mean time to Repair (MTTR) has long been the defacto metric for those tapped with the responsibility of up-time. It’s the
cornerstone measurement of how well teams respond to service disruptions and a key performance metric that nearly
all in IT should aim to consistently improve. Swift recovery provides far more benefits than attempts to engineer failure
from complex systems.
As important as MTTR has become, the mean time to repair is no more than an average of how long it took to manage
individual incidents (from acknowledgement to resolve) over the course of time. The number of data points during that time
can vary greatly depending on the complexity and scale of systems. Furthermore, averages assume there is a normal
event or that your data is a normal distribution. Anyone who has been on-call can attest that some incidents require
longer to resolve than others and that variance is something you shouldn’t ignore.
Within any time-series dataset there are in fact high and low values hidden within the data. These outliers may indicate that
while we think the time it takes to recover from failure is good, bad, or otherwise, many high values in our average distort
or hide lower values and vice versa. We may believe that our efforts to reduce the time it takes to recover from failure is
in fact working when the truth is it’s not.
In this talk, we’ll discuss the metric of Mean Time To Repair as well as additional methods of understanding data related
to building and maintaining reliable systems at scale. MTTR must be made a priority for any IT team that habitually
follows old-view approaches to incident response however; a deeper understanding of the data provides much higher
fidelity regarding the true health of your systems and the teams that support them.
78 — @jasonhand
Additional resources:
VictorOps.com (http://www.victorops.com)
Kitchen Soap : Blogs by John Allspaw (http://
www.kitchensoap.com)
Signalvnoise.com (https://m.signalvnoise.com/)
79 — @jasonhand

Beyond The Mean Time To Recover