The Dark Art of Building a
Production Incident
System
@Alois Reitbauer
Tech. Evangelist & Product Mgr., Compuware
No broken cables
No datacenter fires
Other things can happen
as well
Continuous deployments

Infrastructure changes
other “everyday” stuff
Scaling an incident system
How it feels to do what
we do
Do you alert?
Typical error rate of 3 percent at
10.000 transactions/min
During the night we now have 5
errors in 100 requests.
Do you alert?
Typical response time has been
around 300 ms.
Now we see response times up
to 600 ms.
We a r e g o o d a t f i x i n g
problems, but not really
good at detecting them.
How can we get better?
.
It is all about statistics

I t ’s a l l a b o u t s t a t i s t i c s
Statistics is about
objectively lying to yourself
in a meaningful way.
How to design an incident
It looks really simple
How to calculate
this value?

Which metric
to pick?

How to get
this baseline?

How to define that
this happened?
Which metrics to pick?
Three types of metrics
Capacity Metrics
Define how much of resource is used.
Discrete Metrics
Simple countable things, like errors or users.
Continuous Metrics
Metrics represented by a range of values at any
given time.
Capacity Metrics
Good for capacity planning, not so good for
production alerting
Connection Pools
b e tte r u s e
Connection acquisition time
Tells you, whether anyone needed a connection
and did not get it.
CPU Usage
b e tte r u s e
Combination of Load Average and CPU usage
even better correlate the with response times of
applications
D i s c re te M e t r i c s
Pretty easy to track and analyze.
C o nt i n u o u s M e t r i c s
Require some extra work as they are not that
easy to track.
Continuous Metrics – The hope

42
Continuous Metrics – The
reality
What the average tells us
What the median tells us
How to get a baseline?
A baseline is not a number
Baselines define the range of a value combined
with a probability
Normal distribution as baseline
Mean: 500 ms
Std. Dev.: 100 ms

0

100

200

300

400

500

600

68 %
400ms – 500 ms
95 %
300ms – 700 ms
99 %
200ms – 800 ms

700

800

900
This can go really wrong

“Why alerts suck and monitoring solutions need to become better”
How this leads to false
alerts
Many false alerts

Aggressive Baseline
No alerts at all

Moderate Baseline
Find the right distribution
model
However, this can be really hard to impossible
Your distribution might look like
this
… or like this
or completely different
you never know …
How can we solve this
problem?
Normal distribution - again

50 Percent slower than μ

Median

97.6 Percent slower than μ + 2σ

97th Percentile
The 50 th and 90 th percentile
define normal behavior
without needing
to know anything about the
distribution model
Median shows the real problem
How to define nonnormal behavior?
Fortunately this is not the
problem we need to solve
We are only talking about missed expectations
Let’s look at two scenarios
Errors
Is a certain error rate likely to happen or not?

Response Times
Is a certain increase in response time significant

enough to trigger an incident?
The error rate scenario
We have a typical error rate of 3 percent at
10.000 transactions/minute
During the night we now have 5 errors in 100
requests. Should we alert – or not?
What can we learn
Statistics is everwhere
B i n o m i a l D i st r i b u t i o n
Tells us how likely it is to see n successes in a
certain number of trials
How many errors are ok?
Likeliness of at least n errors
120.0%

18 % probability to see 5 or
more errors. Which is within 2
times Std. Deviation. We do not
alert.

100.0%

80.0%

60.0%

40.0%

20.0%

0.0%
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19
Response Time Example
Our median response time is 300 ms
and we measure
200 ms
500 ms

400 ms
150 ms

350 ms
350 ms

200 ms
400 ms

600 ms
600 ms
Percentile Drift
Detection
Did the median drift
significantly?
Check all values above 300 ms
200 ms
500 ms

400 ms
150 ms

350 ms
350 ms

200 ms
400 ms

600 ms
600 ms

7 values are higher than the median. Is this normal?

We can again use the Binomial Distribution
Applying the Binomial
Distribution
We have a 50 percent likeliness to see values
above the median.
How likely is is that 7 out of 10 samples are higher?
The probability is 17 percent, so we should not alert.
… and we are done!
How to calculate
this value?

Which metric
to pick?

How to get
this baseline?

How to define that
this happened?
This was just the beginning
There are many more use things about statistics,
probabilities, testing, ….
Alois Reitbauer
alois.reitbauer@compuware.com
@AloisReitbauer
apmblog.compuware.com
Image Credits
http://commons.wikimedia.org/wiki/File:Network_switches.jpg
http://commons.wikimedia.org/wiki/File:Wheelock_mt.jpg
http://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpg
http://commons.wikimedia.org/wiki/File:Estacaobras.jpg
http://commons.wikimedia.org/wiki/File:Speedo_angle.jpg
http://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPG
http://commons.wikimedia.org/wiki/File:Dice_02138.JPG
http://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg

The Dark of Building an Production Incident Syste