The Dark of Building an Production Incident Syste

The Dark Art of Building a
Production Incident
System
@Alois Reitbauer
Tech. Evangelist & Product Mgr., Compuware

Other things can happen
as well
Continuous deployments

Infrastructure changes
other “everyday” stuff

Do you alert?
Typical error rate of 3 percent at
10.000 transactions/min
During the night we now have 5
errors in 100 requests.

Do you alert?
Typical response time has been
around 300 ms.
Now we see response times up
to 600 ms.

We a r e g o o d a t f i x i n g
problems, but not really
good at detecting them.

It is all about statistics

I t ’s a l l a b o u t s t a t i s t i c s

Statistics is about
objectively lying to yourself
in a meaningful way.

It looks really simple
How to calculate
this value?

Which metric
to pick?

How to get
this baseline?

How to define that
this happened?

Three types of metrics
Capacity Metrics
Define how much of resource is used.
Discrete Metrics
Simple countable things, like errors or users.
Continuous Metrics
Metrics represented by a range of values at any
given time.

Capacity Metrics
Good for capacity planning, not so good for
production alerting

b e tte r u s e
Connection acquisition time
Tells you, whether anyone needed a connection
and did not get it.

b e tte r u s e
Combination of Load Average and CPU usage
even better correlate the with response times of
applications

D i s c re te M e t r i c s
Pretty easy to track and analyze.

C o nt i n u o u s M e t r i c s
Require some extra work as they are not that
easy to track.

Continuous Metrics – The hope

42

Continuous Metrics – The
reality

A baseline is not a number
Baselines define the range of a value combined
with a probability

Normal distribution as baseline
Mean: 500 ms
Std. Dev.: 100 ms

0

100

200

300

400

500

600

68 %
400ms – 500 ms
95 %
300ms – 700 ms
99 %
200ms – 800 ms

700

800

900

This can go really wrong

“Why alerts suck and monitoring solutions need to become better”

How this leads to false
alerts

Many false alerts

Aggressive Baseline

No alerts at all

Moderate Baseline

Find the right distribution
model
However, this can be really hard to impossible

Your distribution might look like
this

or completely different
you never know …

How can we solve this
problem?

Normal distribution - again

50 Percent slower than μ

Median

97.6 Percent slower than μ + 2σ

97th Percentile

The 50 th and 90 th percentile
define normal behavior
without needing
to know anything about the
distribution model

How to define nonnormal behavior?

Fortunately this is not the
problem we need to solve
We are only talking about missed expectations

Let’s look at two scenarios
Errors
Is a certain error rate likely to happen or not?

Response Times
Is a certain increase in response time significant

enough to trigger an incident?

The error rate scenario
We have a typical error rate of 3 percent at
10.000 transactions/minute
During the night we now have 5 errors in 100
requests. Should we alert – or not?

B i n o m i a l D i st r i b u t i o n
Tells us how likely it is to see n successes in a
certain number of trials

How many errors are ok?
Likeliness of at least n errors
120.0%

18 % probability to see 5 or
more errors. Which is within 2
times Std. Deviation. We do not
alert.

100.0%

80.0%

60.0%

40.0%

20.0%

0.0%
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Response Time Example
Our median response time is 300 ms
and we measure
200 ms
500 ms

400 ms
150 ms

350 ms
350 ms

200 ms
400 ms

600 ms
600 ms

Did the median drift
significantly?
Check all values above 300 ms
200 ms
500 ms

400 ms
150 ms

350 ms
350 ms

200 ms
400 ms

600 ms
600 ms

7 values are higher than the median. Is this normal?

We can again use the Binomial Distribution

Applying the Binomial
Distribution
We have a 50 percent likeliness to see values
above the median.
How likely is is that 7 out of 10 samples are higher?
The probability is 17 percent, so we should not alert.

… and we are done!
How to calculate
this value?

Which metric
to pick?

How to get
this baseline?

How to define that
this happened?

This was just the beginning
There are many more use things about statistics,
probabilities, testing, ….

Alois Reitbauer
alois.reitbauer@compuware.com
@AloisReitbauer
apmblog.compuware.com

Image Credits
http://commons.wikimedia.org/wiki/File:Network_switches.jpg
http://commons.wikimedia.org/wiki/File:Wheelock_mt.jpg
http://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpg
http://commons.wikimedia.org/wiki/File:Estacaobras.jpg
http://commons.wikimedia.org/wiki/File:Speedo_angle.jpg
http://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPG
http://commons.wikimedia.org/wiki/File:Dice_02138.JPG
http://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg

The Dark of Building an Production Incident Syste

More Related Content

Viewers also liked

Similar to The Dark of Building an Production Incident Syste

More from Alois Reitbauer

Recently uploaded

The Dark of Building an Production Incident Syste