16. Three types of metrics
Capacity Metrics
Define how much of a resource is used.
Discrete Metrics
Simple countable things, like errors or users.
Continuous Metrics
Metrics represented by a range of values at any given time.
29. A baseline is not a number
Baselines define the range of a value combined
with a probability
30. Normal distribution as baseline
Mean: 500 ms
Std. Dev.: 100 ms
68 %
400ms – 600 ms
95 %
300ms – 700 ms
0 100 200 300 400 500 600 700 800 900
99 %
200ms – 800 ms
31. This can go really wrong
“Why alerts suck and monitoring solutions need to become better”
44. Fortunately, this is not the problem we
need to solve
We are only talking about missed expectations
45. Let’s look at two scenarios
Errors
Is a certain error rate likely to happen or not?
Response Times
Is a certain increase in response time significant
enough to trigger an incident?
46. The error rate scenario
We have a typical error rate of 3 percent at
10.000 transactions/minute
During the night we now have 5 errors in 100
requests. Should we alert – or not?
50. How many errors are ok?
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Likeliness of at least n errors
18 % probability to see 5 or more
errors. Which is within 2 times Std.
Deviation. We do not alert.
51. Response Time Example
Our median response time is 300 ms
and we measure
200 ms 400 ms 350 ms 200 ms 600 ms
500 ms 150 ms 350 ms 400 ms 600 ms
53. Did the median drift
significantly?
Check all values above 300 ms
200 ms 400 ms 350 ms 200 ms 600 ms
500 ms 150 ms 350 ms 400 ms 600 ms
7 values are higher than the median. Is this normal?
We can again use the Binomial Distribution
54. Applying the Binom ial
Distribution
We have a 50 percent likeliness to see values above the
median.
How likely is is that 7 out of 10 samples are higher?
The probability is 17 percent, so we should not alert.
55. How to calculate
this value?
… and we are done!
Which metric
to pick?
How to get
this baseline?
How to define that
this happened?
56. This was just the beginning
There are many more use things about statistics,
probabilities, testing, ….