Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- The definition of normal - An intro... by Alois Reitbauer 1533 views
- Can a monitoring tool pass the turi... by Alois Reitbauer 246 views
- Monitoring large scale Docker produ... by Alois Reitbauer 313 views
- Monitoring without alerts by Alois Reitbauer 468 views
- Ruxit - How we launched a global mo... by Alois Reitbauer 1179 views
- Monitoring Docker Application in Pr... by Alois Reitbauer 1714 views

1,498 views

Published on

No Downloads

Total views

1,498

On SlideShare

0

From Embeds

0

Number of Embeds

674

Shares

0

Downloads

21

Comments

0

Likes

5

No embeds

No notes for slide

- 1. The Dark Art of Building a Production Incide nt Syste m @AloisReitbauer www.ruxit.com
- 2. No broken cables
- 3. No datacenter fires
- 4. Other things can happen as well Continuous deployments Infrastructure changes other “everyday” stuff
- 5. Scaling an incident system
- 6. How it feels to do what we do
- 7. Do you alert? Typical error rate of 3 percent at 10.000 transactions/min During the night we now have 5 errors in 100 requests.
- 8. Do you alert? Typical response time has been around 300 ms. Now we see response times up to 600 ms.
- 9. We are good at fixing problems, but not really good at detecting them.
- 10. How can we get better? .
- 11. It is all about statistics It ’s all about statistics
- 12. Statistics is about objectively lying to yourself in a meaningful way.
- 13. How to design an incident
- 14. How to calculate this value? It looks really simple Which metric to pick? How to get this baseline? How to define that this happened?
- 15. Which metrics to pick?
- 16. Three types of metrics Capacity Metrics Define how much of a resource is used. Discrete Metrics Simple countable things, like errors or users. Continuous Metrics Metrics represented by a range of values at any given time.
- 17. Capacity Metrics Good for capacity planning, not so good for production alerting
- 18. Connection Pools
- 19. better use Connection acquisition time Tells you, whether anyone needed a connection and did not get it.
- 20. CPU Usage
- 21. better use Combination of Load Average and CPU usage even better correlate the with response times of applications
- 22. Discrete Metrics Pretty easy to track and analyze.
- 23. Continuous Metrics Require some extra work as they are not that easy to track.
- 24. Continuous Metrics – The hope 42
- 25. Continuous Metrics – The reality
- 26. What the average tells us
- 27. What the median tells us
- 28. How to get a baseline?
- 29. A baseline is not a number Baselines define the range of a value combined with a probability
- 30. Normal distribution as baseline Mean: 500 ms Std. Dev.: 100 ms 68 % 400ms – 600 ms 95 % 300ms – 700 ms 0 100 200 300 400 500 600 700 800 900 99 % 200ms – 800 ms
- 31. This can go really wrong “Why alerts suck and monitoring solutions need to become better”
- 32. How this leads to false alerts
- 33. Many false alerts Aggressive Baseline
- 34. No alerts at all Moderate Baseline
- 35. Find the right distribution model However, this can be really hard to impossible
- 36. Your distribution might look like this
- 37. … or like this
- 38. or completely different you never know …
- 39. How can we solve this problem?
- 40. Norm al distribution - again 50 Percent slower than μ 97.6 Percent slower than μ + 2σ Median 97th Percentile
- 41. The 50th and 90th percentile define normal behavior without needing to know anything about the distribution model
- 42. Median shows the real problem
- 43. How to define non- normal behavior?
- 44. Fortunately, this is not the problem we need to solve We are only talking about missed expectations
- 45. Let’s look at two scenarios Errors Is a certain error rate likely to happen or not? Response Times Is a certain increase in response time significant enough to trigger an incident?
- 46. The error rate scenario We have a typical error rate of 3 percent at 10.000 transactions/minute During the night we now have 5 errors in 100 requests. Should we alert – or not?
- 47. What can we learn
- 48. Statistics is everwhere
- 49. Binomial Distribution Tells us how likely it is to see n successes in a certain number of trials
- 50. How many errors are ok? 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Likeliness of at least n errors 18 % probability to see 5 or more errors. Which is within 2 times Std. Deviation. We do not alert.
- 51. Response Time Example Our median response time is 300 ms and we measure 200 ms 400 ms 350 ms 200 ms 600 ms 500 ms 150 ms 350 ms 400 ms 600 ms
- 52. Percentile Drift Detection
- 53. Did the median drift significantly? Check all values above 300 ms 200 ms 400 ms 350 ms 200 ms 600 ms 500 ms 150 ms 350 ms 400 ms 600 ms 7 values are higher than the median. Is this normal? We can again use the Binomial Distribution
- 54. Applying the Binom ial Distribution We have a 50 percent likeliness to see values above the median. How likely is is that 7 out of 10 samples are higher? The probability is 17 percent, so we should not alert.
- 55. How to calculate this value? … and we are done! Which metric to pick? How to get this baseline? How to define that this happened?
- 56. This was just the beginning There are many more use things about statistics, probabilities, testing, ….
- 57. Alois Reitbauer alois.reitbauer@ruxit.com @AloisReitbauer http://bit.ly/bostonwebperf
- 58. ImageCredits http://commons.wikimedia.org/wiki/File:Network_switches.jpg http://commons.wikimedia.org/wiki/File:Wheelock_mt.jpg http://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpg http://commons.wikimedia.org/wiki/File:Estacaobras.jpg http://commons.wikimedia.org/wiki/File:Speedo_angle.jpg http://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPG http://commons.wikimedia.org/wiki/File:Dice_02138.JPG http://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment