The Dark Art of Production Alerting

1,498 views

Published on

Published in: Technology, Design
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,498
On SlideShare
0
From Embeds
0
Number of Embeds
674
Actions
Shares
0
Downloads
21
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • http://upload.wikimedia.org/wikipedia/commons/e/e5/Network_switches.jpg
  • http://commons.wikimedia.org/wiki/File:Wheelock_mt.jpg
  • http://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpg
  • http://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpg
  • http://commons.wikimedia.org/wiki/File:Estacaobras.jpg
  • http://commons.wikimedia.org/wiki/File:Speedo_angle.jpg
  • http://www.appdynamics.com/blog/2012/01/23/why-alerts-suck-and-monitoring-solutions-need-to-become-smarter/
  • http://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPG
  • http://commons.wikimedia.org/wiki/File:Dice_02138.JPG
  • http://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg
  • The Dark Art of Production Alerting

    1. 1. The Dark Art of Building a Production Incide nt Syste m @AloisReitbauer www.ruxit.com
    2. 2. No broken cables
    3. 3. No datacenter fires
    4. 4. Other things can happen as well Continuous deployments Infrastructure changes other “everyday” stuff
    5. 5. Scaling an incident system
    6. 6. How it feels to do what we do
    7. 7. Do you alert? Typical error rate of 3 percent at 10.000 transactions/min During the night we now have 5 errors in 100 requests.
    8. 8. Do you alert? Typical response time has been around 300 ms. Now we see response times up to 600 ms.
    9. 9. We are good at fixing problems, but not really good at detecting them.
    10. 10. How can we get better? .
    11. 11. It is all about statistics It ’s all about statistics
    12. 12. Statistics is about objectively lying to yourself in a meaningful way.
    13. 13. How to design an incident
    14. 14. How to calculate this value? It looks really simple Which metric to pick? How to get this baseline? How to define that this happened?
    15. 15. Which metrics to pick?
    16. 16. Three types of metrics Capacity Metrics Define how much of a resource is used. Discrete Metrics Simple countable things, like errors or users. Continuous Metrics Metrics represented by a range of values at any given time.
    17. 17. Capacity Metrics Good for capacity planning, not so good for production alerting
    18. 18. Connection Pools
    19. 19. better use Connection acquisition time Tells you, whether anyone needed a connection and did not get it.
    20. 20. CPU Usage
    21. 21. better use Combination of Load Average and CPU usage even better correlate the with response times of applications
    22. 22. Discrete Metrics Pretty easy to track and analyze.
    23. 23. Continuous Metrics Require some extra work as they are not that easy to track.
    24. 24. Continuous Metrics – The hope 42
    25. 25. Continuous Metrics – The reality
    26. 26. What the average tells us
    27. 27. What the median tells us
    28. 28. How to get a baseline?
    29. 29. A baseline is not a number Baselines define the range of a value combined with a probability
    30. 30. Normal distribution as baseline Mean: 500 ms Std. Dev.: 100 ms 68 % 400ms – 600 ms 95 % 300ms – 700 ms 0 100 200 300 400 500 600 700 800 900 99 % 200ms – 800 ms
    31. 31. This can go really wrong “Why alerts suck and monitoring solutions need to become better”
    32. 32. How this leads to false alerts
    33. 33. Many false alerts Aggressive Baseline
    34. 34. No alerts at all Moderate Baseline
    35. 35. Find the right distribution model However, this can be really hard to impossible
    36. 36. Your distribution might look like this
    37. 37. … or like this
    38. 38. or completely different you never know …
    39. 39. How can we solve this problem?
    40. 40. Norm al distribution - again 50 Percent slower than μ 97.6 Percent slower than μ + 2σ Median 97th Percentile
    41. 41. The 50th and 90th percentile define normal behavior without needing to know anything about the distribution model
    42. 42. Median shows the real problem
    43. 43. How to define non- normal behavior?
    44. 44. Fortunately, this is not the problem we need to solve We are only talking about missed expectations
    45. 45. Let’s look at two scenarios Errors Is a certain error rate likely to happen or not? Response Times Is a certain increase in response time significant enough to trigger an incident?
    46. 46. The error rate scenario We have a typical error rate of 3 percent at 10.000 transactions/minute During the night we now have 5 errors in 100 requests. Should we alert – or not?
    47. 47. What can we learn
    48. 48. Statistics is everwhere
    49. 49. Binomial Distribution Tells us how likely it is to see n successes in a certain number of trials
    50. 50. How many errors are ok? 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Likeliness of at least n errors 18 % probability to see 5 or more errors. Which is within 2 times Std. Deviation. We do not alert.
    51. 51. Response Time Example Our median response time is 300 ms and we measure 200 ms 400 ms 350 ms 200 ms 600 ms 500 ms 150 ms 350 ms 400 ms 600 ms
    52. 52. Percentile Drift Detection
    53. 53. Did the median drift significantly? Check all values above 300 ms 200 ms 400 ms 350 ms 200 ms 600 ms 500 ms 150 ms 350 ms 400 ms 600 ms 7 values are higher than the median. Is this normal? We can again use the Binomial Distribution
    54. 54. Applying the Binom ial Distribution We have a 50 percent likeliness to see values above the median. How likely is is that 7 out of 10 samples are higher? The probability is 17 percent, so we should not alert.
    55. 55. How to calculate this value? … and we are done! Which metric to pick? How to get this baseline? How to define that this happened?
    56. 56. This was just the beginning There are many more use things about statistics, probabilities, testing, ….
    57. 57. Alois Reitbauer alois.reitbauer@ruxit.com @AloisReitbauer http://bit.ly/bostonwebperf
    58. 58. ImageCredits http://commons.wikimedia.org/wiki/File:Network_switches.jpg http://commons.wikimedia.org/wiki/File:Wheelock_mt.jpg http://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpg http://commons.wikimedia.org/wiki/File:Estacaobras.jpg http://commons.wikimedia.org/wiki/File:Speedo_angle.jpg http://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPG http://commons.wikimedia.org/wiki/File:Dice_02138.JPG http://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg

    ×