Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Dark of Building an Production Incident Syste

807 views

Published on

Published in: Technology, Design
  • Be the first to comment

The Dark of Building an Production Incident Syste

  1. 1. The Dark Art of Building a Production Incident System @Alois Reitbauer Tech. Evangelist & Product Mgr., Compuware
  2. 2. No broken cables
  3. 3. No datacenter fires
  4. 4. Other things can happen as well Continuous deployments Infrastructure changes other “everyday” stuff
  5. 5. Scaling an incident system
  6. 6. How it feels to do what we do
  7. 7. Do you alert? Typical error rate of 3 percent at 10.000 transactions/min During the night we now have 5 errors in 100 requests.
  8. 8. Do you alert? Typical response time has been around 300 ms. Now we see response times up to 600 ms.
  9. 9. We a r e g o o d a t f i x i n g problems, but not really good at detecting them.
  10. 10. How can we get better? .
  11. 11. It is all about statistics I t ’s a l l a b o u t s t a t i s t i c s
  12. 12. Statistics is about objectively lying to yourself in a meaningful way.
  13. 13. How to design an incident
  14. 14. It looks really simple How to calculate this value? Which metric to pick? How to get this baseline? How to define that this happened?
  15. 15. Which metrics to pick?
  16. 16. Three types of metrics Capacity Metrics Define how much of resource is used. Discrete Metrics Simple countable things, like errors or users. Continuous Metrics Metrics represented by a range of values at any given time.
  17. 17. Capacity Metrics Good for capacity planning, not so good for production alerting
  18. 18. Connection Pools
  19. 19. b e tte r u s e Connection acquisition time Tells you, whether anyone needed a connection and did not get it.
  20. 20. CPU Usage
  21. 21. b e tte r u s e Combination of Load Average and CPU usage even better correlate the with response times of applications
  22. 22. D i s c re te M e t r i c s Pretty easy to track and analyze.
  23. 23. C o nt i n u o u s M e t r i c s Require some extra work as they are not that easy to track.
  24. 24. Continuous Metrics – The hope 42
  25. 25. Continuous Metrics – The reality
  26. 26. What the average tells us
  27. 27. What the median tells us
  28. 28. How to get a baseline?
  29. 29. A baseline is not a number Baselines define the range of a value combined with a probability
  30. 30. Normal distribution as baseline Mean: 500 ms Std. Dev.: 100 ms 0 100 200 300 400 500 600 68 % 400ms – 500 ms 95 % 300ms – 700 ms 99 % 200ms – 800 ms 700 800 900
  31. 31. This can go really wrong “Why alerts suck and monitoring solutions need to become better”
  32. 32. How this leads to false alerts
  33. 33. Many false alerts Aggressive Baseline
  34. 34. No alerts at all Moderate Baseline
  35. 35. Find the right distribution model However, this can be really hard to impossible
  36. 36. Your distribution might look like this
  37. 37. … or like this
  38. 38. or completely different you never know …
  39. 39. How can we solve this problem?
  40. 40. Normal distribution - again 50 Percent slower than μ Median 97.6 Percent slower than μ + 2σ 97th Percentile
  41. 41. The 50 th and 90 th percentile define normal behavior without needing to know anything about the distribution model
  42. 42. Median shows the real problem
  43. 43. How to define nonnormal behavior?
  44. 44. Fortunately this is not the problem we need to solve We are only talking about missed expectations
  45. 45. Let’s look at two scenarios Errors Is a certain error rate likely to happen or not? Response Times Is a certain increase in response time significant enough to trigger an incident?
  46. 46. The error rate scenario We have a typical error rate of 3 percent at 10.000 transactions/minute During the night we now have 5 errors in 100 requests. Should we alert – or not?
  47. 47. What can we learn
  48. 48. Statistics is everwhere
  49. 49. B i n o m i a l D i st r i b u t i o n Tells us how likely it is to see n successes in a certain number of trials
  50. 50. How many errors are ok? Likeliness of at least n errors 120.0% 18 % probability to see 5 or more errors. Which is within 2 times Std. Deviation. We do not alert. 100.0% 80.0% 60.0% 40.0% 20.0% 0.0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  51. 51. Response Time Example Our median response time is 300 ms and we measure 200 ms 500 ms 400 ms 150 ms 350 ms 350 ms 200 ms 400 ms 600 ms 600 ms
  52. 52. Percentile Drift Detection
  53. 53. Did the median drift significantly? Check all values above 300 ms 200 ms 500 ms 400 ms 150 ms 350 ms 350 ms 200 ms 400 ms 600 ms 600 ms 7 values are higher than the median. Is this normal? We can again use the Binomial Distribution
  54. 54. Applying the Binomial Distribution We have a 50 percent likeliness to see values above the median. How likely is is that 7 out of 10 samples are higher? The probability is 17 percent, so we should not alert.
  55. 55. … and we are done! How to calculate this value? Which metric to pick? How to get this baseline? How to define that this happened?
  56. 56. This was just the beginning There are many more use things about statistics, probabilities, testing, ….
  57. 57. Alois Reitbauer alois.reitbauer@compuware.com @AloisReitbauer apmblog.compuware.com
  58. 58. Image Credits http://commons.wikimedia.org/wiki/File:Network_switches.jpg http://commons.wikimedia.org/wiki/File:Wheelock_mt.jpg http://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpg http://commons.wikimedia.org/wiki/File:Estacaobras.jpg http://commons.wikimedia.org/wiki/File:Speedo_angle.jpg http://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPG http://commons.wikimedia.org/wiki/File:Dice_02138.JPG http://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg

×