Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring 101


Published on

This presentation talks about three topics related to monitoring. The first is a brief history and future forecast of monitoring trends. The second is a second look at the inputs, outputs, and techniques for setting SLOs. The third sets some basic tenets one should always follow when monitoring systems.

Published in: Software
  • Login to see the comments

Monitoring 101

  1. 1. Monitoring 101 THE BASICS
  2. 2. Theo Schlossnagle CEO @Circonus Twitter: @postwait
  3. 3. Agenda Navigation Skills Tenets Service Level Objective Overview History
  4. 4. “ ” Monitoring is the action of observing and checking static and dynamic properties of a system. - HEINRICH HARTMANN (
  5. 5. Your System Is Larger Than Your “Systems” Technology Engineering Operations HR Finance Sales Marketin
  6. 6. Evolution of Web Monitoring 1995: Synthetic web page loads every 15 minutes 2000: Watching every web request to understand “real users” 2015: Deep analysis on every transaction to ensure no user left behind
  7. 7. Evolution of Database Monitoring 1995: Synthetic queries to test performance 2005: Watching every query request to understand “real performance” 2015: Deep analysis of every transaction to understand overall system behavior
  8. 8. Evolution of Systems Monitoring 1995: Looked at avwait lat (disks) 2015: Track the latency of every disk I/O operation performed
  9. 9. Monitoring Is Sophisticated Increased Telemetry Volume Advances in Time Series Databases to store trillions of samples in a billion streams. Advances in Stream Analytics to handle velocity at scale for real- time analysis and alerting. More Valuable Operational Questions Data Science is the future. Increased volume mandates computer assistance where “ops dashboards” once worked. Most sophisticated modeling: stats, machine learning, AI, etc. Increased Organizational Velocity Systems are decoupled, distributed and changing faster. Understanding overall systems behavior is like looking at sand dunes.
  10. 10. Service Level Objectives SLOS ARE WHAT DRIVES SRES
  11. 11. SLO: usually based on percentiles  E.g. 95th percentile less than 10ms  “simply” 95% of all samples should 10ms or less, 1% can be arbitrarily bad  Not “simple”  Calculated over what period of time (or worse, number of samples)?  Why 95% and not 99% or 99.9% or 99.34860943%?  Why 10ms?  The tragedy of the not-a-histogram histogram:  There are no right answers, and rarely good ones.
  12. 12. Median Latency Over 5m Stepping Window
  13. 13. Summary Histogram 30days and 36mm samples
  14. 14. Time-series Histogram 30days and 36mm samples
  15. 15. Time-series Histogram 30days and 36mm samples
  16. 16. Time-series Histogram 30days and 36mm samples
  17. 17. Average Latency Over 5m Stepping Window
  18. 18. Summary Histogram 2days and 1.6mm samples
  19. 19. Latency Over 5m Stepping Window
  20. 20. Latency Over 5m Stepping Window
  21. 21. p(95) Latency Over 5m Stepping Window
  22. 22. p-1(10ms) Latency Over 5m Stepping Window
  23. 23. p-1(50ms) Latency Over 5m Stepping Window
  24. 24. Time Matters The time quantum you use to assess is your minimum window of failure.
  25. 25. Uncertainty Matters You will certainly want to revise your goals, likely in all parametric space.
  26. 26. Histograms Matter You cannot manage percentile-based SLOs at scale without histograms.
  27. 27. Do not measure rates. You can derive the rate of change over time at query time. #1
  28. 28. Monitor outside the tech stack. Your tech stack would not exist without happy customers and a sales pipeline. Monitor that which is important to the health of your organization. #2
  29. 29. Do not silo data. The behavior of the parts must be put in context. Correlating disparate systems and even business outcomes is critical. #3
  30. 30. Value observation of real work over the measurement of synthesized work. #4
  31. 31. Synthesize work to ensure function for business critical, low-volume events. #5
  32. 32. Percentiles are not histograms. For robust SLO management you need to store histograms for post-processing. #6
  33. 33. History is critical; not weeks or months, but years of detailed history. Capacity planning, retrospectives, comparative analysis, and modelling all rely on accurate, high-fidelity history. #7
  34. 34. Alerts require documentation. No ruleset should trigger an alert without: human-readable explanation business impact description remediation procedure escalation documentation #8
  35. 35. Be outside the blast radius. The purpose of monitoring is to detect changes in behavior and assist in answering operational questions. #9
  36. 36. Something is better than nothing. Don’t let perfect be the enemy of good. You have to start somewhere. #10
  37. 37. Thank You!