Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring with Percentiles


Published on

Learn how percentiles help engineers observe and improve their applications much more deeply. But the usage of percentiles in metrics and monitoring systems is often limited by what's supported in the monitoring tool, leaving lots of benefits on the table.

Published in: Technology

Monitoring with Percentiles

  1. 1. Monitoring With Percentiles Baron Schwartz
  2. 2. #percentiles Introduction ● My email is ● Now let’s learn as much as possible about percentiles in 25 minutes! 2
  3. 3. #percentiles What Are Percentiles? ● More generally, quantiles - percentiles are just a common type of quantile ● Quantiles divide a distribution of values into ordered, equal intervals ● Percentiles divide the distribution into 100 intervals 3
  4. 4. #percentiles 4
  5. 5. #percentiles What’s the 99.9th percentile? ● It’s loose terminology, but we all know what we mean. ● Strictly speaking, it’d be the 999th permille. 5
  6. 6. #percentiles What are Percentiles Good For? ● They show some measure of the extremes of outliers ● They help avoid outliers being obscured ● They help hide the outliers so the bulk of the values aren’t obscured ● They show “worst common case behavior” 6
  7. 7. #percentiles 7
  8. 8. #percentiles Problem: Averages Hide Outliers Source: 8
  9. 9. #percentiles Problem: Outliers Skew Averages ● It’s hard to see the shape of the chart because the spikes cause the rest of the data to be scaled down near the axis. ● This is a chart of an average; how far out did the outlier really extend? Is the outlier itself being scaled by the rest of the data? ● Net result: averages show us neither the outliers, nor the bulk of the data. ● The average is neither robust nor representative. 9
  10. 10. #percentiles Definition of Average Average (def): a random number that falls somewhere between the maximum and 1/2 the median. Most often used to ignore reality. - Gil Tene Source: 10
  11. 11. #percentiles Is There A Robust, Representative Metric? ● You probably know that the median is “robust and representative.” ● It’s commonly used to represent “the common case.” 11
  12. 12. #percentiles Problem: Median Isn’t Most ● The median is the 50th percentile: the midpoint of the distribution. ● When it comes to performance, median isn’t representative of most. ● We should care about “most people’s experience.” ● And we should also care about “some people’s experience.” 12
  13. 13. #percentiles Median Server Response Time: The number that 99.9999999999% of page views can be worse than. - Gil Tene Source: (This is possible because most page views issue multiple requests to backend servers.) Definition of Median 13
  14. 14. #percentiles The Median Is Too Coarse ● The median is too coarse, much more so than you’d expect. ● High quantiles are better for understanding typical experiences. ● You should care about the edge cases, i.e. 99th percentile and higher. ● This helps you understand and design for the impact of outliers on your architecture, and your architecture/design choice’s impact on outliers. ● Design systems to “bend but not break” -- @kellabyte ○ Source: 14
  15. 15. #percentiles How Do Percentiles Work? ● We’re typically dealing with measurements of highly variable quantities. ● These come from processes with properties (i.e. models) that are usually not knowable a priori, and usually not even stable, so you can’t assume things like “normally distributed.” ● Examples: response size in bytes, response latency in seconds. ● There are many more wrong ways to do percentiles than right. 15
  16. 16. #percentiles How Do You Compute Percentiles? ● Divide the possible range of values into partitions. ● Place each measurement into a partition and increment its count. ● Count the total of all partitions (i.e. “19,847 measurements”). ● Multiply the total times the desired quantile (i.e. “99.9th% = 19,827”). ● Find the partition that contains the Nth measurement (i.e. partition 1201). ● The upper boundary of the partition (i.e. 1822ms) is the result. 16
  17. 17. #percentiles Alternative Definitions ● The Nth percentile is ~= the max value of all but 1-Nth measurements. ● So you can discard the Nth worst fraction and measure the max. ● i.e. to get the 95th percentile, ignore the worst 5% and measure the max value of what remains. (This isn’t strictly correct, since it’s not based on an even partitioning of the value space into equal -iles) 17
  18. 18. #percentiles How Monitoring Tools Implement Percentiles ● Vastly differently! ● Don’t assume your monitoring tool does it the right way, the one true way, or the same way any other tool does. 18
  19. 19. #percentiles Factors To Consider 1. How are values measured? 2. What’s the definition of “percentile” in use? 3. How are values aggregated into metrics or other representations? 4. How are metrics (assuming it’s metrics) emitted? 5. How are metrics transmitted and stored? 6. How are metrics retrieved? 7. How are metrics displayed? 8. How are metrics recomputed and transformed for longterm retention? 19
  20. 20. #percentiles Garbage In, Garbage Out 20
  21. 21. #percentiles StatsD and Graphite ● StatsD lets you compute percentiles in the aggregator itself before sending them to Graphite. The result is a “metric of the percentile.” More on this later. ● StatsD’s metrics such as upper_99 and mean_99 are confusing. ● It’s possible to track banded metrics in StatsD and Graphite, and then to compute percentiles from the bands later. ● Graphite itself has percentile functions for wildcard series. 21
  22. 22. #percentiles Datadog ● If you collect a histogram with Datadog’s DogStatsD, it emits metrics of min, max, median, and 95th percentile. ● Similar caveats as StatsD’s percentiles. 22
  23. 23. #percentiles Coda Hale’s Metrics ● Many typical/common metrics coming from this set of libraries are exponentially biased over time. ● eservoirs ● They’re also computed from statistically representative samples of the population. Representative, ~= but still be aware it’s only a sample. ● Many, many products (e.g. Cassandra) use Coda Hale’s Metrics library. ○ E.g. NOTE: these are silently distorted values, not raw measurements. 23
  24. 24. #percentiles VividCortex ● We capture banded metrics. At the moment we only visualize them as rainbow charts. 24
  25. 25. #percentiles Honeycomb ● Based on the raw dataset. ● If the raw dataset is sampled instead of captured in full, possibly skewed. 25
  26. 26. #percentiles Circonus ● High-resolution histograms using “llquantize()” type bucketing ● i.e. “Two base-ten significant digits of precision” 26
  27. 27. #percentiles Advice For understandability and fidelity, you’re best off with: ● Raw data, not predigested or distorted. ● Definitely not exponentially decayed at the aggregator, if possible. ○ i.e. Coda Hale’s Metrics library probably distorts more than you’d be happy with if you really knew the truth about the underlying measurements. ● Banded or histogrammed is better than a single metric of a percentile ○ (more on this to come) ● You have to research the underlying implementation yourself. 27
  28. 28. #percentiles How Can I Visualize Percentiles? ● There’s a variety of ways. ● Distributions of data contain a lot of information, so visualization is essential. ● You’re usually most interested in how it changes over time. ● A few ways to visualize percentiles and distributions over intervals... 28
  29. 29. #percentiles Time Series Graphs 29
  30. 30. #percentiles Banded Metrics Phusion Passenger; VividCortex 30
  31. 31. #percentiles Histograms Apex Ping 31
  32. 32. #percentiles Heat Maps Fastly 32
  33. 33. #percentiles How Can I Describe Distributions? ● If you knew that your values fit a particular distribution… ● Then you’d be able to just record the distribution’s parameters. ● But that’s basically never the case. ● In practice, something equivalent to histograms ends up being necessary. 33
  34. 34. #percentiles Histogram Implementations ● HdrHistogram is the “canonical” implementation for many purposes. ○ Fast, flexible, can be merged together (i.e. Downsampled) ● Many roll-your-own examples exist ● For predefined ranges, good bucket values aren’t hard to choose ● If you don’t the values’ characteristics in advance, it’s harder ○ Powers of two? Powers of… 1.05? ○ Linear buckets? ○ Log-linear buckets? ○ These can end up being equivalent to “achieve desired significant digits in base10” 34
  35. 35. #percentiles Banded Metrics ● Banded metrics can be essentially equivalent to histograms. ● One significant difference is that their cut points are static, unlike histograms which may dynamically differ depending on the actual data in a range of time. 35
  36. 36. #percentiles Histograms to Quantiles ● You don't have to store 100 bands/buckets to get percentiles. ● Simply sum and find the cutoff, then the bucket and value as before. ● This ends up being an approximation, again, not the strictly exactly 100% correct statistician’s dictionary definition of a quantile. 36
  37. 37. #percentiles What Insights Can Percentiles Give You? ● How bad is your typical user’s experience? (Use a high percentile) ● Are there occasional issues that mean you’re providing low-quality service overall? (High-quality service is consistently fast) ● Is there a rare occurrence that’s going to escalate? ○ Note: this is equivalent in some ways to what VividCortex’s Adaptive Fault Detection does ● In other words, monitoring at the edges helps you be more proactive by listening to the canaries in the coal mine. 37
  38. 38. #percentiles Percentile Pitfalls ● Percentile math can be confusing. ● Tools and their distortions can be confusing. ● The math isn’t commutative. ● A metric of percentile doesn’t make sense over time. ○ You can’t take averages of percentiles. ○ You can’t downsample/resample over time. ● You can’t take percentiles of averages. ● (Ok, you can, but the result has no defined meaning) 38
  39. 39. #percentiles Percentile of an Average Q: “what’s the 99th percentile of this metric of buffer-pool-reads-per-second?” A: “it depends on what you mean.” It’s possible to imagine uses for this, but note that things-per-second is by definition an average (aggregate!!) with seconds as the denominator. It’s not a population. 39
  40. 40. #percentiles Percentile Pitfalls, Cont’d ● Computing percentiles can be computationally expensive. ● There are efficient online approximations if you’re interested. ○ Search for “streaming approximate quantiles.” ● The trouble is they pre-digest and result in an approx “percentile metric.” ● General rule of thumb for safety: ○ Don’t emit or store any time series metric that’s not robust when averaged over time. ○ In other words, no fractions or other derived metrics. They don’t work right. ○ This isn’t specific to percentiles, it’s just broad-based advice. 40
  41. 41. #percentiles Percentile Pitfalls, Cont’d Again ● Percentiles aren’t intuitive. ● The high percentiles happen to most of your users, not just some. ● The probability any given user will not have a high-percentile experience with your app is vanishingly small. ● See again: page-loads.html 41
  42. 42. #percentiles Graphite and StatsD Percentile Pitfalls ● I’m not picking on Graphite and StatsD, but they’re especially fraught. ● There’s a lot of combinations of ways things can be done wrong with them. ● If you’re using them, you need to learn how to use them right. 42
  43. 43. #percentiles To Sum Up ● You need to examine the outliers, not just the bulk of the data. ● Percentiles are computed from a population. You can’t store a percentile itself, you have to store either the population itself, or a representation of it (histograms or banded metrics). ● Tools -- almost all of them -- lack guard rails to keep you away from invalid uses of percentiles. There’s moral hazard, you could lead others astray. ● A percentile is still just a single number. Distributions are better than simplifying to a single number. ● All measurements are wrong; some are useful anyway. 43
  44. 44. #percentiles Questions? Don’t Miss Our Next Webinar! What's New in MySQL 8.0 and PostgreSQL 9.6 Tuesday, October 25th 2pm EDT Features to be discussed include: ● New replication capabilities. ● More extensibility. ● Improved performance. ● Broader SQL implementation. ● Better observation and monitorability. ● Improved operability. Subscribe to our newsletter for details! 44