•

3 likes•1,498 views

Slides for my talk at Monitorama PDX 2019. Histograms have the potential to give us tools to meet SLO/SLAs, quantile measurements, and very rich heatmap displays for debugging. Their promise has not been fulfilled by TSDB backends however. This talk talks about the concept of histograms as first class citizens in storage. What does accuracy mean for histograms? How can we store and compress rich histograms for evaluation and querying at massive scale? How can we fix some of the issues with histograms in Prometheus, such as proper aggregation, bucketing, avoiding clipping, etc.?

- 1. Rich Histograms at Scale: A New Hope Evan Chan @evanfchan http://github.com/ﬁlodb/FiloDB
- 2. This is not a contribution
- 3. This is not a contribution@evanfchan What do we do with Histograms?
- 4. This is not a contribution@evanfchan The Evolution of Histograms • Pre-aggregated percentiles Prometheus InﬂuxDB ??? Statsd Graphite OpenTSDB• Histogram with buckets • Prometheus histograms • HDRHistogram • T-Digests
- 5. This is not a contribution@evanfchan Overlaid Latency Quantiles
- 6. This is not a contribution@evanfchan Now an incident happens…
- 7. This is not a contribution@evanfchan Heatmaps: Rich Visuals
- 8. This is not a contribution@evanfchan Grafana Heatmaps • Buckets are scalable to much more input data but needs TSDB support for histogram buckets • Time series: ﬂexible, but Grafana needs to read ALL the raw data
- 9. This is not a contribution@evanfchan Useful Histograms • Should be aggregatable • Supports quantiles, distributions, other f(x) • Heatmaps - histograms over time • Should be accurate • Should scale and be efﬁcient
- 10. This is not a contribution@evanfchan Buckets and Accuracy • Max quantile error = bucket width / lowerBound • Exponential buckets = consistent max quantile errors (Good!) • Linear almost never makes sense • Your custom Prom histogram buckets likely have >100% error Histogram Type Max Error % # Buckets Linear 100% 60,000,000 Exponential 99.1% 26 Linear 10% 600,000,000 Exponential 10.0% 188 Example: (1000, 6E10) value range
- 11. This is not a contribution@evanfchan Conﬁguring your Histograms • Start with the range of values you need: (min, max) • Pick the desired max quantile error % • Think about trading off publish freq for accuracy • # buckets = log(max/min) / log(1 + max_error) • Example: Max error=50%, (1000 to 6E10): numBuckets = Math.log(6E10/1000) / Math.log(1 + 0.50) exponentialBuckets(1000, 1 + 0.50, numBuckets)
- 12. This is not a contribution Histograms at Scale
- 13. This is not a contribution@evanfchan Histograms as First-Class Citizen • Modeling, transporting, and storing histograms holistically offers many beneﬁts • Scalability — much better storage, network, query speed • Proper aggregations • Better accuracy and features • Adaptable to better histogram designs in the future • Almost nobody is doing this yet
- 14. This is not a contribution@evanfchan Prometheus Histogram Schema __name__ metric_sum 5 buckets, sum, count per histogram __name__ metric_count __name__ metric_bucket __name__ metric_bucket __name__ metric_bucket __name__ metric_bucket __name__ metric_bucket le 0.5 le 2.0 le 5.0 le 10. le 25. 44 5 0 2 3 5 5 35 6 1 4 6 6 6 50 10 1 5 8 9 10 60 11 2 6 10 11 11 Series1 Series2 Series3 Series4 Series5 Series6 Series7
- 15. This is not a contribution@evanfchan The Scale Problem with Histograms • My app: 100 metrics, 20 histograms • Assume range of (1000, 6E10). • Notice how histograms dominate the time series! Max error % Num buckets Histogram Series Other Series Total Series 50% 44 882 80 962 10% 188 3762 80 3842 2% 905 18102 80 18182
- 16. This is not a contribution@evanfchan Mama we got a problem • Actual system: hundreds of millions of metrics, each one has histogram with 64 buckets • Using Prometheus would lead to tens of billions of series
- 17. This is not a contribution@evanfchan Prometheus: Raw Data __name__ metric_sum __name__ metric_count __name__ metric_bucket __name__ metric_bucket __name__ metric_bucket __name__ metric_bucket __name__ metric_bucket le 0.5 le 2.0 le 5.0 le 10. le 25. Zone Us-west Zone Us-west Zone Us-west Zone Us-west Zone Us-west Zone Us-west Zone Us-west 44 5 0 2 3 5 5
- 18. This is not a contribution@evanfchan Atomicity Issues • Prom export, scrape does not guarantee grouping of histogram buckets. • Easy to only get part of a histogram • FiloDB is a distributed database. 7 records might end up in 7 different nodes! • Calculating histogram_quantile: talk to 7 nodes for every query!
- 19. This is not a contribution@evanfchan Single Histogram Schema 5 buckets, sum, count per histogram __name__ metric Sum Count Hist 0.5 2.0 5.0 10. 25. 44 5 0 2 3 5 5 35 6 1 4 6 6 6 50 10 1 5 8 9 10 60 11 2 6 10 11 11 Series1
- 20. This is not a contribution@evanfchan Single Histogram Raw Data __name__ MetricZone Us-west 44 5 0 2 3 5 5 Sum Count Hist (0.5, 2, 5, 10, 25) • One record, not (n + 2). No distribution problem! • Labels only appear once • Savings proportional to # of histogram buckets • 50x savings for 64 histogram buckets
- 21. This is not a contribution@evanfchan Much smaller network and disk usage • One time series vs 66 -> 50x network I/O reduction • Single histogram schema in FiloDB uses < 0.2 bytes per histogram bucket Network I/O Bytesper histogram 0 3500 7000 10500 14000 Series/bucket Series/histo Storage cost Bytesperbucket 0 0.4 0.8 1.2 1.6 Series/bucket Series/histo
- 22. This is not a contribution@evanfchan Optimizing Histograms: Compression • Delta encoding of increasing bucket values 0 2 3 5 5 0 2 1 2 0 1 4 6 6 6 1 3 2 0 0 • Compressed size about 4x-10x better than 1 time series per bucket (64 buckets; FiloDB) • 0.18 bytes/histogram bucket (range: 0.16 - 0.61) FiloDB SingleHistogram 0.18 bytes/bucket Prometheus 1.5 bytes/bucket Raw data 8 bytes/bucket
- 23. This is not a contribution@evanfchan Optimizing Histograms: Querying (64 Buckets) • histogram_quantile() is more than 100x faster than series-per-bucket • No need for group-by • Localized computation vs needing to jump across 64 bucket time series histogram_quantile() QPS 0 7500 15000 22500 30000 Series/Bucket Series/Histo
- 24. This is not a contribution Rich Histograms Usability and Correctness
- 25. This is not a contribution@evanfchan Changing buckets…. sum() • sum(rate(http_req_latency{…..}[5m])) by (le) • Different buckets lead to incorrect sums 2.5 5 10 50 +Infle= 25 100
- 26. This is not a contribution@evanfchan Holistic Histograms: Correct Sums • Adding histograms holistically allows us to track bucket changes and correctly sum them 2.5 5 10 50 +Infle= 25 100
- 27. This is not a contribution@evanfchan histogram_quantile clipping • At 20:00, quantile is clipped at 2nd-last bucket of 10.0
- 28. This is not a contribution@evanfchan histogram_max_quantile • Client sends a max value at each time interval
- 29. This is not a contribution@evanfchan histogram_max_quantile • Having a known max allows us to interpolate in last bucket • Cannot interpolate to +Inf • https://github.com/ﬁlodb/FiloDB/pull/361 2.5 5 10 25 +Infle= 40 0.9
- 30. This is not a contribution@evanfchan Ad-Hoc Histograms • Just the quantile, min, max from gauges is not that useful • Get heat map for CPU use across k8s containers • histogram(2, 8, container_cpu_usage_seconds_total{….}) • Aggregate histogram across gauges using new histogram() function • Yes Grafana can do heat maps from raw series - but you can only read so many raw time series. :)
- 31. This is not a contribution@evanfchan Summary: Rich Histograms at Scale • Treating histograms as a ﬁrst class citizen • Massive savings in storage and network I/O • Solve aggregation and other correctness issues • Move towards T-Digests and future formats
- 32. Thank you very much! Please reach out to help make useful histograms at scale a reality! @evanfchan http://github.com/ﬁlodb/FiloDB Monitorama slack: #talk-evan-chan
- 33. This is not a contribution@evanfchan Example 2: Write size
- 34. This is not a contribution@evanfchan Heatmap 2: Write Size
- 35. This is not a contribution@evanfchan Histogram aggregation: Prometheus • Group by is needed for summing histogram buckets due to data model - leak of abstraction • What if dev changes the histogram scheme? (# of buckets, etc.) • Not possible to resolve scheme differences in Prom, since aggregation knows nothing about histograms sum(rate(histogram_bucket{app="foo")[5m])) by (le)
- 36. This is not a contribution@evanfchan Histogram aggregation: FiloDB • No need for _bucket, but need to select histogram column • No need for group by. Histograms are natively understood and correct aggregations happen sum(rate(histogram{app=“foo”,__col__=“h”)[5m]))