Effective management of high volume numeric data with histograms

•

2 likes•778 views

Fred Moyer

Talk I gave at DataEngConf '18 on how we use histograms to manage high volume numeric data at Circonus

Data & Analytics

Effective management of high volume
numeric data with histograms
Fred Moyer @Circonus
DataEngConf SF ‘18

@phredmoyer
● Engineer to Engineer @circonus
● Recovering C and Perl programmer
● Geeking out on histograms since 2015

Pain driven development
● Observability tools caused a telemetry firehose
● Existing monitoring systems got washed away
● Average based metrics gave limited insight

“Effective Management”
● Performance AND scalability
● Avoid memory allocations, copies, locks, waits
● Persist data in size efficient structures

Histogram Basics
Sample Value
Number of
Samples
Median q(0.5)
Mode
q(0.9)
q(1)Mean

Histogram Types
● Fixed Bucket
● Approximate
● Linear
● Log Linear
● Cumulative

Fixed Bucket
User specified
bins/buckets
Number of
Samples
Sample Value

Approximate
Centroids indicating
data grouping
Number of
Samples
Sample Value

Linear
Evenly sized bins
Number of
Samples
Sample Value

Log Linear
Logarithmically
increasing bin sizes
Number of
Samples
Sample Value

Cumulative
Number of
Samples
Sample Value
Total Sample
Count

Open Source Log Linear
C - github.com/circonus-labs/libcircllhist
Go - github.com/circonus-labs/circonusllhist

Open Source Log Linear
Bin size increase
by 10x
90 bins

Bin data structure
Exponent
int8_t
1 byte
Value
int8_t
1 byte
Count
uint64_t
Max 8 bytes
Varbit encoded

Storage efficiency - 1 month
30 days of one minute histograms
30 days * 24 hours/day * 60 bins/hour * 300 bin
span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB
= 123.6 MB

Storage efficiency - 1 year
365 days of five minute histograms
365 days * 24 hours/day * 12 bins/hour * 300 bin
span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB
= 300.9 MB

Quantile calculation
1. Given a quantile q(X) where 0 < X < 1
2. Sum up the counts of all the bins, C
3. Multiply X * C to get count Q
4. Walk bins, sum bin boundary counts until > Q
5. Interpolate quantile value q(X) from bin

Linear interpolation
left_count=600 bin_count=200
right_count=800
left_value=1.0 right_value=1.1
Q = 700
X = 0.5
q(X) = 1.0+(700-600) /
(800-600)*0.1
q(X) = 1.05
q(X) = left_value+(Q-left_count) /
(right_count-left_count)*bin_width

Recap
● Several different types of histograms
● Highly space efficient
● O(1) and O(n) complexity calculating quantiles
● What other fun things can we do?

Inverse Quantiles
● What’s the 95th percentile latency?
○ q(0.95) = 10ms
● What percent of requests exceeded 10ms?
○ 5% for this data set; what about others?

Inverse Quantile calculation
1. Given a sample value X, locate its bin
2. Using the previous linear interpolation
equation, solve for Q given X

Inverse Quantile calculation
X = left_value+(Q-left_count) / (right_count-left_count)*bin_width
X-left_value = (Q-left_count) / (right_count-left_count)*bin_width
(X-left_value)/bin_width = (Q-left_count)/(right_count-left_count)
(X-left_value)/bin_width*(right_count-left_count) = Q-left_count
Q = (X-left_value)/bin_width*(right_count-left_count)+left_count

Linear interpolation
left_count=600 right_count=800
left_value=1.0 right_value=1.1
X = 1.05
Q = (1.05-1.0)/0.1*(800-600)+600
Q =(X-left_value)/bin_width *
(right_count-left_count)+left_count
Q = 700

Inverse Quantile calculation
1. Given a sample value X, locate its bin
2. Using the previous linear interpolation
equation, solve for Q given X
3. Sum the bin counts up to Q as Qleft
4. Inverse quantile qinv(X) = (Qtotal-Qleft)/Qtotal
5. For Qleft=700, Qtotal = 1,000, qinv(X) = 0.3
6. 30% of sample values exceeded X

Thank you!
Questions?
Bug me at the Circonus booth
Come to Office Hours
Tweet @phredmoyer or @circonus

What's hot

Aoa amortized analysisSalabat Khan

Algorithms - "heap sort"Ra'Fat Al-Msie'deen

Color filters for the dummiesHean Hong Leong

Amortized Analysis of Algorithms sathish sak

Web-based framework for online sketch-based image retrievalLukas Tencer

Algorithms - "quicksort"Ra'Fat Al-Msie'deen

Amortized analysisajmalcs

Radix SortFaiza Saleem

Quick sortAbdelrahman Saleh

The Case for Learned Index Structures宇傅

Coderstrust Presentation on C LanguageSrabon Sabbir

Linear searchAbdelrahman Saleh

2 optimizationMr Patrick NIYISHAKA

13 Amortized AnalysisAndres Mendez-Vazquez

Calculation Packagekarlocamacho

Scalability, basics, application to systems, teams and processesEduardo Ferro Aldama

Analisis dan evaluasi penerapan kebijakan perencanaan manajemen sumber daya d...Rama Renspandy

4 ce6a quantity surveying & valuationVikas Yadav

Group functionssehrishishaq1

What's hot (19)

Aoa amortized analysis

Algorithms - "heap sort"

Color filters for the dummies

Amortized Analysis of Algorithms

Web-based framework for online sketch-based image retrieval

Algorithms - "quicksort"

Amortized analysis

Radix Sort

Quick sort

The Case for Learned Index Structures

Coderstrust Presentation on C Language

Linear search

2 optimization

13 Amortized Analysis

Calculation Package

Scalability, basics, application to systems, teams and processes

Analisis dan evaluasi penerapan kebijakan perencanaan manajemen sumber daya d...

4 ce6a quantity surveying & valuation

Group functions

Similar to Effective management of high volume numeric data with histograms

Machine learning using matlab.pdfppvijith

Random number generationvinay126me

Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi

Location based sales forecast for superstoresThaiQuants

Silicon valleycodecamp2013Sanjeev Mishra

Probabilistic data structures. Part 2. CardinalityAndrii Gakhov

Towards Evaluating Size Reduction Techniques for Software Model CheckingAkos Hajdu

BIIntro.pptPerumalPitchandi

Presentation sim opencvAndres Fernandez

Quantum algorithms for pattern matching in genomic sequences - 2018-06-22Aritra Sarkar

Support Vector Machines ( SVM ) Mohammad Junaid Khan

Interpolation Missing values.pptxRushikeshGore18

Artificial Intelligence Course: Linear models ananth

Deep Learning Introduction - WeCloudDataWeCloudData

مدخل إلى تعلم الآلةFares Al-Qunaieer

Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit

Machine learning Investigative Reporting NorthBaySolutions.pdfssusera5352a2

EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE

QC-UNIT 2.pptkhan188474

Stochastic Order Level Inventory Model with Inventory Returns and Special Salestheijes

Similar to Effective management of high volume numeric data with histograms (20)

Machine learning using matlab.pdf

Random number generation

Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access

Location based sales forecast for superstores

Silicon valleycodecamp2013

Probabilistic data structures. Part 2. Cardinality

Towards Evaluating Size Reduction Techniques for Software Model Checking

BIIntro.ppt

Presentation sim opencv

Quantum algorithms for pattern matching in genomic sequences - 2018-06-22

Support Vector Machines ( SVM )

Interpolation Missing values.pptx

Artificial Intelligence Course: Linear models

Deep Learning Introduction - WeCloudData

مدخل إلى تعلم الآلة

Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...

Machine learning Investigative Reporting NorthBaySolutions.pdf

EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME

QC-UNIT 2.ppt

Stochastic Order Level Inventory Model with Inventory Returns and Special Sales

Recently uploaded

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole

Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml

Presentation of project of business person who are successPratikSingh115843

Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56

DATA ANALYSIS using various data sets like shoping data set etclalithasri22

Insurance Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo

2023 Survey Shows Dip in High School E-Cigarette UseBisnar Chase Personal Injury Attorneys

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics

IBEF report on the Insurance market in IndiaManalVerma4

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

Role of Consumer Insights in business transformationAnnie Melnic

Data Analysis Project: Stroke PredictionBoston Institute of Analytics

Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646

Digital Marketing Plan, how digital marketing worksdeepakthakur548787

Recently uploaded (17)

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf

Presentation of project of business person who are success

Statistics For Management by Richard I. Levin 8ed.pdf

DATA ANALYSIS using various data sets like shoping data set etc

Insurance Churn Prediction Data Analysis Project

Digital Indonesia Report 2024 by We Are Social .pdf

2023 Survey Shows Dip in High School E-Cigarette Use

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...

IBEF report on the Insurance market in India

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...

Role of Consumer Insights in business transformation

Data Analysis Project: Stroke Prediction

Non Text Magic Studio Magic Design for Presentations L&P.pdf

Digital Marketing Plan, how digital marketing works

Effective management of high volume numeric data with histograms

1. Effective management of high volume numeric data with histograms Fred Moyer @Circonus DataEngConf SF ‘18

2. @phredmoyer ● Engineer to Engineer @circonus ● Recovering C and Perl programmer ● Geeking out on histograms since 2015

3. Pain driven development ● Observability tools caused a telemetry firehose ● Existing monitoring systems got washed away ● Average based metrics gave limited insight

4. “Effective Management” ● Performance AND scalability ● Avoid memory allocations, copies, locks, waits ● Persist data in size efficient structures

5. Histogram Basics Sample Value Number of Samples Median q(0.5) Mode q(0.9) q(1)Mean

6. Heatmap Basics

7. Histogram Types ● Fixed Bucket ● Approximate ● Linear ● Log Linear ● Cumulative

8. Fixed Bucket User specified bins/buckets Number of Samples Sample Value

9. Approximate Centroids indicating data grouping Number of Samples Sample Value

10. Linear Evenly sized bins Number of Samples Sample Value

11. Log Linear Logarithmically increasing bin sizes Number of Samples Sample Value

12. Cumulative Number of Samples Sample Value Total Sample Count

13. Custom Number of Samples Sample Value

14. Open Source Log Linear C - github.com/circonus-labs/libcircllhist Go - github.com/circonus-labs/circonusllhist

15. Open Source Log Linear

16. Open Source Log Linear Bin size increase by 10x 90 bins

17. Open Source Log Linear

18. Bin data structure Exponent int8_t 1 byte Value int8_t 1 byte Count uint64_t Max 8 bytes Varbit encoded

19. Storage efficiency - 1 month 30 days of one minute histograms 30 days * 24 hours/day * 60 bins/hour * 300 bin span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB = 123.6 MB

20. Storage efficiency - 1 year 365 days of five minute histograms 365 days * 24 hours/day * 12 bins/hour * 300 bin span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB = 300.9 MB

21. Quantile calculation 1. Given a quantile q(X) where 0 < X < 1 2. Sum up the counts of all the bins, C 3. Multiply X * C to get count Q 4. Walk bins, sum bin boundary counts until > Q 5. Interpolate quantile value q(X) from bin

22. Linear interpolation left_count=600 bin_count=200 right_count=800 left_value=1.0 right_value=1.1 Q = 700 X = 0.5 q(X) = 1.0+(700-600) / (800-600)*0.1 q(X) = 1.05 q(X) = left_value+(Q-left_count) / (right_count-left_count)*bin_width

23. Recap ● Several different types of histograms ● Highly space efficient ● O(1) and O(n) complexity calculating quantiles ● What other fun things can we do?

24. Inverse Quantiles ● What’s the 95th percentile latency? ○ q(0.95) = 10ms ● What percent of requests exceeded 10ms? ○ 5% for this data set; what about others?

25. Inverse Quantile calculation 1. Given a sample value X, locate its bin 2. Using the previous linear interpolation equation, solve for Q given X

26. Inverse Quantile calculation X = left_value+(Q-left_count) / (right_count-left_count)*bin_width X-left_value = (Q-left_count) / (right_count-left_count)*bin_width (X-left_value)/bin_width = (Q-left_count)/(right_count-left_count) (X-left_value)/bin_width*(right_count-left_count) = Q-left_count Q = (X-left_value)/bin_width*(right_count-left_count)+left_count

27. Linear interpolation left_count=600 right_count=800 left_value=1.0 right_value=1.1 X = 1.05 Q = (1.05-1.0)/0.1*(800-600)+600 Q =(X-left_value)/bin_width * (right_count-left_count)+left_count Q = 700

28. Inverse Quantile calculation 1. Given a sample value X, locate its bin 2. Using the previous linear interpolation equation, solve for Q given X 3. Sum the bin counts up to Q as Qleft 4. Inverse quantile qinv(X) = (Qtotal-Qleft)/Qtotal 5. For Qleft=700, Qtotal = 1,000, qinv(X) = 0.3 6. 30% of sample values exceeded X

29. Quantiles - Heatmap

30. Quantiles - q(0.9)

31. Inverse Quantiles - SLO

32. Inverse Quantiles - SLO

33. Inverse Quantiles - SLO

34. Anomalies

35. Thank you! Questions? Bug me at the Circonus booth Come to Office Hours Tweet @phredmoyer or @circonus

Editor's Notes

Hello! I’m Fred Moyer from Circonus, and today I’ll be talking about effective management of high volume numeric data with histograms.
There’s my twitter handle, it uses a ph instead of an f. I’m an Engineer at Circonus who does a lot of talking to other engineers. I’m a recovering C and Perl programmer (though I’ve been writing more C lately). I’ve been using histograms to monitor operational systems for about seven years, but I’ve been really diving into them over the past few years.
I want to start off by saying that everything I’m about to show you was motivated by pain. Circonus evolved from people working with large distributed systems who needed to diagnose why those systems weren’t behaving as they should. The observability tools that we employed to peer into those systems yielded a massive amount of operational telemetry. The existing monitoring tools at hand couldn’t handle the scale of that telemetry. Also, nearly all of those tools only worked with average values. As some of you probably know, using average based metrics doesn’t get you very far.
Just what does it mean to effectively manage billions of data points? Your system has to be both performant and scalable. How do you accomplish that? Not only do your algorithms have to be on point, but your implementation of them has to be efficient. You want to avoid allocating memory where possibly, avoid copying data (pass pointers around instead), avoid locks, and avoid waits. Lots of little optimizations that add up to being able to run your code as close to the metal as possible. You also need to be able to scale your data structures. They need to be as size efficient as possible, which means using strongly typed languages with the optimum choice of data types. We’ll look at this in a bit, but first let’s dive into histograms.
How many people here know what a histogram is? This histogram diagram shows a skewed histogram where the mode is near the minimum value q(0). The Y axis is the number of samples (or sample density), and the X axis shows the sample value. How many have used a histogram in practice? How many people know what a percentile or a quantile is? On this histogram we can see that the median is slightly left of the midpoint between the highest value and the lowest value. The mode is at a low sample value, so the median is below the mean, or average value. The 90th percentile is also called q(0.9), and is where 90 percent of the sample values are below it.
How many people here know what a heatmap is? This might look like the display in KITT from Knight Rider (2nd generation display), but this is a heatmap. A heatmap is essentially a series of histograms over time. How many have used a heatmap in practice? This heatmap represents web service request latency. The parts that are red are where the sample density is the highest. So we can that most of the latencies tend to concentrate around 500 nanoseconds. We can overlay quantiles onto this visualization, we’ll cover that in a bit.
Fixed bucket histograms require the user to specify the bucket or bin boundaries. The terms bucket and bin are generally used interchangeably to refer to the boundaries between histogram data elements. Wikipedia uses bin as the designation. Traits: Histogram bin sizes can be fine tuned for known data sets to achieve increased precision Cannot be merged with other types of histograms because the bin boundaries are likely uncommon Lesser experienced users will likely pick suboptimal bin sizes If you change your bin sizes, you can’t do calculations across older configurations
Approximate histograms such as the t-digest histogram (created by Ted Dunning) use approximations of values, such as this example shown which displays centroids. The number of samples on each side of the centroid is the same. The centroids indicate Traits: Space efficient High accuracy at extreme percentiles (95%, 99%+) Worst case errors ~10% at the median with small sample sizes Can be merged with other t-digest histograms
Linear histograms have one bin at even intervals, such as one bin per integer. Because the bins are all evenly sized, this type of histogram uses a large number of bins. Traits: Accuracy dependent on data distribution and bin size Low accuracy at fractional sample values (though this indicates improperly sized bins) Inefficient bin footprint at higher sample values
Log Linear histograms have bins at logarithmically increasing intervals Traits: High accuracy at all sample values Fits all ranges of data well Worst case bin error ~5%, but only with absurdly low sample density Bins are often subdivided into even slices for increased accuracy HDR histograms (high dynamic range) are a type of log linear histograms
Cumulative histograms are different from other types of histograms in that each successive bin contains the sum of the counts of previous bins. Traits: Final bin contains the total sample count, and as such is q(1) Used at Google for their Monarch monitoring system Easy to calculate bin quantile - just divide the count by the maximum
There’s nothing stopping you from putting together your own histogram implementation based on a combination of the ones I’ve just shown you. When preparing this presentation I researched cumulative histograms, and mistakenly interpreted the cumulative aspect to be cumulative over time, not over bin counts. So in effect I had created a custom histogram specification, one that summed bin counts over time. Traits: O(1) calculations for time based ranges - you don’t need to sum up all the intermediate histograms, just diff the time slices. For displaying N slices in a heatmap, you have 2N operations (diff each time interval) instead of N operations as you would with these other implementations. Copes better with data loss in the middle of time ranges
We’re going to take a look at what a programmatic implementation of a log linear histogram looks like. Circonus has released our implementation of log linear histograms, in both C and Golang. We’ll look at a few things: Why it scales Why it is fast How to calculate quantiles
First let’s examine what the visual representation of this histogram is, to get an idea of the structure. At first glance this looks like a linear histogram, but take a look at the 1 million point on the X axis.
You’ll notice a change in bin size by a factor of 10. Where there was a bin from 990k to 1M, the next bin spans 1M to 1.1M. Each power of 10 contains 90 bins evenly spaced. Why not 100 bins you might think? Because the lower bound isn’t zero. In the case of a lower bound of 1 and an upper bound of 10, 10-1 = 9, and spacing those in 0.1 increments yields 90 bins.
Here’s an alternate look at the bin size transition. You can see the transition to larger bins at the 1,000 sample value boundary.
The C implementation of each bin is a struct containing a value and exponent struct paired with the count of samples in that bin. This diagram shows the overall memory footprint of each bin. The value is one byte representing the value of the data sample times 10. The Exponent is the power of 10 of the bin, and ranges from -128 to +127. The sample count is an unsigned 64 bit integer. This field is variable bit encoded, and as such occupies a maximum of 8 bytes. Many of the existing time series data stores out there store a single data point as an average using the value as a uint64. That’s one value for every eight bytes - this structure can store virtually an infinite count of samples in this bin range with the same data storage requirements.
Let’s take a look at what the storage footprint is in practice for this type of bin data structure. In practice, we have not seen more than a 300 bin span for operational sample sets. Bins are not pre-allocated linearly, they are allocated as samples are recorded for that span, so the histogram data structure can have gaps between bins containing data. These calculations show that we can store 30 days of one minute distributions in a maximum space of 123.6 megabytes. Less than a second of disk read operations if that. Now, 30 days of one minute averages only takes about a third of a megabyte - but that data is essentially useless for any sort of analysis.
Let’s examine what a year’s worth of data looks like in five minute windows. The same calculation with different values yield a maximum of about 300 megabytes to represent a year’s worth of data in five minute windows. Note that this is invariant to the total number of samples; the biggest factors in the actual size of the data are the span of bins covered, and the compression factor per bin.
Let’s talk about performing quantile calculations. The quantile notation q of X is just a fancy way of specifying a percentile. The 90th percentile would be q of 0.9, the 95th would be q of 0.95, and the maximum would be q of 1. So say we wanted to calculate the median, q of 0.5. First we iterate over the histogram and sum up the counts in all of the bins. Remember that cumulative histogram we talked about earlier? Guess what - the far right bin already contains the total count, so if you are using a cumulative histogram variation, that part is already done for you and can be a small optimization for quantile calculation since you have an O(1) operation instead of O(n) Now we multiple that count by the quantile value. If we say our count is 1,000 and our median is 0.5, we get a Q of 500. Next iterate over the bins, summing the left and right bin boundary counts, until Q is between those counts. If the count Q matches the left bin boundary count, our quantile value is that bin boundary value. If Q falls in between the left and right boundary counts, we use linear interpolation to calculate the sample value.
Let’s go over the linear interpolation part of quantile calculation. Once we have walked the bins to where Q is located in a certain bin, we use the formula shown here. Little q of X is the left bin value plus big Q minus the left bin boundary, divided by the count differences of the left and ride sides of the bin, times the bin width. Using this approach we can determine the quantile for a log linear histogram to a high degree of accuracy. In terms of what error levels we can experience in terms of quantile calculation, with one sample in a bin it is possible to see a worst case 5% error if the value is 109.9 in a bin bounding 100-110; the reported value would be 105. The best case error for one sample is 0.5% for a value of 955 in a bin spanning 950-960. However, our use of histograms is geared towards very large sets of data. With bins that contain dozens, hundreds, or more samples, accuracy should be expected to 3 or 4 nines.
So let’s review what we’ve just gone over. We looked at a few different implementations of histograms. We got an overview of Circonus’ open source implementation of a log linear histogram. What data structures it uses to codify bin data. The algorithm used to calculate quantiles (minus some implementation specific optimizations). I highly recommend reading the code. The points I mentioned previously about avoiding locks, memory allocations, waits, and copies are a big part of making the calculations highly optimized. Good algorithms are still limited by how they are implemented on the metal itself. From what we’ve looked at, most folks should be able to implement their own time series data store based on the open source library here, if you are into that sort of thing. One problem to solve might be to collect all of the syscall latency data on your systems via eBPF, and compare the 99th percentile syscall latencies one you applied the Spectre patch which disabled CPU branch prediction. It could be you find that your 99th percentile syscall overhead has gone from 10 nanoseconds to 30! While percentiles are a good first step for analyzing your service level objectives, what if we could look at the change in the number of requests that exceeded that objective? Say if 10% of your syscall requests exceeded 10 nanoseconds before the spectre patch, but after patching 50% of those syscalls exceeded 10 nanoseconds?
We can also use histograms to calculate inverse quantiles. Humans can reason about thresholds more naturally than they can about quantiles. If my web service gets a surge in requests, and my 99th percentile response time doesn’t change, that not only means that I just got a bunch of angry users whose requests took too long, but even worse I don’t know by how much those requests exceeded that percentile. I don’t know how bad the bleeding is. Inverse quantiles allow me to set a threshold sample value, then calculate what percentage of values exceeded it.
To calculate the inverse quantile, we start with the target sample and work backwards towards the target count Q
Given the previous equation we had, we can use some middle school level algebra (well, it was when I was in school) and solve for Q
Solving for Q, we get 700, which is expected for our value of 1.05. This handles the
Now that we know Q, we can add up the counts to the left of it, subtract that from the total and then divide by the total to get the percentage of sample values which exceeded our sample value X. So if we are running a website, and we know from industry research that we’ll lose users if our requests take longer than three seconds, we can set X at 3 seconds, calculate the inverse quantile for our request times, and figure out what percentages of our users are getting mad at us. Let’s take a look at how we have been doing that in real time with Circonus.
This is a heatmap of one year’s worth of latency data for a web service. It contains about 300 million samples. Each histogram window in the heatmap is one day’s worth of data, five minute histograms are merged together to create that window. The overlay window shows the distribution for the day where the mouse is hovering over.
Here we added a 99th percentile overlay which you can see implemented as the green lines. It’s pretty easy to spot the monthly recurring rises in the percentile to around 10 seconds. Looks like a network timeout issue, those usually default to around 10 seconds. For most of the time the 99th percentile is relatively low, a few hundred millisecond.
Here we can see the inverse quantile shown for 500 milliseconds request times. As you can see, for most of the graph, at least 90% of the requests are finishing within the 500 millisecond service level objective. We can still see the monthly increases which we believe are related to network client timeouts, but when they spike, they only affect about 25% of requests - not great, but at least we know the extent of that our SLO is exceeded.
We can take that percentage of requests that violated our SLO of 500 milliseconds, and multiply them by the total number of requests to get the number of requests which exceeded 500 milliseconds. This has direct bearing on your business if each of these failed requests is costing you money. Note that we’ve dropped the range here to a month to get a closer look at the data presented by these calculations.
What is we sum up over time the number of requests that exceeded 500 millseconds? Here integrate the number of requests that exceeded the SLO over time, and plot that as the increasing line. You can clearly see now where things get bad with our service by the increase in slope of the blue line. What if we had a way to automatically detect when that is happening and then page whoever is on call?
Here we can see the red line on the graph is the output of a constant anomaly detection algorithm. When the number of SLO violations increases, the code identifies it as an anomaly and rates it on a 0-100 scale. There are several anomaly detection algorithms available out there, and most don’t use a lot of complicated machine learning, just statistics.

Effective management of high volume numeric data with histograms

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Effective management of high volume numeric data with histograms

Similar to Effective management of high volume numeric data with histograms (20)

More from Fred Moyer

More from Fred Moyer (20)

Recently uploaded

Recently uploaded (17)

Effective management of high volume numeric data with histograms

Editor's Notes