Better service monitoring through histograms

•Download as PPTX, PDF•

2 likes•1,871 views

This document discusses using histograms and percentiles to better monitor service performance. It begins by noting the limitations of synthetic monitoring and outlines how real user data can provide a more accurate picture. Percentiles like the median and 90th percentile are explained as useful metrics for understanding performance. Histograms of request latency data over time are presented as a way to detect non-normal distributions that could indicate issues. Calculating alerting thresholds based on percentiles rather than averages is advocated to avoid missing multiple high samples. Examples are given of how percentile-based alerting can more effectively detect performance problems and avoid unnecessary alerts.

Software

Better service monitoring
through histograms
Fred Moyer - @phredmoyer
San Francisco Perl Mongers, 07-26-2016

Systems break while we sleep
How often are you woken up for false alarms?
Welcome

Synthetics
Easy to setup, but
not a real user

Synthetics
Stephen Falken: Uh, uh, General, what you see on these screens up
here is a fantasy; a computer-enhanced hallucination. Those blips
are not real missiles. They're phantoms. (War Games, 1983)

What threshold do you choose?
Threshold Alerting

“Alert me if requests take longer than 200 ms”
10,10,10,10,10,10,10,10,10,5000
Alerts on one outlier in 10
Threshold Alerting

“Alert if request average over one minute
is longer than 200 ms”
avg(10,10,210,210,210,210) = 143 (860/6)
Does not alert on multiple high samples
Threshold Alerting

‘average’ eq ‘arithmetic mean’
A=S/N
A = average
N = the number of terms
S = the sum of the numbers in the set
Math Refresher

median = midpoint of data set
The 50th percentile is 555 - q(0.5)
Value 111 222 333 444
555 666 777 888 999
Sample # 1 2 3 4 5 6 7 8 9
Math Refresher

90th percentile - 90% of samples below it
The 90th percentile is 1,000 - q(0.9)
Value 111 222 333 444 555 666 777 888 999
1,000 1,111
Sample # 1 2 3 4 5 6 7 8 9 10 11
Math Refresher

100th Percentile - the maximum value
The 100th percentile is 1,111 - q(1)
Value 111 222 333 444 555 666 777 888 999 1,000
1,111
Sample # 1 2 3 4 5 6 7 8 9 10 11
Math Refresher

Sample value
Number of
samples
Histogram

Sample value
Number of
samples
Normal Distribution

Sample value
Number of
samples
Normal Distribution
34% within
one sigma (σ)

Sample value
Number of
samples
Non-Normal Distribution

Non-Normal Distribution
Operations data groups at different points

Non-Normal Distribution
Users to the right of the red line are gone

Request latency
“We keep hearing from people that the
website is slow. But it is fine when we test it,
and the request latency graph is constant”
You are only looking at part of the picture.

Practical Percentiles
Bandwidth usage is often billed at 95th percentile usage
Record 5 minute data usage intervals
Sort samples by value of sample
Throw out the highest 5% of samples
Charge usage based on the remaining top sample, i.e. 300
MB transferred over 5 minutes = 1 MB/s rate billing

Practical Percentiles
If I measure 95th percentile per 5 minutes all
month long,
I CANNOT calculate 95th percentile over the
month.

Angry users
How many users are you pissing off?

“Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,10,10,5000] == 10
Alert IS NOT triggered
Do you want to be woken up for this? NO!

“Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,250,300] = ~270
Alert IS triggered
Do you want to be woken up for this? YES!

Who’s using this approach?
Google.com
Circonus.com
You?

Questions?
Thanks to Circonus.com for the tools and help
with the math
http://www.circonus.com/free-account/

Title: "Machine learning and Internet of Things, the future of medical prevention" Abstract: In this talk, Pierre Gutierrez, a data scientist at Dataiku, will discuss Dataiku's experiences using machine learning on IOT data. We will talk about the challenges processing and cleaning IoT data, and how to successfully train a model that can be deployed in production. We will illustrate our talk with two examples from our previous work. Creating algorithm for early epilepsy seizure detection based on wearable tech and Detecting people activity through sensor data.

The Dark of Building an Production Incident Syste

Alois Reitbauer

The document discusses building an effective production incident system using statistics. It explains that using the median and percentiles to define a baseline range captures normal system behavior better than trying to fit a specific distribution model. Two examples are provided: 1) Using the binomial distribution to determine if an error rate exceeds expectations. 2) Using percentiles to check if response times have drifted above the median without knowing the underlying distribution. The key is applying statistical methods to objectively determine what constitutes a normal range of values versus a problem requiring alerting.

Four ways to combat non actionable alerts

BigPanda

The math behind big systems analysis.

Theo Schlossnagle

Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...

DevOps.com

Observability has never been more important: the complexity of microservices makes it harder and harder to answer basic questions about system behavior. The conventional wisdom claims that Metrics, Logging and Tracing are “the three pillars” of observability… yet software organizations check these three boxes and are still grasping at straws during emergencies. In this session we’ll illustrate the problem with the three pillars: metrics, logs, and traces are just data – they are the fuel, not the car.

Introduction to Data streaming - 05/12/2014

Raja Chiky

Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.

The Dark Art of Production Alerting

Alois Reitbauer

The document discusses how to build an effective incident detection system using statistics. It explains that a baseline is needed to determine what normal behavior looks like and how to define abnormal behavior that requires an alert. Key metrics like errors, response times, and percentiles are identified. The document provides examples of how to use statistical distributions like the binomial distribution to calculate the likelihood of an observed value and determine if it warrants an alert or is still within the expected range of normal behavior.

Brian Brazil is an engineer passionate about reliable software operations. He worked at Google SRE for 7 years and is the founder of Prometheus, an open source time series database designed for monitoring system and service metrics. Prometheus supports metric labeling, unified alerting and graphing, and is efficient, decentralized, reliable, and opinionated in how it encourages good monitoring practices.

HBaseCon 2015: Running ML Infrastructure on HBase

HBaseCon

Sift Science uses online, large-scale machine learning to detect fraud for thousands of sites and hundreds of millions of users in real-time. This talk describes how we leverage HBase to power an ML infrastructure including how we train and build models, store and update model parameters online, and provide real-time predictions. The central pieces of the machine learning infrastructure and the tradeoffs we made to maximize performance will also be covered.

Computer Vision for Measurement & FR

RekaNext Capital

This document discusses using computer vision and cameras for measurement applications. It begins by introducing the speaker and their background. It then discusses some of the challenges with computer vision accuracy, particularly when using cameras as contactless sensors outdoors. It provides examples of using video analytics to extract metadata like people counts and speed measurements. The document emphasizes that measurement accuracy depends on many factors like sensor calibration, installation, and environmental conditions.

Finding bad apples early: Minimizing performance impact

Arun Kejariwal

The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams. The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following: # Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster) # Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag! We shall walk the audience through how the techniques are being used with REAL data.

Convolutional Neural Network for Text Classification

Anaïs Addad

It Probably Works

Fastly

Probabilistic algorithms exist to solve problems that are either impossible or unrealistic (too expensive, too time consuming, etc.) to solve precisely. In an ideal world, you would never actually need to use probabilistic algorithms. For programmers who are not familiar with them, the concept can be positively nerve-racking: “How do I know it will actually work? What if it is inexplicably wrong? How can I debug it? Maybe we should just punt on this problem or buy a whole lot more servers. . .” However, for those who either deeply understand probability theory or at least have used and observed the behavior of probabilistic algorithms in large-scale production environments, these algorithms are not only acceptable but also worth using at any opportunity. This is because they can help solve problems, create systems that are less expensive and more predictable, and do things that could not be done otherwise.

How to not fail at security data analytics (by CxOSidekick)

Dinis Cruz

1. The document discusses the challenges of obtaining security-related data from different sources and transporting it to a central platform for analysis. It addresses questions about data volume, collection methods, filtering and formatting. 2. Setting up a security data pipeline involves determining what data to collect from various host systems, networks, and applications. Data must then be forwarded from collectors to a central platform while managing bandwidth, latency, and failures. 3. Collecting the right security-related data is vital for detecting threats and being able to investigate incidents. The document argues for collecting most available data by default and filtering out exceptions, rather than only collecting predefined types of data.

Handling Numeric Attributes in Hoeffding Trees

butest

Hoeffding trees are decision trees designed for data streams that can incrementally learn from examples using limited memory. This document evaluates methods for handling numeric attributes in Hoeffding trees, which is important for performance. It finds that approaches using more approximation, like maintaining 10 bins or Gaussian distributions, allow greater tree growth within memory limits and perform best in empirical tests, outperforming methods like exhaustive binary trees. Evaluation on different datasets and memory environments finds the 10-bin and Gaussian methods generally perform similarly well, with no clear winner, though increased approximation comes at a cost of slower training and prediction speeds.

A sentient network - How High-velocity Data and Machine Learning will Shape t...

Wenjing Chu

Semantics in Sensor Networks

Oscar Corcho

Design and Implementation of A Data Stream Management System

Erdi Olmezogullari

Calculating a Sample Size

Matt Hansen

This document provides information on calculating sample sizes using a sample size calculator. It defines sample size calculators, explains their purpose, and describes their key components. It then demonstrates how to use a sample size calculator by inputting values for three components to determine the fourth missing value. Finally, it provides examples of using a sample size calculator for scenarios involving polling for political elections, measuring call durations at a call center, and comparing the efficiencies of two systems.

Machine Learning Intro Session

Naveen Rajan

Real-time Classification of Malicious URLs on Twitter using Machine Activity ...

Pete Burnap

This document summarizes research on classifying malicious URLs on Twitter in real-time using machine activity data. The researchers collected data on URLs shared on Twitter during sporting events and used a honeypot to identify malicious ones. They built machine learning models to predict maliciousness based on metrics like CPU usage, network traffic, and processes when a URL was clicked. The best model was a multi-layer perceptron that achieved up to 72% accuracy. It showed network activity, CPU usage, and processes were predictive. Testing on a new dataset showed some independence between events. Using only 1% of training data caused a small 5% drop in performance, alleviating concerns over data requirements.

CSF Tips and Tricks 8MS Webinar

Aerialink

We know our 8MS users are made up of pros and power users, but even the pros get stumped every now and then! Over the years, our support team has heard all your calls and seen every kind of “weird error message” out there. Now, they want to bring these stories to light and offer some useful tips in all in one place. We’ve rounded up 10 of the trickiest issues that have stumped even our most seasoned 8MS users, along with best practices on how to resolve them. You already know Matt Noreen and Mike Gilbert as your “go-to” 8MS guys, now hear them on this interactive webinar, where you’ll get the chance to test your own knowledge with our mini quizzes! We'll be revealing a secret prize during the webinar for the most correct answers – so you won’t want to miss this!

Application Metrics (with Prometheus examples) #PHPDD18

Rafael Dohms

We all know not to poke at alien life forms in another planet, right? But what about metrics, do you know how to pick, measure and draw conclusions from them? In this talk we will cover various Site Reliability Engineering topics, such as SLIs and SLOs while we explore real life examples of defining and implementing metrics in a system with examples using Prometheus, an open-source system monitoring and alert platform, to demonstrate implementation. Let's get back to some real science.

Machine learning session6(decision trees random forrest)

Abhimanyu Dwivedi

2014 abic-talk

c.titus.brown

This document summarizes a keynote presentation about challenges in bioinformatics software development and proposed solutions. Some of the key points made include: 1) bioinformatics software development involves multiple disciplines including computer science, software engineering, statistics, and biology, each with different priorities; 2) there is a massive proliferation of bioinformatics software packages that leads to many difficult choices for researchers; 3) proposed solutions include developing software in a more modular and automated way, using common benchmarks and protocols to evaluate tools, and focusing on reproducibility and usability.

Subverting Machine Learning Detections for fun and profit

Ram Shankar Siva Kumar

This document discusses adversarial machine learning and how to attack machine learning algorithms. It provides examples of how naive Bayes, k-means clustering, and SVM algorithms can be subverted by manipulating input data or model parameters. Specifically, the naive Bayes algorithm's accuracy can be decreased by introducing benign words to messages. The k-means clustering algorithm's false negative rate can be increased by adding outlier points. And the SVM algorithm's decision boundary and predictions can be controlled. The document advocates for defenses like ensembling multiple models and using robust learning methods.

Reliable observability at scale: Error Budgets for 1,000+

Better service monitoring through histograms

Recommended

Recommended

More Related Content

Similar to Better service monitoring through histograms

Similar to Better service monitoring through histograms (20)

More from Fred Moyer

More from Fred Moyer (19)

Recently uploaded

Recently uploaded (20)

Better service monitoring through histograms

Editor's Notes