This document discusses using histograms and percentiles to better monitor service performance. It begins by noting the limitations of synthetic monitoring and outlines how real user data can provide a more accurate picture. Percentiles like the median and 90th percentile are explained as useful metrics for understanding performance. Histograms of request latency data over time are presented as a way to detect non-normal distributions that could indicate issues. Calculating alerting thresholds based on percentiles rather than averages is advocated to avoid missing multiple high samples. Examples are given of how percentile-based alerting can more effectively detect performance problems and avoid unnecessary alerts.
Machine learning and Internet of Things, the future of medical preventionPierre Gutierrez
Title:
"Machine learning and Internet of Things, the future of medical prevention"
Abstract:
In this talk, Pierre Gutierrez, a data scientist at Dataiku, will discuss Dataiku's experiences using machine learning on IOT data. We will talk about the challenges processing and cleaning IoT data, and how to successfully train a model that can be deployed in production. We will illustrate our talk with two examples from our previous work. Creating algorithm for early epilepsy seizure detection based on wearable tech and Detecting people activity through sensor data.
The Dark of Building an Production Incident SysteAlois Reitbauer
The document discusses building an effective production incident system using statistics. It explains that using the median and percentiles to define a baseline range captures normal system behavior better than trying to fit a specific distribution model. Two examples are provided: 1) Using the binomial distribution to determine if an error rate exceeds expectations. 2) Using percentiles to check if response times have drifted above the median without knowing the underlying distribution. The key is applying statistical methods to objectively determine what constitutes a normal range of values versus a problem requiring alerting.
Many alerts place an unnecessary burden on Ops teams instead of helping them solve issues. This presentation describes the phenomenon and four ways to address it.
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...DevOps.com
Observability has never been more important: the complexity of microservices makes it harder and harder to answer basic questions about system behavior.
The conventional wisdom claims that Metrics, Logging and Tracing are “the three pillars” of observability… yet software organizations check these three boxes and are still grasping at straws during emergencies.
In this session we’ll illustrate the problem with the three pillars: metrics, logs, and traces are just data – they are the fuel, not the car.
Introduction to Data streaming - 05/12/2014Raja Chiky
Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.
The document discusses how to build an effective incident detection system using statistics. It explains that a baseline is needed to determine what normal behavior looks like and how to define abnormal behavior that requires an alert. Key metrics like errors, response times, and percentiles are identified. The document provides examples of how to use statistical distributions like the binomial distribution to calculate the likelihood of an observed value and determine if it warrants an alert or is still within the expected range of normal behavior.
Machine learning and Internet of Things, the future of medical preventionPierre Gutierrez
Title:
"Machine learning and Internet of Things, the future of medical prevention"
Abstract:
In this talk, Pierre Gutierrez, a data scientist at Dataiku, will discuss Dataiku's experiences using machine learning on IOT data. We will talk about the challenges processing and cleaning IoT data, and how to successfully train a model that can be deployed in production. We will illustrate our talk with two examples from our previous work. Creating algorithm for early epilepsy seizure detection based on wearable tech and Detecting people activity through sensor data.
The Dark of Building an Production Incident SysteAlois Reitbauer
The document discusses building an effective production incident system using statistics. It explains that using the median and percentiles to define a baseline range captures normal system behavior better than trying to fit a specific distribution model. Two examples are provided: 1) Using the binomial distribution to determine if an error rate exceeds expectations. 2) Using percentiles to check if response times have drifted above the median without knowing the underlying distribution. The key is applying statistical methods to objectively determine what constitutes a normal range of values versus a problem requiring alerting.
Many alerts place an unnecessary burden on Ops teams instead of helping them solve issues. This presentation describes the phenomenon and four ways to address it.
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...DevOps.com
Observability has never been more important: the complexity of microservices makes it harder and harder to answer basic questions about system behavior.
The conventional wisdom claims that Metrics, Logging and Tracing are “the three pillars” of observability… yet software organizations check these three boxes and are still grasping at straws during emergencies.
In this session we’ll illustrate the problem with the three pillars: metrics, logs, and traces are just data – they are the fuel, not the car.
Introduction to Data streaming - 05/12/2014Raja Chiky
Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.
The document discusses how to build an effective incident detection system using statistics. It explains that a baseline is needed to determine what normal behavior looks like and how to define abnormal behavior that requires an alert. Key metrics like errors, response times, and percentiles are identified. The document provides examples of how to use statistical distributions like the binomial distribution to calculate the likelihood of an observed value and determine if it warrants an alert or is still within the expected range of normal behavior.
Introduction to Artificial Intelligence. Not complex and should be relative easy to follow. Be aware that due to its high levelness (and no voice over) some care should be taken by the simplified examples used.
Brian Brazil is an engineer passionate about reliable software operations. He worked at Google SRE for 7 years and is the founder of Prometheus, an open source time series database designed for monitoring system and service metrics. Prometheus supports metric labeling, unified alerting and graphing, and is efficient, decentralized, reliable, and opinionated in how it encourages good monitoring practices.
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon
Sift Science uses online, large-scale machine learning to detect fraud for thousands of sites and hundreds of millions of users in real-time. This talk describes how we leverage HBase to power an ML infrastructure including how we train and build models, store and update model parameters online, and provide real-time predictions. The central pieces of the machine learning infrastructure and the tradeoffs we made to maximize performance will also be covered.
This document discusses using computer vision and cameras for measurement applications. It begins by introducing the speaker and their background. It then discusses some of the challenges with computer vision accuracy, particularly when using cameras as contactless sensors outdoors. It provides examples of using video analytics to extract metadata like people counts and speed measurements. The document emphasizes that measurement accuracy depends on many factors like sensor calibration, installation, and environmental conditions.
Finding bad apples early: Minimizing performance impactArun Kejariwal
The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams.
The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following:
# Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster)
# Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification
The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag!
We shall walk the audience through how the techniques are being used with REAL data.
Convolutional Neural Network for Text ClassificationAnaïs Addad
Work under Pr. Nolan, with a team of 4 to implement a convolutional neural network for text classification in TensorFlow using a dataset of Amazon reviews
Probabilistic algorithms exist to solve problems that are either impossible or unrealistic (too expensive, too time consuming, etc.) to solve precisely. In an ideal world, you would never actually need to use probabilistic algorithms. For programmers who are not familiar with them, the concept can be positively nerve-racking: “How do I know it will actually work? What if it is inexplicably wrong? How can I debug it? Maybe we should just punt on this problem or buy a whole lot more servers. . .”
However, for those who either deeply understand probability theory or at least have used and observed the behavior of probabilistic algorithms in large-scale production environments, these algorithms are not only acceptable but also worth using at any opportunity. This is because they can help solve problems, create systems that are less expensive and more predictable, and do things that could not be done otherwise.
How to not fail at security data analytics (by CxOSidekick)Dinis Cruz
1. The document discusses the challenges of obtaining security-related data from different sources and transporting it to a central platform for analysis. It addresses questions about data volume, collection methods, filtering and formatting.
2. Setting up a security data pipeline involves determining what data to collect from various host systems, networks, and applications. Data must then be forwarded from collectors to a central platform while managing bandwidth, latency, and failures.
3. Collecting the right security-related data is vital for detecting threats and being able to investigate incidents. The document argues for collecting most available data by default and filtering out exceptions, rather than only collecting predefined types of data.
Handling Numeric Attributes in Hoeffding Treesbutest
Hoeffding trees are decision trees designed for data streams that can incrementally learn from examples using limited memory. This document evaluates methods for handling numeric attributes in Hoeffding trees, which is important for performance. It finds that approaches using more approximation, like maintaining 10 bins or Gaussian distributions, allow greater tree growth within memory limits and perform best in empirical tests, outperforming methods like exhaustive binary trees. Evaluation on different datasets and memory environments finds the 10-bin and Gaussian methods generally perform similarly well, with no clear winner, though increased approximation comes at a cost of slower training and prediction speeds.
A sentient network - How High-velocity Data and Machine Learning will Shape t...Wenjing Chu
Dell's Distinguished Engineer Wenjing Chu discusses innovations in applying Machine Learning to solve challenges in Telco/Communication Services, and predicts that the future is a Sentient Network powered by Machine Learning that can handle real-time high-velocity data.
Design and Implementation of A Data Stream Management SystemErdi Olmezogullari
This presentation is related to my Master's Thesis at Ozyegin University. We focused on data mining on the real streaming (not binary) data. The most popular data mining algorithm, Association Rule Mining (ARM), was performed during this study from scratch. At the end of the thesis, we published four national/international papers in the different conferences such as Cloud Computing and Big Data.
This document provides information on calculating sample sizes using a sample size calculator. It defines sample size calculators, explains their purpose, and describes their key components. It then demonstrates how to use a sample size calculator by inputting values for three components to determine the fourth missing value. Finally, it provides examples of using a sample size calculator for scenarios involving polling for political elections, measuring call durations at a call center, and comparing the efficiencies of two systems.
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Pete Burnap
This document summarizes research on classifying malicious URLs on Twitter in real-time using machine activity data. The researchers collected data on URLs shared on Twitter during sporting events and used a honeypot to identify malicious ones. They built machine learning models to predict maliciousness based on metrics like CPU usage, network traffic, and processes when a URL was clicked. The best model was a multi-layer perceptron that achieved up to 72% accuracy. It showed network activity, CPU usage, and processes were predictive. Testing on a new dataset showed some independence between events. Using only 1% of training data caused a small 5% drop in performance, alleviating concerns over data requirements.
We know our 8MS users are made up of pros and power users, but even the pros get stumped every now and then! Over the years, our support team has heard all your calls and seen every kind of “weird error message” out there. Now, they want to bring these stories to light and offer some useful tips in all in one place.
We’ve rounded up 10 of the trickiest issues that have stumped even our most seasoned 8MS users, along with best practices on how to resolve them. You already know Matt Noreen and Mike Gilbert as your “go-to” 8MS guys, now hear them on this interactive webinar, where you’ll get the chance to test your own knowledge with our mini quizzes! We'll be revealing a secret prize during the webinar for the most correct answers – so you won’t want to miss this!
We all know not to poke at alien life forms in another planet, right? But what about metrics, do you know how to pick, measure and draw conclusions from them? In this talk we will cover various Site Reliability Engineering topics, such as SLIs and SLOs while we explore real life examples of defining and implementing metrics in a system with examples using Prometheus, an open-source system monitoring and alert platform, to demonstrate implementation. Let's get back to some real science.
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
Concepts include decision tree with its examples. Measures used for splitting in decision tree like gini index, entropy, information gain, pros and cons, validation. Basics of random forests with its example and uses.
This document summarizes a keynote presentation about challenges in bioinformatics software development and proposed solutions. Some of the key points made include: 1) bioinformatics software development involves multiple disciplines including computer science, software engineering, statistics, and biology, each with different priorities; 2) there is a massive proliferation of bioinformatics software packages that leads to many difficult choices for researchers; 3) proposed solutions include developing software in a more modular and automated way, using common benchmarks and protocols to evaluate tools, and focusing on reproducibility and usability.
This document discusses adversarial machine learning and how to attack machine learning algorithms. It provides examples of how naive Bayes, k-means clustering, and SVM algorithms can be subverted by manipulating input data or model parameters. Specifically, the naive Bayes algorithm's accuracy can be decreased by introducing benign words to messages. The k-means clustering algorithm's false negative rate can be increased by adding outlier points. And the SVM algorithm's decision boundary and predictions can be controlled. The document advocates for defenses like ensembling multiple models and using robust learning methods.
Reliable observability at scale: Error Budgets for 1,000+Fred Moyer
This document summarizes a presentation about implementing service level objectives (SLOs) and error budgets at scale. It discusses establishing service level indicators (SLIs) to define good and bad service, setting SLOs as targets for SLIs over time periods, and calculating error budgets as the complement of SLOs. The presentation provides examples of SLIs, SLOs, and error budgets for latency and availability. It also discusses challenges including variance from real users and different stakeholders' needs, and recommends approaches like flexible latency metrics and measuring as close to users as possible.
Practical service level objectives with error budgetingFred Moyer
This document summarizes Fred Moyer's presentation on practical service level objectives with error budgeting. It discusses how to calculate error budgets based on service level indicators, objectives, and agreements. It presents methods to calculate error budgets using log files by counting errors and slow requests, and using metrics like counters and histograms to track errors and response time distributions over time. Maintaining an error budget allows managing risk while releasing new code by setting a target for acceptable errors or slow requests.
More Related Content
Similar to Better service monitoring through histograms
Introduction to Artificial Intelligence. Not complex and should be relative easy to follow. Be aware that due to its high levelness (and no voice over) some care should be taken by the simplified examples used.
Brian Brazil is an engineer passionate about reliable software operations. He worked at Google SRE for 7 years and is the founder of Prometheus, an open source time series database designed for monitoring system and service metrics. Prometheus supports metric labeling, unified alerting and graphing, and is efficient, decentralized, reliable, and opinionated in how it encourages good monitoring practices.
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon
Sift Science uses online, large-scale machine learning to detect fraud for thousands of sites and hundreds of millions of users in real-time. This talk describes how we leverage HBase to power an ML infrastructure including how we train and build models, store and update model parameters online, and provide real-time predictions. The central pieces of the machine learning infrastructure and the tradeoffs we made to maximize performance will also be covered.
This document discusses using computer vision and cameras for measurement applications. It begins by introducing the speaker and their background. It then discusses some of the challenges with computer vision accuracy, particularly when using cameras as contactless sensors outdoors. It provides examples of using video analytics to extract metadata like people counts and speed measurements. The document emphasizes that measurement accuracy depends on many factors like sensor calibration, installation, and environmental conditions.
Finding bad apples early: Minimizing performance impactArun Kejariwal
The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams.
The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following:
# Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster)
# Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification
The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag!
We shall walk the audience through how the techniques are being used with REAL data.
Convolutional Neural Network for Text ClassificationAnaïs Addad
Work under Pr. Nolan, with a team of 4 to implement a convolutional neural network for text classification in TensorFlow using a dataset of Amazon reviews
Probabilistic algorithms exist to solve problems that are either impossible or unrealistic (too expensive, too time consuming, etc.) to solve precisely. In an ideal world, you would never actually need to use probabilistic algorithms. For programmers who are not familiar with them, the concept can be positively nerve-racking: “How do I know it will actually work? What if it is inexplicably wrong? How can I debug it? Maybe we should just punt on this problem or buy a whole lot more servers. . .”
However, for those who either deeply understand probability theory or at least have used and observed the behavior of probabilistic algorithms in large-scale production environments, these algorithms are not only acceptable but also worth using at any opportunity. This is because they can help solve problems, create systems that are less expensive and more predictable, and do things that could not be done otherwise.
How to not fail at security data analytics (by CxOSidekick)Dinis Cruz
1. The document discusses the challenges of obtaining security-related data from different sources and transporting it to a central platform for analysis. It addresses questions about data volume, collection methods, filtering and formatting.
2. Setting up a security data pipeline involves determining what data to collect from various host systems, networks, and applications. Data must then be forwarded from collectors to a central platform while managing bandwidth, latency, and failures.
3. Collecting the right security-related data is vital for detecting threats and being able to investigate incidents. The document argues for collecting most available data by default and filtering out exceptions, rather than only collecting predefined types of data.
Handling Numeric Attributes in Hoeffding Treesbutest
Hoeffding trees are decision trees designed for data streams that can incrementally learn from examples using limited memory. This document evaluates methods for handling numeric attributes in Hoeffding trees, which is important for performance. It finds that approaches using more approximation, like maintaining 10 bins or Gaussian distributions, allow greater tree growth within memory limits and perform best in empirical tests, outperforming methods like exhaustive binary trees. Evaluation on different datasets and memory environments finds the 10-bin and Gaussian methods generally perform similarly well, with no clear winner, though increased approximation comes at a cost of slower training and prediction speeds.
A sentient network - How High-velocity Data and Machine Learning will Shape t...Wenjing Chu
Dell's Distinguished Engineer Wenjing Chu discusses innovations in applying Machine Learning to solve challenges in Telco/Communication Services, and predicts that the future is a Sentient Network powered by Machine Learning that can handle real-time high-velocity data.
Design and Implementation of A Data Stream Management SystemErdi Olmezogullari
This presentation is related to my Master's Thesis at Ozyegin University. We focused on data mining on the real streaming (not binary) data. The most popular data mining algorithm, Association Rule Mining (ARM), was performed during this study from scratch. At the end of the thesis, we published four national/international papers in the different conferences such as Cloud Computing and Big Data.
This document provides information on calculating sample sizes using a sample size calculator. It defines sample size calculators, explains their purpose, and describes their key components. It then demonstrates how to use a sample size calculator by inputting values for three components to determine the fourth missing value. Finally, it provides examples of using a sample size calculator for scenarios involving polling for political elections, measuring call durations at a call center, and comparing the efficiencies of two systems.
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Pete Burnap
This document summarizes research on classifying malicious URLs on Twitter in real-time using machine activity data. The researchers collected data on URLs shared on Twitter during sporting events and used a honeypot to identify malicious ones. They built machine learning models to predict maliciousness based on metrics like CPU usage, network traffic, and processes when a URL was clicked. The best model was a multi-layer perceptron that achieved up to 72% accuracy. It showed network activity, CPU usage, and processes were predictive. Testing on a new dataset showed some independence between events. Using only 1% of training data caused a small 5% drop in performance, alleviating concerns over data requirements.
We know our 8MS users are made up of pros and power users, but even the pros get stumped every now and then! Over the years, our support team has heard all your calls and seen every kind of “weird error message” out there. Now, they want to bring these stories to light and offer some useful tips in all in one place.
We’ve rounded up 10 of the trickiest issues that have stumped even our most seasoned 8MS users, along with best practices on how to resolve them. You already know Matt Noreen and Mike Gilbert as your “go-to” 8MS guys, now hear them on this interactive webinar, where you’ll get the chance to test your own knowledge with our mini quizzes! We'll be revealing a secret prize during the webinar for the most correct answers – so you won’t want to miss this!
We all know not to poke at alien life forms in another planet, right? But what about metrics, do you know how to pick, measure and draw conclusions from them? In this talk we will cover various Site Reliability Engineering topics, such as SLIs and SLOs while we explore real life examples of defining and implementing metrics in a system with examples using Prometheus, an open-source system monitoring and alert platform, to demonstrate implementation. Let's get back to some real science.
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
Concepts include decision tree with its examples. Measures used for splitting in decision tree like gini index, entropy, information gain, pros and cons, validation. Basics of random forests with its example and uses.
This document summarizes a keynote presentation about challenges in bioinformatics software development and proposed solutions. Some of the key points made include: 1) bioinformatics software development involves multiple disciplines including computer science, software engineering, statistics, and biology, each with different priorities; 2) there is a massive proliferation of bioinformatics software packages that leads to many difficult choices for researchers; 3) proposed solutions include developing software in a more modular and automated way, using common benchmarks and protocols to evaluate tools, and focusing on reproducibility and usability.
This document discusses adversarial machine learning and how to attack machine learning algorithms. It provides examples of how naive Bayes, k-means clustering, and SVM algorithms can be subverted by manipulating input data or model parameters. Specifically, the naive Bayes algorithm's accuracy can be decreased by introducing benign words to messages. The k-means clustering algorithm's false negative rate can be increased by adding outlier points. And the SVM algorithm's decision boundary and predictions can be controlled. The document advocates for defenses like ensembling multiple models and using robust learning methods.
Similar to Better service monitoring through histograms (20)
Reliable observability at scale: Error Budgets for 1,000+Fred Moyer
This document summarizes a presentation about implementing service level objectives (SLOs) and error budgets at scale. It discusses establishing service level indicators (SLIs) to define good and bad service, setting SLOs as targets for SLIs over time periods, and calculating error budgets as the complement of SLOs. The presentation provides examples of SLIs, SLOs, and error budgets for latency and availability. It also discusses challenges including variance from real users and different stakeholders' needs, and recommends approaches like flexible latency metrics and measuring as close to users as possible.
Practical service level objectives with error budgetingFred Moyer
This document summarizes Fred Moyer's presentation on practical service level objectives with error budgeting. It discusses how to calculate error budgets based on service level indicators, objectives, and agreements. It presents methods to calculate error budgets using log files by counting errors and slow requests, and using metrics like counters and histograms to track errors and response time distributions over time. Maintaining an error budget allows managing risk while releasing new code by setting a target for acceptable errors or slow requests.
This document summarizes a presentation about properly computing service level objectives (SLOs) using latency data. It discusses common mistakes like averaging percentiles, and better approaches like using histograms. Log linear histograms are recommended as they provide flexibility in choosing thresholds while being space efficient. Open source libraries like libcircllhist can be used to calculate SLOs from latency data stored in histograms.
The document discusses proper techniques for defining and measuring service level objectives (SLOs). It begins with an overview of SLOs, service level indicators (SLIs), and service level agreements (SLAs). It then describes a common mistake in averaging percentiles across data sets. The rest of the document discusses different methods for accurately computing SLOs using log data, counting requests, and histograms. It argues that histograms provide the most flexibility while avoiding issues with averaging percentiles.
The document discusses techniques for accurately calculating service level objectives (SLOs) based on latency. It begins with an overview of common SLO terminology. It then describes a common mistake where percentiles are incorrectly averaged across time windows. The document proceeds to examine approaches to computing SLOs using log data, request counting, and histograms. Histograms are identified as the most flexible technique since they allow thresholds to be chosen as needed and provide full statistical analysis of latency data.
This document discusses best practices for defining and measuring latency service level objectives (SLOs). It recommends computing SLOs directly from raw log data using histograms, which allow arbitrary percentiles to be derived and are better than averaging sample percentiles. Histograms can also be aggregated over time and used to count the number of requests above a latency threshold regardless of what the threshold was set to initially. Common histogram implementations like HDR-Histogram and t-digest are suggested.
Comprehensive Container Based Service Monitoring with Kubernetes and IstioFred Moyer
This document summarizes Fred Moyer's talk on comprehensive container-based service monitoring with Kubernetes and Istio. The talk covered Istio architecture and deployment, using the Istio sample bookinfo application, and monitoring the application with Istio metrics and Grafana dashboards. It also discussed Istio Mixer metrics adapters, math and statistics concepts like histograms and quantiles, and monitoring concepts like service level objectives, indicators, and agreements. The talk provided exercises for attendees to deploy sample applications and create custom metrics adapters.
Comprehensive container based service monitoring with kubernetes and istioFred Moyer
The document provides an overview of using Kubernetes and Istio to monitor microservices. It discusses using Istio to collect telemetry data on requests, including rate, errors, and duration. This data can be visualized in Grafana dashboards to monitor key performance indicators. Histograms are recommended to capture request durations as they allow calculating percentiles over time for service level indicators. An Istio metrics adapter is also described that sends telemetry data to Circonus for long-term storage and alerting.
This document provides an overview of key statistical concepts including:
1. The average (arithmetic mean) is calculated by summing all values and dividing by the number of samples.
2. The median is the middle value of a data set when values are sorted from lowest to highest.
3. The 90th percentile represents the value where 90% of values are below it.
4. Standard deviation measures how spread out values are from the average and 68% of values fall within one standard deviation of the average in a normal distribution.
Fred Moyer from Circonus presented on IRONdb and Grafana. IRONdb is a time series database that can replace existing TSDBs without changes to ingestion or visualizations. It provides scale, reliability, and ease of operations. IRONdb is distributed, replicated across multiple datacenters for reliability, and can store years of high-cardinality histogram and metric data. The upcoming IRONdb data source for Grafana will support histograms, stream tags, and Prometheus storage. Attendees could sign up for early access and preview accounts.
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseFred Moyer
The document discusses the process of logically sharding a growing PostgreSQL database. It describes the stages involved: diagnosing which tables are largest; evaluating options like account, geographic or hardware-based sharding; scoping the solution by separating tables between a main and marks database; implementing changes including managing transactions and configuration across databases; releasing the changes; and cleaning up afterwards. It emphasizes testing rollback processes, managing technical debt, and bringing empathy to understanding legacy code and configurations.
The document discusses differences between Perl and Go for Perl programmers. It covers Go topics like goroutines (threads), channels (queues), formatting code with gofmt, defining structs instead of hashes/objects, using slices instead of arrays, maps instead of hashes, error handling, importing packages instead of using Perl modules, writing tests with godoc instead of perldoc, and getting code with go get instead of cpanminus. It also provides Golang web resources for learning more.
Netfilter was used to solve performance and scalability issues with an existing captive portal solution. A netfilter module was developed that removed port numbers from HTTP requests, allowing most static content to be fetched directly from origin servers rather than through a proxy. This avoided proxying all traffic and achieved better performance than alternatives like Tinyproxy. The netfilter solution worked well technically but did not prove viable long-term for business reasons.
This document discusses Apache::Dispatch, a lightweight abstraction layer for mod_perl applications. It maps URIs to application resources via method handlers, providing the power of mod_perl handlers with a painless migration. The document reviews how Apache::Dispatch works, provides examples of configuration, method handlers, and testing with Apache::Test. It also covers additional Apache::Dispatch features like pre/post-dispatch handlers, inheritance, autoloading, and filtering.
This document discusses the Data::FormValidator module, which provides a simplified way to validate form data in Perl. It allows defining validation profiles that specify required and optional fields, as well as custom and built-in constraint methods. The module takes request parameters, runs validation according to the profile, and returns results that can be easily integrated into templates to display error messages.
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
SMS API Integration in Saudi Arabia| Best SMS API ServiceYara Milbes
Discover the benefits and implementation of SMS API integration in the UAE and Middle East. This comprehensive guide covers the importance of SMS messaging APIs, the advantages of bulk SMS APIs, and real-world case studies. Learn how CEQUENS, a leader in communication solutions, can help your business enhance customer engagement and streamline operations with innovative CPaaS, reliable SMS APIs, and omnichannel solutions, including WhatsApp Business. Perfect for businesses seeking to optimize their communication strategies in the digital age.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
Most important New features of Oracle 23c for DBAs and Developers. You can get more idea from my youtube channel video from https://youtu.be/XvL5WtaC20A
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j
Dr. Jesús Barrasa, Head of Solutions Architecture for EMEA, Neo4j
Découvrez les dernières innovations de Neo4j, et notamment les dernières intégrations cloud et les améliorations produits qui font de Neo4j un choix essentiel pour les développeurs qui créent des applications avec des données interconnectées et de l’IA générative.
How Can Hiring A Mobile App Development Company Help Your Business Grow?ToXSL Technologies
ToXSL Technologies is an award-winning Mobile App Development Company in Dubai that helps businesses reshape their digital possibilities with custom app services. As a top app development company in Dubai, we offer highly engaging iOS & Android app solutions. https://rb.gy/necdnt
Transform Your Communication with Cloud-Based IVR SolutionsTheSMSPoint
Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony
When it is all about ERP solutions, companies typically meet their needs with common ERP solutions like SAP, Oracle, and Microsoft Dynamics. These big players have demonstrated that ERP systems can be either simple or highly comprehensive. This remains true today, but there are new factors to consider, including a promising new contender in the market that’s Odoo. This blog compares Odoo ERP with traditional ERP systems and explains why many companies now see Odoo ERP as the best choice.
What are ERP Systems?
An ERP, or Enterprise Resource Planning, system provides your company with valuable information to help you make better decisions and boost your ROI. You should choose an ERP system based on your company’s specific needs. For instance, if you run a manufacturing or retail business, you will need an ERP system that efficiently manages inventory. A consulting firm, on the other hand, would benefit from an ERP system that enhances daily operations. Similarly, eCommerce stores would select an ERP system tailored to their needs.
Because different businesses have different requirements, ERP system functionalities can vary. Among the various ERP systems available, Odoo ERP is considered one of the best in the ERp market with more than 12 million global users today.
Odoo is an open-source ERP system initially designed for small to medium-sized businesses but now suitable for a wide range of companies. Odoo offers a scalable and configurable point-of-sale management solution and allows you to create customised modules for specific industries. Odoo is gaining more popularity because it is built in a way that allows easy customisation, has a user-friendly interface, and is affordable. Here, you will cover the main differences and get to know why Odoo is gaining attention despite the many other ERP systems available in the market.
4. Synthetics
Stephen Falken: Uh, uh, General, what you see on these screens up
here is a fantasy; a computer-enhanced hallucination. Those blips
are not real missiles. They're phantoms. (War Games, 1983)
9. “Alert me if requests take longer than 200 ms”
10,10,10,10,10,10,10,10,10,5000
Alerts on one outlier in 10
Threshold Alerting
10. “Alert if request average over one minute
is longer than 200 ms”
avg(10,10,210,210,210,210) = 143 (860/6)
Does not alert on multiple high samples
Threshold Alerting
11. ‘average’ eq ‘arithmetic mean’
A=S/N
A = average
N = the number of terms
S = the sum of the numbers in the set
Math Refresher
12. median = midpoint of data set
The 50th percentile is 555 - q(0.5)
Value 111 222 333 444
555 666 777 888 999
Sample # 1 2 3 4 5 6 7 8 9
Math Refresher
13. 90th percentile - 90% of samples below it
The 90th percentile is 1,000 - q(0.9)
Value 111 222 333 444 555 666 777 888 999
1,000 1,111
Sample # 1 2 3 4 5 6 7 8 9 10 11
Math Refresher
14. 100th Percentile - the maximum value
The 100th percentile is 1,111 - q(1)
Value 111 222 333 444 555 666 777 888 999 1,000
1,111
Sample # 1 2 3 4 5 6 7 8 9 10 11
Math Refresher
22. Request latency
“We keep hearing from people that the
website is slow. But it is fine when we test it,
and the request latency graph is constant”
You are only looking at part of the picture.
25. Practical Percentiles
Bandwidth usage is often billed at 95th percentile usage
Record 5 minute data usage intervals
Sort samples by value of sample
Throw out the highest 5% of samples
Charge usage based on the remaining top sample, i.e. 300
MB transferred over 5 minutes = 1 MB/s rate billing
26. Practical Percentiles
If I measure 95th percentile per 5 minutes all
month long,
I CANNOT calculate 95th percentile over the
month.
29. “Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,10,10,5000] == 10
Alert IS NOT triggered
Do you want to be woken up for this? NO!
30. “Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,250,300] = ~270
Alert IS triggered
Do you want to be woken up for this? YES!
A synthetic is basically a bot check against your system. One of the benefits (perhaps the only benefit) of the synthetic is that it’s more highly available than the application you are monitoring.
The response from synthetic requests don’t tell you anything meaningful about how actual users experience your application.
What am I looking at here? This is a time series graph of response times from synthetic login checks against a website. The results are remarkably consistent, as they should be.
It gives you the viewpoint of one user - a computer somewhere dispatches a request over the same network route to your server. It records several metrics about how your application responds; time to start the ssl connection, time to the first byte served, average request time...
Those metrics are not only useless (unless anyone here runs a service just for one user… in that case, kudos), they lie to you. These are LIES. The falsely represent the health of your application. It’s a binary - is the service up, or is the service down? That’s all you get.
Your user base will likely have a distribution of ages, genders, devices, network connections.
The synthetic check used an external user agent, but you can use collection tools like statsd or log analysis to record request times for real users. This is better than only using a synthetic check, but this technique still has a number of shortcomings. The first is that collection data is averaged over an interval (generally 10 seconds to a minute).
So if Cyndi, Bobby, and Mike are all shopping at your website at the same time, you only see the average of their request times over a given interval. Mike might be having a great experience because his office network is 100 megabit, but Bobby is on gig-e, and Cyndi on 10 megabit, you’ll only see Bobby’s view of the website user experience.
The second short shortcoming of a time series average value graph is spike erosion, also known as downsampling. Spike erosion is what happens when you zoom in on specific areas of a time series graph. As you zoom in, the data is averaged over intervals closer to the actual collection intervals. As you can see on this graph, when we zoom into a 2 hour view of the graph we just looked at, the maximum value we see now is 2,000 milliseconds instead of 500 milliseconds. That’s a 400% increase!
I don’t like this image - find a better one.
If you alert based on values you get from the graphs I’ve shown, what value do you alert on? As you’ve seen, avoiding false positives is impossible.
Correct this since one sample will trigger alert, use average alert instead
200 ms is too slow, so we take an average, 66% of population is over 200ms, no alert thrown, this is the solution people use to avoid the outlier in previous slide
0th quantile is first element
0th quantile is first element
A histogram is one of the seven basic tools of quality. The Y axis indicates the number of samples, where the X axis indicates the sample value. One use of a histogram that you may have seen is plotting human height vs number of people who are that tall.
Human height follows what is called a normal distribution (also known as a Gaussian distribution). The majority of the population tends to group around one value, and tapers off at the high and low sample values. With a perfect normal distribution, the arithmetic mean (the average) and the median are one in the same.
The mode is also equal to the median. You’ve heard the term standard deviation before most likely. With a normal distribution, 68% of the values lie within one standard deviation for both sides of the median. 95% within 2 standard deviations, 99.7% within 3 sigma. The smaller a standard deviation, the closer the data is to the mean. The larger one sigma is, the farther the data is away from the mean. It is important to note that these metrics only make sense for normal distribution, where there is a single mode.
This is a non normal distribution. In this example, there are large numbers of samples grouped at the highest and lowest sample values. Because there are two distinct peaks, this is called a bimodal distribution (or multi-modal distribution). In a multimodal distribution like this, standard deviation and multi-sigma values are useless.
This is another non-normal distribution. As you can see, it only has one mode, and is a skewed distribution. Standard deviation has little to no meaning here.
Here is a histogram of web page request time. The higher the bar, the more users are affected. This is a highly skewed distribution - notice the grouping between the spike at ~150 milliseconds, and the long tail past there. There’s another smaller spike at ~25 ms, so this is mostly a bimodal distribution.
In terms of website performance, people will generally get angry if request times take longer than 250 milliseconds. So what we see here is a bunch of users who are getting acceptable response times, and a long tail of pissed off users.
People on left side are having a great experience, people on right side are leaving the site.
Here is a histogram of web page request time. The higher the bar, the more users are affected. This is a highly skewed distribution - notice the grouping between the spike at ~150 milliseconds, and the long tail past there. There’s another smaller spike at ~25 ms, so this is mostly a bimodal distribution.
In terms of website performance, people will generally get angry if request times take longer than 250 milliseconds. So what we see here is a bunch of users who are getting acceptable response times, and a long tail of pissed off users.
People on left side are having a great experience, people on right side are leaving the site. Note that this is for a time slice, say 5 minutes. What does this look like if we integrate over time?
Heat maps are visual representations of histograms over time windows. It gives you a visualization of data distributions over time.
With heat maps, you can add percentile overlays to show the 50th, 95, and any other percentile distribution over time slices
A percentile is a barrier where to the left the samples are 95%, to the right are the remaining 5%. There is a caveat with the barrier hitting in the middle of data points. If you measure on the right including the barrier, >= 95th percentile of whole data set, if you measure to the left of the barrier, <= 95%. If you have two samples, median is every value between those two samples. Samples on the barrier are counted twice. Divide data set into two sets.
Have a slide that says - bespoke things you probably didn’t know about histograms. For the purpose of our examples, we’ll avoid these edge cases. If you see a histogram where the ⅓ quantile and ⅔ quantile are equal value, they add up to > 100%. Histogram of 1 value is one example (everything is measured twice).
1,2 - 1,2,3.
Percentiles cannot be averaged. You have to calculate them from the raw usage data. There are several monitoring solutions out there that will let you average percentiles - this is flat out WRONG
What’s your SLA? If you set your 95% percentile at 250 ms, and you meet your SLA, you’re pissing off 5% of your users. They’re going to your competitor. Let’s try to calculate how many users you are screwing.
Take the number of requests outside your 95 percentile (the 5th percent inverse quantile), and integrate that over time to get a cumulative number of users that you’ve screwed. Multiply that times the dollar value of each lost request - that’s how much money you’re losing.
Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.
Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.
Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.