SlideShare a Scribd company logo
Better service monitoring
through histograms
Fred Moyer - @phredmoyer
San Francisco Perl Mongers, 07-26-2016
Systems break while we sleep
How often are you woken up for false alarms?
Welcome
Synthetics
Easy to setup, but
not a real user
Synthetics
Stephen Falken: Uh, uh, General, what you see on these screens up
here is a fantasy; a computer-enhanced hallucination. Those blips
are not real missiles. They're phantoms. (War Games, 1983)
Real Users
These are your
users, right?
Real data
Real Users
500 ms is really 2,000 ms
Spike Erosion
What threshold do you choose?
Threshold Alerting
“Alert me if requests take longer than 200 ms”
10,10,10,10,10,10,10,10,10,5000
Alerts on one outlier in 10
Threshold Alerting
“Alert if request average over one minute
is longer than 200 ms”
avg(10,10,210,210,210,210) = 143 (860/6)
Does not alert on multiple high samples
Threshold Alerting
‘average’ eq ‘arithmetic mean’
A=S/N
A = average
N = the number of terms
S = the sum of the numbers in the set
Math Refresher
median = midpoint of data set
The 50th percentile is 555 - q(0.5)
Value 111 222 333 444
555 666 777 888 999
Sample # 1 2 3 4 5 6 7 8 9
Math Refresher
90th percentile - 90% of samples below it
The 90th percentile is 1,000 - q(0.9)
Value 111 222 333 444 555 666 777 888 999
1,000 1,111
Sample # 1 2 3 4 5 6 7 8 9 10 11
Math Refresher
100th Percentile - the maximum value
The 100th percentile is 1,111 - q(1)
Value 111 222 333 444 555 666 777 888 999 1,000
1,111
Sample # 1 2 3 4 5 6 7 8 9 10 11
Math Refresher
Sample value
Number of
samples
Histogram
Sample value
Number of
samples
Normal Distribution
Sample value
Number of
samples
Normal Distribution
34% within
one sigma (σ)
Sample value
Number of
samples
Non-Normal Distribution
Sample value
Number of
samples
Non-Normal Distribution
Non-Normal Distribution
Operations data groups at different points
Non-Normal Distribution
Users to the right of the red line are gone
Request latency
“We keep hearing from people that the
website is slow. But it is fine when we test it,
and the request latency graph is constant”
You are only looking at part of the picture.
Heat Map
Histograms over time windows
Percentiles
Practical Percentiles
Bandwidth usage is often billed at 95th percentile usage
Record 5 minute data usage intervals
Sort samples by value of sample
Throw out the highest 5% of samples
Charge usage based on the remaining top sample, i.e. 300
MB transferred over 5 minutes = 1 MB/s rate billing
Practical Percentiles
If I measure 95th percentile per 5 minutes all
month long,
I CANNOT calculate 95th percentile over the
month.
Angry users
How many users are you pissing off?
Angry users
“Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,10,10,5000] == 10
Alert IS NOT triggered
Do you want to be woken up for this? NO!
“Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,250,300] = ~270
Alert IS triggered
Do you want to be woken up for this? YES!
Percentile based alerting
Who’s using this approach?
Google.com
Circonus.com
You?
Questions?
Thanks to Circonus.com for the tools and help
with the math
http://www.circonus.com/free-account/

More Related Content

Similar to Better service monitoring through histograms

Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.
Dr. Kim (Kyllesbech Larsen)
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
Brian Brazil
 
HBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon
 
Computer Vision for Measurement & FR
Computer Vision for Measurement & FRComputer Vision for Measurement & FR
Computer Vision for Measurement & FR
RekaNext Capital
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
Arun Kejariwal
 
Convolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationConvolutional Neural Network for Text Classification
Convolutional Neural Network for Text Classification
Anaïs Addad
 
It Probably Works
It Probably WorksIt Probably Works
It Probably Works
Fastly
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
Dinis Cruz
 
Handling Numeric Attributes in Hoeffding Trees
Handling Numeric Attributes in Hoeffding TreesHandling Numeric Attributes in Hoeffding Trees
Handling Numeric Attributes in Hoeffding Trees
butest
 
A sentient network - How High-velocity Data and Machine Learning will Shape t...
A sentient network - How High-velocity Data and Machine Learning will Shape t...A sentient network - How High-velocity Data and Machine Learning will Shape t...
A sentient network - How High-velocity Data and Machine Learning will Shape t...
Wenjing Chu
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
Oscar Corcho
 
Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management System
Erdi Olmezogullari
 
Calculating a Sample Size
Calculating a Sample SizeCalculating a Sample Size
Calculating a Sample Size
Matt Hansen
 
Machine Learning Intro Session
Machine Learning Intro SessionMachine Learning Intro Session
Machine Learning Intro Session
Naveen Rajan
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Pete Burnap
 
CSF Tips and Tricks 8MS Webinar
CSF Tips and Tricks 8MS WebinarCSF Tips and Tricks 8MS Webinar
CSF Tips and Tricks 8MS Webinar
Aerialink
 
Application Metrics (with Prometheus examples) #PHPDD18
Application Metrics (with Prometheus examples) #PHPDD18Application Metrics (with Prometheus examples) #PHPDD18
Application Metrics (with Prometheus examples) #PHPDD18
Rafael Dohms
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
Abhimanyu Dwivedi
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
c.titus.brown
 
Subverting Machine Learning Detections for fun and profit
Subverting Machine Learning Detections for fun and profitSubverting Machine Learning Detections for fun and profit
Subverting Machine Learning Detections for fun and profit
Ram Shankar Siva Kumar
 

Similar to Better service monitoring through histograms (20)

Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
HBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBase
 
Computer Vision for Measurement & FR
Computer Vision for Measurement & FRComputer Vision for Measurement & FR
Computer Vision for Measurement & FR
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
 
Convolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationConvolutional Neural Network for Text Classification
Convolutional Neural Network for Text Classification
 
It Probably Works
It Probably WorksIt Probably Works
It Probably Works
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
 
Handling Numeric Attributes in Hoeffding Trees
Handling Numeric Attributes in Hoeffding TreesHandling Numeric Attributes in Hoeffding Trees
Handling Numeric Attributes in Hoeffding Trees
 
A sentient network - How High-velocity Data and Machine Learning will Shape t...
A sentient network - How High-velocity Data and Machine Learning will Shape t...A sentient network - How High-velocity Data and Machine Learning will Shape t...
A sentient network - How High-velocity Data and Machine Learning will Shape t...
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
 
Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management System
 
Calculating a Sample Size
Calculating a Sample SizeCalculating a Sample Size
Calculating a Sample Size
 
Machine Learning Intro Session
Machine Learning Intro SessionMachine Learning Intro Session
Machine Learning Intro Session
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
 
CSF Tips and Tricks 8MS Webinar
CSF Tips and Tricks 8MS WebinarCSF Tips and Tricks 8MS Webinar
CSF Tips and Tricks 8MS Webinar
 
Application Metrics (with Prometheus examples) #PHPDD18
Application Metrics (with Prometheus examples) #PHPDD18Application Metrics (with Prometheus examples) #PHPDD18
Application Metrics (with Prometheus examples) #PHPDD18
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
Subverting Machine Learning Detections for fun and profit
Subverting Machine Learning Detections for fun and profitSubverting Machine Learning Detections for fun and profit
Subverting Machine Learning Detections for fun and profit
 

More from Fred Moyer

Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+
Fred Moyer
 
Practical service level objectives with error budgeting
Practical service level objectives with error budgetingPractical service level objectives with error budgeting
Practical service level objectives with error budgeting
Fred Moyer
 
SREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done RightSREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done Right
Fred Moyer
 
Scale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done RightScale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done Right
Fred Moyer
 
Latency SLOs Done Right
Latency SLOs Done RightLatency SLOs Done Right
Latency SLOs Done Right
Fred Moyer
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done right
Fred Moyer
 
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and IstioComprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
Fred Moyer
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istio
Fred Moyer
 
Effective management of high volume numeric data with histograms
Effective management of high volume numeric data with histogramsEffective management of high volume numeric data with histograms
Effective management of high volume numeric data with histograms
Fred Moyer
 
Statistics for dummies
Statistics for dummiesStatistics for dummies
Statistics for dummies
Fred Moyer
 
GrafanaCon EU 2018
GrafanaCon EU 2018GrafanaCon EU 2018
GrafanaCon EU 2018
Fred Moyer
 
Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017
Fred Moyer
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL Database
Fred Moyer
 
Learning go for perl programmers
Learning go for perl programmersLearning go for perl programmers
Learning go for perl programmers
Fred Moyer
 
Surge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightningSurge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightning
Fred Moyer
 
Qpsmtpd
QpsmtpdQpsmtpd
Qpsmtpd
Fred Moyer
 
Apache Dispatch
Apache DispatchApache Dispatch
Apache Dispatch
Fred Moyer
 
Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008
Fred Moyer
 
Data::FormValidator Simplified
Data::FormValidator SimplifiedData::FormValidator Simplified
Data::FormValidator Simplified
Fred Moyer
 

More from Fred Moyer (19)

Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+
 
Practical service level objectives with error budgeting
Practical service level objectives with error budgetingPractical service level objectives with error budgeting
Practical service level objectives with error budgeting
 
SREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done RightSREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done Right
 
Scale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done RightScale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done Right
 
Latency SLOs Done Right
Latency SLOs Done RightLatency SLOs Done Right
Latency SLOs Done Right
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done right
 
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and IstioComprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istio
 
Effective management of high volume numeric data with histograms
Effective management of high volume numeric data with histogramsEffective management of high volume numeric data with histograms
Effective management of high volume numeric data with histograms
 
Statistics for dummies
Statistics for dummiesStatistics for dummies
Statistics for dummies
 
GrafanaCon EU 2018
GrafanaCon EU 2018GrafanaCon EU 2018
GrafanaCon EU 2018
 
Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL Database
 
Learning go for perl programmers
Learning go for perl programmersLearning go for perl programmers
Learning go for perl programmers
 
Surge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightningSurge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightning
 
Qpsmtpd
QpsmtpdQpsmtpd
Qpsmtpd
 
Apache Dispatch
Apache DispatchApache Dispatch
Apache Dispatch
 
Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008
 
Data::FormValidator Simplified
Data::FormValidator SimplifiedData::FormValidator Simplified
Data::FormValidator Simplified
 

Recently uploaded

LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
Yara Milbes
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
TaghreedAltamimi
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
GohKiangHock
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Requirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional SafetyRequirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional Safety
Ayan Halder
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Envertis Software Solutions
 

Recently uploaded (20)

LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Requirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional SafetyRequirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional Safety
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
 

Better service monitoring through histograms

Editor's Notes

  1. A synthetic is basically a bot check against your system. One of the benefits (perhaps the only benefit) of the synthetic is that it’s more highly available than the application you are monitoring. The response from synthetic requests don’t tell you anything meaningful about how actual users experience your application.
  2. What am I looking at here? This is a time series graph of response times from synthetic login checks against a website. The results are remarkably consistent, as they should be. It gives you the viewpoint of one user - a computer somewhere dispatches a request over the same network route to your server. It records several metrics about how your application responds; time to start the ssl connection, time to the first byte served, average request time... Those metrics are not only useless (unless anyone here runs a service just for one user… in that case, kudos), they lie to you. These are LIES. The falsely represent the health of your application. It’s a binary - is the service up, or is the service down? That’s all you get.
  3. Your user base will likely have a distribution of ages, genders, devices, network connections.
  4. The synthetic check used an external user agent, but you can use collection tools like statsd or log analysis to record request times for real users. This is better than only using a synthetic check, but this technique still has a number of shortcomings. The first is that collection data is averaged over an interval (generally 10 seconds to a minute). So if Cyndi, Bobby, and Mike are all shopping at your website at the same time, you only see the average of their request times over a given interval. Mike might be having a great experience because his office network is 100 megabit, but Bobby is on gig-e, and Cyndi on 10 megabit, you’ll only see Bobby’s view of the website user experience.
  5. The second short shortcoming of a time series average value graph is spike erosion, also known as downsampling. Spike erosion is what happens when you zoom in on specific areas of a time series graph. As you zoom in, the data is averaged over intervals closer to the actual collection intervals. As you can see on this graph, when we zoom into a 2 hour view of the graph we just looked at, the maximum value we see now is 2,000 milliseconds instead of 500 milliseconds. That’s a 400% increase!
  6. I don’t like this image - find a better one. If you alert based on values you get from the graphs I’ve shown, what value do you alert on? As you’ve seen, avoiding false positives is impossible.
  7. Correct this since one sample will trigger alert, use average alert instead
  8. 200 ms is too slow, so we take an average, 66% of population is over 200ms, no alert thrown, this is the solution people use to avoid the outlier in previous slide
  9. 0th quantile is first element
  10. 0th quantile is first element
  11. A histogram is one of the seven basic tools of quality. The Y axis indicates the number of samples, where the X axis indicates the sample value. One use of a histogram that you may have seen is plotting human height vs number of people who are that tall.
  12. Human height follows what is called a normal distribution (also known as a Gaussian distribution). The majority of the population tends to group around one value, and tapers off at the high and low sample values. With a perfect normal distribution, the arithmetic mean (the average) and the median are one in the same.
  13. The mode is also equal to the median. You’ve heard the term standard deviation before most likely. With a normal distribution, 68% of the values lie within one standard deviation for both sides of the median. 95% within 2 standard deviations, 99.7% within 3 sigma. The smaller a standard deviation, the closer the data is to the mean. The larger one sigma is, the farther the data is away from the mean. It is important to note that these metrics only make sense for normal distribution, where there is a single mode.
  14. This is a non normal distribution. In this example, there are large numbers of samples grouped at the highest and lowest sample values. Because there are two distinct peaks, this is called a bimodal distribution (or multi-modal distribution). In a multimodal distribution like this, standard deviation and multi-sigma values are useless.
  15. This is another non-normal distribution. As you can see, it only has one mode, and is a skewed distribution. Standard deviation has little to no meaning here.
  16. Here is a histogram of web page request time. The higher the bar, the more users are affected. This is a highly skewed distribution - notice the grouping between the spike at ~150 milliseconds, and the long tail past there. There’s another smaller spike at ~25 ms, so this is mostly a bimodal distribution. In terms of website performance, people will generally get angry if request times take longer than 250 milliseconds. So what we see here is a bunch of users who are getting acceptable response times, and a long tail of pissed off users. People on left side are having a great experience, people on right side are leaving the site.
  17. Here is a histogram of web page request time. The higher the bar, the more users are affected. This is a highly skewed distribution - notice the grouping between the spike at ~150 milliseconds, and the long tail past there. There’s another smaller spike at ~25 ms, so this is mostly a bimodal distribution. In terms of website performance, people will generally get angry if request times take longer than 250 milliseconds. So what we see here is a bunch of users who are getting acceptable response times, and a long tail of pissed off users. People on left side are having a great experience, people on right side are leaving the site. Note that this is for a time slice, say 5 minutes. What does this look like if we integrate over time?
  18. Heat maps are visual representations of histograms over time windows. It gives you a visualization of data distributions over time.
  19. With heat maps, you can add percentile overlays to show the 50th, 95, and any other percentile distribution over time slices
  20. A percentile is a barrier where to the left the samples are 95%, to the right are the remaining 5%. There is a caveat with the barrier hitting in the middle of data points. If you measure on the right including the barrier, >= 95th percentile of whole data set, if you measure to the left of the barrier, <= 95%. If you have two samples, median is every value between those two samples. Samples on the barrier are counted twice. Divide data set into two sets. Have a slide that says - bespoke things you probably didn’t know about histograms. For the purpose of our examples, we’ll avoid these edge cases. If you see a histogram where the ⅓ quantile and ⅔ quantile are equal value, they add up to > 100%. Histogram of 1 value is one example (everything is measured twice). 1,2 - 1,2,3.
  21. Percentiles cannot be averaged. You have to calculate them from the raw usage data. There are several monitoring solutions out there that will let you average percentiles - this is flat out WRONG
  22. What’s your SLA? If you set your 95% percentile at 250 ms, and you meet your SLA, you’re pissing off 5% of your users. They’re going to your competitor. Let’s try to calculate how many users you are screwing.
  23. Take the number of requests outside your 95 percentile (the 5th percent inverse quantile), and integrate that over time to get a cumulative number of users that you’ve screwed. Multiply that times the dollar value of each lost request - that’s how much money you’re losing.
  24. Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.
  25. Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.
  26. Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.