SlideShare a Scribd company logo
Latency SLOs Done Right
Heinrich Hartmann
SRECon 2019, Dublin
@heinrichhartmann
For any API you care about ...
How many requests in August were served within 100ms?
@heinrichhartmann
For any API you care about ...
How many requests in August were served within 150ms?
@heinrichhartmann
For any API you care about ...
How many requests in August were served within 180ms?
@heinrichhartmann
For any API you care about ...
How many requests in June 16th 9:12-9:35 were served
within 100ms?
@heinrichhartmann
- Data Scientist at Circonus
- PhD in Mathematics
- Talks about #StatisticsForEngineers
- Lives in Stemwede, Germany
@heinrichhartmann
Hi, I am Heinrich
Agenda
@heinrichhartmann
- Why monitor Latency?
- What is an SLO?
- Three methods to calculate Latency SLOs
- Method 1
- Method 2
- Method 3
- Conclusion
Why monitor Latency?
Latency is a key performance indicator for any API.
Four Golden Signals from the SRE Book:
> “Latency, Traffic, Errors, and Saturation”
This later became the RED Method:
Requests, Errors, Duration.
What is an SLO?
Service Level Indicator (SLI)
A metric that quantifies the quality or reliability of your service.
Typically of the form “Good Events / Valid Events * 100”.
Service Level Objective (SLO)
A target value for the service level, as measured by an SLI,
that sets expectations about how the service will perform.
Service Level Agreement (SLA)
What happens if a published SLO is not met?
@heinrichhartmann
Availability
SLI
Once a minute, ssh into the target host, report 1 if it’s working 0 if not.
SLO
Uptime 99.9%, i.e. SLI = 1 over the last month 99.9% of the time.
SLA
If we don’t meet our SLO in one month, you will get exactly one cake.
@heinrichhartmann
Latency SLOs Example from SRECon 2018*
SLI
“The proportion of valid requests that were served within < 1s”
SLO
“99% of valid requests in the past 28 days served in < 1s”
SLA
-
@heinrichhartmann
(*) SRECon 2018 : Developing Effective ... Fong-Jones, Bennett, Quinlan, Stockman, Thorne @ Google
Script, p16, we replaced “100ms” by “1s” for the example.
Was the SLO met?
@heinrichhartmann
SLO: 99% of valid requests in the past 28 days served in <1s
Source: Developing Effective ... Fong-Jones, Bennett, Quinlan, Stockman, Thorne @ Google, Script, p26
Was the SLO met?
@heinrichhartmann
SLO: 99% of valid requests in the past 28 days served in <1s
Source: Developing Effective ... Fong-Jones, Bennett, Quinlan, Stockman, Thorne @ Google, Script, p26
What if I told you, 99.5% of all
requests occurred here?
Was the SLO met?
@heinrichhartmann
SLO: 99% of valid requests in the past 6h served in <50ms
Source: Circonus, internal
Percentile Metrics can’t be used for SLOs
@heinrichhartmann
For SLOs we need to compute percentiles over ...
(1) multiple weeks of data
(2) multiple nodes (potentially).
But: Percentiles can’t be aggregated.
Percentiles can’t be aggregated
@heinrichhartmann
Long story of discussions following my Monitorama 2016 talk:
Source: https://twitter.com/danslimmon/status/748295758827315201
Percentiles can’t be aggregated
@heinrichhartmann
Averaging API Latency Percentiles over 24 hours
@heinrichhartmann
Latency SLOs Done Right
@heinrichhartmann
Task
Count all requests over $period served faster than $threshold.
Three valid Methods
(1) Log data
(2) Counter Metrics
(3) Histogram Metrics
(1) Latency SLOs via Log Data
SELECT count(*) FROM logs
WHERE
time in $period
AND
latency < $threshold
@heinrichhartmann
(1) Latency SLOs via Log Data
@heinrichhartmann
+ Correct, Clean, Easy
- You need to keep all your log data for months
- ... which can get very expensive.
You can do this with: ssh+awk, ELK, Splunk, Honeycomb, etc.
(2) Latency SLOs with Counter Metrics
@heinrichhartmann
- Pick a latency $threshold, e.g. 1s
- Add a metric that counts how many requests were faster than $threshold
- ... store as e.g. “aws-eu.www22.GET./.lt_1s”
- ... for each node and endpoint you care about
- Sum / Integrate these metrics across nodes, endpoint and time:
find(“aws-eu.www*.GET.*.lt_1s”) | stats:sum() | integrate()
(2) Latency SLOs with Counter Metrics
#ObservabilitySummit@phredmoyer
9.5 K / 106 K = 8.9%
Slow Requests
(2) Latency SLOs via Counter Metrics
@heinrichhartmann
+ Easy, Correct
+ Cost effective
+ Full flexibility in choosing aggregation intervals
- Need to choose (multiple) latency thresholds upfront
You can do this with:
Prometheus (“Histograms”), Graphite, DataDog, VividCortext
HDR Histograms
@heinrichhartmann
HDR Histogram Data Structures
@heinrichhartmann
Store Latency distribution in sparse, log-linear histogram data structures:
- Total of 46.081 bins that cover the decimal float range ±10^±128 (HDR)
- One bin for each decimal float number with two significant digits (log-linear):
10,11,12, ... 100,110,120, ..., 1000, 1100, 1200,...
- Only store bins which have entries (sparse encoding)
- Typical size 300b / Histogram even with >100K entries
Open Source Implementations:
- hdrhistogram.org (Tene @ Azul Systems)
- github.com/circonus-labs/libcircllhist (Circonus, IronDB)
HDR Histogram Metric
@heinrichhartmann
(3) Latency SLOs with Histograms
@heinrichhartmann
1. Capture latency information of all relevant APIs as histogram metrics.
2. Aggregate latency histograms over nodes, endpoints and time.
Get total latency distribution over SLO timeframe (weeks, months).
3. Count samples in bins below the thresholds to compute SLOs.
(3) Latency SLOs with Histogram Metrics
@heinrichhartmann
89.3% of all requests in
August 2018 were
served within 50ms.
Demo
(3) Latency SLOs with Histograms
@heinrichhartmann
+ Full flexibility in choosing thresholds
+ Full flexibility in choosing aggregation intervals and levels
+ Cost effective (300b / Histogram value)
- Need HDR Histogram Instrumentation
- Need HDR Histogram Metric Store
You can do this with: Circonus, IronDB + Graphite / Grafana,
Google internal tooling.
(3) Alternative Mergeable Summaries
@heinrichhartmann
There are alternative “Mergeable Quantile Summaries” in the literature.
Available practical options include:
- circllhist -- Schlossnagle @ Circonus 2013
- HDR Histograms -- Tene @ Azul Systems 2015
- t-digest -- Dunning @ MapR, Erl @ Dynatrace 2015
- DD-Sketch (Histogram) -- Masson @ DataDog 2019
Mergeable Summaries - Percentile Comparison
@heinrichhartmann
Aggregation Method Accuracy
p90 (error %)
Performance(*)
insert / merge / p90 (sec)
raw 35.774 (-) - / - / -
circllhist 35.773 (0.001%) 0.86 / 0.000262 / 0.000005
HDR Histogram 35.775 (0.000%) 3.59 / 0.003... / 0.003...
t-digest 35.803 (0.029%) 97.00 / 1.900... / 0.003...
DD-sketch 35.519 (0.256%) 2.39 / 0.003... / 0.000037
(*) This is a quick benchmark of the most popular Python libraries, performed on a 2015 MBP.
Notebook at https://github.com/HeinrichHartmann/Statistics-for-Engineers/
Conclusion
@heinrichhartmann
Percentile Metrics are not suitable for implementing Latency SLOs.
Histogram Metrics allow you to easily calculate arbitrary Latency SLOs.
If you don’t have Histogram metrics, you can do the following:
1. Use log data available over the last days to determine sensible latency
thresholds for your service.
2. Add counter metrics for the thresholds
3. Aggregate counter metrics as needed

More Related Content

What's hot

Netezza query runtime using plans
Netezza   query runtime using plansNetezza   query runtime using plans
Netezza query runtime using plans
Rashmi Malik
 
PowerStream: Propelling Energy Innovation with Predictive Analytics
PowerStream: Propelling Energy Innovation with Predictive Analytics PowerStream: Propelling Energy Innovation with Predictive Analytics
PowerStream: Propelling Energy Innovation with Predictive Analytics
SingleStore
 
Continuously Updating Query Results over Real-Time Linked Data
Continuously Updating Query Results over Real-Time Linked DataContinuously Updating Query Results over Real-Time Linked Data
Continuously Updating Query Results over Real-Time Linked Data
Ruben Taelman
 
An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...
An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...
An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...
FogGuru MSCA Project
 
Exposing the Cost of Performance Hidden in the Cloud
Exposing the Cost of Performance Hidden in the CloudExposing the Cost of Performance Hidden in the Cloud
Exposing the Cost of Performance Hidden in the Cloud
Neil Gunther
 
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of Twitter
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of TwitterThe Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of Twitter
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of Twitter
The Hive
 
Weibull Analysis: Tableau + R Integration by Monica Willbrand
Weibull Analysis: Tableau + R Integration by Monica WillbrandWeibull Analysis: Tableau + R Integration by Monica Willbrand
Weibull Analysis: Tableau + R Integration by Monica Willbrand
Data Con LA
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
Sergey Enin
 
Continuous Self-Updating Query Results over Dynamic Linked Data
Continuous Self-Updating Query Results over Dynamic Linked DataContinuous Self-Updating Query Results over Dynamic Linked Data
Continuous Self-Updating Query Results over Dynamic Linked Data
Ruben Taelman
 
Final presentation mgarren
Final presentation mgarrenFinal presentation mgarren
Final presentation mgarrenmgarren
 
Ceilometer lsf-intergration-openstack-summit
Ceilometer lsf-intergration-openstack-summitCeilometer lsf-intergration-openstack-summit
Ceilometer lsf-intergration-openstack-summitTim Bell
 
Scalable Machine Learning
Scalable Machine LearningScalable Machine Learning
Scalable Machine Learning
Mikio L. Braun
 
From Cloud to Fog: the Tao of IT Infrastructure Decentralization
From Cloud to Fog: the Tao of IT Infrastructure DecentralizationFrom Cloud to Fog: the Tao of IT Infrastructure Decentralization
From Cloud to Fog: the Tao of IT Infrastructure Decentralization
FogGuru MSCA Project
 
Visualising and Linking Open Data from Multiple Sources
Visualising and Linking Open Data from Multiple SourcesVisualising and Linking Open Data from Multiple Sources
Visualising and Linking Open Data from Multiple Sources
Data Driven Innovation
 
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
Jonas Traub
 
Streaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same GameStreaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same Game
Numenta
 
14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Da...
14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Da...14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Da...
14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Da...Swiss Big Data User Group
 
6 Lowpan Rpl Tutorial Code
6 Lowpan Rpl Tutorial Code6 Lowpan Rpl Tutorial Code
6 Lowpan Rpl Tutorial Code
Phdtopiccom
 
Load Balancing in Cloud Computing Thesis Research Help
Load Balancing in Cloud Computing Thesis Research HelpLoad Balancing in Cloud Computing Thesis Research Help
Load Balancing in Cloud Computing Thesis Research Help
Phdtopiccom
 

What's hot (20)

Netezza query runtime using plans
Netezza   query runtime using plansNetezza   query runtime using plans
Netezza query runtime using plans
 
Vlsi 2015 2016 ieee project list-(m)
Vlsi 2015 2016 ieee project list-(m)Vlsi 2015 2016 ieee project list-(m)
Vlsi 2015 2016 ieee project list-(m)
 
PowerStream: Propelling Energy Innovation with Predictive Analytics
PowerStream: Propelling Energy Innovation with Predictive Analytics PowerStream: Propelling Energy Innovation with Predictive Analytics
PowerStream: Propelling Energy Innovation with Predictive Analytics
 
Continuously Updating Query Results over Real-Time Linked Data
Continuously Updating Query Results over Real-Time Linked DataContinuously Updating Query Results over Real-Time Linked Data
Continuously Updating Query Results over Real-Time Linked Data
 
An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...
An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...
An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...
 
Exposing the Cost of Performance Hidden in the Cloud
Exposing the Cost of Performance Hidden in the CloudExposing the Cost of Performance Hidden in the Cloud
Exposing the Cost of Performance Hidden in the Cloud
 
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of Twitter
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of TwitterThe Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of Twitter
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of Twitter
 
Weibull Analysis: Tableau + R Integration by Monica Willbrand
Weibull Analysis: Tableau + R Integration by Monica WillbrandWeibull Analysis: Tableau + R Integration by Monica Willbrand
Weibull Analysis: Tableau + R Integration by Monica Willbrand
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Continuous Self-Updating Query Results over Dynamic Linked Data
Continuous Self-Updating Query Results over Dynamic Linked DataContinuous Self-Updating Query Results over Dynamic Linked Data
Continuous Self-Updating Query Results over Dynamic Linked Data
 
Final presentation mgarren
Final presentation mgarrenFinal presentation mgarren
Final presentation mgarren
 
Ceilometer lsf-intergration-openstack-summit
Ceilometer lsf-intergration-openstack-summitCeilometer lsf-intergration-openstack-summit
Ceilometer lsf-intergration-openstack-summit
 
Scalable Machine Learning
Scalable Machine LearningScalable Machine Learning
Scalable Machine Learning
 
From Cloud to Fog: the Tao of IT Infrastructure Decentralization
From Cloud to Fog: the Tao of IT Infrastructure DecentralizationFrom Cloud to Fog: the Tao of IT Infrastructure Decentralization
From Cloud to Fog: the Tao of IT Infrastructure Decentralization
 
Visualising and Linking Open Data from Multiple Sources
Visualising and Linking Open Data from Multiple SourcesVisualising and Linking Open Data from Multiple Sources
Visualising and Linking Open Data from Multiple Sources
 
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
 
Streaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same GameStreaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same Game
 
14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Da...
14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Da...14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Da...
14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Da...
 
6 Lowpan Rpl Tutorial Code
6 Lowpan Rpl Tutorial Code6 Lowpan Rpl Tutorial Code
6 Lowpan Rpl Tutorial Code
 
Load Balancing in Cloud Computing Thesis Research Help
Load Balancing in Cloud Computing Thesis Research HelpLoad Balancing in Cloud Computing Thesis Research Help
Load Balancing in Cloud Computing Thesis Research Help
 

Similar to Latency SLOs Done Right @ SREcon EMEA 2019

OSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
OSMC 2016 | Friends and foes in API Monitoring by Heinrich HartmannOSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
OSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
NETWAYS
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done right
Fred Moyer
 
Post quantum cryptography in vault (hashi talks 2020)
Post quantum cryptography in vault (hashi talks 2020)Post quantum cryptography in vault (hashi talks 2020)
Post quantum cryptography in vault (hashi talks 2020)
Mitchell Pronschinske
 
How to Measure Latency
How to Measure LatencyHow to Measure Latency
How to Measure Latency
ScyllaDB
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
Farid Gurbanov
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
DataWorks Summit
 
Realtime data processing with Flink and Druid by Youngpyo Lee, SKT
Realtime data processing with Flink and Druid by Youngpyo Lee, SKTRealtime data processing with Flink and Druid by Youngpyo Lee, SKT
Realtime data processing with Flink and Druid by Youngpyo Lee, SKT
Metatron
 
Application Metrics (with Prometheus examples)
Application Metrics (with Prometheus examples)Application Metrics (with Prometheus examples)
Application Metrics (with Prometheus examples)
Rafael Dohms
 
Real time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystemReal time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystemChris Huang
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Data Con LA
 
Application Metrics - IPC2023
Application Metrics - IPC2023Application Metrics - IPC2023
Application Metrics - IPC2023
Rafael Dohms
 
Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...
Aljoscha Krettek
 
Application metrics - Confoo 2019
Application metrics - Confoo 2019Application metrics - Confoo 2019
Application metrics - Confoo 2019
Rafael Dohms
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software Validation
BioDec
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
Guido Schmutz
 
Deconstructing Lambda
Deconstructing LambdaDeconstructing Lambda
Deconstructing Lambda
darach
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
Aleksandr Kuboskin, CFA
 
Latency SLOs Done Right
Latency SLOs Done RightLatency SLOs Done Right
Latency SLOs Done Right
Fred Moyer
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
Evan Chan
 
Predictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine IntelligencePredictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine Intelligence
Numenta
 

Similar to Latency SLOs Done Right @ SREcon EMEA 2019 (20)

OSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
OSMC 2016 | Friends and foes in API Monitoring by Heinrich HartmannOSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
OSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done right
 
Post quantum cryptography in vault (hashi talks 2020)
Post quantum cryptography in vault (hashi talks 2020)Post quantum cryptography in vault (hashi talks 2020)
Post quantum cryptography in vault (hashi talks 2020)
 
How to Measure Latency
How to Measure LatencyHow to Measure Latency
How to Measure Latency
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
 
Realtime data processing with Flink and Druid by Youngpyo Lee, SKT
Realtime data processing with Flink and Druid by Youngpyo Lee, SKTRealtime data processing with Flink and Druid by Youngpyo Lee, SKT
Realtime data processing with Flink and Druid by Youngpyo Lee, SKT
 
Application Metrics (with Prometheus examples)
Application Metrics (with Prometheus examples)Application Metrics (with Prometheus examples)
Application Metrics (with Prometheus examples)
 
Real time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystemReal time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystem
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 
Application Metrics - IPC2023
Application Metrics - IPC2023Application Metrics - IPC2023
Application Metrics - IPC2023
 
Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...
 
Application metrics - Confoo 2019
Application metrics - Confoo 2019Application metrics - Confoo 2019
Application metrics - Confoo 2019
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software Validation
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
Deconstructing Lambda
Deconstructing LambdaDeconstructing Lambda
Deconstructing Lambda
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 
Latency SLOs Done Right
Latency SLOs Done RightLatency SLOs Done Right
Latency SLOs Done Right
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
 
Predictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine IntelligencePredictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine Intelligence
 

More from Heinrich Hartmann

Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case Study
Heinrich Hartmann
 
Linux System Monitoring with eBPF
Linux System Monitoring with eBPFLinux System Monitoring with eBPF
Linux System Monitoring with eBPF
Heinrich Hartmann
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
Heinrich Hartmann
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
Heinrich Hartmann
 
Seminar on Complex Geometry
Seminar on Complex GeometrySeminar on Complex Geometry
Seminar on Complex Geometry
Heinrich Hartmann
 
Seminar on Motivic Hall Algebras
Seminar on Motivic Hall AlgebrasSeminar on Motivic Hall Algebras
Seminar on Motivic Hall Algebras
Heinrich Hartmann
 
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONSGROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONSHeinrich Hartmann
 
Topics in Category Theory
Topics in Category TheoryTopics in Category Theory
Topics in Category Theory
Heinrich Hartmann
 
Related-Work.net at WeST Oberseminar
Related-Work.net at WeST OberseminarRelated-Work.net at WeST Oberseminar
Related-Work.net at WeST Oberseminar
Heinrich Hartmann
 
Komplexe Zahlen
Komplexe ZahlenKomplexe Zahlen
Komplexe Zahlen
Heinrich Hartmann
 
Pushforward of Differential Forms
Pushforward of Differential FormsPushforward of Differential Forms
Pushforward of Differential Forms
Heinrich Hartmann
 
Dimensionstheorie Noetherscher Ringe
Dimensionstheorie Noetherscher RingeDimensionstheorie Noetherscher Ringe
Dimensionstheorie Noetherscher Ringe
Heinrich Hartmann
 
Polynomproblem
PolynomproblemPolynomproblem
Polynomproblem
Heinrich Hartmann
 
Hecke Curves and Moduli spcaes of Vector Bundles
Hecke Curves and Moduli spcaes of Vector BundlesHecke Curves and Moduli spcaes of Vector Bundles
Hecke Curves and Moduli spcaes of Vector Bundles
Heinrich Hartmann
 
Dimension und Multiplizität von D-Moduln
Dimension und Multiplizität von D-ModulnDimension und Multiplizität von D-Moduln
Dimension und Multiplizität von D-Moduln
Heinrich Hartmann
 
Nodale kurven und Hilbertschemata
Nodale kurven und HilbertschemataNodale kurven und Hilbertschemata
Nodale kurven und Hilbertschemata
Heinrich Hartmann
 
Local morphisms are given by composition
Local morphisms are given by compositionLocal morphisms are given by composition
Local morphisms are given by composition
Heinrich Hartmann
 
Notes on Intersection theory
Notes on Intersection theoryNotes on Intersection theory
Notes on Intersection theory
Heinrich Hartmann
 
Cusps of the Kähler moduli space and stability conditions on K3 surfaces
Cusps of the Kähler moduli space and stability conditions on K3 surfacesCusps of the Kähler moduli space and stability conditions on K3 surfaces
Cusps of the Kähler moduli space and stability conditions on K3 surfaces
Heinrich Hartmann
 
Log Files
Log FilesLog Files

More from Heinrich Hartmann (20)

Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case Study
 
Linux System Monitoring with eBPF
Linux System Monitoring with eBPFLinux System Monitoring with eBPF
Linux System Monitoring with eBPF
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
 
Seminar on Complex Geometry
Seminar on Complex GeometrySeminar on Complex Geometry
Seminar on Complex Geometry
 
Seminar on Motivic Hall Algebras
Seminar on Motivic Hall AlgebrasSeminar on Motivic Hall Algebras
Seminar on Motivic Hall Algebras
 
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONSGROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
 
Topics in Category Theory
Topics in Category TheoryTopics in Category Theory
Topics in Category Theory
 
Related-Work.net at WeST Oberseminar
Related-Work.net at WeST OberseminarRelated-Work.net at WeST Oberseminar
Related-Work.net at WeST Oberseminar
 
Komplexe Zahlen
Komplexe ZahlenKomplexe Zahlen
Komplexe Zahlen
 
Pushforward of Differential Forms
Pushforward of Differential FormsPushforward of Differential Forms
Pushforward of Differential Forms
 
Dimensionstheorie Noetherscher Ringe
Dimensionstheorie Noetherscher RingeDimensionstheorie Noetherscher Ringe
Dimensionstheorie Noetherscher Ringe
 
Polynomproblem
PolynomproblemPolynomproblem
Polynomproblem
 
Hecke Curves and Moduli spcaes of Vector Bundles
Hecke Curves and Moduli spcaes of Vector BundlesHecke Curves and Moduli spcaes of Vector Bundles
Hecke Curves and Moduli spcaes of Vector Bundles
 
Dimension und Multiplizität von D-Moduln
Dimension und Multiplizität von D-ModulnDimension und Multiplizität von D-Moduln
Dimension und Multiplizität von D-Moduln
 
Nodale kurven und Hilbertschemata
Nodale kurven und HilbertschemataNodale kurven und Hilbertschemata
Nodale kurven und Hilbertschemata
 
Local morphisms are given by composition
Local morphisms are given by compositionLocal morphisms are given by composition
Local morphisms are given by composition
 
Notes on Intersection theory
Notes on Intersection theoryNotes on Intersection theory
Notes on Intersection theory
 
Cusps of the Kähler moduli space and stability conditions on K3 surfaces
Cusps of the Kähler moduli space and stability conditions on K3 surfacesCusps of the Kähler moduli space and stability conditions on K3 surfaces
Cusps of the Kähler moduli space and stability conditions on K3 surfaces
 
Log Files
Log FilesLog Files
Log Files
 

Recently uploaded

Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxTOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
nikitacareer3
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
awadeshbabu
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
AIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdfAIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdf
RicletoEspinosa1
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
ssuser36d3051
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
nooriasukmaningtyas
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 

Recently uploaded (20)

Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxTOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptx
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
AIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdfAIR POLLUTION lecture EnE203 updated.pdf
AIR POLLUTION lecture EnE203 updated.pdf
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 

Latency SLOs Done Right @ SREcon EMEA 2019

  • 1. Latency SLOs Done Right Heinrich Hartmann SRECon 2019, Dublin @heinrichhartmann
  • 2. For any API you care about ... How many requests in August were served within 100ms? @heinrichhartmann
  • 3. For any API you care about ... How many requests in August were served within 150ms? @heinrichhartmann
  • 4. For any API you care about ... How many requests in August were served within 180ms? @heinrichhartmann
  • 5. For any API you care about ... How many requests in June 16th 9:12-9:35 were served within 100ms? @heinrichhartmann
  • 6. - Data Scientist at Circonus - PhD in Mathematics - Talks about #StatisticsForEngineers - Lives in Stemwede, Germany @heinrichhartmann Hi, I am Heinrich
  • 7. Agenda @heinrichhartmann - Why monitor Latency? - What is an SLO? - Three methods to calculate Latency SLOs - Method 1 - Method 2 - Method 3 - Conclusion
  • 8. Why monitor Latency? Latency is a key performance indicator for any API. Four Golden Signals from the SRE Book: > “Latency, Traffic, Errors, and Saturation” This later became the RED Method: Requests, Errors, Duration.
  • 9. What is an SLO? Service Level Indicator (SLI) A metric that quantifies the quality or reliability of your service. Typically of the form “Good Events / Valid Events * 100”. Service Level Objective (SLO) A target value for the service level, as measured by an SLI, that sets expectations about how the service will perform. Service Level Agreement (SLA) What happens if a published SLO is not met? @heinrichhartmann
  • 10. Availability SLI Once a minute, ssh into the target host, report 1 if it’s working 0 if not. SLO Uptime 99.9%, i.e. SLI = 1 over the last month 99.9% of the time. SLA If we don’t meet our SLO in one month, you will get exactly one cake. @heinrichhartmann
  • 11. Latency SLOs Example from SRECon 2018* SLI “The proportion of valid requests that were served within < 1s” SLO “99% of valid requests in the past 28 days served in < 1s” SLA - @heinrichhartmann (*) SRECon 2018 : Developing Effective ... Fong-Jones, Bennett, Quinlan, Stockman, Thorne @ Google Script, p16, we replaced “100ms” by “1s” for the example.
  • 12. Was the SLO met? @heinrichhartmann SLO: 99% of valid requests in the past 28 days served in <1s Source: Developing Effective ... Fong-Jones, Bennett, Quinlan, Stockman, Thorne @ Google, Script, p26
  • 13. Was the SLO met? @heinrichhartmann SLO: 99% of valid requests in the past 28 days served in <1s Source: Developing Effective ... Fong-Jones, Bennett, Quinlan, Stockman, Thorne @ Google, Script, p26 What if I told you, 99.5% of all requests occurred here?
  • 14. Was the SLO met? @heinrichhartmann SLO: 99% of valid requests in the past 6h served in <50ms Source: Circonus, internal
  • 15. Percentile Metrics can’t be used for SLOs @heinrichhartmann For SLOs we need to compute percentiles over ... (1) multiple weeks of data (2) multiple nodes (potentially). But: Percentiles can’t be aggregated.
  • 16. Percentiles can’t be aggregated @heinrichhartmann Long story of discussions following my Monitorama 2016 talk: Source: https://twitter.com/danslimmon/status/748295758827315201
  • 17. Percentiles can’t be aggregated @heinrichhartmann
  • 18. Averaging API Latency Percentiles over 24 hours @heinrichhartmann
  • 19. Latency SLOs Done Right @heinrichhartmann Task Count all requests over $period served faster than $threshold. Three valid Methods (1) Log data (2) Counter Metrics (3) Histogram Metrics
  • 20. (1) Latency SLOs via Log Data SELECT count(*) FROM logs WHERE time in $period AND latency < $threshold @heinrichhartmann
  • 21. (1) Latency SLOs via Log Data @heinrichhartmann + Correct, Clean, Easy - You need to keep all your log data for months - ... which can get very expensive. You can do this with: ssh+awk, ELK, Splunk, Honeycomb, etc.
  • 22. (2) Latency SLOs with Counter Metrics @heinrichhartmann - Pick a latency $threshold, e.g. 1s - Add a metric that counts how many requests were faster than $threshold - ... store as e.g. “aws-eu.www22.GET./.lt_1s” - ... for each node and endpoint you care about - Sum / Integrate these metrics across nodes, endpoint and time: find(“aws-eu.www*.GET.*.lt_1s”) | stats:sum() | integrate()
  • 23. (2) Latency SLOs with Counter Metrics #ObservabilitySummit@phredmoyer 9.5 K / 106 K = 8.9% Slow Requests
  • 24. (2) Latency SLOs via Counter Metrics @heinrichhartmann + Easy, Correct + Cost effective + Full flexibility in choosing aggregation intervals - Need to choose (multiple) latency thresholds upfront You can do this with: Prometheus (“Histograms”), Graphite, DataDog, VividCortext
  • 26. HDR Histogram Data Structures @heinrichhartmann Store Latency distribution in sparse, log-linear histogram data structures: - Total of 46.081 bins that cover the decimal float range ±10^±128 (HDR) - One bin for each decimal float number with two significant digits (log-linear): 10,11,12, ... 100,110,120, ..., 1000, 1100, 1200,... - Only store bins which have entries (sparse encoding) - Typical size 300b / Histogram even with >100K entries Open Source Implementations: - hdrhistogram.org (Tene @ Azul Systems) - github.com/circonus-labs/libcircllhist (Circonus, IronDB)
  • 28. (3) Latency SLOs with Histograms @heinrichhartmann 1. Capture latency information of all relevant APIs as histogram metrics. 2. Aggregate latency histograms over nodes, endpoints and time. Get total latency distribution over SLO timeframe (weeks, months). 3. Count samples in bins below the thresholds to compute SLOs.
  • 29. (3) Latency SLOs with Histogram Metrics @heinrichhartmann 89.3% of all requests in August 2018 were served within 50ms. Demo
  • 30. (3) Latency SLOs with Histograms @heinrichhartmann + Full flexibility in choosing thresholds + Full flexibility in choosing aggregation intervals and levels + Cost effective (300b / Histogram value) - Need HDR Histogram Instrumentation - Need HDR Histogram Metric Store You can do this with: Circonus, IronDB + Graphite / Grafana, Google internal tooling.
  • 31. (3) Alternative Mergeable Summaries @heinrichhartmann There are alternative “Mergeable Quantile Summaries” in the literature. Available practical options include: - circllhist -- Schlossnagle @ Circonus 2013 - HDR Histograms -- Tene @ Azul Systems 2015 - t-digest -- Dunning @ MapR, Erl @ Dynatrace 2015 - DD-Sketch (Histogram) -- Masson @ DataDog 2019
  • 32. Mergeable Summaries - Percentile Comparison @heinrichhartmann Aggregation Method Accuracy p90 (error %) Performance(*) insert / merge / p90 (sec) raw 35.774 (-) - / - / - circllhist 35.773 (0.001%) 0.86 / 0.000262 / 0.000005 HDR Histogram 35.775 (0.000%) 3.59 / 0.003... / 0.003... t-digest 35.803 (0.029%) 97.00 / 1.900... / 0.003... DD-sketch 35.519 (0.256%) 2.39 / 0.003... / 0.000037 (*) This is a quick benchmark of the most popular Python libraries, performed on a 2015 MBP. Notebook at https://github.com/HeinrichHartmann/Statistics-for-Engineers/
  • 33. Conclusion @heinrichhartmann Percentile Metrics are not suitable for implementing Latency SLOs. Histogram Metrics allow you to easily calculate arbitrary Latency SLOs. If you don’t have Histogram metrics, you can do the following: 1. Use log data available over the last days to determine sensible latency thresholds for your service. 2. Add counter metrics for the thresholds 3. Aggregate counter metrics as needed