Statistics for Engineers

Heinrich Hartmann
Heinrich HartmannData Scientist. Mathematician. Husband. / Email: Heinrich at HeinrichHartmann dot com
Monitorama PDX, June 29th 2016
Statistics for Engineers
Heinrich Hartmann, Circonus
@HeinrichHartman
Hi, I am Heinrich
· Lives in Munich, EU
· Refugee from Academia (Ph.D.)
· Analytics Lead at Circonus,
Monitoring and Analytics Platform
heinrich.hartmann@circonus.com
@HeinrichHartman(n)
@HeinrichHartman
#StatsForEngineers has been around for a while
[1] Statistics for Engineers @ ACM Queue
[2] Statistics for Engineers Workshop Material @ GitHub
[3] Spike Erosion @ circonus.com
[4] T. Schlossnagle - The Problem with Math @ circonus.com
[5] T. Schlossnagle - Percentages are not People @ circonus.com
[6] W. Vogels - Service Level Agreements in Amazon’s Dynamo/Sec. 2.2
[7] G. Schlossnagle - API Performance Monitoring @ Velocity Bejing 2015
Upcoming
[8] 3h workshop “Statistics for Engineers” @ SRECon 2016 in Dublin
@HeinrichHartman
A tale of API Monitoring
@HeinrichHartman
“Attic” - a furniture webstore
· Attic is a (fictional) furniture webstore
· Web API serving their catalog
· Loses money if requests take too long
Monitoring Goals
1. Measure user experience / quality of service
2. Determine (financial) implications of service degradation
3. Define sensible SLA-targets for the Dev- and Ops-teams
@HeinrichHartman
{1} External Monitoring
@HeinrichHartman
{1} External API Monitoring
Method
1. Make a synthetic request every minute
2. Measure and store request latency
Good for
· Measure Availability
· Alert on outages
Bad for
· Measuring user experience
Latencies of synthetic requests over time
@HeinrichHartman
<!> Spike Erosion </!>
· On long time ranges, aggregated / rolled-up data is commonly displayed
· This practice “erodes” latency spikes heavily!
· Store all data and use alternative aggregation methods (min/max) to get full picture, cf. [3].
1d max
all samples as
Heatmap / ‘dirt’
@HeinrichHartman
{2} Log Analysis
@HeinrichHartman
Method
Write to log file:
- time of completion,
- request latency,
and further metadata.
Discussion
· Rich information source for all kinds of analysis
· Easy instrumentation (printf)
· Slow. Long delay (minutes) before data is indexed and becomes accessible for analysis
· Expensive. Not feasibile for high volume APIs
{2} Log Analysis
Internal view of an API - “UML” version.
@HeinrichHartman
Numerical Digest: The Request-Latency Chart
a concise visualization of the API usage
Latency on the y-
axis
time the request was completed
@HeinrichHartman
Construction of the Request-Latency Chart (RLC)
Request Latency UML Diagram Request Latency Chart
@HeinrichHartman
Math view on APIs
(A) Latency
distribution
(B) Arrival/Completion times
(C) Queuing theory
@HeinrichHartman
“Requests are People”
If you care about your users, you care about their requests.
Every single one.
@HeinrichHartman
{3} Monitoring Latency Averages
@HeinrichHartman
{3} What are latency mean values?
reporting period
@HeinrichHartman
{3} Mean Request Latency Monitoring
Method
1. Select a reporting period (e.g. 1 min)
2. For each period report the mean latency
Pro/Con
+ Measure requests by actual people
+ Cheap to collect store and analyze
- Easily skewed by outliers at the high end
(complex, long running requests)
- ... and the low end (cached responses)
“Measuring the average latency is like measuring the average temperature in a hospital.”
-- Dogan @ Optimizely
@HeinrichHartman
{3} Mean Request Latency in practice
@HeinrichHartman
{3} Mean Request Latency - Robust Variants
1. Median Latency
- Sort latency values in reporting period
- The median is the ‘central’ value.
2. Truncated Means
- Take out min and max latencies in reporting
period (k-times).
- Then compute the mean value
3. Collect Deviation Measures
- Avoid standdard deviations, use
- Use Mean absolute deviation Construction of the median latency
@HeinrichHartman
{4} Percentile Monitoring
@HeinrichHartman
{4} What are Percentiles?
@HeinrichHartman
{4} Percentile Monitoring
Method
1. Select a reporting period (e.g. 1 min)
2. For each reporting period measure the 50%, 90%, 99%, 99.9% latency percentile
3. Alert when percentiles are over a threshold value
Pro/Con
+ Measure requests by actual people
+ Cheap to collect store and analyze
+ Robust to Outliers
- Up-front choice of percentiles needed
- Can not be aggregated
{5} How it looks in practice
Latency percentiles 50,90,99 computed over 1m reporting periods
<!> Percentiles can’t be aggregated </!>
The median of two medians is NOT the total median.
If you store percentiles you need to:
A. Keep all your data. Never take average rollups!
B. Store percentiles for all aggregation levels separately, e.g.
○ per Node / Rack / DC
○ per Endpoint / Service
C. Store percentiles for all reporting periods you are interested in, e.g. per min / h / day
D. Store all percentiles you will ever be interested in, e.g. 50, 75, 90, 99, 99.9
Further Reading: [4] T. Schlossnagle - The Problem with Math @ circonus.com
@HeinrichHartman
{5} API Monitoring with Histograms
{5} API Monitoring with Histograms
Method
1. Divide latency scale into bands
2. Divide the time scale into reporting periods
3. Count the number of samples in each
latency band x reporting period
Discussion
· Summary of full RLC, with reduced precision
· Extreme compression compared to logs
· Percentiles, averages, medians, etc. can be derived
· Aggregation across time and nodes trivial
· Allows more meaningful metrics
latency
sample count
time
{5} Histogram Monitoring in Practice
Histograms can be visualized as heatmaps.
Aggregate data from all nodes
serving “web-api”
.. across windows of 10min.
{5} Histogram Monitoring in Practice
All kinds of metrics can be derived from histograms
@HeinrichHartman
{6} The search for meaningful metrics
{6} Users offended per minute
{6} Total users offended so far
@HeinrichHartman
Takeaways
· Don’t trust line graphs (at least on large scale)
· Don’t aggregate percentiles. Aggregate histograms.
· Keep your data
· Strive for meaningful metrics
1 of 32

Recommended

Scalable Online Analytics for Monitoring by
Scalable Online Analytics for MonitoringScalable Online Analytics for Monitoring
Scalable Online Analytics for MonitoringHeinrich Hartmann
3K views42 slides
Circonus: Design failures - A Case Study by
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case StudyHeinrich Hartmann
294 views45 slides
Latency SLOs Done Right @ SREcon EMEA 2019 by
Latency SLOs Done Right @ SREcon EMEA 2019Latency SLOs Done Right @ SREcon EMEA 2019
Latency SLOs Done Right @ SREcon EMEA 2019Heinrich Hartmann
614 views33 slides
Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par... by
Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...
Real Time Experiment Analytics at Pinterest with Apache Flink - Ben Liu & Par...Flink Forward
230 views30 slides
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F... by
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...Flink Forward
1.4K views48 slides
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset... by
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...Flink Forward
1.1K views19 slides

More Related Content

What's hot

Debunking Common Myths in Stream Processing by
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingKostas Tzoumas
607 views41 slides
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati... by
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...Flink Forward
306 views25 slides
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ... by
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...Jonas Traub
811 views15 slides
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ... by
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Flink Forward
150 views28 slides
Robust Stream Processing With Apache Flink by
Robust Stream Processing With Apache FlinkRobust Stream Processing With Apache Flink
Robust Stream Processing With Apache FlinkJamie Grier
588 views25 slides
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces... by
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...Flink Forward
1.1K views19 slides

What's hot(20)

Debunking Common Myths in Stream Processing by Kostas Tzoumas
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
Kostas Tzoumas607 views
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati... by Flink Forward
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
Flink Forward306 views
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ... by Jonas Traub
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
Jonas Traub811 views
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ... by Flink Forward
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Flink Forward150 views
Robust Stream Processing With Apache Flink by Jamie Grier
Robust Stream Processing With Apache FlinkRobust Stream Processing With Apache Flink
Robust Stream Processing With Apache Flink
Jamie Grier588 views
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces... by Flink Forward
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward1.1K views
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an... by Flink Forward
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward942 views
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac... by Ververica
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Ververica 775 views
Making Sense of Streaming Sensor Data: How Uber Detects on Trip Car Crashes -... by Flink Forward
Making Sense of Streaming Sensor Data: How Uber Detects on Trip Car Crashes -...Making Sense of Streaming Sensor Data: How Uber Detects on Trip Car Crashes -...
Making Sense of Streaming Sensor Data: How Uber Detects on Trip Car Crashes -...
Flink Forward418 views
Stream Loops on Flink - Reinventing the wheel for the streaming era by Paris Carbone
Stream Loops on Flink - Reinventing the wheel for the streaming eraStream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming era
Paris Carbone805 views
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us... by Flink Forward
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward5K views
Insight Demo by reza-asad
Insight DemoInsight Demo
Insight Demo
reza-asad204 views
Insight Recent Demo by reza-asad
Insight Recent DemoInsight Recent Demo
Insight Recent Demo
reza-asad334 views
Continuously Updating Query Results over Real-Time Linked Data by Ruben Taelman
Continuously Updating Query Results over Real-Time Linked DataContinuously Updating Query Results over Real-Time Linked Data
Continuously Updating Query Results over Real-Time Linked Data
Ruben Taelman1.3K views
Using Kafka to integrate DWH and Cloud Based big data systems by confluent
Using Kafka to integrate DWH and Cloud Based big data systemsUsing Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systems
confluent468 views
Monitoring Flink with Prometheus by Maximilian Bode
Monitoring Flink with PrometheusMonitoring Flink with Prometheus
Monitoring Flink with Prometheus
Maximilian Bode1.6K views
Continuous Self-Updating Query Results over Dynamic Linked Data by Ruben Taelman
Continuous Self-Updating Query Results over Dynamic Linked DataContinuous Self-Updating Query Results over Dynamic Linked Data
Continuous Self-Updating Query Results over Dynamic Linked Data
Ruben Taelman1.1K views
Ceilometer lsf-intergration-openstack-summit by Tim Bell
Ceilometer lsf-intergration-openstack-summitCeilometer lsf-intergration-openstack-summit
Ceilometer lsf-intergration-openstack-summit
Tim Bell1.2K views
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete... by Flink Forward
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...
Flink Forward1.3K views
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea... by Ververica
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Ververica 1.6K views

Viewers also liked

OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm by
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin ParmOSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin ParmNETWAYS
2.1K views66 slides
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015 by
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015DevOpsDays Tel Aviv
5.2K views58 slides
Prometheus (Monitorama 2016) by
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Brian Brazil
4.5K views23 slides
Production testing through monitoring by
Production testing through monitoringProduction testing through monitoring
Production testing through monitoringLeon Fayer
7K views58 slides
2016 metrics-as-culture by
2016 metrics-as-culture2016 metrics-as-culture
2016 metrics-as-cultureNicole Forsgren
3.8K views21 slides
SREcon 2016 Performance Checklists for SREs by
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
206.4K views79 slides

Viewers also liked(20)

OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm by NETWAYS
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin ParmOSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
NETWAYS2.1K views
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015 by DevOpsDays Tel Aviv
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
DevOpsDays Tel Aviv5.2K views
Prometheus (Monitorama 2016) by Brian Brazil
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
Brian Brazil4.5K views
Production testing through monitoring by Leon Fayer
Production testing through monitoringProduction testing through monitoring
Production testing through monitoring
Leon Fayer7K views
SREcon 2016 Performance Checklists for SREs by Brendan Gregg
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
Brendan Gregg206.4K views
Monitoring As a Service by James Turnbull
Monitoring As a ServiceMonitoring As a Service
Monitoring As a Service
James Turnbull22.9K views
Monitoring Challenges - Monitorama 2016 - Monitoringless by Adrian Cockcroft
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - Monitoringless
Adrian Cockcroft8.6K views
DevOpsDays Amsterdam - Monitoring at Service Provider Scale by Chris Jackson
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleDevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
Chris Jackson2.1K views
Bruce Lawson: Progressive Web Apps: the future of Apps by brucelawson
Bruce Lawson: Progressive Web Apps: the future of AppsBruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of Apps
brucelawson6.2K views
Microservices Workshop All Topics Deck 2016 by Adrian Cockcroft
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
Adrian Cockcroft33.4K views
Julieta Sieyra - CV Especialidades by Julieta Sieyra
Julieta Sieyra - CV EspecialidadesJulieta Sieyra - CV Especialidades
Julieta Sieyra - CV Especialidades
Julieta Sieyra501 views
Building Scalable Systems: an asynchronous approach by Theo Schlossnagle
Building Scalable Systems: an asynchronous approachBuilding Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approach
Theo Schlossnagle2.8K views
Monitorama: How monitoring can improve the rest of the company by Jeff Weinstein
Monitorama: How monitoring can improve the rest of the companyMonitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the company
Jeff Weinstein6.7K views
Grokking Grok: Monitorama PDX 2015 by GregMefford
Grokking Grok: Monitorama PDX 2015Grokking Grok: Monitorama PDX 2015
Grokking Grok: Monitorama PDX 2015
GregMefford4.3K views
2014 devops conferences by David Lutz
2014 devops conferences2014 devops conferences
2014 devops conferences
David Lutz3.6K views

Similar to Statistics for Engineers

OSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann by
OSMC 2016 | Friends and foes in API Monitoring by Heinrich HartmannOSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
OSMC 2016 | Friends and foes in API Monitoring by Heinrich HartmannNETWAYS
41 views46 slides
OSMC 2016 - Friends and foes by Heinrich Hartmann by
OSMC 2016 - Friends and foes by Heinrich HartmannOSMC 2016 - Friends and foes by Heinrich Hartmann
OSMC 2016 - Friends and foes by Heinrich HartmannNETWAYS
63 views46 slides
How to Measure Latency by
How to Measure LatencyHow to Measure Latency
How to Measure LatencyScyllaDB
1.7K views29 slides
Lego-like building blocks of Storm and Spark Streaming Pipelines by
Lego-like building blocks of Storm and Spark Streaming PipelinesLego-like building blocks of Storm and Spark Streaming Pipelines
Lego-like building blocks of Storm and Spark Streaming PipelinesDataWorks Summit/Hadoop Summit
960 views15 slides
Webinar Monitoring in era of cloud computing by
Webinar Monitoring in era of cloud computingWebinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computingCREATE-NET
1.1K views32 slides
Shikha fdp 62_14july2017 by
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Dr. Shikha Mehta
214 views37 slides

Similar to Statistics for Engineers(20)

OSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann by NETWAYS
OSMC 2016 | Friends and foes in API Monitoring by Heinrich HartmannOSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
OSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
NETWAYS41 views
OSMC 2016 - Friends and foes by Heinrich Hartmann by NETWAYS
OSMC 2016 - Friends and foes by Heinrich HartmannOSMC 2016 - Friends and foes by Heinrich Hartmann
OSMC 2016 - Friends and foes by Heinrich Hartmann
NETWAYS63 views
How to Measure Latency by ScyllaDB
How to Measure LatencyHow to Measure Latency
How to Measure Latency
ScyllaDB1.7K views
Webinar Monitoring in era of cloud computing by CREATE-NET
Webinar Monitoring in era of cloud computingWebinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computing
CREATE-NET1.1K views
Lean Six Sigma methodology by Ramiro Cid
Lean Six Sigma methodologyLean Six Sigma methodology
Lean Six Sigma methodology
Ramiro Cid10K views
5.1 mining data streams by Krish_ver2
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver214.4K views
Product Engineer Certified Lean Six Sigma Black Belt by IASSC by HAKKACHE Mohamed
Product Engineer Certified Lean Six Sigma Black Belt by IASSCProduct Engineer Certified Lean Six Sigma Black Belt by IASSC
Product Engineer Certified Lean Six Sigma Black Belt by IASSC
HAKKACHE Mohamed181 views
Apache Flink: Real-World Use Cases for Streaming Analytics by Slim Baltagi
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi12.5K views
Digital and data journey demystified: how it all works by Michal Hodinka
Digital and data journey demystified: how it all worksDigital and data journey demystified: how it all works
Digital and data journey demystified: how it all works
Michal Hodinka52 views
Christian Kreuzfeld – Static vs Dynamic Stream Processing by Flink Forward
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward7.6K views
Optimizing connected system performance md&amp;m-anaheim-sandhi bhide 02-07-2017 by sandhibhide
Optimizing connected system performance md&amp;m-anaheim-sandhi bhide 02-07-2017Optimizing connected system performance md&amp;m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&amp;m-anaheim-sandhi bhide 02-07-2017
sandhibhide74 views
Big Data Architectures @ JAX / BigDataCon 2016 by Guido Schmutz
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
Guido Schmutz1.5K views
Data Stream Algorithms in Storm and R by Radek Maciaszek
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek1.8K views
Revealing Differences in Designer‘s and Users‘Perspectives by Sebastian Feuerstack
Revealing Differences in Designer‘s and Users‘PerspectivesRevealing Differences in Designer‘s and Users‘Perspectives
Revealing Differences in Designer‘s and Users‘Perspectives
Self-Tuning Data Centers by Reza Rahimi
Self-Tuning Data CentersSelf-Tuning Data Centers
Self-Tuning Data Centers
Reza Rahimi962 views

More from Heinrich Hartmann

Linux System Monitoring with eBPF by
Linux System Monitoring with eBPFLinux System Monitoring with eBPF
Linux System Monitoring with eBPFHeinrich Hartmann
981 views20 slides
Geometric Aspects of LSA by
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSAHeinrich Hartmann
360 views8 slides
Geometric Aspects of LSA by
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSAHeinrich Hartmann
571 views15 slides
Seminar on Complex Geometry by
Seminar on Complex GeometrySeminar on Complex Geometry
Seminar on Complex GeometryHeinrich Hartmann
869 views5 slides
Seminar on Motivic Hall Algebras by
Seminar on Motivic Hall AlgebrasSeminar on Motivic Hall Algebras
Seminar on Motivic Hall AlgebrasHeinrich Hartmann
524 views8 slides
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS by
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONSGROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONSHeinrich Hartmann
302 views7 slides

More from Heinrich Hartmann(19)

GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS by Heinrich Hartmann
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONSGROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
Heinrich Hartmann302 views
Hecke Curves and Moduli spcaes of Vector Bundles by Heinrich Hartmann
Hecke Curves and Moduli spcaes of Vector BundlesHecke Curves and Moduli spcaes of Vector Bundles
Hecke Curves and Moduli spcaes of Vector Bundles
Heinrich Hartmann611 views
Dimension und Multiplizität von D-Moduln by Heinrich Hartmann
Dimension und Multiplizität von D-ModulnDimension und Multiplizität von D-Moduln
Dimension und Multiplizität von D-Moduln
Heinrich Hartmann383 views
Local morphisms are given by composition by Heinrich Hartmann
Local morphisms are given by compositionLocal morphisms are given by composition
Local morphisms are given by composition
Heinrich Hartmann313 views
Cusps of the Kähler moduli space and stability conditions on K3 surfaces by Heinrich Hartmann
Cusps of the Kähler moduli space and stability conditions on K3 surfacesCusps of the Kähler moduli space and stability conditions on K3 surfaces
Cusps of the Kähler moduli space and stability conditions on K3 surfaces
Heinrich Hartmann619 views

Recently uploaded

NET Conf 2023 Recap by
NET Conf 2023 RecapNET Conf 2023 Recap
NET Conf 2023 RecapLee Richardson
10 views71 slides
SAP Automation Using Bar Code and FIORI.pdf by
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdfVirendra Rai, PMP
23 views38 slides
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdfDr. Jimmy Schwarzkopf
19 views29 slides
Business Analyst Series 2023 - Week 3 Session 5 by
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5DianaGray10
248 views20 slides
"Running students' code in isolation. The hard way", Yurii Holiuk by
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk Fwdays
11 views34 slides
Five Things You SHOULD Know About Postman by
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About PostmanPostman
33 views43 slides

Recently uploaded(20)

SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10248 views
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays11 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman33 views
Future of AR - Facebook Presentation by ssuserb54b561
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
ssuserb54b56114 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Powerful Google developer tools for immediate impact! (2023-24) by wesley chun
Powerful Google developer tools for immediate impact! (2023-24)Powerful Google developer tools for immediate impact! (2023-24)
Powerful Google developer tools for immediate impact! (2023-24)
wesley chun10 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson85 views

Statistics for Engineers

  • 1. Monitorama PDX, June 29th 2016 Statistics for Engineers Heinrich Hartmann, Circonus
  • 2. @HeinrichHartman Hi, I am Heinrich · Lives in Munich, EU · Refugee from Academia (Ph.D.) · Analytics Lead at Circonus, Monitoring and Analytics Platform heinrich.hartmann@circonus.com @HeinrichHartman(n)
  • 3. @HeinrichHartman #StatsForEngineers has been around for a while [1] Statistics for Engineers @ ACM Queue [2] Statistics for Engineers Workshop Material @ GitHub [3] Spike Erosion @ circonus.com [4] T. Schlossnagle - The Problem with Math @ circonus.com [5] T. Schlossnagle - Percentages are not People @ circonus.com [6] W. Vogels - Service Level Agreements in Amazon’s Dynamo/Sec. 2.2 [7] G. Schlossnagle - API Performance Monitoring @ Velocity Bejing 2015 Upcoming [8] 3h workshop “Statistics for Engineers” @ SRECon 2016 in Dublin
  • 4. @HeinrichHartman A tale of API Monitoring
  • 5. @HeinrichHartman “Attic” - a furniture webstore · Attic is a (fictional) furniture webstore · Web API serving their catalog · Loses money if requests take too long Monitoring Goals 1. Measure user experience / quality of service 2. Determine (financial) implications of service degradation 3. Define sensible SLA-targets for the Dev- and Ops-teams
  • 7. @HeinrichHartman {1} External API Monitoring Method 1. Make a synthetic request every minute 2. Measure and store request latency Good for · Measure Availability · Alert on outages Bad for · Measuring user experience Latencies of synthetic requests over time
  • 8. @HeinrichHartman <!> Spike Erosion </!> · On long time ranges, aggregated / rolled-up data is commonly displayed · This practice “erodes” latency spikes heavily! · Store all data and use alternative aggregation methods (min/max) to get full picture, cf. [3]. 1d max all samples as Heatmap / ‘dirt’
  • 10. @HeinrichHartman Method Write to log file: - time of completion, - request latency, and further metadata. Discussion · Rich information source for all kinds of analysis · Easy instrumentation (printf) · Slow. Long delay (minutes) before data is indexed and becomes accessible for analysis · Expensive. Not feasibile for high volume APIs {2} Log Analysis Internal view of an API - “UML” version.
  • 11. @HeinrichHartman Numerical Digest: The Request-Latency Chart a concise visualization of the API usage Latency on the y- axis time the request was completed
  • 12. @HeinrichHartman Construction of the Request-Latency Chart (RLC) Request Latency UML Diagram Request Latency Chart
  • 13. @HeinrichHartman Math view on APIs (A) Latency distribution (B) Arrival/Completion times (C) Queuing theory
  • 14. @HeinrichHartman “Requests are People” If you care about your users, you care about their requests. Every single one.
  • 16. @HeinrichHartman {3} What are latency mean values? reporting period
  • 17. @HeinrichHartman {3} Mean Request Latency Monitoring Method 1. Select a reporting period (e.g. 1 min) 2. For each period report the mean latency Pro/Con + Measure requests by actual people + Cheap to collect store and analyze - Easily skewed by outliers at the high end (complex, long running requests) - ... and the low end (cached responses) “Measuring the average latency is like measuring the average temperature in a hospital.” -- Dogan @ Optimizely
  • 18. @HeinrichHartman {3} Mean Request Latency in practice
  • 19. @HeinrichHartman {3} Mean Request Latency - Robust Variants 1. Median Latency - Sort latency values in reporting period - The median is the ‘central’ value. 2. Truncated Means - Take out min and max latencies in reporting period (k-times). - Then compute the mean value 3. Collect Deviation Measures - Avoid standdard deviations, use - Use Mean absolute deviation Construction of the median latency
  • 22. @HeinrichHartman {4} Percentile Monitoring Method 1. Select a reporting period (e.g. 1 min) 2. For each reporting period measure the 50%, 90%, 99%, 99.9% latency percentile 3. Alert when percentiles are over a threshold value Pro/Con + Measure requests by actual people + Cheap to collect store and analyze + Robust to Outliers - Up-front choice of percentiles needed - Can not be aggregated
  • 23. {5} How it looks in practice Latency percentiles 50,90,99 computed over 1m reporting periods
  • 24. <!> Percentiles can’t be aggregated </!> The median of two medians is NOT the total median. If you store percentiles you need to: A. Keep all your data. Never take average rollups! B. Store percentiles for all aggregation levels separately, e.g. ○ per Node / Rack / DC ○ per Endpoint / Service C. Store percentiles for all reporting periods you are interested in, e.g. per min / h / day D. Store all percentiles you will ever be interested in, e.g. 50, 75, 90, 99, 99.9 Further Reading: [4] T. Schlossnagle - The Problem with Math @ circonus.com
  • 26. {5} API Monitoring with Histograms Method 1. Divide latency scale into bands 2. Divide the time scale into reporting periods 3. Count the number of samples in each latency band x reporting period Discussion · Summary of full RLC, with reduced precision · Extreme compression compared to logs · Percentiles, averages, medians, etc. can be derived · Aggregation across time and nodes trivial · Allows more meaningful metrics latency sample count time
  • 27. {5} Histogram Monitoring in Practice Histograms can be visualized as heatmaps. Aggregate data from all nodes serving “web-api” .. across windows of 10min.
  • 28. {5} Histogram Monitoring in Practice All kinds of metrics can be derived from histograms
  • 29. @HeinrichHartman {6} The search for meaningful metrics
  • 30. {6} Users offended per minute
  • 31. {6} Total users offended so far
  • 32. @HeinrichHartman Takeaways · Don’t trust line graphs (at least on large scale) · Don’t aggregate percentiles. Aggregate histograms. · Keep your data · Strive for meaningful metrics