SlideShare a Scribd company logo
1 of 76
Download to read offline
Brought to you
by
Misery Metrics and
Consequences
Gil Tene
CTO & co-Founder, Azul
Systems @giltene
Misery Metrics
and
Consequences
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
Agenda
Some intro stuff
How much of monitoring is simply broken
How it’s so bad that there is no fixing it
lights and tunnels
appreciating misery
Understanding Latency
and
Application Responsiveness
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
(Micro?)services
and
Response Time
Behavior
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
How NOT to
Measure Latency
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
The “Oh S@%#!”
talk
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
How I learned
to
stop worrying
and
love Misery
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
How I learned
to
stop worrying
and
love Misery
Gil Tene, CTO & co-Founder, Azul Systems
@giltene
(Micro)services
Source: “Visualizing and Tracking your Microservices”, AppDynamics blog
Source: “Architecting for Speed: How agile innovators accelerate growth through microservices”, Jesper Nordstrom & Tomas Betzholtzblog, 3gamma.com
Source: “Microservices at Amazon”, Alan Ho (@karlunho), apigee developer blog, image from circa 2009
This is a Microservice
This is a Microservices
Architecture
?
?
?
Source: “Netflix OSS, Spring Cloud, or Kubernetes? How About All of Them!”, Christian Posta, DZone
Remember: all I'm offering is the truth.
We like to look at pretty charts…
95%’lie
The “We only want to show good things” chart
What you find when you dig deeper into the 9s
What you find when you dig deeper into the 9s
What (TF) does the Average
of the 95%’lie mean?
What (TF) does the Average
of the 95%’lie mean?
Lets do the same with 100%’ile; Suppose we have a set of
100%’ile values for each minute:
[1, 0, 3, 1, 601, 4, 2, 8, 0, 3, 3, 1, 1, 0, 2]
“The average 100%’ile over the past 15 minutes was 42”
Same nonsense applies to averaging any other %’lie
#LatencyTipOfTheDay:
You can't average percentiles.
Period.
What percentile levels do YOU monitor?
99%’lie: a good indicator, right?
What are the chances of a single web page view
experiencing >99%’lie latency of:
- A single search engine node?
- A single Key/Value store node?
- A single Database node?
- A single CDN request?
- A single (micro)service call?
#LatencyTipOfTheDay:
MOST page loads will experience worse
than the 99%'lie server response
So why are we measuring it?
Gauging user experience
Example: If a typical user session involves 5 page loads,
averaging 40 resources per page.
-How many of our users will NOT experience something
worse than the 95%’lie of http requests?
Answer: ~0.003%
-How many of our users will NOT experience something
worse than the 99%’lie of http requests?
Answer: ~13%
Gauging user experience
Example: If a typical user session involves 5 page loads,
averaging 40 resources per page.
-What http response percentile will be experienced by the
95%’ile of users?
Answer: ~99.97%
-What http response percentile will be experienced by the
99%’ile of users
Answer: ~99.995%
Why don’t we have response time
or latency stats with multiple 9s
in them???
You can’t average percentiles…
And you also can’t get an hour’s
99.999%’lie out of lots
of 10 second interval 99%’lie
reports…
Why don’t we have response time
or latency stats with multiple 9s
in them???
You can’t average percentiles…
And you also can’t get an hour’s 99.999%’lie
out of lots
of 10 second interval 99%’lie reports…
Check out HdrHistogram
It lets you have nice things….
Why don’t we have response time or
latency stats with multiple 9s in
them???
0% 90% 99% 99.99% 99.999% 99.9999%
140
120
100
80
60
40
20
0
Hiccup
Duration
(msec)
99.9%
Percentile
Hiccups by Percentile Distribution
Hiccups by Percentile SLA
The coordinated omission
problem
An accidental conspiracy.
.
The lie in the 99%’lies
Coordinated Omission in Monitoring Code
Long operations only get measured once
delays outside of timing window do not get measured at all
How bad can this get?
Avg. is 1
msec over 1st
100 sec
System
Stalled for
100 Sec
System easily
handles 100
requests/sec
Responds to
each in
1msec
How would you characterize this
system?
Elapsed Time
~75%‘ile is 50
sec
~50%‘ile is 1
msec
99.99%‘ile is
~100sec
Avg. is 50 sec.
over next 100
sec
Overall Average response time is ~25
sec.
Measurement in practice
System
Stalled for
100 Sec
System easily
handles 100
requests/sec
Responds to
each in
1msec
What actually gets
measured?
Elapsed Time
75%‘lie is 1
msec
99.99%‘lie is 1
msec
50%‘ile is 1
msec
10,000
measurements
@ 1 msec each
1
measurement
@ 100 sec
Overall Average is 10.9 msec
(!!!)
Proper measurement
System
Stalled for
100 Sec
Elapsed
Time
System easily
handles 100
requests/sec
Responds to
each in
1msec
10,000
results
Varying
linearly from
100 sec
to 10 msec
10,000
results @ 1
msec each
~50%‘ile is 1
msec
~75%‘ile is 50
sec
99.99%‘ile is
~100sec
Proper measurement
System
Stalled for
100 Sec
Elapsed
Time
System easily
handles 100
requests/sec
Responds to
each in
1msec
10,000
results
Varying
linearly from
100 sec
to 10 msec
10,000
results @ 1
msec each
~50%‘ile is 1
msec
~75%‘ile is 50
sec
99.99%‘ile is
~100sec
Coordinated
Omission
1 msec 1 msec
1 result
@ 100
sec
“Better” can look “Worse”
System easily
handles 100
requests/sec
Responds to
each in
1msec
10,000 @
1msec
10,000 @ 5
msec
System
Slowed for
100 Sec
Still easily
handles 100
requests/sec
50%‘ile is 1
msec
99.99%‘lie is
~5msec
Responds to
each in 5
msec
Elapsed Time
75%‘lie is
2.5msec
Service Time vs. Response
Time
Response Time vs. Service Time @60K/sec
How “real” people react
How does the world
keep working ?!?!?
Luckily, I think that there is
a rainbow at the end of the
tunnel…
How does the world
even keep working
?!?!?
Something we are
doing actually works…
Something we are
doing actually works…
3000 BC
Graduated with honors
Witch Doctor Academy
Timeout
s
Retries Misses
Abandoned
Shopping Carts
Loss of
Engagemen
t
Failed
Queries
Angry
Phone Calls
It is nonsense to ask:
“was my 99%’lie a timeout?”
Instead Ask:
“How many timeouts
did I have?”
&
“What % of my
requests timed out?”
TransactionLoad
Success Rate
Looking at misery
Focus on measuring Misery
@gilten
Love
Misery
Love
Misery
Don’t aim to make a specific %’ile acceptable
Instead, aim for an acceptable %
of misery
final #LatencyTipOfTheDay:
Brought to you
by
Gil Tene
@gilten
e

More Related Content

What's hot

M|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write PathsM|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write PathsMariaDB plc
 
Windows Sistemlerde Yetkili Hesapları İzleyerek Yanal Hareketleri Yakalama
Windows Sistemlerde Yetkili Hesapları İzleyerek Yanal Hareketleri YakalamaWindows Sistemlerde Yetkili Hesapları İzleyerek Yanal Hareketleri Yakalama
Windows Sistemlerde Yetkili Hesapları İzleyerek Yanal Hareketleri YakalamaBGA Cyber Security
 
No Easy Breach DerbyCon 2016
No Easy Breach DerbyCon 2016No Easy Breach DerbyCon 2016
No Easy Breach DerbyCon 2016Matthew Dunwoody
 
Understanding Application Threat Modelling & Architecture
 Understanding Application Threat Modelling & Architecture Understanding Application Threat Modelling & Architecture
Understanding Application Threat Modelling & ArchitecturePriyanka Aash
 
DNS spoofing/poisoning Attack
DNS spoofing/poisoning AttackDNS spoofing/poisoning Attack
DNS spoofing/poisoning AttackFatima Qayyum
 
Purple Teaming with ATT&CK - x33fcon 2018
Purple Teaming with ATT&CK - x33fcon 2018Purple Teaming with ATT&CK - x33fcon 2018
Purple Teaming with ATT&CK - x33fcon 2018Christopher Korban
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Threat Hunting
Threat HuntingThreat Hunting
Threat HuntingSplunk
 
TCP/IP Ağlarda Parçalanmış Paketler ve Etkileri
TCP/IP Ağlarda Parçalanmış Paketler ve EtkileriTCP/IP Ağlarda Parçalanmış Paketler ve Etkileri
TCP/IP Ağlarda Parçalanmış Paketler ve EtkileriBGA Cyber Security
 
ITIL - IAM (Access Management)
ITIL - IAM (Access Management)ITIL - IAM (Access Management)
ITIL - IAM (Access Management)Josep Bardallo
 
ProxySQL High Availability (Clustering)
ProxySQL High Availability (Clustering)ProxySQL High Availability (Clustering)
ProxySQL High Availability (Clustering)Mydbops
 
DerbyCon 2019 - Kerberoasting Revisited
DerbyCon 2019 - Kerberoasting RevisitedDerbyCon 2019 - Kerberoasting Revisited
DerbyCon 2019 - Kerberoasting RevisitedWill Schroeder
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsJean-François Gagné
 
BATbern48_How Zero Trust can help your organisation keep safe.pdf
BATbern48_How Zero Trust can help your organisation keep safe.pdfBATbern48_How Zero Trust can help your organisation keep safe.pdf
BATbern48_How Zero Trust can help your organisation keep safe.pdfBATbern
 
Pen Testing Explained
Pen Testing ExplainedPen Testing Explained
Pen Testing ExplainedRand W. Hirt
 
【de:code 2020】 もうセキュリティはやりたくない!! 第 5 弾 ~Microsoft の xDR で攻撃者を追え!!~​
【de:code 2020】 もうセキュリティはやりたくない!! 第 5 弾  ~Microsoft の xDR で攻撃者を追え!!~​【de:code 2020】 もうセキュリティはやりたくない!! 第 5 弾  ~Microsoft の xDR で攻撃者を追え!!~​
【de:code 2020】 もうセキュリティはやりたくない!! 第 5 弾 ~Microsoft の xDR で攻撃者を追え!!~​日本マイクロソフト株式会社
 
.LNK Tears of the Kingdom
.LNK Tears of the Kingdom.LNK Tears of the Kingdom
.LNK Tears of the KingdomMITRE ATT&CK
 
Tuning TCP and NGINX on EC2
Tuning TCP and NGINX on EC2Tuning TCP and NGINX on EC2
Tuning TCP and NGINX on EC2Chartbeat
 
Splunk Overview
Splunk OverviewSplunk Overview
Splunk OverviewSplunk
 

What's hot (20)

M|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write PathsM|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write Paths
 
Windows Sistemlerde Yetkili Hesapları İzleyerek Yanal Hareketleri Yakalama
Windows Sistemlerde Yetkili Hesapları İzleyerek Yanal Hareketleri YakalamaWindows Sistemlerde Yetkili Hesapları İzleyerek Yanal Hareketleri Yakalama
Windows Sistemlerde Yetkili Hesapları İzleyerek Yanal Hareketleri Yakalama
 
No Easy Breach DerbyCon 2016
No Easy Breach DerbyCon 2016No Easy Breach DerbyCon 2016
No Easy Breach DerbyCon 2016
 
Understanding Application Threat Modelling & Architecture
 Understanding Application Threat Modelling & Architecture Understanding Application Threat Modelling & Architecture
Understanding Application Threat Modelling & Architecture
 
DNS spoofing/poisoning Attack
DNS spoofing/poisoning AttackDNS spoofing/poisoning Attack
DNS spoofing/poisoning Attack
 
Purple Teaming with ATT&CK - x33fcon 2018
Purple Teaming with ATT&CK - x33fcon 2018Purple Teaming with ATT&CK - x33fcon 2018
Purple Teaming with ATT&CK - x33fcon 2018
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Threat Hunting
Threat HuntingThreat Hunting
Threat Hunting
 
TCP/IP Ağlarda Parçalanmış Paketler ve Etkileri
TCP/IP Ağlarda Parçalanmış Paketler ve EtkileriTCP/IP Ağlarda Parçalanmış Paketler ve Etkileri
TCP/IP Ağlarda Parçalanmış Paketler ve Etkileri
 
ITIL - IAM (Access Management)
ITIL - IAM (Access Management)ITIL - IAM (Access Management)
ITIL - IAM (Access Management)
 
ProxySQL High Availability (Clustering)
ProxySQL High Availability (Clustering)ProxySQL High Availability (Clustering)
ProxySQL High Availability (Clustering)
 
DerbyCon 2019 - Kerberoasting Revisited
DerbyCon 2019 - Kerberoasting RevisitedDerbyCon 2019 - Kerberoasting Revisited
DerbyCon 2019 - Kerberoasting Revisited
 
SIEM Başarıya Giden Yol
SIEM Başarıya Giden YolSIEM Başarıya Giden Yol
SIEM Başarıya Giden Yol
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
 
BATbern48_How Zero Trust can help your organisation keep safe.pdf
BATbern48_How Zero Trust can help your organisation keep safe.pdfBATbern48_How Zero Trust can help your organisation keep safe.pdf
BATbern48_How Zero Trust can help your organisation keep safe.pdf
 
Pen Testing Explained
Pen Testing ExplainedPen Testing Explained
Pen Testing Explained
 
【de:code 2020】 もうセキュリティはやりたくない!! 第 5 弾 ~Microsoft の xDR で攻撃者を追え!!~​
【de:code 2020】 もうセキュリティはやりたくない!! 第 5 弾  ~Microsoft の xDR で攻撃者を追え!!~​【de:code 2020】 もうセキュリティはやりたくない!! 第 5 弾  ~Microsoft の xDR で攻撃者を追え!!~​
【de:code 2020】 もうセキュリティはやりたくない!! 第 5 弾 ~Microsoft の xDR で攻撃者を追え!!~​
 
.LNK Tears of the Kingdom
.LNK Tears of the Kingdom.LNK Tears of the Kingdom
.LNK Tears of the Kingdom
 
Tuning TCP and NGINX on EC2
Tuning TCP and NGINX on EC2Tuning TCP and NGINX on EC2
Tuning TCP and NGINX on EC2
 
Splunk Overview
Splunk OverviewSplunk Overview
Splunk Overview
 

Similar to Misery Metrics & Consequences

How NOT to Measure Latency
How NOT to Measure LatencyHow NOT to Measure Latency
How NOT to Measure LatencyC4Media
 
MeasureWorks - Velocity Europe - Real World Rum
MeasureWorks - Velocity Europe - Real World RumMeasureWorks - Velocity Europe - Real World Rum
MeasureWorks - Velocity Europe - Real World RumMeasureWorks
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure
VSLive Orlando 2019 - When "We are down" is not good enough. SRE on AzureVSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure
VSLive Orlando 2019 - When "We are down" is not good enough. SRE on AzureRene Van Osnabrugge
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Coradiant
CoradiantCoradiant
Coradiantgigamon
 
When down is not good enough. SRE On Azure
When down is not good enough. SRE On AzureWhen down is not good enough. SRE On Azure
When down is not good enough. SRE On AzureRene Van Osnabrugge
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Brian Brazil
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
 
Performance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and AppsPerformance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and AppsNinefold
 
MeasureWorks - Shoppingtoday - 5 must-do's for the holiday season
MeasureWorks - Shoppingtoday - 5 must-do's for the holiday seasonMeasureWorks - Shoppingtoday - 5 must-do's for the holiday season
MeasureWorks - Shoppingtoday - 5 must-do's for the holiday seasonMeasureWorks
 
Hidden Costs of Chasing the Mythical 'Five Nines'
Hidden Costs of Chasing the Mythical 'Five Nines'Hidden Costs of Chasing the Mythical 'Five Nines'
Hidden Costs of Chasing the Mythical 'Five Nines'DevOpsDays DFW
 
DockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is DeadDockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is DeadKevin Crawley
 
MeasureWorks - Social Mentions as a Performance KPI
MeasureWorks - Social Mentions as a Performance KPIMeasureWorks - Social Mentions as a Performance KPI
MeasureWorks - Social Mentions as a Performance KPIMeasureWorks
 
Understanding Latency Response Time and Behavior
Understanding Latency Response Time and BehaviorUnderstanding Latency Response Time and Behavior
Understanding Latency Response Time and BehaviorzuluJDK
 
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and ToolsIntelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and ToolsAzul Systems Inc.
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteAlois Reitbauer
 
10 impact of uncertainty in matching
10 impact of uncertainty in matching10 impact of uncertainty in matching
10 impact of uncertainty in matchingRishi Mathur
 
Web Performance BootCamp 2013
Web Performance BootCamp 2013Web Performance BootCamp 2013
Web Performance BootCamp 2013Daniel Austin
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 

Similar to Misery Metrics & Consequences (20)

How NOT to Measure Latency
How NOT to Measure LatencyHow NOT to Measure Latency
How NOT to Measure Latency
 
MeasureWorks - Velocity Europe - Real World Rum
MeasureWorks - Velocity Europe - Real World RumMeasureWorks - Velocity Europe - Real World Rum
MeasureWorks - Velocity Europe - Real World Rum
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure
VSLive Orlando 2019 - When "We are down" is not good enough. SRE on AzureVSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure
VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Coradiant
CoradiantCoradiant
Coradiant
 
When down is not good enough. SRE On Azure
When down is not good enough. SRE On AzureWhen down is not good enough. SRE On Azure
When down is not good enough. SRE On Azure
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
Performance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and AppsPerformance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and Apps
 
MeasureWorks - Shoppingtoday - 5 must-do's for the holiday season
MeasureWorks - Shoppingtoday - 5 must-do's for the holiday seasonMeasureWorks - Shoppingtoday - 5 must-do's for the holiday season
MeasureWorks - Shoppingtoday - 5 must-do's for the holiday season
 
Hidden Costs of Chasing the Mythical 'Five Nines'
Hidden Costs of Chasing the Mythical 'Five Nines'Hidden Costs of Chasing the Mythical 'Five Nines'
Hidden Costs of Chasing the Mythical 'Five Nines'
 
DockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is DeadDockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is Dead
 
MeasureWorks - Social Mentions as a Performance KPI
MeasureWorks - Social Mentions as a Performance KPIMeasureWorks - Social Mentions as a Performance KPI
MeasureWorks - Social Mentions as a Performance KPI
 
Understanding Latency Response Time and Behavior
Understanding Latency Response Time and BehaviorUnderstanding Latency Response Time and Behavior
Understanding Latency Response Time and Behavior
 
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and ToolsIntelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
 
10 impact of uncertainty in matching
10 impact of uncertainty in matching10 impact of uncertainty in matching
10 impact of uncertainty in matching
 
Web Performance BootCamp 2013
Web Performance BootCamp 2013Web Performance BootCamp 2013
Web Performance BootCamp 2013
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 

More from ScyllaDB

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...ScyllaDB
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptxScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDBScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsScyllaDB
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101ScyllaDB
 

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Recently uploaded

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Misery Metrics & Consequences

  • 1. Brought to you by Misery Metrics and Consequences Gil Tene CTO & co-Founder, Azul Systems @giltene
  • 2. Misery Metrics and Consequences Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 3. Agenda Some intro stuff How much of monitoring is simply broken How it’s so bad that there is no fixing it lights and tunnels appreciating misery
  • 4. Understanding Latency and Application Responsiveness Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 5. (Micro?)services and Response Time Behavior Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 6. How NOT to Measure Latency Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 7. The “Oh S@%#!” talk Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 8. How I learned to stop worrying and love Misery Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 9. How I learned to stop worrying and love Misery Gil Tene, CTO & co-Founder, Azul Systems @giltene
  • 11. Source: “Visualizing and Tracking your Microservices”, AppDynamics blog
  • 12. Source: “Architecting for Speed: How agile innovators accelerate growth through microservices”, Jesper Nordstrom & Tomas Betzholtzblog, 3gamma.com
  • 13. Source: “Microservices at Amazon”, Alan Ho (@karlunho), apigee developer blog, image from circa 2009
  • 14. This is a Microservice
  • 15. This is a Microservices Architecture
  • 16.
  • 17.
  • 18.
  • 19.
  • 20. ? ? ?
  • 21. Source: “Netflix OSS, Spring Cloud, or Kubernetes? How About All of Them!”, Christian Posta, DZone
  • 22.
  • 23.
  • 24.
  • 25.
  • 26. Remember: all I'm offering is the truth.
  • 27. We like to look at pretty charts… 95%’lie The “We only want to show good things” chart
  • 28.
  • 29. What you find when you dig deeper into the 9s
  • 30. What you find when you dig deeper into the 9s
  • 31.
  • 32.
  • 33.
  • 34. What (TF) does the Average of the 95%’lie mean?
  • 35. What (TF) does the Average of the 95%’lie mean? Lets do the same with 100%’ile; Suppose we have a set of 100%’ile values for each minute: [1, 0, 3, 1, 601, 4, 2, 8, 0, 3, 3, 1, 1, 0, 2] “The average 100%’ile over the past 15 minutes was 42” Same nonsense applies to averaging any other %’lie
  • 37. What percentile levels do YOU monitor?
  • 38. 99%’lie: a good indicator, right? What are the chances of a single web page view experiencing >99%’lie latency of: - A single search engine node? - A single Key/Value store node? - A single Database node? - A single CDN request? - A single (micro)service call?
  • 39.
  • 40.
  • 41. #LatencyTipOfTheDay: MOST page loads will experience worse than the 99%'lie server response So why are we measuring it?
  • 42. Gauging user experience Example: If a typical user session involves 5 page loads, averaging 40 resources per page. -How many of our users will NOT experience something worse than the 95%’lie of http requests? Answer: ~0.003% -How many of our users will NOT experience something worse than the 99%’lie of http requests? Answer: ~13%
  • 43. Gauging user experience Example: If a typical user session involves 5 page loads, averaging 40 resources per page. -What http response percentile will be experienced by the 95%’ile of users? Answer: ~99.97% -What http response percentile will be experienced by the 99%’ile of users Answer: ~99.995%
  • 44. Why don’t we have response time or latency stats with multiple 9s in them???
  • 45. You can’t average percentiles… And you also can’t get an hour’s 99.999%’lie out of lots of 10 second interval 99%’lie reports… Why don’t we have response time or latency stats with multiple 9s in them???
  • 46. You can’t average percentiles… And you also can’t get an hour’s 99.999%’lie out of lots of 10 second interval 99%’lie reports… Check out HdrHistogram It lets you have nice things…. Why don’t we have response time or latency stats with multiple 9s in them??? 0% 90% 99% 99.99% 99.999% 99.9999% 140 120 100 80 60 40 20 0 Hiccup Duration (msec) 99.9% Percentile Hiccups by Percentile Distribution Hiccups by Percentile SLA
  • 47. The coordinated omission problem An accidental conspiracy. . The lie in the 99%’lies
  • 48. Coordinated Omission in Monitoring Code Long operations only get measured once delays outside of timing window do not get measured at all
  • 49. How bad can this get? Avg. is 1 msec over 1st 100 sec System Stalled for 100 Sec System easily handles 100 requests/sec Responds to each in 1msec How would you characterize this system? Elapsed Time ~75%‘ile is 50 sec ~50%‘ile is 1 msec 99.99%‘ile is ~100sec Avg. is 50 sec. over next 100 sec Overall Average response time is ~25 sec.
  • 50. Measurement in practice System Stalled for 100 Sec System easily handles 100 requests/sec Responds to each in 1msec What actually gets measured? Elapsed Time 75%‘lie is 1 msec 99.99%‘lie is 1 msec 50%‘ile is 1 msec 10,000 measurements @ 1 msec each 1 measurement @ 100 sec Overall Average is 10.9 msec (!!!)
  • 51. Proper measurement System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
  • 52. Proper measurement System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec Coordinated Omission 1 msec 1 msec 1 result @ 100 sec
  • 53. “Better” can look “Worse” System easily handles 100 requests/sec Responds to each in 1msec 10,000 @ 1msec 10,000 @ 5 msec System Slowed for 100 Sec Still easily handles 100 requests/sec 50%‘ile is 1 msec 99.99%‘lie is ~5msec Responds to each in 5 msec Elapsed Time 75%‘lie is 2.5msec
  • 54. Service Time vs. Response Time
  • 55. Response Time vs. Service Time @60K/sec
  • 57.
  • 58. How does the world keep working ?!?!?
  • 59. Luckily, I think that there is a rainbow at the end of the tunnel…
  • 60. How does the world even keep working ?!?!?
  • 61. Something we are doing actually works…
  • 62. Something we are doing actually works…
  • 63.
  • 64. 3000 BC Graduated with honors Witch Doctor Academy
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70. Timeout s Retries Misses Abandoned Shopping Carts Loss of Engagemen t Failed Queries Angry Phone Calls
  • 71. It is nonsense to ask: “was my 99%’lie a timeout?” Instead Ask: “How many timeouts did I have?” & “What % of my requests timed out?”
  • 75. Love Misery Don’t aim to make a specific %’ile acceptable Instead, aim for an acceptable % of misery final #LatencyTipOfTheDay:
  • 76. Brought to you by Gil Tene @gilten e