SlideShare a Scribd company logo
M E A S U R E T O F A I L
D L A C Z E G O K L I E N C I S I Ę C Z E P I A J Ą J A K W Y K R E S Y M Ó W I Ą , Ż E
A P L I K A C J A J E S T S Z Y B K A ?
S U R V E Y
• Do you…
S U R V E Y
• Use graphite?
S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
• Modify their source? Use nonstandard options?
S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
• Modify their source? Use nonstandard options?
• Graph average? Median?
S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
• Modify their source? Use nonstandard options?
• Graph average? Median?
• Percentiles?
S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
• Modify their source? Use nonstandard options?
• Graph average? Median?
• Percentiles?
• Know the term “cargo cult”?
C A R G O C U L T
During the Middle Ages there were all kinds of
crazy ideas, such as that a piece of of
rhinoceros horn would increase potency. Then a
method was discovered for separating the
ideas- which was to try one to see if it worked,
and if it didn't work, to eliminate it. This method
became organized, of course, into science. And
it developed very well, so that we are now in the
scientific age. It is such a scientific age, in fact,
that we have difficulty in understanding how
witch doctors could ever have existed, when
nothing that they proposed ever really worked-or
very little of it did.
Richard Feynman
From a Caltech commencement address
given in 1974
M E A S U R I N G C O R R E C T L Y I S
I M P O R T A N T
• You get what you measure
• Predictable is better than fast
• One page display requires multiple calls (static and
dynamic resources)
• Multiple microservices are called to generate response
• Each user will do hundreds of displays of your
webpages
W H Y D O T H I S ?
• Every 100 ms increase in load time of Amazon.com
decreased sales by 1%1
• Increasing web search latency 100 to 400 ms reduces
the daily searches per user by 0.2% to 0.6%.
Furthermore, users do fewer searches the longer they
are exposed. For longer delays, the loss of searches
persists for a time even after latency returns to
previous levels.2
1Kohavi and Longbotham 2007
2Brutlag 2009
W H A T M E T R I C S C A N W E U S E ?
graphite.send(prefix(name, "max"), ...);
graphite.send(prefix(name, "mean"), ...);
graphite.send(prefix(name, "min"), ...);
graphite.send(prefix(name, "stddev"), ...);
graphite.send(prefix(name, "p50"), ...);
graphite.send(prefix(name, "p75"), ...);
graphite.send(prefix(name, "p95"), ...);
graphite.send(prefix(name, "p98"), ...);
graphite.send(prefix(name, "p99"), ...);
graphite.send(prefix(name, “p999"), ...);
D O N ’ T L O O K A T M E A N
• 1000 queries - 0ms latency, 100 queries 5s latency
• Average is 4,5ms
• 1000 queries - 1ms latency, 100 queries - 5s latency
• Average is 455ms
• Does not help to quantify lags users will experience
P L O T T I N G M E A N I S F O R
S H O W I N G O F F T O M A N A G E M E N T
M A Y B E M E D I A N T H E N ?
• What is the probability of end user encountering
latency worse than median?
• Remember: usually multiple requests are needed to
respond to API call (e.g. N micro services, N
resource requests per page)
P R O B A B I L I T Y O F E X P E R I E N C I N G
L A T E N C Y B E T T E R T H A N M E D I A N
I N F U N C T I O N O F M I C R O S E R V I C E S I N V O L V E D
W H I C H P E R C E N T I L E I S R E L E V A N T T O
Y O U ?
• Is 99th percentile demanding constraint?
• In application serving 1000 qps latency worse than that happens ten
times per second.
• User that needs to navigate through several web pages will most
probably experience it
• What is the probability of encountering latency better than 99th?
P R O B A B I L I T Y O F E X P E R I E N C I N G
L A T E N C Y B E T T E R T H A N 9 9 T H
P E R C E N T I L EI N F U N C T I O N O F M I C R O S E R V I C E S I N V O L V E D
D O N O T A V E R A G E P E R C E N T I L E S
Example scenario:
1. Load balancer splits traffic unevenly (ELB anyone?)
2. Server S1 has 1 qps over measured time with 95%’ile == 1ms
3. Server S2 has 100 qps over measured time with 95%’ile == 10s
4. Average is ~5s.
5. What does that tell us?
6. Did we satisfy SLA if it says “95%’ile must be below 8s”?
7. Actual 95%’ile percentile is ~10s
– A L I C E ' S A D V E N T U R E S I N W O N D E R L A N D
“If there's no meaning in it,' said the King, 'that
saves a world of trouble, you know, as we
needn't try to find any”
m e t r i c R e g i s t r y . t i m e r ( " m y a p p . r e s p o n s e T i m e " ) ;
Standard timer will over or under report actual
percentiles at will.
Green line represents actual MAX values.
m e t r i c R e g i s t r y . t i m e r ( " m y a p p . r e s p o n s e T i m e " ) ;
Standard timer will over or under report actual
percentiles at will.
Green line represents actual MAX values.
Blue line represents metric reported from Timer class
Green line represents request rate
T I M E R , T I M E R N E V E R
C H A N G E S …
• Timer values decay exponentially
• giving artificial smoothing of values for server behaviour that
may be long gone
• Timer that is not updated does not decay
• If Timer is not updated (e.g. subprocess failed and we
stopped sending requests to it) its values will remain constant
• Check this post for potential solutions:
taint.org/2014/01/16/145944a.html
T I M E R ’ S H I S T O G R A M R E S E R V O I R
• Backing storage for Timer’s data
• Contain “statistically representative reservoir of a data stream”
• Default is ExponentiallyDecayingReservoir which has many
drawbacks and is source of most inaccuracies observed
throughout this presentation
• Others include
• UniformReservoir, SlidingTimeWindowReservoir,
SlidingTimeWindowReservoir, SlidingWindowReservoir
E X P O N E N T I A L L Y D E C A Y I N G
R E S E R V O I R
• Assumes normal distribution of recorded values
• Stores 1024 random samples by default
• Many statistical tools applied in computer systems
monitoring will assume normal distribution
• Be suspicious of such tools
• Why is that a bad idea?
N O R M A L
D I S T R I B U T I O N -
W H Y S O U S E F U L ?
• Central limit theorem
• Chebyshev's inequality
C A L C U L A T E
9 5 % ’ I L E B A S E D O N
M E A N A N D S T D .
D E V .
• IFF latency values were
distributed normally then
we could calculate any
percentile based on mean
and standard deviation
• Lookup into standard
normal (Z) table
• 95%’ile is located 1.65 std.
dev. from mean
• Result is 11,65ms
Latency profile resembling normal distribution…
Add spikes due to young gen GC pauses
Add spikes due to old gen GC pauses
Add spikes due to calling other services (like DB)
Add spikes due to: lost tcp packet retransmission,
disk swapping, kernel bookkeeping etc.
N O R M A L
D I S T R I B U T I O N -
W H Y N O T
A P P L I C A B L E ?
• The value of the normal distribution
is practically zero when the value x
lies more than a few standard
deviations away from the mean.
• It may not be an appropriate model
when one expects a significant
fraction of outliers
• […] other statistical inference
methods that are optimal for
normally distributed variables often
become highly unreliable when
applied to such data.1
1All quotes on this slide from Wikipedia
H D R H I S T O G R A M
• Supports recording and analysis of sampled data across
configurable range with configurable accuracy
• Provides compact representation of data while retaining
high resolution
• Allows configurable tradeoffs between space and accuracy
• Very fast, allocation free, not thread safe for maximum
speed (thread safe versions available)
• Created by Gil Tene of Azul Sytems
R E C O R D E R
• Uses HdrHistogram to store values
• Supports concurrent recording of values
• Recording is lock free but also wait free on most
architectures (that support lock xadd)
• Reading is not lock free but does not stall writers (writer-
reader phaser)
• Checkout Marshall Pierce’s library for using it as a
Reservoir implementation
JMH benchmarks (from my laptop, caveat emptor!)
S O L U T I O N S
• Instantiate Timer with custom reservoir
• new ExponentiallyDecayingReservoir(LARGE_NUMBER)
• new SlidingTimeWindowReservoir(1, MINUTES)
• new HdrHistogramResetOnSnapshotReservoir()
• Only last one is safe and accurate and will not report stale values
if no updates were made
S M O K I N G B E N C H M A R K I N G I S T H E
L E A D I N G C A U S E O F S T A T I S T I C S I N
T H E W O R L D
C O O R D I N A T E D O M I S S I O N
• When load driver is plotting with system under test to
deceive you
• Most tools do this
• Most benchmarks do this
• Yahoo Cloud Serving Benchmark had that problem1
1Recently fixed by Nitsan Wakart, see
psy-lob-saw.blogspot.com/2015/03/fixing-ycsb-coordinated-omission.html
– C R E A T E D W I T H G I L T E N E ' S H D R H I S T O G R A M
P L O T T I N G S C R I P T
Effects on benchmarks at high percentiles are
spectacular
C O O R D I N A T E D O M I S S I O N
S O L U T I O N S
1. Ignore the problem!
perfectly fine for non interactive system where only
throughput matters
C O O R D I N A T E D O M I S S I O N
S O L U T I O N S
2. Correct it mathematically in sampling mechanism
HdrHistogram can correct CO with these methods
(choose one!):
histogram.recordValueWithExpectedInterval(
value,
expectedIntervalBetweenSamples
);
histogram.copyCorrectedForCoordinatedOmission(
expectedIntervalBetweenSamples
);
C O O R D I N A T E D O M I S S I O N
S O L U T I O N S
3. Correct it on load driver side
by noticing pauses between sent requests.
newly issued request will have timer that starts
counting from time it should have been sent but wasn't
C O O R D I N A T E D
O M I S S I O N
S O L U T I O N S
4. Fail the test
for hard real time
systems where pause causes
human casualties (breaks,
pacemakers, Phalanx
system)
C O O R D I N A T E D O M I S S I O N
• Mathematical solutions can overcorrect when load driver
has pauses (e.g. GC).
• Do not account for the fact that server after pause has no
work to do instead of N more requests waiting to be
executed
• In real world it might have never recovered
• Most tools ignore the problem
• Notable exception: Twitter Iago
– L O A D D R I V E R M O T T O
“Do not bend to the tyranny of reality”
S U M M A R Y
• Measure what is meaningful not just what is measurable
• Set SLA before testing and creating dashboards
• Do not trust Timer class, use custom reservoirs, HdrHistogram,
Recorder, never trust EMWA for request rate
• Do not average percentiles unless you need a random number
generator
• Do not plot averages unless you just want to look good on dashboards
• When load testing be aware of coordinated omission
S O U R C E S , T H A N K Y O U S A N D
R E C O M M E N D E D F O L L O W U P S
• Coda Hale for great metrics library
• Gil Tene
• latencytipoftheday.blogspot.de
• www.infoq.com/presentations/latency-pitfalls
• github.com/HdrHistogram/HdrHistogram
• Nitsan Wakart
• psy-lob-saw.blogspot.de/2015/03/fixing-ycsb-coordinated-omission.html
• and whole blog
• Matin Thompson et. al.
• groups.google.com/forum/#!forum/mechanical-sympathy
R E C O M M E N D E D
Great introduction to statistics
and queueing theory.
Performance Modeling and
Design of Computer Systems:
Queueing Theory in Action
Prof. Mor Harchol-Balter

More Related Content

Viewers also liked

PLNOG14: Quo Vadis RPKI - Andrzej Wolski
PLNOG14: Quo Vadis RPKI - Andrzej WolskiPLNOG14: Quo Vadis RPKI - Andrzej Wolski
PLNOG14: Quo Vadis RPKI - Andrzej Wolski
PROIDEA
 
4Developers 2015: Designing for failure - architecting fault-tolerant system ...
4Developers 2015: Designing for failure - architecting fault-tolerant system ...4Developers 2015: Designing for failure - architecting fault-tolerant system ...
4Developers 2015: Designing for failure - architecting fault-tolerant system ...
PROIDEA
 
PLNOG14: Przyszłość usług transmisji danych L2 - Andrzej Zieliński, Mariusz K...
PLNOG14: Przyszłość usług transmisji danych L2 - Andrzej Zieliński, Mariusz K...PLNOG14: Przyszłość usług transmisji danych L2 - Andrzej Zieliński, Mariusz K...
PLNOG14: Przyszłość usług transmisji danych L2 - Andrzej Zieliński, Mariusz K...
PROIDEA
 
PLNOG14: Historia epickiej wyprawy cz. I - Robert Woźny, Łukasz Trąbiński
PLNOG14: Historia epickiej wyprawy cz. I -  Robert Woźny, Łukasz TrąbińskiPLNOG14: Historia epickiej wyprawy cz. I -  Robert Woźny, Łukasz Trąbiński
PLNOG14: Historia epickiej wyprawy cz. I - Robert Woźny, Łukasz Trąbiński
PROIDEA
 
4Developers 2015: Jaka piękna katastrofa w doskonałym świecie, rzecz o archit...
4Developers 2015: Jaka piękna katastrofa w doskonałym świecie, rzecz o archit...4Developers 2015: Jaka piękna katastrofa w doskonałym świecie, rzecz o archit...
4Developers 2015: Jaka piękna katastrofa w doskonałym świecie, rzecz o archit...
PROIDEA
 
PLNOG14: Od Nova Network przez Neutron do Opencontrail czyli sieć w Openstack...
PLNOG14: Od Nova Network przez Neutron do Opencontrail czyli sieć w Openstack...PLNOG14: Od Nova Network przez Neutron do Opencontrail czyli sieć w Openstack...
PLNOG14: Od Nova Network przez Neutron do Opencontrail czyli sieć w Openstack...
PROIDEA
 
4Developers 2015: Testowanie ze Spockiem - Dominik Przybysz
4Developers 2015: Testowanie ze Spockiem - Dominik Przybysz4Developers 2015: Testowanie ze Spockiem - Dominik Przybysz
4Developers 2015: Testowanie ze Spockiem - Dominik Przybysz
PROIDEA
 
4Developers 2015: Analiza ruchu w aplikacji AngularJS - Kamil Borkowski
4Developers 2015: Analiza ruchu w aplikacji AngularJS - Kamil Borkowski4Developers 2015: Analiza ruchu w aplikacji AngularJS - Kamil Borkowski
4Developers 2015: Analiza ruchu w aplikacji AngularJS - Kamil Borkowski
PROIDEA
 
4Developers 2015: Making sense of agile requirements - Łukasz Szydło
4Developers 2015: Making sense of agile requirements - Łukasz Szydło4Developers 2015: Making sense of agile requirements - Łukasz Szydło
4Developers 2015: Making sense of agile requirements - Łukasz Szydło
PROIDEA
 
PLNOG14: Prawo w Internecie, fakty i mity - Agata Kowalska
PLNOG14: Prawo w Internecie, fakty i mity - Agata KowalskaPLNOG14: Prawo w Internecie, fakty i mity - Agata Kowalska
PLNOG14: Prawo w Internecie, fakty i mity - Agata Kowalska
PROIDEA
 
PLNOG14: Usługi zarządzane ICT jako nowy etap partnerstwa z Twoim klientem bi...
PLNOG14: Usługi zarządzane ICT jako nowy etap partnerstwa z Twoim klientem bi...PLNOG14: Usługi zarządzane ICT jako nowy etap partnerstwa z Twoim klientem bi...
PLNOG14: Usługi zarządzane ICT jako nowy etap partnerstwa z Twoim klientem bi...
PROIDEA
 
PLNOG14: Overlay Networking, nowatorskie podejście do budowy wydajnej sieci D...
PLNOG14: Overlay Networking, nowatorskie podejście do budowy wydajnej sieci D...PLNOG14: Overlay Networking, nowatorskie podejście do budowy wydajnej sieci D...
PLNOG14: Overlay Networking, nowatorskie podejście do budowy wydajnej sieci D...
PROIDEA
 
JDD2015: Trudne Rozmowy [WORKSHOP] - Mariusz Sieraczkiewicz
JDD2015: Trudne Rozmowy [WORKSHOP] - Mariusz SieraczkiewiczJDD2015: Trudne Rozmowy [WORKSHOP] - Mariusz Sieraczkiewicz
JDD2015: Trudne Rozmowy [WORKSHOP] - Mariusz Sieraczkiewicz
PROIDEA
 
JDD2014: Introducing groovy into JAVA project - Yuriy Chulovskyy
JDD2014: Introducing groovy into JAVA project - Yuriy ChulovskyyJDD2014: Introducing groovy into JAVA project - Yuriy Chulovskyy
JDD2014: Introducing groovy into JAVA project - Yuriy Chulovskyy
PROIDEA
 
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case StudyPLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
PROIDEA
 
PLNOG 13: Robert Ślaski: NFV, Virtualise networks or die – the voice of the r...
PLNOG 13: Robert Ślaski: NFV, Virtualise networks or die – the voice of the r...PLNOG 13: Robert Ślaski: NFV, Virtualise networks or die – the voice of the r...
PLNOG 13: Robert Ślaski: NFV, Virtualise networks or die – the voice of the r...
PROIDEA
 
JDD2014: Spring 4, JAVA EE 7 or both? - Ivar Grimstad
JDD2014: Spring 4, JAVA EE 7 or both? - Ivar GrimstadJDD2014: Spring 4, JAVA EE 7 or both? - Ivar Grimstad
JDD2014: Spring 4, JAVA EE 7 or both? - Ivar Grimstad
PROIDEA
 
PLNOG14: Network and Application Performance Monitoring - Bob Cronin
PLNOG14: Network and Application Performance Monitoring - Bob CroninPLNOG14: Network and Application Performance Monitoring - Bob Cronin
PLNOG14: Network and Application Performance Monitoring - Bob Cronin
PROIDEA
 
4Developers 2015: Mikroserwisy - szanse, dylematy i problemy - Łukasz Sowa
4Developers 2015: Mikroserwisy - szanse, dylematy i problemy - Łukasz Sowa4Developers 2015: Mikroserwisy - szanse, dylematy i problemy - Łukasz Sowa
4Developers 2015: Mikroserwisy - szanse, dylematy i problemy - Łukasz Sowa
PROIDEA
 
CONFidence 2015: Analiza przypadku: Carbanak - jak uniknąć powtórki - Przemys...
CONFidence 2015: Analiza przypadku: Carbanak - jak uniknąć powtórki - Przemys...CONFidence 2015: Analiza przypadku: Carbanak - jak uniknąć powtórki - Przemys...
CONFidence 2015: Analiza przypadku: Carbanak - jak uniknąć powtórki - Przemys...
PROIDEA
 

Viewers also liked (20)

PLNOG14: Quo Vadis RPKI - Andrzej Wolski
PLNOG14: Quo Vadis RPKI - Andrzej WolskiPLNOG14: Quo Vadis RPKI - Andrzej Wolski
PLNOG14: Quo Vadis RPKI - Andrzej Wolski
 
4Developers 2015: Designing for failure - architecting fault-tolerant system ...
4Developers 2015: Designing for failure - architecting fault-tolerant system ...4Developers 2015: Designing for failure - architecting fault-tolerant system ...
4Developers 2015: Designing for failure - architecting fault-tolerant system ...
 
PLNOG14: Przyszłość usług transmisji danych L2 - Andrzej Zieliński, Mariusz K...
PLNOG14: Przyszłość usług transmisji danych L2 - Andrzej Zieliński, Mariusz K...PLNOG14: Przyszłość usług transmisji danych L2 - Andrzej Zieliński, Mariusz K...
PLNOG14: Przyszłość usług transmisji danych L2 - Andrzej Zieliński, Mariusz K...
 
PLNOG14: Historia epickiej wyprawy cz. I - Robert Woźny, Łukasz Trąbiński
PLNOG14: Historia epickiej wyprawy cz. I -  Robert Woźny, Łukasz TrąbińskiPLNOG14: Historia epickiej wyprawy cz. I -  Robert Woźny, Łukasz Trąbiński
PLNOG14: Historia epickiej wyprawy cz. I - Robert Woźny, Łukasz Trąbiński
 
4Developers 2015: Jaka piękna katastrofa w doskonałym świecie, rzecz o archit...
4Developers 2015: Jaka piękna katastrofa w doskonałym świecie, rzecz o archit...4Developers 2015: Jaka piękna katastrofa w doskonałym świecie, rzecz o archit...
4Developers 2015: Jaka piękna katastrofa w doskonałym świecie, rzecz o archit...
 
PLNOG14: Od Nova Network przez Neutron do Opencontrail czyli sieć w Openstack...
PLNOG14: Od Nova Network przez Neutron do Opencontrail czyli sieć w Openstack...PLNOG14: Od Nova Network przez Neutron do Opencontrail czyli sieć w Openstack...
PLNOG14: Od Nova Network przez Neutron do Opencontrail czyli sieć w Openstack...
 
4Developers 2015: Testowanie ze Spockiem - Dominik Przybysz
4Developers 2015: Testowanie ze Spockiem - Dominik Przybysz4Developers 2015: Testowanie ze Spockiem - Dominik Przybysz
4Developers 2015: Testowanie ze Spockiem - Dominik Przybysz
 
4Developers 2015: Analiza ruchu w aplikacji AngularJS - Kamil Borkowski
4Developers 2015: Analiza ruchu w aplikacji AngularJS - Kamil Borkowski4Developers 2015: Analiza ruchu w aplikacji AngularJS - Kamil Borkowski
4Developers 2015: Analiza ruchu w aplikacji AngularJS - Kamil Borkowski
 
4Developers 2015: Making sense of agile requirements - Łukasz Szydło
4Developers 2015: Making sense of agile requirements - Łukasz Szydło4Developers 2015: Making sense of agile requirements - Łukasz Szydło
4Developers 2015: Making sense of agile requirements - Łukasz Szydło
 
PLNOG14: Prawo w Internecie, fakty i mity - Agata Kowalska
PLNOG14: Prawo w Internecie, fakty i mity - Agata KowalskaPLNOG14: Prawo w Internecie, fakty i mity - Agata Kowalska
PLNOG14: Prawo w Internecie, fakty i mity - Agata Kowalska
 
PLNOG14: Usługi zarządzane ICT jako nowy etap partnerstwa z Twoim klientem bi...
PLNOG14: Usługi zarządzane ICT jako nowy etap partnerstwa z Twoim klientem bi...PLNOG14: Usługi zarządzane ICT jako nowy etap partnerstwa z Twoim klientem bi...
PLNOG14: Usługi zarządzane ICT jako nowy etap partnerstwa z Twoim klientem bi...
 
PLNOG14: Overlay Networking, nowatorskie podejście do budowy wydajnej sieci D...
PLNOG14: Overlay Networking, nowatorskie podejście do budowy wydajnej sieci D...PLNOG14: Overlay Networking, nowatorskie podejście do budowy wydajnej sieci D...
PLNOG14: Overlay Networking, nowatorskie podejście do budowy wydajnej sieci D...
 
JDD2015: Trudne Rozmowy [WORKSHOP] - Mariusz Sieraczkiewicz
JDD2015: Trudne Rozmowy [WORKSHOP] - Mariusz SieraczkiewiczJDD2015: Trudne Rozmowy [WORKSHOP] - Mariusz Sieraczkiewicz
JDD2015: Trudne Rozmowy [WORKSHOP] - Mariusz Sieraczkiewicz
 
JDD2014: Introducing groovy into JAVA project - Yuriy Chulovskyy
JDD2014: Introducing groovy into JAVA project - Yuriy ChulovskyyJDD2014: Introducing groovy into JAVA project - Yuriy Chulovskyy
JDD2014: Introducing groovy into JAVA project - Yuriy Chulovskyy
 
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case StudyPLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
PLNOG 13: Piotr Szolkowski: 100G Ethernet – Case Study
 
PLNOG 13: Robert Ślaski: NFV, Virtualise networks or die – the voice of the r...
PLNOG 13: Robert Ślaski: NFV, Virtualise networks or die – the voice of the r...PLNOG 13: Robert Ślaski: NFV, Virtualise networks or die – the voice of the r...
PLNOG 13: Robert Ślaski: NFV, Virtualise networks or die – the voice of the r...
 
JDD2014: Spring 4, JAVA EE 7 or both? - Ivar Grimstad
JDD2014: Spring 4, JAVA EE 7 or both? - Ivar GrimstadJDD2014: Spring 4, JAVA EE 7 or both? - Ivar Grimstad
JDD2014: Spring 4, JAVA EE 7 or both? - Ivar Grimstad
 
PLNOG14: Network and Application Performance Monitoring - Bob Cronin
PLNOG14: Network and Application Performance Monitoring - Bob CroninPLNOG14: Network and Application Performance Monitoring - Bob Cronin
PLNOG14: Network and Application Performance Monitoring - Bob Cronin
 
4Developers 2015: Mikroserwisy - szanse, dylematy i problemy - Łukasz Sowa
4Developers 2015: Mikroserwisy - szanse, dylematy i problemy - Łukasz Sowa4Developers 2015: Mikroserwisy - szanse, dylematy i problemy - Łukasz Sowa
4Developers 2015: Mikroserwisy - szanse, dylematy i problemy - Łukasz Sowa
 
CONFidence 2015: Analiza przypadku: Carbanak - jak uniknąć powtórki - Przemys...
CONFidence 2015: Analiza przypadku: Carbanak - jak uniknąć powtórki - Przemys...CONFidence 2015: Analiza przypadku: Carbanak - jak uniknąć powtórki - Przemys...
CONFidence 2015: Analiza przypadku: Carbanak - jak uniknąć powtórki - Przemys...
 

Similar to 4Developers 2015: Measure to fail - Tomasz Kowalczewski

Monitoring and Logging in Wonderland
Monitoring and Logging in WonderlandMonitoring and Logging in Wonderland
Monitoring and Logging in Wonderland
Paul Seiffert
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
New Relic
 
100% Visibility - Jason Yee - Codemotion Amsterdam 2018
100% Visibility - Jason Yee - Codemotion Amsterdam 2018100% Visibility - Jason Yee - Codemotion Amsterdam 2018
100% Visibility - Jason Yee - Codemotion Amsterdam 2018
Codemotion
 
Canary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS Summit
Canary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS SummitCanary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS Summit
Canary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS Summit
Amazon Web Services
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
David Simons
 
Creating Modern Metadata Systems [FutureStack16 NYC]
Creating Modern Metadata Systems [FutureStack16 NYC]Creating Modern Metadata Systems [FutureStack16 NYC]
Creating Modern Metadata Systems [FutureStack16 NYC]
New Relic
 
Scientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing SystemsScientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing Systems
inside-BigData.com
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
Jerry Yoakum
 
Using Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS PlatformUsing Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS Platform
DevOps.com
 
How Machines Help Humans Root Case Issues @ Netflix
How Machines Help Humans Root Case Issues @ NetflixHow Machines Help Humans Root Case Issues @ Netflix
How Machines Help Humans Root Case Issues @ Netflix
C4Media
 
Time Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and Azure
Marco Parenzan
 
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Aleksandr Tavgen
 
PAC 2020 Santorin - Giovanni Paolo Gibilisco
PAC 2020 Santorin - Giovanni Paolo GibiliscoPAC 2020 Santorin - Giovanni Paolo Gibilisco
PAC 2020 Santorin - Giovanni Paolo Gibilisco
Neotys
 
Wiring the IoT for modern manufacturing
Wiring the IoT for modern manufacturingWiring the IoT for modern manufacturing
Wiring the IoT for modern manufacturing
Florent Solt
 
Observability - the good, the bad, and the ugly
Observability - the good, the bad, and the uglyObservability - the good, the bad, and the ugly
Observability - the good, the bad, and the ugly
Aleksandr Tavgen
 
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
InfluxData
 
Training - What is Performance ?
Training  - What is Performance ?Training  - What is Performance ?
Training - What is Performance ?
Betclic Everest Group Tech Team
 
Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019
Michał Kurzeja
 
PAC 2020 Santorin - Joerek Van Gaalen
PAC 2020 Santorin - Joerek Van GaalenPAC 2020 Santorin - Joerek Van Gaalen
PAC 2020 Santorin - Joerek Van Gaalen
Neotys
 
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
Moshe Zioni
 

Similar to 4Developers 2015: Measure to fail - Tomasz Kowalczewski (20)

Monitoring and Logging in Wonderland
Monitoring and Logging in WonderlandMonitoring and Logging in Wonderland
Monitoring and Logging in Wonderland
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
 
100% Visibility - Jason Yee - Codemotion Amsterdam 2018
100% Visibility - Jason Yee - Codemotion Amsterdam 2018100% Visibility - Jason Yee - Codemotion Amsterdam 2018
100% Visibility - Jason Yee - Codemotion Amsterdam 2018
 
Canary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS Summit
Canary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS SummitCanary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS Summit
Canary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS Summit
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
 
Creating Modern Metadata Systems [FutureStack16 NYC]
Creating Modern Metadata Systems [FutureStack16 NYC]Creating Modern Metadata Systems [FutureStack16 NYC]
Creating Modern Metadata Systems [FutureStack16 NYC]
 
Scientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing SystemsScientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing Systems
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
 
Using Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS PlatformUsing Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS Platform
 
How Machines Help Humans Root Case Issues @ Netflix
How Machines Help Humans Root Case Issues @ NetflixHow Machines Help Humans Root Case Issues @ Netflix
How Machines Help Humans Root Case Issues @ Netflix
 
Time Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and Azure
 
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
 
PAC 2020 Santorin - Giovanni Paolo Gibilisco
PAC 2020 Santorin - Giovanni Paolo GibiliscoPAC 2020 Santorin - Giovanni Paolo Gibilisco
PAC 2020 Santorin - Giovanni Paolo Gibilisco
 
Wiring the IoT for modern manufacturing
Wiring the IoT for modern manufacturingWiring the IoT for modern manufacturing
Wiring the IoT for modern manufacturing
 
Observability - the good, the bad, and the ugly
Observability - the good, the bad, and the uglyObservability - the good, the bad, and the ugly
Observability - the good, the bad, and the ugly
 
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
 
Training - What is Performance ?
Training  - What is Performance ?Training  - What is Performance ?
Training - What is Performance ?
 
Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019
 
PAC 2020 Santorin - Joerek Van Gaalen
PAC 2020 Santorin - Joerek Van GaalenPAC 2020 Santorin - Joerek Van Gaalen
PAC 2020 Santorin - Joerek Van Gaalen
 
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
 

Recently uploaded

Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 

Recently uploaded (20)

Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 

4Developers 2015: Measure to fail - Tomasz Kowalczewski

  • 1. M E A S U R E T O F A I L D L A C Z E G O K L I E N C I S I Ę C Z E P I A J Ą J A K W Y K R E S Y M Ó W I Ą , Ż E A P L I K A C J A J E S T S Z Y B K A ?
  • 2. S U R V E Y • Do you…
  • 3. S U R V E Y • Use graphite?
  • 4. S U R V E Y • Use graphite? • Feed it with Coda Hale/Dropwizard metrics?
  • 5. S U R V E Y • Use graphite? • Feed it with Coda Hale/Dropwizard metrics? • Modify their source? Use nonstandard options?
  • 6. S U R V E Y • Use graphite? • Feed it with Coda Hale/Dropwizard metrics? • Modify their source? Use nonstandard options? • Graph average? Median?
  • 7. S U R V E Y • Use graphite? • Feed it with Coda Hale/Dropwizard metrics? • Modify their source? Use nonstandard options? • Graph average? Median? • Percentiles?
  • 8. S U R V E Y • Use graphite? • Feed it with Coda Hale/Dropwizard metrics? • Modify their source? Use nonstandard options? • Graph average? Median? • Percentiles? • Know the term “cargo cult”?
  • 9. C A R G O C U L T During the Middle Ages there were all kinds of crazy ideas, such as that a piece of of rhinoceros horn would increase potency. Then a method was discovered for separating the ideas- which was to try one to see if it worked, and if it didn't work, to eliminate it. This method became organized, of course, into science. And it developed very well, so that we are now in the scientific age. It is such a scientific age, in fact, that we have difficulty in understanding how witch doctors could ever have existed, when nothing that they proposed ever really worked-or very little of it did. Richard Feynman From a Caltech commencement address given in 1974
  • 10. M E A S U R I N G C O R R E C T L Y I S I M P O R T A N T • You get what you measure • Predictable is better than fast • One page display requires multiple calls (static and dynamic resources) • Multiple microservices are called to generate response • Each user will do hundreds of displays of your webpages
  • 11. W H Y D O T H I S ? • Every 100 ms increase in load time of Amazon.com decreased sales by 1%1 • Increasing web search latency 100 to 400 ms reduces the daily searches per user by 0.2% to 0.6%. Furthermore, users do fewer searches the longer they are exposed. For longer delays, the loss of searches persists for a time even after latency returns to previous levels.2 1Kohavi and Longbotham 2007 2Brutlag 2009
  • 12. W H A T M E T R I C S C A N W E U S E ? graphite.send(prefix(name, "max"), ...); graphite.send(prefix(name, "mean"), ...); graphite.send(prefix(name, "min"), ...); graphite.send(prefix(name, "stddev"), ...); graphite.send(prefix(name, "p50"), ...); graphite.send(prefix(name, "p75"), ...); graphite.send(prefix(name, "p95"), ...); graphite.send(prefix(name, "p98"), ...); graphite.send(prefix(name, "p99"), ...); graphite.send(prefix(name, “p999"), ...);
  • 13. D O N ’ T L O O K A T M E A N • 1000 queries - 0ms latency, 100 queries 5s latency • Average is 4,5ms • 1000 queries - 1ms latency, 100 queries - 5s latency • Average is 455ms • Does not help to quantify lags users will experience
  • 14. P L O T T I N G M E A N I S F O R S H O W I N G O F F T O M A N A G E M E N T
  • 15. M A Y B E M E D I A N T H E N ? • What is the probability of end user encountering latency worse than median? • Remember: usually multiple requests are needed to respond to API call (e.g. N micro services, N resource requests per page)
  • 16. P R O B A B I L I T Y O F E X P E R I E N C I N G L A T E N C Y B E T T E R T H A N M E D I A N I N F U N C T I O N O F M I C R O S E R V I C E S I N V O L V E D
  • 17. W H I C H P E R C E N T I L E I S R E L E V A N T T O Y O U ? • Is 99th percentile demanding constraint? • In application serving 1000 qps latency worse than that happens ten times per second. • User that needs to navigate through several web pages will most probably experience it • What is the probability of encountering latency better than 99th?
  • 18. P R O B A B I L I T Y O F E X P E R I E N C I N G L A T E N C Y B E T T E R T H A N 9 9 T H P E R C E N T I L EI N F U N C T I O N O F M I C R O S E R V I C E S I N V O L V E D
  • 19. D O N O T A V E R A G E P E R C E N T I L E S Example scenario: 1. Load balancer splits traffic unevenly (ELB anyone?) 2. Server S1 has 1 qps over measured time with 95%’ile == 1ms 3. Server S2 has 100 qps over measured time with 95%’ile == 10s 4. Average is ~5s. 5. What does that tell us? 6. Did we satisfy SLA if it says “95%’ile must be below 8s”? 7. Actual 95%’ile percentile is ~10s
  • 20. – A L I C E ' S A D V E N T U R E S I N W O N D E R L A N D “If there's no meaning in it,' said the King, 'that saves a world of trouble, you know, as we needn't try to find any”
  • 21. m e t r i c R e g i s t r y . t i m e r ( " m y a p p . r e s p o n s e T i m e " ) ; Standard timer will over or under report actual percentiles at will. Green line represents actual MAX values.
  • 22. m e t r i c R e g i s t r y . t i m e r ( " m y a p p . r e s p o n s e T i m e " ) ; Standard timer will over or under report actual percentiles at will. Green line represents actual MAX values.
  • 23. Blue line represents metric reported from Timer class Green line represents request rate
  • 24. T I M E R , T I M E R N E V E R C H A N G E S … • Timer values decay exponentially • giving artificial smoothing of values for server behaviour that may be long gone • Timer that is not updated does not decay • If Timer is not updated (e.g. subprocess failed and we stopped sending requests to it) its values will remain constant • Check this post for potential solutions: taint.org/2014/01/16/145944a.html
  • 25. T I M E R ’ S H I S T O G R A M R E S E R V O I R • Backing storage for Timer’s data • Contain “statistically representative reservoir of a data stream” • Default is ExponentiallyDecayingReservoir which has many drawbacks and is source of most inaccuracies observed throughout this presentation • Others include • UniformReservoir, SlidingTimeWindowReservoir, SlidingTimeWindowReservoir, SlidingWindowReservoir
  • 26. E X P O N E N T I A L L Y D E C A Y I N G R E S E R V O I R • Assumes normal distribution of recorded values • Stores 1024 random samples by default • Many statistical tools applied in computer systems monitoring will assume normal distribution • Be suspicious of such tools • Why is that a bad idea?
  • 27. N O R M A L D I S T R I B U T I O N - W H Y S O U S E F U L ? • Central limit theorem • Chebyshev's inequality
  • 28. C A L C U L A T E 9 5 % ’ I L E B A S E D O N M E A N A N D S T D . D E V . • IFF latency values were distributed normally then we could calculate any percentile based on mean and standard deviation • Lookup into standard normal (Z) table • 95%’ile is located 1.65 std. dev. from mean • Result is 11,65ms
  • 29. Latency profile resembling normal distribution…
  • 30. Add spikes due to young gen GC pauses
  • 31. Add spikes due to old gen GC pauses
  • 32. Add spikes due to calling other services (like DB)
  • 33. Add spikes due to: lost tcp packet retransmission, disk swapping, kernel bookkeeping etc.
  • 34. N O R M A L D I S T R I B U T I O N - W H Y N O T A P P L I C A B L E ? • The value of the normal distribution is practically zero when the value x lies more than a few standard deviations away from the mean. • It may not be an appropriate model when one expects a significant fraction of outliers • […] other statistical inference methods that are optimal for normally distributed variables often become highly unreliable when applied to such data.1 1All quotes on this slide from Wikipedia
  • 35. H D R H I S T O G R A M • Supports recording and analysis of sampled data across configurable range with configurable accuracy • Provides compact representation of data while retaining high resolution • Allows configurable tradeoffs between space and accuracy • Very fast, allocation free, not thread safe for maximum speed (thread safe versions available) • Created by Gil Tene of Azul Sytems
  • 36. R E C O R D E R • Uses HdrHistogram to store values • Supports concurrent recording of values • Recording is lock free but also wait free on most architectures (that support lock xadd) • Reading is not lock free but does not stall writers (writer- reader phaser) • Checkout Marshall Pierce’s library for using it as a Reservoir implementation
  • 37. JMH benchmarks (from my laptop, caveat emptor!)
  • 38. S O L U T I O N S • Instantiate Timer with custom reservoir • new ExponentiallyDecayingReservoir(LARGE_NUMBER) • new SlidingTimeWindowReservoir(1, MINUTES) • new HdrHistogramResetOnSnapshotReservoir() • Only last one is safe and accurate and will not report stale values if no updates were made
  • 39. S M O K I N G B E N C H M A R K I N G I S T H E L E A D I N G C A U S E O F S T A T I S T I C S I N T H E W O R L D
  • 40. C O O R D I N A T E D O M I S S I O N • When load driver is plotting with system under test to deceive you • Most tools do this • Most benchmarks do this • Yahoo Cloud Serving Benchmark had that problem1 1Recently fixed by Nitsan Wakart, see psy-lob-saw.blogspot.com/2015/03/fixing-ycsb-coordinated-omission.html
  • 41.
  • 42. – C R E A T E D W I T H G I L T E N E ' S H D R H I S T O G R A M P L O T T I N G S C R I P T Effects on benchmarks at high percentiles are spectacular
  • 43. C O O R D I N A T E D O M I S S I O N S O L U T I O N S 1. Ignore the problem! perfectly fine for non interactive system where only throughput matters
  • 44. C O O R D I N A T E D O M I S S I O N S O L U T I O N S 2. Correct it mathematically in sampling mechanism HdrHistogram can correct CO with these methods (choose one!): histogram.recordValueWithExpectedInterval( value, expectedIntervalBetweenSamples ); histogram.copyCorrectedForCoordinatedOmission( expectedIntervalBetweenSamples );
  • 45. C O O R D I N A T E D O M I S S I O N S O L U T I O N S 3. Correct it on load driver side by noticing pauses between sent requests. newly issued request will have timer that starts counting from time it should have been sent but wasn't
  • 46. C O O R D I N A T E D O M I S S I O N S O L U T I O N S 4. Fail the test for hard real time systems where pause causes human casualties (breaks, pacemakers, Phalanx system)
  • 47. C O O R D I N A T E D O M I S S I O N • Mathematical solutions can overcorrect when load driver has pauses (e.g. GC). • Do not account for the fact that server after pause has no work to do instead of N more requests waiting to be executed • In real world it might have never recovered • Most tools ignore the problem • Notable exception: Twitter Iago
  • 48. – L O A D D R I V E R M O T T O “Do not bend to the tyranny of reality”
  • 49. S U M M A R Y • Measure what is meaningful not just what is measurable • Set SLA before testing and creating dashboards • Do not trust Timer class, use custom reservoirs, HdrHistogram, Recorder, never trust EMWA for request rate • Do not average percentiles unless you need a random number generator • Do not plot averages unless you just want to look good on dashboards • When load testing be aware of coordinated omission
  • 50. S O U R C E S , T H A N K Y O U S A N D R E C O M M E N D E D F O L L O W U P S • Coda Hale for great metrics library • Gil Tene • latencytipoftheday.blogspot.de • www.infoq.com/presentations/latency-pitfalls • github.com/HdrHistogram/HdrHistogram • Nitsan Wakart • psy-lob-saw.blogspot.de/2015/03/fixing-ycsb-coordinated-omission.html • and whole blog • Matin Thompson et. al. • groups.google.com/forum/#!forum/mechanical-sympathy
  • 51. R E C O M M E N D E D Great introduction to statistics and queueing theory. Performance Modeling and Design of Computer Systems: Queueing Theory in Action Prof. Mor Harchol-Balter