Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How NOT to Measure Latency

391 views

Published on

Understanding Java application responsiveness and latency is critical not only for delivering good application behavior but also for maintaining profitability and containing risk. But when measurements present false or misleading information, even the best analysis can lead to wrong operational decisions and poor application experience. This presentation discusses common pitfalls encountered in measuring and characterizing latency, and ways to address them using some recently open sourced tools.

Twitter: @AzulSystems
Web: azul.com

Published in: Software
  • Be the first to comment

  • Be the first to like this

How NOT to Measure Latency

  1. 1. © Copyright Azul Systems 2015 © Copyright Azul Systems 2015 @azulsystems How NOT to Measure Latency Matt Schuetze Product Management Director, Azul Systems 6/12/20151 QCon NY Brooklyn, New York
  2. 2. © Copyright Azul Systems 2015 © Copyright Azul Systems 2015 @azulsystems Understanding Latency and Application Responsiveness Matt Schuetze Product Management Director, Azul Systems 6/12/20152 QCon NY Brooklyn, New York
  3. 3. © Copyright Azul Systems 2015 © Copyright Azul Systems 2015 @azulsystems The Oh $@%T! talk. Matt Schuetze Product Management Director, Azul Systems 6/12/20153 QCon NY Brooklyn, New York
  4. 4. © Copyright Azul Systems 2015 About me: Matt Schuetze  Product Management Director at Azul Systems  Translate Voice of Customer into Zing and Zulu requirements and work items  Sing the praises of Azul efforts through product launches  Azul alternate on JCP exec committee, co-lead Detroit Java User Group  Stand on the shoulders of giants and admit it 6/12/20154
  5. 5. © Copyright Azul Systems 2015 Philosophy and motivation What do we actually care about. And why? 6/12/20155
  6. 6. © Copyright Azul Systems 2015 Latency Behavior  Latency: The time it took one operation to happen  Each operation occurrence has its own latency  What we care about is how latency behaves  Behavior is a lot more than “the common case was X” 6/12/20156
  7. 7. © Copyright Azul Systems 2015©2013 Azul Systems, Inc. 95%’ile The “We only want to show good things” chart We like to look at charts 6/12/20157
  8. 8. © Copyright Azul Systems 2015 What do you care about? Do you : Care about latency in your system? Care about the worst case? Care about the 99.99%’ile? Only care about the fastest thing in the day? Only care about the best 50% Only need 90% of operations to meet requirements? 6/12/20158
  9. 9. © Copyright Azul Systems 2015©2013 Azul Systems, Inc. We like to rant about latency 6/12/201515
  10. 10. © Copyright Azul Systems 2015 99%‘ile is ~60 usec. (but mean is ~210usec) “outliers”, “averages” and other nonsense We nicknamed these spikes “hiccups” 6/12/201516
  11. 11. © Copyright Azul Systems 2015 Dispelling standard deviation 6/12/201517
  12. 12. © Copyright Azul Systems 2015 Mean = 0.06 msec Std. Deviation (𝞂) = 0.21msec 99.999% = 38.66msec In a normal distribution, These are NOT normal distributions ~184 σ (!!!) away from the mean the 99.999%’ile falls within 4.5 σ Dispelling standard deviation 6/12/201518
  13. 13. © Copyright Azul Systems 2015 Is the 99%’ile “rare”? 6/12/201519
  14. 14. © Copyright Azul Systems 2015 What are the chances of a single web page view experiencing the 99%’ile latency of: - A single search engine node? - A single Key/Value store node? - A single Database node? - A single CDN request? Cumulative probability… 6/12/201520
  15. 15. © Copyright Azul Systems 20156/12/201521
  16. 16. © Copyright Azul Systems 2015 Which HTTP response time metric is more “representative” of user experience? The 95%’ile or the 99.9%’ile 6/12/201522
  17. 17. © Copyright Azul Systems 2015 Example: A typical user session involves 5 page loads, averaging 40 resources per page. - How many of our users will NOT experience something worse than the 95%’ile? Answer: ~0.003% - How many of our users will experience at least one response that is longer than the 99.9%’ile? Answer: ~18% Gauging user experience 6/12/201523
  18. 18. © Copyright Azul Systems 2015 Classic look at response time behavior Response time as a function of load source: IBM CICS server documentation, “understanding response times” Average? Max? Median? 90%? 99.9% 6/12/201524
  19. 19. © Copyright Azul Systems 2015 Hiccups are strongly multi-modal  They don’t look anything like a normal distribution  A complete shift from one mode/behavior to another  Mode A: “good”.  Mode B: “Somewhat bad”  Mode C: “terrible”, ...  The real world is not a gentle, smooth curve  Mode transitions are “phase changes” 6/12/201525
  20. 20. © Copyright Azul Systems 2015 Proven ways to deal with hiccups Actually characterizing latency Requirements Response Time Percentile plot line 6/12/201526
  21. 21. © Copyright Azul Systems 2015 Different throughputs, configurations, or other parameters on one graph Comparing Behavior 6/12/201529
  22. 22. © Copyright Azul Systems 2015 Shameless Bragging 6/12/201530
  23. 23. © Copyright Azul Systems 2015 Comparing Behaviors - Actual Latency sensitive messaging distribution application: HotSpot vs. Zing 6/12/201531
  24. 24. © Copyright Azul Systems 2015 Zing  A standards-compliant JVM for Linux/x86 servers  Eliminates Garbage Collection as a concern for enterprise applications in Java, Scala, or any JVM language  Very wide operating range: Used in both low latency and large scale enterprise application spaces  Decouples scale metrics from response time concerns Transaction rate, data set size, concurrent users, heap size, allocation rate, mutation rate, etc.  Leverages elastic memory for resilient operation 6/12/201532
  25. 25. © Copyright Azul Systems 2015 What is Zing good for?  If you have a server-based Java application  And you are running on Linux  And you use using more than ~300MB of memory, up to as high as 1TB memory,  Then Zing will likely deliver superior behavior metrics 6/12/201533
  26. 26. © Copyright Azul Systems 2015 Where Zing shines  Low latency Eliminate behavior blips down to the sub-millisecond-units level  Machine-to-machine “stuff” Support higher *sustainable* throughput (one that meets SLAs) Messaging, queues, market data feeds, fraud detection, analytics  Human response times Eliminate user-annoying response time blips. Multi-second and even fraction-of-a-second blips will be completely gone. Support larger memory JVMs *if needed* (e.g. larger virtual user counts, or larger cache, in-memory state, or consolidating multiple instances)  “Large” data and in-memory analytics Make batch stuff “business real time”. Gain super-efficiencies. Cassandra, Spark, Solr, DataGrid, any large dataset in fast motion 6/12/201534
  27. 27. © Copyright Azul Systems 2015 An accidental conspiracy... The coordinated omission problem 6/12/201535
  28. 28. © Copyright Azul Systems 2015 The coordinated omission problem  Common load testing example: – each “client” issues requests at a certain rate – measure/log response time for each request  So what’s wrong with that? – works only if ALL responses fit within interval – implicit “automatic back off” coordination Begin audience participation exercise now… 6/12/201536
  29. 29. © Copyright Azul Systems 2015 It is MUCH more common than you may think... Is coordinated omission rare? 6/12/201537
  30. 30. © Copyright Azul Systems 2015 Before Correction After Correcting for Omission JMeter makes this mistake... And so do other tools! 6/12/201538
  31. 31. © Copyright Azul Systems 2015 Before Correction After Correction Wrong by 7x Real World Coordinated Omission effects 6/12/201539
  32. 32. © Copyright Azul Systems 2015 Uncorrected Data Real World Coordinated Omission effects 6/12/201540
  33. 33. © Copyright Azul Systems 2015 Uncorrected Data Corrected for Coordinated Omission Real World Coordinated Omission effects 6/12/201541
  34. 34. © Copyright Azul Systems 2015 A ~2500x difference in reported percentile levels for the problem that Zing eliminates Zing “other” JVM Real World Coordinated Omission effects Why I care 6/12/201542
  35. 35. © Copyright Azul Systems 2015 How “real” people react 6/12/201543
  36. 36. © Copyright Azul Systems 2015 Suggestions  Whatever your measurement technique is, test it.  Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior  Don’t waste time analyzing until you establish sanity  Don’t ever use or derive from standard deviation  Always measure Max time. Consider what it means...  Be suspicious.  Measure %‘iles. Lots of them. 6/12/201544
  37. 37. © Copyright Azul Systems 2015 HdrHistogram 6/12/201545
  38. 38. © Copyright Azul Systems 2015 Then you need both good dynamic range and good resolution HdrHistogram If you want to be able to produce charts like this... 6/12/201546
  39. 39. © Copyright Azul Systems 20156/12/201548
  40. 40. © Copyright Azul Systems 20156/12/201549
  41. 41. © Copyright Azul Systems 2015 Shape of Constant latency 10K fixed line latency 6/12/201550
  42. 42. © Copyright Azul Systems 2015 Shape of Gaussian latency 10K fixed line latency with added Gaussian noise (std dev. = 5K) 6/12/201551
  43. 43. © Copyright Azul Systems 2015 Shape of Random latency 10K fixed line latency with added Gaussian (std dev. = 5K) vs. random (+5K) 6/12/201552
  44. 44. © Copyright Azul Systems 2015 Shape of Stalling latency 10K fixed base, stall magnitude of 50K stall likelihood = 0.00005 (interval = 100) 6/12/201553
  45. 45. © Copyright Azul Systems 2015 Shape of Queuing latency 10K fixed base, occasional bursts of 500 msgs handling time = 100, burst likelihood = 0.00005 6/12/201554
  46. 46. © Copyright Azul Systems 2015 Shape of Multi Modal latency 10K mode0 70K mode1 (likelihood 0.01) 180K mode2 (likelihood 0.00001) 6/12/201555
  47. 47. © Copyright Azul Systems 2015 And this what the real(?) world sometimes looks like… 6/12/201556
  48. 48. © Copyright Azul Systems 2015 Real world “deductive reasoning” 6/12/201557
  49. 49. © Copyright Azul Systems 2015 http://www.jhiccup.org 6/12/201558
  50. 50. © Copyright Azul Systems 2015 jHiccup 6/12/201559
  51. 51. © Copyright Azul Systems 2015 Discontinuity in Java execution 6/12/201560
  52. 52. © Copyright Azul Systems 2015 Examples 6/12/201563
  53. 53. © Copyright Azul Systems 2015 Oracle HotSpot ParallelGC Oracle HotSpot G1 1GB live set in 8GB heap, same app, same HotSpot, different GC 6/12/201564
  54. 54. © Copyright Azul Systems 2015 Oracle HotSpot CMS Zing Pauseless GC 1GB live set in 8GB heap, same app, different JVM/GC 6/12/201565
  55. 55. © Copyright Azul Systems 2015 Oracle HotSpot CMS Zing Pauseless GC 1GB live set in 8GB heap, same app, different JVM/GC- drawn to scale 6/12/201566
  56. 56. © Copyright Azul Systems 2015 Zing Low latency trading application Oracle HotSpot (pure NewGen) Zing Pauseless GC 6/12/201567
  57. 57. © Copyright Azul Systems 2015 Oracle HotSpot (pure newgen) ZingOracle HotSpot (pure newgen) Low latency trading application Oracle HotSpot (pure NewGen) Zing Pauseless GC 6/12/201568
  58. 58. © Copyright Azul Systems 2015 Oracle HotSpot (pure newgen) Zing Low latency trading application – drawn to scale Oracle HotSpot (pure NewGen) Zing Pauseless GC 6/12/201569
  59. 59. © Copyright Azul Systems 2015 Q & A www.azul.com www.jhiccup.com www.hdrhistogram.com @schuetzematt 6/12/201570

×