Understanding Application
Hiccups
and what you can do about them
An introduction to the Open Source jHiccup tool
Gil Tene,...
About me: Gil Tene
co-founder, CTO
@Azul Systems
Have been working on
“think different” GC
approaches since 2002
Created P...
About Azul
Vega

We make scalable Virtual
Machines
Have built “whatever it takes
to get job done” since 2002
3 generations...
A classic look at response
time behavior
Key Assumption: Response time is a function of load
Average?
Max?
Median?
90%?
99...
Common fallacies
Computers run application code continuously
CPUs stop processing application code for all sorts of reason...
Interpreting Results

Response time over time

From this graph it is possible to identify peaks in the response time over ...
Application Hiccups
Where do they come from?
Usually a factor of outside of the individual transaction work
E.g: Queueing,...
What happened here?
Hiccups&by&Time&Interval&

“Hiccups”

Max"per"Interval"

99%"

99.90%"

99.99%"

Max"

Hiccup&Dura*on&...
Pitfall: Calulating %’iles “naïvely”
Common Example:
build/buy simple load tester to measure throughput
issue requests one...
Example of naïve %’ile
System easily handles
100 requests/sec
Responds to each
in 1msec

System Stalled
for 100 Sec

How w...
Proper measurement
System easily handles
100 requests/sec

System Stalled
for 100 Sec

Responds to each
in 1msec

10,000 r...
jHiccup
A tool for capturing and displaying platform hiccups
Records any observed non-continuity of the underlying platfor...
what jHiccup measures
jHiccup measures the platform, not the application
Experiences whatever delays application threads w...
The anatomy of Hiccup Charts

©2011 Azul Systems, Inc.
Telco App Example
Max Time per

Hiccups&by&Time&Interval&

interval

Max"per"Interval"

99%"

99.90%"

99.99%"

Max"

Hicc...
Hiccups&by&Time&Interval&
Max"per"Interval"

Hiccup&Dura*on&(msec)&

1200"

99%"

99.90%"

99.99%"

Max"

1000"
800"
600"
...
How jHiccup works
Background thread, measures time to do “nothing”
Sleeps for 1 milliseconds, allocates 1 object, records ...
Some fun with jHiccup
Examples

©2011 Azul Systems, Inc.
Idle App on Quiet System

Idle App on Busy System

Hiccups&by&Time&Interval&
Max"per"Interval"

99%"

99.90%"

Hiccups&by&...
Idle App on Quiet System

Idle App on Dedicated System

Hiccups&by&Time&Interval&
Max"per"Interval"

99%"

99.90%"

Hiccup...
A 1GB Java Cache under load
Hiccups&by&Time&Interval&
Max"per"Interval"

99%"

99.90%"

99.99%"

Max"

Hiccup&Dura*on&(mse...
Telco App
Hiccups&by&Time&Interval&
Max"per"Interval"

99%"

99.90%"

99.99%"

Max"

Hiccup&Dura*on&(msec)&

1800"
1600"
1...
Fun with jHiccup
Obviously, we can use it for
comparison purposes

©2011 Azul Systems, Inc.
Zing 5, 1GB in an 8GB heap

Oracle HotSpot CMS, 1GB in an 8GB heap

Hiccups&by&Time&Interval&
Max"per"Interval"

99%"

99....
Zing 5, 1GB in an 8GB heap

Oracle HotSpot CMS, 1GB in an 8GB heap

Hiccups&by&Time&Interval&
Max"per"Interval"

99%"

99....
Oracle HotSpot CMS, 4GB in a 18GB heap

Oracle HotSpot CMS, 1GB in an 8GB heap

Hiccups&by&Time&Interval&
Max"per"Interval...
Q&A
http:/
/www.azulsystems.com
http:/
/www.azulsystems.com/dev_resources/jhiccup

©2011 Azul Systems, Inc.	

	

	

	

	

...
Upcoming SlideShare
Loading in …5
×

Understanding Application Hiccups - and What You Can Do About Them

897 views

Published on

In this presentation, Gil Tene (CTO, Azul Systems) will introduce simple, non-obstrusive methods for measuring and characterizing platform "hiccups" during application execution. Using the new jHiccup open source tool, Gil will demonstrate and chart commonly observed behaviors of idle, mostly idle, and busy systems, as well as common workload types that experience outliers due to garbage collection pauses and other runtime-induced delays. After demonstrating how simple, non-obtrusive measurement can establish a clear "best case" baseline for any expected application responsiveness, Gil will discuss the important things to look for in such measurements, as well as the common pitfalls experienced early in characterization attempts.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
897
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
11
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Understanding Application Hiccups - and What You Can Do About Them

  1. 1. Understanding Application Hiccups and what you can do about them An introduction to the Open Source jHiccup tool Gil Tene, CTO & co-Founder, Azul Systems ©2011 Azul Systems, Inc.
  2. 2. About me: Gil Tene co-founder, CTO @Azul Systems Have been working on “think different” GC approaches since 2002 Created Pauseless & C4 core GC algorithms (Tene, Wolf) A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc... ©2011 Azul Systems, Inc. * working on real-world trash compaction issues, circa 2004
  3. 3. About Azul Vega We make scalable Virtual Machines Have built “whatever it takes to get job done” since 2002 3 generations of custom SMP Multi-core HW (Vega) Now Pure software for commodity x86 (Zing) C4 “Industry firsts” in Garbage collection, elastic memory, Java virtualization, memory scale ©2011 Azul Systems, Inc.
  4. 4. A classic look at response time behavior Key Assumption: Response time is a function of load Average? Max? Median? 90%? 99.9% source: IBM CICS server docuementation, “understanding response times” ©2011 Azul Systems, Inc.
  5. 5. Common fallacies Computers run application code continuously CPUs stop processing application code for all sorts of reasons e.g: Interrupts. Scheduling of other work, swapping, etc. Modern system architectures add more: Power management, Virtualization (cross-image context switching, physical VM motion), Garbage Collection, etc. Response time can be measured as work units/time. Response time exhibits a normal distribution Leading to attempts to represent with average + std. deviation ©2011 Azul Systems, Inc.
  6. 6. Interpreting Results Response time over time From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown. When we measure behavior over time, we often see: “Hiccups” source: ZOHO QEngine White Paper: performance testing report analysis ©2011 Azul Systems, Inc.
  7. 7. Application Hiccups Where do they come from? Usually a factor of outside of the individual transaction work E.g: Queueing, accumulated work, platform inconsistency Do they matter? That depends. What are your end-user’s expectations? Hiccups often dominate response time behavior How can/should we measure them? Average? Max? 99.9%? Mean with Std. deviation? Hiccup magnitude is often not a function of load ©2011 Azul Systems, Inc.
  8. 8. What happened here? Hiccups&by&Time&Interval& “Hiccups” Max"per"Interval" 99%" 99.90%" 99.99%" Max" Hiccup&Dura*on&(msec)& 9000" 8000" 7000" 6000" 5000" 4000" 3000" 2000" 1000" 0" 0" 20" 40" 60" 80" 100" 120" 140" 160" 180" &Elapsed&Time&(sec)& Source: Gil running an idle program and suspending it five times in the middle ©2011 Azul Systems, Inc.
  9. 9. Pitfall: Calulating %’iles “naïvely” Common Example: build/buy simple load tester to measure throughput issue requests one by one at a certain rate measure and log response time for each request results log used to produce histograms, percentiles, etc. So what’s wrong with that? works well only when all responses fit within in rate interval technique includes “automatic backoff” and coordination But requirements interested in random, uncoordinated requests Bad how bad can this get, really? ©2011 Azul Systems, Inc.
  10. 10. Example of naïve %’ile System easily handles 100 requests/sec Responds to each in 1msec System Stalled for 100 Sec How would you characterize this system? 10,000 @ 1msec Naïve results: 1 @ 100 second 1 msec Elapsed Time Naïve characterization: 99.99% below 1 sec !!!
  11. 11. Proper measurement System easily handles 100 requests/sec System Stalled for 100 Sec Responds to each in 1msec 10,000 results @ 1 msec each 10,000 results Varying linearly from 100 sec to 10 msec Elapsed Time Proper characterization: 50% below 1 second
  12. 12. jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple, consistent format Simple, non-intrusive As simple as adding the word “jHiccup” to your java launch line % jHiccup java myflags myApp Adds a background thread that samples time @ 1000/sec Open Source released to the public domain, creative commons CC0 ©2011 Azul Systems, Inc.
  13. 13. what jHiccup measures jHiccup measures the platform, not the application Experiences whatever delays application threads would see Highlights inconsistency in platform execution of “nothing” No attempt to identify cause (e.g. scheduling, GC, swapping, etc.) An application cannot behave better than it’s platform If the platform stalls, the application stalled as well. jHiccup provides a lower bound for “best application behavior” Useful control measurements Comparing jHiccup results for an idle application running on the same platform, at the same time, narrows down behavior to process-specific artifacts (-c option) ©2011 Azul Systems, Inc.
  14. 14. The anatomy of Hiccup Charts ©2011 Azul Systems, Inc.
  15. 15. Telco App Example Max Time per Hiccups&by&Time&Interval& interval Max"per"Interval" 99%" 99.90%" 99.99%" Max" Hiccup&Dura*on&(msec)& 140" 120" 100" 80" 60" 40" 20" 0" 0" 500" 1000" 1500" 2000" 2500" &Elapsed&Time&(sec)& Optional SLA Hiccups&by&Percen*le&Distribu*on& plotting Hiccup Hiccup&Dura*on&(msec)& 140" 120" duration at 100" percentile 80" 60" levels 40" 20" 0" 0%" 90%" 99%" & 99.9%" & Percen*le& Hiccups"by"Percen?le" ©2011 Azul Systems, Inc. 99.99%" SLA" 99.999%" 99.9999%"
  16. 16. Hiccups&by&Time&Interval& Max"per"Interval" Hiccup&Dura*on&(msec)& 1200" 99%" 99.90%" 99.99%" Max" 1000" 800" 600" 400" 200" 0" 0" 200" 400" 600" 800" 1000" 1200" 1400" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccup&Dura*on&(msec)& 1200" 1000" 600" 400" 200" 0" ©2011 Azul Systems, Inc. Max=967.68& 800" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%"
  17. 17. How jHiccup works Background thread, measures time to do “nothing” Sleeps for 1 milliseconds, allocates 1 object, records time gap Collects detailed “floating point” histogram of observed times Constant, small memory footprint, virtually no cpu load No effect on application threads Outputs two files A time-based log with 1 line per interval (default 5 sec interval). A percentile distribution log of accumulated histogram data Spreadsheet imports files and plots data A standard, simple chart format ©2011 Azul Systems, Inc.
  18. 18. Some fun with jHiccup
  19. 19. Examples ©2011 Azul Systems, Inc.
  20. 20. Idle App on Quiet System Idle App on Busy System Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" Hiccups&by&Time&Interval& 99.99%" Max" Max"per"Interval" 20" 15" 10" 5" 0" Max" 50" 40" 30" 20" 10" 100" 200" 300" 400" 500" 600" 700" 800" 900" 0" 100" 200" &Elapsed&Time&(sec)& 300" 400" 500" 600" 700" 800" 900" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccups&by&Percen*le&Distribu*on& 25" 60" Max=22.336& 20" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 99.99%" 0" 0" 15" 10" 5" 0" 99.90%" 60" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 25" 99%" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%" 50" Max=49.728& 40" 30" 20" 10" 0" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%"
  21. 21. Idle App on Quiet System Idle App on Dedicated System Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" Hiccups&by&Time&Interval& 99.99%" Max" Max"per"Interval" 20" 15" 10" 5" 0" 99.99%" Max" 0.4" 0.35" 0.3" 0.25" 0.2" 0.15" 0.1" 0.05" 0" 0" 100" 200" 300" 400" 500" 600" 700" 800" 900" 0" 100" 200" &Elapsed&Time&(sec)& 300" 400" 500" 600" 700" 800" 900" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccups&by&Percen*le&Distribu*on& 25" 0.45" Max=22.336& 20" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 99.90%" 0.45" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 25" 99%" 15" 10" 5" 0.4" Max=0.411& 0.35" 0.3" 0.25" 0.2" 0.15" 0.1" 0.05" 0" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%" 0" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%"
  22. 22. A 1GB Java Cache under load Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" Hiccup&Dura*on&(msec)& 4000" 3500" 3000" 2500" 2000" 1500" 1000" 500" 0" 0" 500" 1000" 1500" 2000" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccup&Dura*on&(msec)& 4000" 3500" Max=3448.832& 3000" 2500" 2000" 1500" 1000" 500" 0" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%" 99.9999%"
  23. 23. Telco App Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" Hiccup&Dura*on&(msec)& 1800" 1600" 1400" 1200" 1000" 800" 600" 400" 200" 0" 0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccup&Dura*on&(msec)& 1800" 1600" Max=1665.024& 1400" 1200" 1000" 800" 600" 400" 200" 0" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%"
  24. 24. Fun with jHiccup
  25. 25. Obviously, we can use it for comparison purposes ©2011 Azul Systems, Inc.
  26. 26. Zing 5, 1GB in an 8GB heap Oracle HotSpot CMS, 1GB in an 8GB heap Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" Hiccups&by&Time&Interval& 99.99%" Max" Max"per"Interval" 20" 15" 10" 5" 0" Max" 12000" 10000" 8000" 6000" 4000" 2000" 500" 1000" 1500" 2000" 2500" 3000" 3500" 0" 500" 1000" &Elapsed&Time&(sec)& 1500" 2000" 2500" 3000" 3500" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccups&by&Percen*le&Distribu*on& 14000" 20" Hiccup&Dura*on&(msec)& 25" Hiccup&Dura*on&(msec)& 99.99%" 0" 0" Max=20.384& 15" 10" 5" 0" 99.90%" 14000" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 25" 99%" 0%" 90%" 99%" & 99.9%" 99.99%" & Percen*le& 99.999%" 99.9999%" Max=13156.352& 12000" 10000" 8000" 6000" 4000" 2000" 0" 0%" 90%" 99%" &99.9%" & Percen*le& 99.99%" 99.999%"
  27. 27. Zing 5, 1GB in an 8GB heap Oracle HotSpot CMS, 1GB in an 8GB heap Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" Hiccups&by&Time&Interval& 99.99%" Max" Max"per"Interval" 12000" 99.99%" Max" 12000" 10000" 8000" 6000" 4000" 2000" 0" 10000" 8000" 6000" 4000" 2000" 0" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 0" 500" 1000" &Elapsed&Time&(sec)& 1500" 2000" 2500" 3000" 3500" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccups&by&Percen*le&Distribu*on& 14000" 12000" 12000" Hiccup&Dura*on&(msec)& 14000" Hiccup&Dura*on&(msec)& 99.90%" 14000" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 14000" 99%" 10000" 8000" 6000" 4000" 2000" 0" 0%" 90%" Max=20.384& 99%" & 99.9%" 99.99%" & Percen*le& 99.999%" 99.9999%" Max=13156.352& 10000" 8000" 6000" 4000" 2000" 0" 0%" 90%" 99%" &99.9%" & Percen*le& 99.99%" 99.999%"
  28. 28. Oracle HotSpot CMS, 4GB in a 18GB heap Oracle HotSpot CMS, 1GB in an 8GB heap Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" Hiccups&by&Time&Interval& 99.99%" Max" Max"per"Interval" 40000" 99.99%" Max" 12000" 35000" 30000" 25000" 20000" 15000" 10000" 5000" 0" 10000" 8000" 6000" 4000" 2000" 0" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 0" 500" 1000" &Elapsed&Time&(sec)& 2000" 2500" 3000" 3500" Hiccups&by&Percen*le&Distribu*on& 45000" 14000" 40000" Hiccup&Dura*on&(msec)& Max=40173.568& 35000" 30000" 25000" 20000" 15000" 10000" 5000" 0" 1500" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccup&Dura*on&(msec)& 99.90%" 14000" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 45000" 99%" 0%" 90%" & 99.9%" 99%" & Percen*le& 99.99%" 99.999%" Max=13156.352& 12000" 10000" 8000" 6000" 4000" 2000" 0" 0%" 90%" 99%" &99.9%" & Percen*le& 99.99%" 99.999%"
  29. 29. Q&A http:/ /www.azulsystems.com http:/ /www.azulsystems.com/dev_resources/jhiccup ©2011 Azul Systems, Inc.

×