• Share
  • Email
  • Embed
  • Like
  • Private Content
Understanding Application Hiccups - and What You Can Do About Them
 

Understanding Application Hiccups - and What You Can Do About Them

on

  • 413 views

In this presentation, Gil Tene (CTO, Azul Systems) will introduce simple, non-obstrusive methods for measuring and characterizing platform "hiccups" during application execution. Using the new jHiccup ...

In this presentation, Gil Tene (CTO, Azul Systems) will introduce simple, non-obstrusive methods for measuring and characterizing platform "hiccups" during application execution. Using the new jHiccup open source tool, Gil will demonstrate and chart commonly observed behaviors of idle, mostly idle, and busy systems, as well as common workload types that experience outliers due to garbage collection pauses and other runtime-induced delays. After demonstrating how simple, non-obtrusive measurement can establish a clear "best case" baseline for any expected application responsiveness, Gil will discuss the important things to look for in such measurements, as well as the common pitfalls experienced early in characterization attempts.

Statistics

Views

Total Views
413
Views on SlideShare
413
Embed Views
0

Actions

Likes
1
Downloads
6
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Understanding Application Hiccups - and What You Can Do About Them Understanding Application Hiccups - and What You Can Do About Them Presentation Transcript

    • Understanding Application Hiccups and what you can do about them An introduction to the Open Source jHiccup tool Gil Tene, CTO & co-Founder, Azul Systems ©2011 Azul Systems, Inc.
    • About me: Gil Tene co-founder, CTO @Azul Systems Have been working on “think different” GC approaches since 2002 Created Pauseless & C4 core GC algorithms (Tene, Wolf) A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc... ©2011 Azul Systems, Inc. * working on real-world trash compaction issues, circa 2004
    • About Azul Vega We make scalable Virtual Machines Have built “whatever it takes to get job done” since 2002 3 generations of custom SMP Multi-core HW (Vega) Now Pure software for commodity x86 (Zing) C4 “Industry firsts” in Garbage collection, elastic memory, Java virtualization, memory scale ©2011 Azul Systems, Inc.
    • A classic look at response time behavior Key Assumption: Response time is a function of load Average? Max? Median? 90%? 99.9% source: IBM CICS server docuementation, “understanding response times” ©2011 Azul Systems, Inc.
    • Common fallacies Computers run application code continuously CPUs stop processing application code for all sorts of reasons e.g: Interrupts. Scheduling of other work, swapping, etc. Modern system architectures add more: Power management, Virtualization (cross-image context switching, physical VM motion), Garbage Collection, etc. Response time can be measured as work units/time. Response time exhibits a normal distribution Leading to attempts to represent with average + std. deviation ©2011 Azul Systems, Inc.
    • Interpreting Results Response time over time From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown. When we measure behavior over time, we often see: “Hiccups” source: ZOHO QEngine White Paper: performance testing report analysis ©2011 Azul Systems, Inc.
    • Application Hiccups Where do they come from? Usually a factor of outside of the individual transaction work E.g: Queueing, accumulated work, platform inconsistency Do they matter? That depends. What are your end-user’s expectations? Hiccups often dominate response time behavior How can/should we measure them? Average? Max? 99.9%? Mean with Std. deviation? Hiccup magnitude is often not a function of load ©2011 Azul Systems, Inc.
    • What happened here? Hiccups&by&Time&Interval& “Hiccups” Max"per"Interval" 99%" 99.90%" 99.99%" Max" Hiccup&Dura*on&(msec)& 9000" 8000" 7000" 6000" 5000" 4000" 3000" 2000" 1000" 0" 0" 20" 40" 60" 80" 100" 120" 140" 160" 180" &Elapsed&Time&(sec)& Source: Gil running an idle program and suspending it five times in the middle ©2011 Azul Systems, Inc.
    • Pitfall: Calulating %’iles “naïvely” Common Example: build/buy simple load tester to measure throughput issue requests one by one at a certain rate measure and log response time for each request results log used to produce histograms, percentiles, etc. So what’s wrong with that? works well only when all responses fit within in rate interval technique includes “automatic backoff” and coordination But requirements interested in random, uncoordinated requests Bad how bad can this get, really? ©2011 Azul Systems, Inc.
    • Example of naïve %’ile System easily handles 100 requests/sec Responds to each in 1msec System Stalled for 100 Sec How would you characterize this system? 10,000 @ 1msec Naïve results: 1 @ 100 second 1 msec Elapsed Time Naïve characterization: 99.99% below 1 sec !!!
    • Proper measurement System easily handles 100 requests/sec System Stalled for 100 Sec Responds to each in 1msec 10,000 results @ 1 msec each 10,000 results Varying linearly from 100 sec to 10 msec Elapsed Time Proper characterization: 50% below 1 second
    • jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple, consistent format Simple, non-intrusive As simple as adding the word “jHiccup” to your java launch line % jHiccup java myflags myApp Adds a background thread that samples time @ 1000/sec Open Source released to the public domain, creative commons CC0 ©2011 Azul Systems, Inc.
    • what jHiccup measures jHiccup measures the platform, not the application Experiences whatever delays application threads would see Highlights inconsistency in platform execution of “nothing” No attempt to identify cause (e.g. scheduling, GC, swapping, etc.) An application cannot behave better than it’s platform If the platform stalls, the application stalled as well. jHiccup provides a lower bound for “best application behavior” Useful control measurements Comparing jHiccup results for an idle application running on the same platform, at the same time, narrows down behavior to process-specific artifacts (-c option) ©2011 Azul Systems, Inc.
    • The anatomy of Hiccup Charts ©2011 Azul Systems, Inc.
    • Telco App Example Max Time per Hiccups&by&Time&Interval& interval Max"per"Interval" 99%" 99.90%" 99.99%" Max" Hiccup&Dura*on&(msec)& 140" 120" 100" 80" 60" 40" 20" 0" 0" 500" 1000" 1500" 2000" 2500" &Elapsed&Time&(sec)& Optional SLA Hiccups&by&Percen*le&Distribu*on& plotting Hiccup Hiccup&Dura*on&(msec)& 140" 120" duration at 100" percentile 80" 60" levels 40" 20" 0" 0%" 90%" 99%" & 99.9%" & Percen*le& Hiccups"by"Percen?le" ©2011 Azul Systems, Inc. 99.99%" SLA" 99.999%" 99.9999%"
    • Hiccups&by&Time&Interval& Max"per"Interval" Hiccup&Dura*on&(msec)& 1200" 99%" 99.90%" 99.99%" Max" 1000" 800" 600" 400" 200" 0" 0" 200" 400" 600" 800" 1000" 1200" 1400" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccup&Dura*on&(msec)& 1200" 1000" 600" 400" 200" 0" ©2011 Azul Systems, Inc. Max=967.68& 800" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%"
    • How jHiccup works Background thread, measures time to do “nothing” Sleeps for 1 milliseconds, allocates 1 object, records time gap Collects detailed “floating point” histogram of observed times Constant, small memory footprint, virtually no cpu load No effect on application threads Outputs two files A time-based log with 1 line per interval (default 5 sec interval). A percentile distribution log of accumulated histogram data Spreadsheet imports files and plots data A standard, simple chart format ©2011 Azul Systems, Inc.
    • Some fun with jHiccup
    • Examples ©2011 Azul Systems, Inc.
    • Idle App on Quiet System Idle App on Busy System Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" Hiccups&by&Time&Interval& 99.99%" Max" Max"per"Interval" 20" 15" 10" 5" 0" Max" 50" 40" 30" 20" 10" 100" 200" 300" 400" 500" 600" 700" 800" 900" 0" 100" 200" &Elapsed&Time&(sec)& 300" 400" 500" 600" 700" 800" 900" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccups&by&Percen*le&Distribu*on& 25" 60" Max=22.336& 20" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 99.99%" 0" 0" 15" 10" 5" 0" 99.90%" 60" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 25" 99%" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%" 50" Max=49.728& 40" 30" 20" 10" 0" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%"
    • Idle App on Quiet System Idle App on Dedicated System Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" Hiccups&by&Time&Interval& 99.99%" Max" Max"per"Interval" 20" 15" 10" 5" 0" 99.99%" Max" 0.4" 0.35" 0.3" 0.25" 0.2" 0.15" 0.1" 0.05" 0" 0" 100" 200" 300" 400" 500" 600" 700" 800" 900" 0" 100" 200" &Elapsed&Time&(sec)& 300" 400" 500" 600" 700" 800" 900" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccups&by&Percen*le&Distribu*on& 25" 0.45" Max=22.336& 20" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 99.90%" 0.45" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 25" 99%" 15" 10" 5" 0.4" Max=0.411& 0.35" 0.3" 0.25" 0.2" 0.15" 0.1" 0.05" 0" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%" 0" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%"
    • A 1GB Java Cache under load Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" Hiccup&Dura*on&(msec)& 4000" 3500" 3000" 2500" 2000" 1500" 1000" 500" 0" 0" 500" 1000" 1500" 2000" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccup&Dura*on&(msec)& 4000" 3500" Max=3448.832& 3000" 2500" 2000" 1500" 1000" 500" 0" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%" 99.9999%"
    • Telco App Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" Hiccup&Dura*on&(msec)& 1800" 1600" 1400" 1200" 1000" 800" 600" 400" 200" 0" 0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccup&Dura*on&(msec)& 1800" 1600" Max=1665.024& 1400" 1200" 1000" 800" 600" 400" 200" 0" 0%" 90%" 99%" & 99.9%" & Percen*le& 99.99%" 99.999%"
    • Fun with jHiccup
    • Obviously, we can use it for comparison purposes ©2011 Azul Systems, Inc.
    • Zing 5, 1GB in an 8GB heap Oracle HotSpot CMS, 1GB in an 8GB heap Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" Hiccups&by&Time&Interval& 99.99%" Max" Max"per"Interval" 20" 15" 10" 5" 0" Max" 12000" 10000" 8000" 6000" 4000" 2000" 500" 1000" 1500" 2000" 2500" 3000" 3500" 0" 500" 1000" &Elapsed&Time&(sec)& 1500" 2000" 2500" 3000" 3500" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccups&by&Percen*le&Distribu*on& 14000" 20" Hiccup&Dura*on&(msec)& 25" Hiccup&Dura*on&(msec)& 99.99%" 0" 0" Max=20.384& 15" 10" 5" 0" 99.90%" 14000" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 25" 99%" 0%" 90%" 99%" & 99.9%" 99.99%" & Percen*le& 99.999%" 99.9999%" Max=13156.352& 12000" 10000" 8000" 6000" 4000" 2000" 0" 0%" 90%" 99%" &99.9%" & Percen*le& 99.99%" 99.999%"
    • Zing 5, 1GB in an 8GB heap Oracle HotSpot CMS, 1GB in an 8GB heap Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" Hiccups&by&Time&Interval& 99.99%" Max" Max"per"Interval" 12000" 99.99%" Max" 12000" 10000" 8000" 6000" 4000" 2000" 0" 10000" 8000" 6000" 4000" 2000" 0" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 0" 500" 1000" &Elapsed&Time&(sec)& 1500" 2000" 2500" 3000" 3500" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccups&by&Percen*le&Distribu*on& 14000" 12000" 12000" Hiccup&Dura*on&(msec)& 14000" Hiccup&Dura*on&(msec)& 99.90%" 14000" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 14000" 99%" 10000" 8000" 6000" 4000" 2000" 0" 0%" 90%" Max=20.384& 99%" & 99.9%" 99.99%" & Percen*le& 99.999%" 99.9999%" Max=13156.352& 10000" 8000" 6000" 4000" 2000" 0" 0%" 90%" 99%" &99.9%" & Percen*le& 99.99%" 99.999%"
    • Oracle HotSpot CMS, 4GB in a 18GB heap Oracle HotSpot CMS, 1GB in an 8GB heap Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" Hiccups&by&Time&Interval& 99.99%" Max" Max"per"Interval" 40000" 99.99%" Max" 12000" 35000" 30000" 25000" 20000" 15000" 10000" 5000" 0" 10000" 8000" 6000" 4000" 2000" 0" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 0" 500" 1000" &Elapsed&Time&(sec)& 2000" 2500" 3000" 3500" Hiccups&by&Percen*le&Distribu*on& 45000" 14000" 40000" Hiccup&Dura*on&(msec)& Max=40173.568& 35000" 30000" 25000" 20000" 15000" 10000" 5000" 0" 1500" &Elapsed&Time&(sec)& Hiccups&by&Percen*le&Distribu*on& Hiccup&Dura*on&(msec)& 99.90%" 14000" Hiccup&Dura*on&(msec)& Hiccup&Dura*on&(msec)& 45000" 99%" 0%" 90%" & 99.9%" 99%" & Percen*le& 99.99%" 99.999%" Max=13156.352& 12000" 10000" 8000" 6000" 4000" 2000" 0" 0%" 90%" 99%" &99.9%" & Percen*le& 99.99%" 99.999%"
    • Q&A http:/ /www.azulsystems.com http:/ /www.azulsystems.com/dev_resources/jhiccup ©2011 Azul Systems, Inc.