This document discusses benchmarking Java applications. It covers challenges like warmup time, garbage collection, and Java time APIs. It also discusses designing experiments through clearly stating questions and hypotheses testing. Statistical methods for benchmarking are presented, including averaging, standard deviation, confidence intervals, and hypothesis testing.
3. BenchMarking Java Applications
3
Sno. Micro Macro
1 Repeatable measurement of specific
section of code
Repeatable measurement of whole
application or part performance from user
point of view
2 Abstracted from VM warmup,
garbage collection & other side
effects
Abstracted from performance overhead
caused by monitoring tools
Micro vs Macro Benchmarks
12. BenchMarking Java Applications
12
Challenges with Benchmarks
Garbage Collection
Use Serial Garbage Collector: -XX+UseSerialGC
Explicitly set Min & Max Heap Size to same value
Explicitly set Young GenSize
Invoke System.GC() multiple times prior to Benchmark
32. BenchMarking Java Applications
32
Statistical Methods
Confidence Interval of Difference of Mean
True Difference in Average
1- confidence level
Size of Sample 1 & 2
Pooled Standard Deviation
Let us now talk about Benchmarking Java Applications
This slide describes diffrerences between a micro and macro benchmark
This topic has 3 sections namely:
Challenges with benchmarks
Design of Experiments
Use of Statistical methods
Let us first look at the challenges with benchmarks
This slide lists the challenges and issues with benchmarking java applications
Lets start warmup time
This slide shows an example of code that warms up code prior to test
By default, the HotSpot Server VM executes a block of Java byte code 10,000 times before the HotSpot Server JIT compiler
produces native machine code for that block of Java bytecode. The HotSpot Client VM begins producing native machine code at 1,500 iterati
It is important to check for Compilation activity during the measurement interval when executing a benchmark.
Use this VM option to check for compilation activity during measurement interval
This slide shows a small portion of output produced by -XX:+PrintCompilation on a micro-benchmark
A way to ensure a HotSpot JIT compiler has reached steady state, finished its optimizations, and generated optimized code for a benchmark is to execute a run of the benchmark with the HotSpot VM command line option -XX:+PrintCompilation
along with instrumenting the benchmark to indicate when it has completed the
warm-up period.
-XX:+PrintCompilation causes the JVM to print a line for each method as it optimizes or deoptimizes.
Next let us look at Garbage Collection
This slide lists steps to minimize the impact of gc on benchmark results.
Garbage Collectors consume cpu cycles and can pause the application threads while execution of benchmark. It is important to tune the the garbage collector prior to execution of the benchmark. For microbenchmark following steps can avoid gc impact on measurement.
Lets now talk about java time apis
This slide talks about guidelines to follow when using java time apis for benchmarking programs.
Java millisecond and nanosecond api are precise but not accurate since the value depends on underlying operating system, hence for benchmarking purpose it is advisable to use a sufficiently large interval relative a nanosecond
Next we will talk about Optimization of Dead code
JVM optimization of unreachable code and skew benchmark. Good practice are to keep the computation nontrivial and store the results of computation outside the measurement interval
Next we will look at Inlining
For microbenchmarks VMs can mark inlined methods as dead code if the method return value is not used. It is advisable to use the following VM options to check the behavior of JVM in such cases
Slide image displays output of printInline VM option showing methods that are inlined by JVM
Next we will talk about deoptimization
Use this VM option to check for deoptimization of code during execution of a benchmark
Use this VM option to check for deoptimization of code during execution of a benchmark. Deoptimization during benchmark execution can skew benchmark result so it is important to track this while running micro benchmarks
This slide shows that JVM deoptimization during benchmark run can be detected with +PrintCompilation method (This can happen if the aggressive optimization decision is undone due to earlier incorrect assumption about about method argument)
Next we will look at tips to design good experiments
This slide lists the factors to keep in mind when designing experiments to benchmark applications
Next we will talk about design of experiments
This slide talks about the steps to design a good experiment
1. Clearly State the question that the experiment is trying to answer(DOes a 100mb increase in young gen space result in 1% improvement on specific benchmark on specific hardware configuration?)
2. Formulate a hypothesis: (Ex: Is the improvement from the change atleast 10%?)
3. Use statistical techniques to validate the hypothesis. (Confidence Interval Tests)
Next we will look at use of statistical methods for benchmarks
Compute average of metric of both baseline (before change) and specimen after change
Baseline Average is the sum of all observations of baseline divided by the number of baseline executions
Specimen Average is the sum of all observations of specimen divided by the number of specimen executions
Baseline or specimen’s variability can be evaluated by computing a sample standard deviation (s)
Calculate a Confidence interval for estimation of true average of baseline and specimen observation using tvalue for a given value of alpha and n-1 degree of freedom as shown in figure Here alpha = 1- confidence level chosen
This slide shows the formula to compute a confidence interval for a true difference of sample means
Value of pooled standard deviation (s) in previous slide is calculated using the following formula where s1 and s2 are the standard deviation of sample 1 and sample 2.
In this approach, a hypothesis, more
formally known as a null hypothesis, is formulated based on a problem statement,
that is, what you want to know. Then data is collected and a t-statistic is calculated
based on the collected observations. The t-statistic is compared to a value obtained
from a Student’s t-distribution for an a (alpha) and degrees of freedom
alpha is the risk
level at which you are willing to incorrectly accept the null hypothesis as true when
it is really false, also known in statistical terms as a Type I Error
This slide shows the formula for calculation of t value for hypothesis tests.
In this approach, a hypothesis, more
formally known as a null hypothesis, is formulated based on a problem statement,
that is, what you want to know. Then data is collected and a t-statistic is calculated
based on the collected observations. The t-statistic is compared to a value obtained
from a Student’s t-distribution for an a (alpha) and degrees of freedom
alpha is the risk
level at which you are willing to incorrectly accept the null hypothesis as true when
it is really false, also known in statistical terms as a Type I Error
This slide shows the formula for calculation of t value for hypothesis tests.