Introduction to Java Profiling

J AVA P R O F I L I N G
I N T R O D U C T I O N T O
Jerry Yoakum
Expedia Affiliate Network

A G E N D A
• When to profile
• Profiler Sampling
• Profiler Instrumentation
• Where to Start
• Examples
• Micro vs Macro Benchmarking

W H E N T O P R O F I L E
• When a performance issue is unclear.
• To proactively check that an application is performing as expected.
• To turbo-charge an application?

“We should forget about small efficiencies,
say about 97% of the time; premature
optimization is the root of all evil.”
– D O N A L D K N U T H
The point that Knuth is trying to make is that in the end, you should write “clean, straightforward code that is simple to read and understand. In this context, “optimizing”
is understood to mean employing algorithmic and design changes that complicate program structure but provide better performance. Those kind of optimizations indeed
are best left undone until such time as the proﬁling of a program shows that there is a large beneﬁt from performing them.

if (LOG.isTraceEnabled()) {
LOG.trace(String.format("X: %s and Y: %s", 
calcX(), calcY())); 
}
B E S T P R A C T I C E S A R E N O T
P R E M AT U R E O P T I M I Z AT I O N S

P R E M AT U R E O P T I M I Z AT I O N S I N C L U D E …
• Manually inlining methods.
• Writing directly in bytecode.
• Allocating public variables and using them as global memory 
through out an application.
• And anything else that makes the code unduly difficult to 
work with.

T O O L S !
• vmstat
• iostat
“Performance analysis is all about visibility—knowing what is going on inside of an application, and in the application’s environment. Visibility is all about tools. And so
performance tuning is all about tools.”

O V E R L O A D E D
M A C H I N E
• $ vmstat 1
• ‘r’ column is the run queue length
• the number of all threads that are
running or that could run if there were
an available CPU
• if the run queue length is too high for
any significant period of time, it is an
indication that the machine is
overloaded

V M S TAT E X A M P L E F O R A L O W U S A G E S Y S T E M
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 867632 38568 165348 0 0 453 20 236 271 3 5 91 1 0
0 0 0 867632 38568 165348 0 0 0 0 161 247 0 1 99 0 0
0 0 0 867632 38568 165348 0 0 0 0 140 240 0 1 99 0 0
0 0 0 867632 38568 165348 0 0 0 0 152 255 0 1 99 0 0
1 0 0 867632 38568 165348 0 0 0 0 147 240 0 1 99 0 0

V M S TAT E X A M P L E F O R A B U S Y S Y S T E M
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
12 0 82596 130020 130816 524228 0 0 0 0 2696 4644 84 12 4 0 0
12 0 83288 149288 129784 517476 32 692 32 692 3722 4536 85 14 1 0 0
14 0 83288 130248 129784 522520 0 0 0 0 2644 5128 87 13 0 0 0
0 2 83288 142548 129788 521936 64 0 64 40 1653 2748 53 8 20 20 0
13 0 86720 127480 125384 519344 32 3436 32 3436 4421 4671 76 12 6 5 0
17 1 87336 141932 124548 515632 64 616 64 632 3110 4302 87 13 1 0 0

Examine Disk IO with iostat -xm 5
for a non-busy system
avg-cpu: %user %nice %system %iowait %steal %idle
22.84 0.00 1.00 0.01 0.00 76.14
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
vda 0.01 15.67 0.04 4.42 0.00 0.08 36.28 0.01 2.27 0.22 0.10
dm-0 0.00 0.00 0.77 0.56 0.00 0.00 8.00 0.01 4.89 0.36 0.05
dm-1 0.00 0.00 0.05 20.09 0.00 0.08 8.03 0.12 5.73 0.05 0.10

for a busy system
86.20 0.00 13.50 0.00 0.10 0.20
vda 30.00 2.40 8.20 1.00 0.15 0.01 36.00 0.05 5.78 3.04 2.80
dm-0 0.00 0.00 0.20 3.20 0.00 0.01 8.00 0.05 15.53 4.00 1.36
dm-1 0.00 0.00 38.00 0.00 0.15 0.00 8.00 0.17 4.49 0.38 1.44
Is %idle low?

for a busy system
16.20 0.00 83.50 0.00 0.10 0.20
vda 30.00 2.40 8.20 1.00 0.15 0.01 36.00 0.05 5.78 3.04 2.80
dm-0 0.00 0.00 0.20 3.20 0.00 0.01 8.00 0.05 15.53 4.00 1.36
dm-1 0.00 0.00 38.00 0.00 0.15 0.00 8.00 0.17 4.49 0.38 1.44
Is %system higher than %user?

for a busy system
16.20 0.00 83.50 0.00 0.10 0.20
vda 30.00 2.40 8.20 1.00 0.15 0.01 36.00 0.05 5.78 3.04 2.80
dm-0 0.00 0.00 0.20 3.20 0.00 0.01 8.00 0.05 35.53 4.00 81.36
dm-1 0.00 0.00 38.00 0.00 0.15 0.00 8.00 0.17 4.49 0.38 1.44
Is a device being used more than others?

for a busy system
16.20 0.00 83.50 0.00 0.10 0.20
vda 30.00 2.40 8.20 1.0 0.15 0.01 36.00 0.05 5.78 3.04 2.80
dm-0 0.00 0.00 0.20 63.2 0.00 0.01 8.00 0.05 35.53 4.00 81.36
dm-1 0.00 0.00 38.00 0.0 0.15 0.00 8.00 0.17 4.49 0.38 1.44
Are the w/s high while the wMB/s is low?

for a busy system
16.20 0.00 83.50 0.00 0.10 0.20
vda 30.00 2.40 8.20 1.0 0.15 0.01 36.00 0.05 5.78 3.04 2.80
dm-0 0.00 0.00 0.20 63.2 0.00 0.01 8.00 0.05 35.53 4.00 81.36
dm-1 0.00 0.00 38.00 0.0 0.15 0.00 8.00 0.17 4.49 0.38 1.44
Is await high for a device?

P R O F I L E R S A M P L I N G
• Sampling-based profilers are the most common kind of profiler.
• Because of their relatively low profile, sampling profilers introduce fewer
measurement artifacts.
• Different sampling profiles behave differently; each may be better for a
particular application.
Sampling proﬁlers probe the program counter at regular intervals using operating system interrupts. Sampling proﬁlers are less accurate but facilitate a near normal
execution time.

S A M P L I N G
main()
prog()
s()
con()

S A M P L I N G
S A F E P O I N T S
Sampling profilers in Java can only take the sample of
a thread when the thread is at a safepoint—essentially,
whenever it is allocating memory.

P R O F I L E R I N S T R U M E N TAT I O N
• Instrumented profilers yield more information about an application, but
can possibly have a greater effect on the application than a sampling
profiler.
• Instrumented profilers should be set up to instrument small sections of the
code—a few classes or packages. That limits their impact on the
application’s performance.
Instrumented proﬁler adds additional instructions in the code to gather data about what was executed, when, for how long, etc.

I N S T R U M E N TAT I O N I M PA C T
Instrumented code may change the execution profile.
For example, the JVM will inline small methods so that no method invocation is needed when the small-method code is executed. The compiler makes that decision
based on the size of the code; depending on how the code is instrumented, it may no longer be eligible to be inlined. This may cause the instrumented proﬁler to
overestimate the contribution of certain methods. And inlining is just one example of a decision that the compiler makes based on the layout of the code; in general, the
more the code is instrumented (changed), the more likely it is that its execution proﬁle will change.

I N S T R U M E N T E D
main()
prog()
s()
con()

I N S T R U M E N T E D
main()
prog()
s()
con()
The thing to notice is that there is so much instrumentation that it is potentially greater than the con() but since it is added to con() that method appears to have greater
impact.

P R O F I L E T H E C P U F I R S T
• CPU time is the first thing to examine when looking at performance of an
application.
• The goal in optimizing code is to drive the CPU usage up (for a shorter
period of time), not down.
• Understand why CPU usage is low before diving in and attempting to tune
an application.

P R O F I L E T H E C P U F I R S T
In the heat of battle, in can be tough to choose your targets. I’m sympathetic to that. You see lots of garbage collections with a big heap, you want to profile the memory
right away! But I’m asking you… no, I’m begging you. For the love of Java. People. Profile the CPU. The CPU. This CPU right here! Profile the CPU first!

L I M I T WA S T E E X A M P L E
static volatile Long value = 0L;
…
20 private static void waste() {
21 for (Long count = 0L;
count < 500_000_000;
count++) {
22 value += count;
23 }
24 }

S TA R T L I M I T WA S T E W I T H A G E N T AT TA C H E D
$ java -agentpath:libyjpagent.jnilib LimitWaste
[YourKit Java Profiler 2015 build 15042]
Log file: /Users/jyoakum/.yjp/log/LimitWaste-4096.log
Press enter to continue.

Y O U R K I T J AVA P R O F I L E R

Y O U R K I T - C H O O S E A P P L I C AT I O N

Y O U R K I T - S TA R T S W I T H S TA C K T E L E M E T RY

Y O U R K I T - S TA R T S A M P L I N G

C O N T I N U E P R O C E S S I N G O F L I M I T WA S T E
124999999750000000 after 7827.359 ms
Press enter to finish.

Y O U R K I T - S T O P S A M P L I N G

Y O U R K I T - A N A LY Z E C A L L T R E E

L I M I T A L L O C AT I O N WA S T E E X A M P L E
…
20 private static void waste() {
21 for (Long count = 0L;
count < 500_000_000;
Long.valueOf(count + 1)) {
22 value = Long.valueOf(value + count);
23 }
24 }

Y O U R K I T - P E R F C H A R T F O R G C

Y O U R K I T - P E R F C H A R T F O R A L L O C AT I O N

…
20 private static void lessWaste() {
21 for (long count = 0;
count < 500_000_000;
count++) {
22 value = Long.valueOf(value + count);
23 }
24 }

L I M I T WA S T E I M P R O V E D
124999999750000000 after 14833.461 ms
124999999750000000 after 8551.391 ms

Y O U R K I T - L I M I T WA S T E I M P R O V E D

…
20 private static void haste() {
21 long fastValue = 0L;
22 for (long count = 0;
count < 500_000_000;
count++) {
23 fastValue += count;
24 }
25 value = fastValue;
26 }

L I M I T WA S T E - M A K E H A S T E
124999999750000000 after 14833.461 ms
124999999750000000 after 8551.391 ms
124999999750000000 after 266.119 ms

Y O U R K I T - L I M I T WA S T E - M A K E H A S T E

T H R E A D P R O F I L I N G
• Thread profiling is concerned with examining the different thread states.
• If threads are blocked most of the time then execution power is reduced.

T H R E A D P R O F I L I N G E X A M P L E
ExecutorService execSvc = Executors.newFixedThreadPool(200);
for (int i = 0; i < 1000; i++) {
execSvc.execute(new SortingThread());
}
execSvc.shutdown();
execSvc.awaitTermination(5, TimeUnit.MINUTES);

class SortingThread implements Runnable {
@Override
public void run() {
System.out.println("starting...");
int arraySize = 300_000;
int[] bigArray = new int[arraySize];
// populate the array with random numbers
for (int i = 0; i < arraySize; i++) {
bigArray[i] = ThreadLocalRandom.current().nextInt(50_000);
}
Arrays.sort(bigArray);
System.out.println("finished!");
}
}

$ java -agentpath:libyjpagent.jnilib ThreadExample
Log file: /Users/jyoakum/.yjp/log/ThreadExample-90362.log
starting…
…
finished!
Complete after 9041.103 ms

T H R E A D P R O F I L I N G E X A M P L E - Y O U R K I T
The key thing to take notice of here is that the percent of time under run() only adds up to 56%. Leaving 43% as unaccounted…

T H R E A D P R O F I L I N G E X A M P L E - Y O U R K I T

T H R E A D P R O F I L I N G E X A M P L E - J M C
• JMC (Java Mission Control)
• Low overhead - built into the JVM
• Commercial feature that requires license agreements for production use

$ java -XX:+UnlockCommercialFeatures
-XX:+FlightRecorder
ThreadExample
starting…
…
finished!
Complete after 4965.916 ms

T H R E A D P R O F I L I N G E X A M P L E - S M A L L E R P O O L
• Originally used a pool size of 200 threads.
• Using a pool size of 40 threads results in nearly the same run time and
some other benefits.

Before we had multiple threads blocked. Now we have are waiting to create threads.

Before we used nearly 256 MB of heap. Now we used just over 128 MB of heap.

M I C R O B E N C H M A R K S
public void doTest() {
double d;
long then = System.currentTimeMillis();
for (int i = 0; i < nLoops; i++) {
d = fib(15);
}
long now = System.currentTimeMillis();
System.out.println( 
"Elapsed time: " + (now - then));
}
private double fib(int n) {
if (n < 0) { 
throw new IllegalArgumentException( 
"Must be > 0"); 
}
if (n == 0) { return 0.0d; }
if (n == 1) { return 1.0d; }
double d = fib(n - 2) + fib(n - 1);
if (Double.isInfinite(d)) { 
throw new ArithmeticException("Overflow"); 
}
return d;
}

M I C R O B E N C H M A R K S M U S T U S E T H E I R R E S U LT S
A smart compiler will end up executing this code:
long then = System.currentTimeMillis();
long now = System.currentTimeMillis();
System.out.println("Elapsed time: " + (now - then));
Avoid compiler optimizations:
• Read each result.
• Use volatile instance variables.
There is a way around that particular issue: ensure that each result is read, not simply written. In practice, changing the deﬁnition of i from a local variable to an instance
variable (declared with the volatile keyword) will allow the performance of the method to be measured.

WA R M - U P P E R I O D
For microbenchmarks, a warm-up period is
required; otherwise, the microbenchmark
is measuring the performance of
compilation rather than the code it is
attempting to measure.

M A C R O B E N C H M A R K S
No test can give comparable results
to examining an application in production.
The best thing to use to measure performance of an application “is the application itself, in conjunction with any external resources it uses. If the application normally
checks the credentials of a user by making LDAP calls, it should be tested in that mode. Stubbing out the LDAP calls may make sense for module-level testing, but the
application must be tested in its full conﬁguration.

S U M M A RY
• When to profile
• Profiler Sampling
• Profiler Instrumentation
• Where to Start
• Examples
• Micro vs Macro Benchmarking
Yes, it is the same slide as the agenda slide.

Introduction to Java Profiling

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Introduction to Java Profiling

Similar to Introduction to Java Profiling (20)

Recently uploaded

Recently uploaded (20)

Introduction to Java Profiling