Micro-Benchmarking Considered Harmful

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Micro-Benchmarking
Considered Harmful
Thomas Wuerthinger
@thomaswue
Senior Research Director, Oracle Labs
Keynote at 8th ACM/SPEC International Conference on Performance Engineering
April 2017

My Background
• Working on various optimizing compilers
– HotSpot client compiler
– V8 Crankshaft optimizing compiler
– Maxine Research VM
• Since 2011 at Oracle Labs
– Graal compiler: a new high-tier compiler for Java
– Truffle: practical partial evaluation for high-performance dynamic language interpreters
– Group of ~50 researchers attempting to push the boundaries of managed language runtimes together
with university research collaborators
2
We are looking for passionate compiler engineers, researchers, and interns in Zurich, Prague,
Linz, or bay area! Mail to thomas.wuerthinger@oracle.comor DM @thomaswue

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 3
Which of those Java statements executes faster?
It depends…
if (x instanceof A) {
}
if (x instanceof B) {
}

It depends – Part I
• Is it a final leaf class?
– Direct comparison with constant type read from object header
• Is it a class?
– Direct super-type check available for class hierarchies up to specific depth (on HotSpot
by default 8)
• Is it an interface?
– Secondary super-type check caches last checked type
– Worst case loop over list of super types of object

It depends – Part II
• Static information on checked object can be available
• Check can completely fold
– Made redundant by encapsulating check or preceding check
• Check can turn into different category
– Interface check can turn into class check

It depends – Part III
• If the check does not fold, there can be profiling information available
• List of concrete classes whose instances were observed
– HotSpot uses TypeProfileWidthoption (default=2)
• Turns into cascade of direct checks
– Deoptimizationto the interpreter and reprofilingtriggered if all direct checks fail
– Cascade can be optimized if some static information on object is available
• Profile pollution from different callers can further increase unpredictability

It depends – Part IV
• Global assumptions about current state of class hierarchy
– Non-final leaf class can be treated as final
– Interface with single implementorchanged into class check
• Assumption is registered for the compiled code
– Class loading can cause deoptimization
– Threads stopped at safepoint and execution transferred to interpreter

It depends – Part V
• Once approximate low-level operations to be performed is known, still large
machine-dependent variability
– Branch prediction availability
– Memory bandwidth
– Cache behavior

Example Assumptions in Other Languages
• JavaScript Array.prototype[100] = 42;
console.log([1, 2, 3][100]);
x <- c(1, 2)
`[<-` <- function(x, i, j, ..., value) { 42 }
x[1] <- 100
print(x)
print(length(x))
Fixnum.send :define_method, :+ do |other|
self - other
end
puts 44 + 2
• Ruby
• Let’s talk about R…

Solution for which statement executes faster?
Dependent on properties of A, B, dynamic values of x,
surrounding code, potentially *any* loaded code, and the
hardware it is running on… so basically almost anything...
if (x instanceof A) {
}
if (x instanceof B) {
}

What about profilers?
• Attribution of performance in state-of-the art Java profilers is based on
highly inaccurate program location information
• Data more accurate than per compilation unit is fake
• Compilation units can be very large
– Can contain 1000s of inlinedmethods
– Compilers perform aggressive code motion mixing the code of those methods
11
Method profilers are (often) lying to you!

Micro-benchmarking to the rescue?
• Extract small patterns into compilation units
• Accurately measure those snippets of code
12
Accurate measurement, but conclusions extending to
performance in a larger context practically impossible

Complex Interactions Between Program Snippets
• Performance of combination of two snippets is difficult to predict
• Examples of positive combination effects
– Global value number of expressions
– Read/write elimination
– Tail duplication opportunities for shared conditions
• Examples of negative combination effects
– Memory kills
– Register kills or pressure
– Prohibited optimizations based on code size trade-offs (e.g., loop unrolling, inlining)

Micro-Benchmark Example – Part I
int foo(int x) {
return (a % b) - (x * b);
}
int bar(int x) {
return (x * b) % 100 == 0 ? (int) Math.sin(x + 1) : x;
}
T(foo + bar) < T(foo) + T(bar)
T(x) … time for executing code x as part of a long-running loop
Shared expression (x*b) makes combined code slightly faster on most platforms.

Micro-Benchmark Example – Part II
int bar(int x) {
}
T(bar’) < T(bar)
Programmer decides to “optimize” method bar and replace with new version.
Micro-benchmarking confirms that new version runs faster.
int bar’(int x) {
return (x * b) % 100 == 0 ? bar’(x + 1) : x;
}

Micro-Benchmark Example – Part III
T(foo + bar’) > T(foo + bar)
Suddenly the combination of foo and bar’ runs significantly slower.
Reason: Recursion introduces new kill point and loop invariant expression (a % b) can no longer be
moved out of the loop.
int bar’(int x) {
return (x * b) % 100 == 0 ? bar’(x + 1) : x;
}
int foo(int x) {
}

Conclusion: Sum can be Bigger or Smaller Than Parts
int foo(int x) {
}
int bar(int x) {
}
int bar’(int x) {
return (x * b) % 100 == 0 ? bar’(x + 1) : x;
}
T(bar’) < T(bar)
T(foo + bar’) > T(foo + bar)
T(foo + bar) < T(foo) + T(bar)
T(foo + bar’) > T(foo) + T(bar’)
github.com/thomaswue/micro-bench-harmfulTry yourself!
Prominent real world example: HashMap#put implementation change from JDK7 to JDK8

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 18
Performance Advice?
Recent slide deck from
a browser vendor:
• Highly runtime-specific
– Even same runtime with different version yields different results
• Does not extend to larger context (profile, surrounding code, …)

Why optimize at all?
• High-level programming abstractions increased the delta between
optimized and unoptimized
• Factor up to ~100x possible for Java with inlining, escape analysis and other
profile-guided as well as traditional compiler optimizations
• Factor up to ~1000x for languages like R or Ruby
• Abstractions help building more complex programs overcoming human
mind bottleneck
• There will be even more abstractions in the future, not less…

Conclusions
• Quantitative timing-based performance metrics have serious downsides
– Results dependent on hardware, runtime version, runtime global state, surrounding
code, program snippet interactions, program input, …
– Micro-benchmarking can easily lead to optimizations of individual operations that slow
down the overall program
• Qualitative performance metrics should get more attention
– Characterize program snippet performance in terms of general properties that are
relevant for performance (e.g., memory kill locations, logic complexity, profiling state)
– Less useful for specific problem instance, but more generally applicable and more
robust in terms of changes to the program, its input, or its surrounding environment
– In particular advisable for often reused program snippets (e.g., libraries)

Q/A
21
Graal projects on github: github.com/graalvm
Micro-benchmark example on github:
github.com/thomaswue/micro-bench-harmful
@thomaswue

Micro-Benchmarking Considered Harmful

More Related Content

What's hot

Similar to Micro-Benchmarking Considered Harmful

Recently uploaded

Micro-Benchmarking Considered Harmful