Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		|
Micro-Benchmarking	
Considered	Harmful
Thomas	Wuerthinger
@thomaswue
Senior	Research	Director,	Oracle	Labs
Keynote	at	8th	ACM/SPEC	International	Conference	on	Performance	Engineering
April	2017
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		|
My Background
• Working	on	various	optimizing	compilers
– HotSpot	client	compiler
– V8	Crankshaft	optimizing	compiler
– Maxine	Research	VM
• Since	2011	at	Oracle	Labs
– Graal compiler:	a	new	high-tier	compiler	for	Java
– Truffle:	practical	partial	evaluation	for	high-performance	dynamic	language	interpreters
– Group	of	~50	researchers attempting	to	push	the	boundaries	of	managed	language	runtimes	together	
with	university	research	collaborators
2
We	are	looking	for	passionate	compiler	engineers,	researchers,	and	interns	in	Zurich,	Prague,	
Linz,	or	bay	area!	Mail	to	thomas.wuerthinger@oracle.comor	DM	@thomaswue
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 3
Which	of	those	Java	statements	executes	faster?
It	depends…
if (x instanceof A) {
}
if (x instanceof B) {
}
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 4
It	depends	– Part	I
• Is	it	a	final	leaf	class?
– Direct	comparison	with	constant	type	read	from	object	header
• Is	it	a	class?
– Direct	super-type	check	available	for	class	hierarchies	up	to	specific	depth	(on	HotSpot	
by	default	8)
• Is	it	an	interface?
– Secondary	super-type	check	caches	last	checked	type
– Worst	case	loop	over	list	of	super	types	of	object
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 5
It	depends	– Part	II
• Static	information	on	checked	object	can	be	available
• Check	can	completely	fold
– Made	redundant	by	encapsulating	check	or	preceding	check
• Check	can	turn	into	different	category
– Interface	check	can	turn	into	class	check
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 6
It	depends	– Part	III
• If	the	check	does	not	fold,	there	can	be	profiling	information	available
• List	of	concrete	classes	whose	instances	were	observed
– HotSpot	uses	TypeProfileWidthoption	(default=2)
• Turns	into	cascade	of	direct	checks
– Deoptimizationto	the	interpreter	and	reprofilingtriggered	if	all	direct	checks	fail
– Cascade	can	be	optimized	if	some	static	information	on	object	is	available
• Profile	pollution	from	different	callers	can	further	increase	unpredictability
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 7
It	depends	– Part	IV
• Global	assumptions	about	current	state	of	class	hierarchy
– Non-final	leaf	class	can	be	treated	as	final
– Interface	with	single	implementorchanged	into	class	check
• Assumption	is	registered	for	the	compiled	code
– Class	loading	can	cause	deoptimization
– Threads	stopped	at	safepoint and	execution	transferred	to	interpreter
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 8
It	depends	– Part	V
• Once	approximate	low-level	operations	to	be	performed	is	known,	still	large	
machine-dependent	variability
– Branch	prediction	availability
– Memory	bandwidth
– Cache	behavior
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 9
Example	Assumptions	in	Other	Languages
• JavaScript Array.prototype[100] = 42;
console.log([1, 2, 3][100]);
x <- c(1, 2)
`[<-` <- function(x, i, j, ..., value) { 42 }
x[1] <- 100
print(x)
print(length(x))
Fixnum.send :define_method, :+ do |other|
self - other
end
puts 44 + 2
• Ruby
• Let’s	talk	about	R…
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 10
Solution	for	which	statement	executes	faster?	
Dependent	on	properties	of	A,	B,	dynamic	values	of	x,	
surrounding	code,	potentially	*any*	loaded	code,	and	the	
hardware	it	is	running	on…	so	basically	almost	anything...
if (x instanceof A) {
}
if (x instanceof B) {
}
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		|
What about profilers?
• Attribution	of	performance	in	state-of-the	art	Java	profilers	is	based	on	
highly	inaccurate	program	location	information
• Data	more	accurate	than	per	compilation	unit	is	fake
• Compilation	units	can	be	very	large
– Can	contain	1000s	of	inlinedmethods
– Compilers	perform	aggressive	code	motion	mixing	the	code	of	those	methods
11
Method	profilers	are	(often)	lying	to	you!
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		|
Micro-benchmarking to the rescue?
• Extract	small	patterns	into	compilation	units
• Accurately	measure	those	snippets	of	code
12
Accurate	measurement,	but	conclusions	extending	to	
performance	in	a	larger	context	practically	impossible
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 13
Complex	Interactions	Between	Program	Snippets
• Performance	of	combination	of	two	snippets	is	difficult	to	predict
• Examples	of	positive	combination	effects
– Global	value	number	of	expressions
– Read/write	elimination
– Tail	duplication	opportunities	for	shared	conditions
• Examples	of	negative	combination	effects
– Memory	kills
– Register	kills	or	pressure
– Prohibited	optimizations	based	on	code	size	trade-offs	(e.g.,	loop	unrolling,	inlining)
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 14
Micro-Benchmark	Example	– Part	I
int foo(int x) {
return (a % b) - (x * b);
}
int bar(int x) {
return (x * b) % 100 == 0 ? (int) Math.sin(x + 1) : x;
}
T(foo	+	bar)	<	T(foo)	+	T(bar)
T(x)	…	time	for	executing	code	x	as	part	of	a	long-running	loop
Shared	expression	(x*b)	makes	combined	code	slightly	faster	on	most	platforms.
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 15
Micro-Benchmark	Example	– Part	II
int bar(int x) {
return (x * b) % 100 == 0 ? (int) Math.sin(x + 1) : x;
}
T(bar’)	<	T(bar)
T(x)	…	time	for	executing	code	x	as	part	of	a	long-running	loop
Programmer	decides	to	“optimize”	method	bar	and	replace	with	new	version.
Micro-benchmarking	 confirms	that	new	version	runs	faster.
int bar’(int x) {
return (x * b) % 100 == 0 ? bar’(x + 1) : x;
}
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 16
Micro-Benchmark	Example	– Part	III
T(foo	+	bar’)	>	T(foo	+	bar)
T(x)	…	time	for	executing	code	x	as	part	of	a	long-running	loop
Suddenly	 the	combination	of	foo	and	bar’	runs	significantly	slower.
Reason:	Recursion	introduces	new	kill	point	and	loop	invariant	expression	(a	%	b)	can	no	longer	be	
moved	out	of	the	loop.
int bar’(int x) {
return (x * b) % 100 == 0 ? bar’(x + 1) : x;
}
int foo(int x) {
return (a % b) - (x * b);
}
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 17
Conclusion:	Sum	can	be	Bigger	or	Smaller	Than	Parts
int foo(int x) {
return (a % b) - (x * b);
}
int bar(int x) {
return (x * b) % 100 == 0 ? (int) Math.sin(x + 1) : x;
}
int bar’(int x) {
return (x * b) % 100 == 0 ? bar’(x + 1) : x;
}
T(bar’)	<	T(bar)
T(foo	+	bar’)	>	T(foo	+	bar)
T(foo	+	bar)	<	T(foo)	+	T(bar)
T(foo	+	bar’)	>	T(foo)	+	T(bar’)
github.com/thomaswue/micro-bench-harmfulTry	yourself!
Prominent	real	world	example:	HashMap#put implementation	change	from	JDK7	to	JDK8
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| Confidential	– Oracle	Internal/Restricted/Highly	Restricted 18
Performance	Advice?
Recent	slide	deck	from	
a	browser	vendor:
• Highly	runtime-specific
– Even	same	runtime	with	different	version	yields	different	results
• Does	not	extend	to	larger	context	(profile,	surrounding	code,	…)
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 19
Why	optimize	at	all?
• High-level	programming	abstractions	increased	the	delta	between	
optimized	and	unoptimized
• Factor	up	to	~100x	possible	for	Java	with	inlining,	escape	analysis	and	other	
profile-guided	as	well	as	traditional	compiler	optimizations
• Factor	up	to	~1000x	for	languages	like	R	or	Ruby
• Abstractions	help	building	more	complex	programs	overcoming	human	
mind	bottleneck
• There	will	be	even	more	abstractions	in	the	future,	not	less…
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		| 20
Conclusions
• Quantitative	timing-based	performance	metrics	have	serious	downsides
– Results	dependent	on	hardware,	runtime	version,	runtime	global	state,	surrounding	
code,	program	snippet	interactions,	program	input,	…
– Micro-benchmarking	can	easily	lead	to	optimizations	of	individual	operations	that	slow	
down	the	overall	program
• Qualitative	performance	metrics	should	get	more	attention
– Characterize	program	snippet	performance	in	terms	of	general	properties	that	are	
relevant	for	performance	(e.g.,	memory	kill	locations,	logic	complexity,	profiling	state)
– Less	useful	for	specific	problem	instance,	but	more	generally	applicable	and	more	
robust	in	terms	of	changes	to	the	program,	its	input,	or	its	surrounding	environment
– In	particular	advisable	for	often	reused	program	snippets	(e.g.,	libraries)
Copyright	©	2017, Oracle	and/or	its	affiliates.	All	rights	reserved.		|
Q/A
21
Graal projects	on	github:	github.com/graalvm
Micro-benchmark	example	on	github:	
github.com/thomaswue/micro-bench-harmful
@thomaswue

Micro-Benchmarking Considered Harmful