The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical Evaluations

The Truth, the Whole Truth, and
Nothing but the Truth
A Pragmatic Guide to Assessing Empirical Evaluations
Stephen M. Blackburn, Amer Diwan, Matthias Hauswirth, Peter F. Sweeney
Jose ́ Nelson Amaral, Tim Brecht, Lubomr Bulej, Cliff Click, Lieven Eeckhout,
Sebastian Fischmeister, Daniel Frampton, Laurie J. Hendren, Michael Hind,
Antony L. Hosking, Richard E. Jones, Tomas Kalibera, Nathan Keynes,
Nathaniel Nystrom, and Andreas Zeller
Programming Languages Mentoring Workshop, SantaBarbara, June 2016

Rob Walsh
Anneke
Kathryn McKinley
Ron Morrison
Eliot Moss
Robin Stanton
Dave Thomas
Generosity
Perseverance
Mr Beach
Graham Hellistrand

1
1.1
1.2
1.3
1.4
1.5
1.6
1 1.5 2 2.5 3
16
18
20
22
24
30MB 40MB 50MB 60MB 70MB 80MB 90MB
NormalizedTime
Time(sec)
Heap size relative to minimum heap size
Heap size
Depth First
Breadth First
Partial DF 2 Children
OOR
(a) jython Total Time
1
1.1
1.2
1.3
1.4
1.5
1.6
1 1.5 2 2.5 3
12
13
14
15
16
17
18
15MB 20MB 25MB 30MB 35MB 40MB 45MB 50MB 55MB 60MB 65MB
NormalizedTime
Time(sec)
Heap size
Depth First
Breadth First
OOR
(b) db Total Time
1
1.1
1.2
1.3
1.4
1.5
1.6
1 1.5 2 2.5 3
6.5
7
7.5
8
8.5
9
9.5
10
NormalizedTime
Time(sec)
Heap size
Depth First
Breadth First
OOR
(c) javac Total Time
1
1.1
1.2
1.3
1.4
1.5
1 1.5 2 2.5 3
15
16
17
18
19
20
21
22
NormalizedMutatorTime
MutatorTime(sec)
Heap size
Depth First
Breadth First
OOR
(d) jython Mutator Time
1
1.1
1.2
1.3
1.4
1.5
1 1.5 2 2.5 3
11
12
13
14
15
16
17
15MB 20MB 25MB 30MB 35MB 40MB 45MB 50MB 55MB 60MB 65MB
MutatorTime(sec)
Heap size
Depth First
Breadth First
OOR
(e) db Mutator Time
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
1 1.5 2 2.5 3
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
MutatorTime(sec)
Heap size
Depth First
Breadth First
OOR
(f) javac Mutator Time
3.5
4
4.5
5
120
140
160
torL2Misses
ses(10^6)
Heap size
Depth First
Breadth First
OOR
1.6
1.8
2
200
220
15MB20MB25MB30MB35MB40MB45MB50MB55MB60MB65MB
torL2Misses
ses(10^6)
Heap size
Depth First
Breadth First
OOR
1.6
1.8
2
26
28
30
32
torL2Misses
ses(10^6)
Heap size
Depth First
Breadth First
OOR

hypotheses, evaluations, claims

a block-based heap
structure could allow us
to get the best of
compacting, free list,
and copying collectors
hypothesis
Abstract
Programmers are increasingly choosing managed languages for
modern applications, which tend to allocate many short-to-medium
lived small objects. The garbage collector therefore directly deter-
mines program performance by making a classic space-time trade-
off that seeks to provide space efficiency, fast reclamation, and mu-
tator performance. The three canonical tracing garbage collectors:
semi-space, mark-sweep, and mark-compact each sacrifice one ob-
jective. This paper describes a collector family, called mark-region,
and introduces opportunistic defragmentation, which mixes copy-
ing and marking in a single pass. Combining both, we implement
immix, a novel high performance garbage collector that achieves all
three performance objectives. The key insight is to allocate and re-
claim memory in contiguous regions, at a coarse block grain when
possible and otherwise in groups of finer grain lines. We show
that immix outperforms existing canonical algorithms, improving
total application performance by 7 to 25% on average across 20
benchmarks. As the mature space in a generational collector, im-
mix matches or beats a highly tuned generational collector, e.g. it
improves jbb2000 by 5%. These innovations and the identification
of a new family of collectors open new opportunities for garbage
collector design.
Categories and Subject Descriptors D.3.4 [Programming Lan-
guages]: Processors—Memory management (garbage collection)
General Terms Algorithms, Experimentation, Languages, Performance,
Measurement
Keywords Fragmentation, Free-List, Compact, Mark-Sweep, Semi-
Space, Mark-Region, Immix, Sweep-To-Region, Sweep-To-Free-List
1. Introduction
Modern applications are increasingly written in managed lan-
guages and make conflicting demands on their underlying mem-
ory managers. For example, real-time applications demand pause-
time guarantees, embedded systems demand space efficiency, and
servers demand high throughput. In seeking to satisfy these de-
efficiency, fast collection, and mutator performance through con-
tiguous allocation of contemporaneous objects.
Figure 1 starkly illustrates this dichotomy for full heap versions
of mark-sweep (MS), semi-space (SS), and mark-compact (MC)
implemented in MMTk [12], running on a Core 2 Duo. It plots
the geometric mean of total time, collection time, mutator time,
and mutator cache misses as a function of heap size, normalized
to the best, for 20 DaCapo, SPECjvm98, and SPECjbb2000 bench-
marks, and shows 99% confidence intervals. The crossing lines in
Figure 1(a) illustrate the classic space-time trade-off at the heart of
garbage collection. Mark-compact is uncompetitive in this setting
due to its overwhelming collection costs. In smaller heap sizes, the
space and collector efficiency of mark-sweep perform best since
the overheads of garbage collection dominate total performance.
Figures 1(c) and 1(d) show that the primary advantage for semi-
space is 10% better mutator time compared with mark-sweep, due
to better cache locality. Once the heap size is large enough, garbage
collection time reduces, and the locality of the mutator dominates
total performance so semi-space performs best.
To explain this tradeoff, we need to introduce and slightly ex-
pand memory management terminology. A tracing garbage collec-
tor performs allocation of new objects, identification of live ob-
jects, and reclamation of free memory. The canonical collectors all
identify live objects the same way, by marking objects during a
transitive closure over the object graph.
Reclamation strategy dictates allocation strategy, and the litera-
ture identifies just three strategies: (1) sweep-to-free-list, (2) evacu-
ation, and (3) compaction. For example, mark-sweep collectors al-
locate from a free list, mark live objects, and then sweep-to-free-list
1.3
1.4
1.5
dTime
MS
MC
SS
8
16
32
64
CTime(log)
MS
MC
SS
claim evaluation
Immix: A Mark-Region Garbage Collector with
Space Efficiency, Fast Collection, and Mutator Performance ⇤
Stephen M. Blackburn
Australian National University
Steve.Blackburn@anu.edu.au
Kathryn S. McKinley
The University of Texas at Austin
mckinley@cs.utexas.edu
Abstract
Programmers are increasingly choosing managed languages for
modern applications, which tend to allocate many short-to-medium
lived small objects. The garbage collector therefore directly deter-
mines program performance by making a classic space-time trade-
off that seeks to provide space efficiency, fast reclamation, and mu-
tator performance. The three canonical tracing garbage collectors:
semi-space, mark-sweep, and mark-compact each sacrifice one ob-
jective. This paper describes a collector family, called mark-region,
and introduces opportunistic defragmentation, which mixes copy-
ing and marking in a single pass. Combining both, we implement
immix, a novel high performance garbage collector that achieves all
three performance objectives. The key insight is to allocate and re-
claim memory in contiguous regions, at a coarse block grain when
possible and otherwise in groups of finer grain lines. We show
that immix outperforms existing canonical algorithms, improving
total application performance by 7 to 25% on average across 20
benchmarks. As the mature space in a generational collector, im-
mix matches or beats a highly tuned generational collector, e.g. it
improves jbb2000 by 5%. These innovations and the identification
of a new family of collectors open new opportunities for garbage
collector design.
efficiency, fast collection, and mutator performance through con-
tiguous allocation of contemporaneous objects.
Figure 1 starkly illustrates this dichotomy for full heap versions
of mark-sweep (MS), semi-space (SS), and mark-compact (MC)
implemented in MMTk [12], running on a Core 2 Duo. It plots
the geometric mean of total time, collection time, mutator time,
and mutator cache misses as a function of heap size, normalized
to the best, for 20 DaCapo, SPECjvm98, and SPECjbb2000 bench-
marks, and shows 99% confidence intervals. The crossing lines in
Figure 1(a) illustrate the classic space-time trade-off at the heart of
garbage collection. Mark-compact is uncompetitive in this setting
due to its overwhelming collection costs. In smaller heap sizes, the
space and collector efficiency of mark-sweep perform best since
the overheads of garbage collection dominate total performance.
Figures 1(c) and 1(d) show that the primary advantage for semi-
space is 10% better mutator time compared with mark-sweep, due
to better cache locality. Once the heap size is large enough, garbage
collection time reduces, and the locality of the mutator dominates
total performance so semi-space performs best.
To explain this tradeoff, we need to introduce and slightly ex-
pand memory management terminology. A tracing garbage collec-
tor performs allocation of new objects, identification of live ob-
jects, and reclamation of free memory. The canonical collectors all
General Terms Algorithms, Experimentation, Languages, Performance,
Measurement
Keywords Fragmentation, Free-List, Compact, Mark-Sweep, Semi-
Space, Mark-Region, Immix, Sweep-To-Region, Sweep-To-Free-List
1. Introduction
Modern applications are increasingly written in managed lan-
guages and make conflicting demands on their underlying mem-
ory managers. For example, real-time applications demand pause-
time guarantees, embedded systems demand space efficiency, and
servers demand high throughput. In seeking to satisfy these de-
mands, the literature includes reference counting collectors and
three canonical tracing collectors: semi-space, mark-sweep, and
mark-compact. These collectors are typically used as building
blocks for more sophisticated algorithms. Since reference count-
ing is incomplete, we omit it from further consideration here. Un-
fortunately, the tracing collectors each achieve only two of: space
⇤ This work is supported by ARC DP0666059, NSF CNS-0719966, NSF CCF-
0429859, NSF EIA-0303609, DARPA F33615-03-C-4106, Microsoft, Intel, and IBM.
Any opinions, findings and conclusions expressed herein are the authors’ and do not
necessarily reflect those of the sponsors.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
PLDI’08, June 7–13, 2008, Tucson, Arizona, USA.
Copyright c 2008 ACM 978-1-59593-860-2/08/06...$5.00
transitive closure over the object graph.
Reclamation strategy dictates allocation strategy, and the litera-
ture identifies just three strategies: (1) sweep-to-free-list, (2) evacu-
ation, and (3) compaction. For example, mark-sweep collectors al-
locate from a free list, mark live objects, and then sweep-to-free-list
1
1.1
1.2
1.3
1.4
1.5
1 2 3 4 5 6
NormalizedTime
MS
MC
SS
(a) Total Time
0.5
1
2
4
8
16
32
64
1 2 3 4 5 6
NormalizedGCTime(log)
MS
MC
SS
(b) Garbage Collection Time
1
1.05
1.1
1.15
1.2
1 2 3 4 5 6
MS
MC
SS
(c) Mutator Time
0.5
1
2
4
1 2 3 4 5 6
NormalizedMutatorPerf(log)
MS
MC
SS
(d) Mutator L2 Misses
Figure 1. Performance Tradeoffs For Canonical Collectors: Geo-
metric Mean for 20 DaCapo and SPEC Benchmarks.

evaluation claim?
You are doing it completely wrong!
sins of exposition
sins of reasoning
inscrutabilityirreproducability
ignorance inappropriateness inconsistency

ignorance
• claim ignores elements of the
evaluation
• may be benign
• not benign when the ignored elements
counter the claim
• common idioms
• ignoring data points
• ignoring data distribution
evaluation
claim

ignorance: ignoring data points [Georges et al 2007]
mental setup. In a
m in general, there
m that affect over-
on-determinism is
machine (VM) that
M compilation and
-determinism and
utions of the same
being taken and,
compiled and op-
n. Another source
cheduling in time-
mean w/ 95% confidence i
9.0
9.5
10.0
10.5
11.0
11.5
12.0
12.5
CopyMS
GenCopy
GenMS
MarkSweep
executiontime(s)
best of 30
9.0
9.5
10.0
10.5
11.0
11.5
12.0
12.5
CopyMS
GenCopy
GenMS
MarkSweep
SemiSpace
executiontime(s)
Figure 1. An example illustrating the pitfall of pre
Java performance data analysis methods: the ‘best’ m
a
ere
er-
is
hat
nd
nd
me
nd,
op-
ce
me-
mean w/ 95% confidence interval
9.0
9.5
10.0
10.5
11.0
11.5
12.0
12.5
CopyMS
GenCopy
GenMS
MarkSweep
SemiSpace
executiontime(s)
best of 30
9.0
9.5
10.0
10.5
11.0
11.5
12.0
12.5
CopyMS
GenCopy
GenMS
MarkSweep
SemiSpace
executiontime(s)
Figure 1. An example illustrating the pitfall of prevalent
Java performance data analysis methods: the ‘best’ method

ignorance: ignoring bimodality of a distribution

‘In our experience, while the sin of ignorance
seems obvious and easy to avoid, in reality it is
far from that. Many factors in the
evaluation that seem irrelevant to a claim
may actually be critical to the soundness
of the claim.
As a community we need to work towards
identifying these factors.’

inappropriateness
• claim extends beyond the evaluation
• may be benign
• assume laws of physics
• assume prior (cited) work
• not benign when extending beyond these
• common idioms
• inappropriate metrics (energy != performance)
• inappropriate use of independent variables (don’t control for space in GC eval)
• inappropriate use of tools (such as biased sampling)
evaluation
claim

A B
1
1.1
1.2
1.3
1.4
1.5
1 2 3 4 5 6
NormalizedTime
SemiSpace
MarkSweep
inappropriateness: misuse of independent variables (ignoring heap size)
[Blackburn et al 2008]
1
1.1
1.2
1.3
1.4
1.5
1 2 3 4 5 6
NormalizedTime
SemiSpace
MarkSweep

inappropriateness: inappropriate (biased) sampling: hotness of methods under four
profilers [Mytkowicz et al 2010]
hprof jprofile xprof yourkit
0
5
10
15
20
JavaParser.jj_scan_token
NodeIterator.getPositionFromParent
DefaultNameStep.evaluate
●
● ●
●
●
●
● ●● ●
●
●
percentofoverallexecution
Figure 1. Disagreement in the hottest method for benchmark pmd
across four popular Java proﬁlers.
B
a
b
c
fo
jy
lu
p
Ta
as
tim
3.1
We

‘In our experience, while the sin of
inappropriateness seems obvious and easy to
avoid, in reality it is far from that. Many
factors that may be unaccounted for in an
evaluation may actually be important to
deriving a sound claim.
As a community, we need to work towards
identifying these factors.’

inconsistency
• claim and evaluation are
about different things
• compare two incompatible things
in the evaluation, but claim ignores the incompatibility (apples v oranges)
• common idioms
• inconsistent measurement contexts (night | day, 32 | 64 machines)
• inconsistent metrics (wall clock | cycles, retired instructions | issued instructions)
• inconsistent workloads
• inconsistent data analysis
evaluation claim

inconsistency: time of day/week makes a large difference to workload [gmail]
Time
RequestsperSecond

‘In our experience, while the sin of
inappropriateness seems obvious and easy to
avoid, in reality it is far from that. Many
artifacts that seem comparable may actually
be inconsistent.
As a community we need to work towards
identifying these artifacts.’

inscrutability
• description of the claim is inadequate
• authors often fail to explicitly identify their claim
• three common forms:
• omission (no claim: when this happens, the work is literally meaningless)
• ambiguity (‘improved performance’: latency? throughput? …)
• distortion (‘improved by 10%’: the GC? the whole program? …)
• claim is synthesized by the authors, so in principle, inscrutability
should never occur
claim

irreproducability
• the description of the evaluation is inadequate
• three common forms:
• omission (space constraints, not realizing factor is relevant, confidentiality)
• ambiguity (imprecise language, lack of detail, missing units)
• distortion (unambiguous, but incorrect, such as missing units
evaluation

evaluation claim?
sins of exposition
sins of reasoning
inscrutabilityirreproducability
ignorance inappropriateness inconsistency

our hope: we’re moving up Phil Amour’s orders of ignorance…
OOI3: I don’t know something and I do not have a
process to find out that I don’t know it
OOI2: I don’t know something but I do have a
process to find out that I don’t know it

Rejected
Often rejected
Safe bet
Often rejected
Rare
Novelty
Qualityofevaluation

PLDI 2015: 27/33 artifacts accepted from 58 accepted papers

assert(scope_of_claim == scope_of_eval)

Questions?
With special thanks to my co-authors:
Amer Diwan, Matthias Hauswirth, Peter F. Sweeney
Jose ́ Nelson Amaral, Tim Brecht, Lubomr Bulej, Cliff Click, Lieven Eeckhout,
Sebastian Fischmeister, Daniel Frampton, Laurie J. Hendren, Michael Hind,
Antony L. Hosking, Richard E. Jones, Tomas Kalibera, Nathan Keynes,
Nathaniel Nystrom, and Andreas Zeller

The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical Evaluations

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical Evaluations

Similar to The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical Evaluations (20)

Recently uploaded

Recently uploaded (20)

The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical Evaluations