Ruby3x3: How are we going to measure 3x

Ruby3x3: How are we going
to measure 3x?

Matthew
Gaudet
Developer
on the
Eclipse OMR
project

Cross platform
components for
building reliable, high
performance language
runtimes
github.com/eclipse/omr
@eclipseOMR

0
0.5
1
1.5
2
2.5
3
3.5
Ruby 2.0 Ruby 3.0
Ruby 3x3: The Goal.
Performance

Agenda
Let’s talk about benchmarking!
• Some definitions
• Some philosophy
• Some pitfalls
Ruby 3x3
• Some Thoughts from Me.

Definition
Benchmark:
A piece of computer code run in
order to gather measurements
for comparison.

Definition
Benchmark:
Comparing the execution time of different
interpreters, or options.
Comparing the execution time of algorithms
Comparing the accuracy of different machine
learning algorithms

The Art of Benchmarking: What do you run?

Microbenchmark
Full
Application
The Benchmark
Continuum
Application
Kernel

Microbenchmarks
Pros
 Often easy to setup and run.
 Targeted to a particular
aspect.
 Fast acquisition of data.
Cons
 Exaggerates effects.
 Not typically generalizable.
A very small program written to explore the performance of one
aspect of the system under test.

Full Applications
Pros
 Immediate and obvious real
world impact!
Cons
 Small effects can be
swamped in natural
application variance.
 Can be complicated to
setup, or slow to run!
Benchmarking a whole application

Application Kernel
Pros
 Tight connection to real
world code.
 Typically more
generalizable.
Cons
 Difficult to know how much
of a an application should
be included vs. mocked.
A particular part of an application extracted for the express purpose
of constructing a benchmark.

Pitfalls in benchmark design
Un-Ruby-Like Code:
Code that looks like another language.
“You can write FORTRAN in any language”
Code that never produces garbage.
Code without exceptions

Input Data is a key part of many benchmarks: Watch out
for weird input data!
 Imagine an MP3 compressor benchmark
– Inputs are
1. Silence. weird because most mp3s are not silence.
2. White noise. weird because most mp3s have some
structure.
– Reduces the generalizability of the results!

What do you measure?

Time?
Throughput?
Latency?

Definition
Wall-clock time:
The measurement of relative to
a clock independent of the
process being timed.
$ time sleep 1
real 0m1.003s
user 0m0.000s
sys 0m0.000s

Definition
CPU time:
Measurement of how much of
the CPU the process actually
used
$ time sleep 1
real 0m1.003s
user 0m0.000s
sys 0m0.000s

Definition
Throughput:
A count of operations that occur
per unit of time.

Definition
Latency:
The time it takes for a response
to occur after stimulus.

What do you measure?
What do you report?

Raw Measurements?
Speedup?

Definition
Speedup:
A ratio computed between a
baseline and experimental time
measurement.
𝑇𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒
𝑇𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡𝑎𝑙

An aside on misleading with speedup.
Speedup:
A ratio computed between a
baseline and experimental time
measurement.

“He who controls the baseline
controls the speedup”

“Our parallelization system shows
linear speedup as the number of
threads increases”

0
1
2
3
4
5
6
7
8
9
1 thread 2 thread 4 thread 8 thread
SPEEDUP
Speedup

Measurement Time (s)
Original Sequential Program 10.0
Parallelized 1 thread 100.0
The distinction between relative speedup
and absolute speedup.

How you measure affects what
you measure.

Both of these are valid benchmarks!
$ cat test.rb
...
puts Benchmark.measure {
1_000_000.times {
compute_foo()
}
}
$ for i in `seq 1 10`; do
ruby t.rb ; done;
...
10.times {
puts Benchmark.measure {
1_000_000.times {
compute_foo()
}
}
}
vs.
But they’re going to measure (and may encourage
the optimization of ) two different things!

Definition
Warmup:
The time from application start
until it hits peak performance.
100
64 69
36
25 30 25 26 25 26 25
1 2 3 4 5 6 7 8 9 10 11
Time per Iteration (s)

When has warmup finished?
Despite this, even knowing warmup exists is important: It
allows us to choose methodologies that can accommodate the
possibility!

Definition
Run-to-Run Variance
The observed effect that
identical runs do not have
identical times.
$ for i in `seq 1 5`; do ruby -I../../lib/ string-equal.rb
--loopn 1 1000; done;
1.347334558
1.348350632
1.30690478
1.314764977
1.323862345

Methodology:
An incomplete list of decisions that need to be made when
developing benchmarking methodology:
1. Does your methodology account for warmup?
2. How are you accounting for run-to-run variance?
3. How are you accounting for the effects of the garbage
collector?

Accounting for warmup often means producing
intermediate scores, so you can see when they stabilize.
If you aren’t accounting for warmup, you may find
that you miss out on peak performance.

Account for run to run variance by running multiple times,
and presenting confidence intervals!
Be sure you’re methodology doesn’t encourage wild
variations in performance though!

Be aware, benchmarks can act Weird

Garbage Collector Impact
ruby -J-Xmx330m -J-Xms330m
-I../../lib/ connected.rb --loopn
10 1
0.426412300002994
0.35442964400863275
0.3484781830047723
0.36281039800087456
0.3565745719970437
0.36179181998886634
0.31713732800562866
0.3365019329939969
0.305397536008968
0.3006619710067753
ruby -J-Xmx33m -J-Xms33m
-I../../lib/ connected.rb --loopn
10 1
0.5431441880064085
0.8410410610085819
0.7975159170018742
0.8458756269974401
0.9974212259985507
1.0887025539996102
1.067053010003292
1.057003531997907
1.0708161939983256
1.0480617069988512

Garbage Collector Impact
Garbage collector impact can make benchmarks incredibly difficult to
compare:
 The Ruby+OMR Preview uses the OMR GC technology, including a
change to move off heap data on heap.
 Side effect of this is that it’s crazy difficult to compare against the default
ruby: there’s an entirely different set of data on the heap!
If heap size adapts to machine memory, you’ll need to figure out how to
lock it to give good comparisons across machines
42
string malloc string OMRBuffer

User Error
$ time ruby their_implementation.rb 100000
real 0m10.003s
user 0m08.001s
sys 0m02.007s
$ time ruby my_implementation.rb 10000
real 0m1.003s
user 0m0.801s
sys 0m0.206s
10x speedup!

User Error
$ time ruby their_implementation.rb 100000
real 0m10.003s
user 0m08.001s
sys 0m02.007s
$ time ruby my_implementation.rb 10000
real 0m1.003s
user 0m0.801s
sys 0m0.206s
10x speedup!
Pro Tip: Use a harness that
keeps you out of the
benchmarking process.
Aim for reproducibility!

Time(s)
Iterations
Unplugs
Laptop
Return
Power
Power Saving
Mode

Other Hardware Effects to watch for!
TurboBoost (and similar): Frequency scaling based
on the season.

on the season location.

on the season location rack

on the season location rack CPU temperature.
Even in the cloud! [1]
[1]: http://www.brendangregg.com/blog/2014-09-15/the-msrs-
of-ec2.html

Software Pitfalls
What about your backup service?
Long sequence of benchmarks… do you have
automatic software updates installed?
Do your system administrators know you are
benchmarking?

Paranoia is a matter of
Effect Sizes
 Hardware Changes:
– Disable turbo boost,
– Disable hyperthreading.
 Krun tool:
– Set ulimit for heap and stack.
– Reboot machine before execution
– Monitor dmesg for unexpected output
– Monitor temperature of machine.
– Disable pstates
– CPU Governor set to performance mode.
– Perf sample rate control.
– Disable ASLR.
– Create a new user account for each run
http://arxiv.org/pdf/1602.00602v1.pdf

Performance improvements compound!
55
is 10 increases of 11%
is 25 increases of 4.5%
is 100 increases of 1.1%
3x

0
0.5
1
1.5
2
2.5
3
3.5
Ruby 2.0 Ruby 2.1 Ruby 2.2 Ruby 2.3 Ruby 2.4 Ruby 2.? Ruby 2.? Ruby 2.? Ruby 2.? Ruby 2.? Ruby 2.? Ruby 3.0
Ruby 3x3: The Process
Performance
(Made up data for
illustration only)

Philosophy
Benchmarks drive change.
– What you measure is what
people try to change.
–What you don’t measure, may
not change how you want.

Squeezing a Water Balloon
Be sure to measure associated metrics to have a
clear headed view of tradeoffs:
For example: JIT Compilation:
Trade startup speed for peak speed.
Trade footprint for speed.

Benchmarks age!
Benchmarks can be wrung of all their possible
performance at some point.
 Using the same benchmarks for too long can lead to
shortsighted decisions driven by old benchmarks.
Idiomatic code evolves in a language.
 Benchmark use of language features can help drive
adoption!
–Be sure to benchmark desirable new language features!
60

https://twitter.com/tenderlove/status/765288219931881472
62

Ruby Community has some great starting points!

Recall: Benchmarks drive change
Thought: Choose 9 application kernels that
represent what we want from a future CRuby!
• Why 9?
• Too many benchmarks can diffuse effort.
• Also! 3x3 = 9!
¯_(ツ)_/¯

Brainstorming on the nine?
1. Some CPU intensive applications:
• OptCarrot, Neural Nets, Monte Carlo Tree
Search, PSD filter pipeline?
2. Some memory intensive application:
• Large tree mutation benchmark?
3. A startup benchmark:
• time ruby -e “def foo; ‘100’; end; puts foo”?
4. Some web application framework benchmarks.

Choose a methodology that drives the change we
want in CRuby.
Want great performance, but not huge warmup
times?
–Only run 5 iterations, and score the last one?
Don’t want to deal with warmup?
–Don’t run iterations: Score the first run!
I Error Bars

What about a more ambitious
choice?

Use the ecosystem!
Add a standard performance harness to RubyGems.
 Would allow VM developers to sample popular gems, and
run a perf suite written by gem authors.
 With effort, time and $$$, we could make broad statements
about performance impact on the gem ecosystem.

Use the ecosystem!
Doesn’t just help VM developers
Gem authors get
1. Enabled for performance tracking!
2. Easier performance reporting with VM developers.

Credits
 Headache: https://en.wikipedia.org/wiki/Headache#/media/File:Cruikshank_-
_The_Head_Ache.png
@MattStudies
magaudet@ca.ibm.com
For more on software systems evaluation, be sure to visit
The Evaluate Collaboratory @
http://evaluate.inf.usi.ch/

Ruby3x3: How are we going to measure 3x

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ruby3x3: How are we going to measure 3x

Similar to Ruby3x3: How are we going to measure 3x (20)

Recently uploaded

Recently uploaded (20)

Ruby3x3: How are we going to measure 3x

Editor's Notes