Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Tracing versus Partial Evaluation
Which Meta-Compilation Approach is
Better for Self-Optimizing
Interpreters?
Stefan Marr, Stéphane Ducasse
OOPSLA, October 28, 2015
Work Done At

Disclaimer
2
I am currently funded by
* Würthinger, T.; Wimmer, C.; Wöß A.; Stadler, L.; Duboscq, G.; Humer, C.; Richards, G.; Simon, D. & Wolczko, M,
One VM to Rule Them All,
in Proceedings of the 2013 ACM International Symposium on New Ideas,
New Paradigms, and Reflections on Programming & Software, ACM.
Oracle Labs

Compare Concrete Systems
Truffle + Graal
with Partial Evaluation
RPython
with Meta-Tracing
[3] Würthinger et al., One VM to Rule Them All, Onward!
2013, ACM, pp. 187-204.
[2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT
Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
Oracle Labs

Selecting A Case Study
 On both Systems
5
 Self-Optimizing AST Interpreter

Represents Large Group of
Dynamic Languages
Dynamically Typed (Smalltalk)
Classes
(and everything is an Object)
Closures (lambdas)
Non-local Returns
(almost exceptions)
Set of Benchmark
6
http://som-st.github.io

SOMMT versus SOMPE
Meta-Tracing Partial Evaluation
7
cnt
1
+
cnt:
=
if
cnt:
=
0
cnt
1
+
cnt:
=if cnt:
=
0
[3] Würthinger et al., One VM to Rule Them
All, Onward! 2013, ACM, pp. 187-204.
[2] Bolz et al., Tracing the Meta-level: PyPy's
Tracing JIT Compiler, ICOOOLPS Workshop
2009, ACM, pp. 18-25.

WHICH APPROACH IS FASTER FAST?
minimal amount of engineering to get good performance
8

Peak Performance of Basic Interpreters
Runtime
Normalized
to Java 8
(lower is
better)
Compiled
SOM[MT]
Compiled
SOM[PE]
10
100
Bounce
BubbleSort
DeltaBlue
Fannkuch
GraphSearch
Json
Mandelbrot
NBody
PageRank
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Bounce
BubbleSort
DeltaBlue
Fannkuch
GraphSearch
Json
Mandelbrot
NBody
PageRank
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Runtimenormalizedto
Java(compiledorinterpreted)
SOMMT on RPython SOMPE on Truffle
Minimal SOMMT
5.5x slower
min. 1.6x
max. 14x
Minimal SOMPE
170x slower
min. 60x
max. 600x

WHICH APPROACH IS THE FASTEST?
best peak performance
10

Which Self-Optimizations Should a
Language Implementer Add?
• Type-specialize variables
• Type-specialize object fields
• Type-specialize collection storage
• Lower control structures from library
• Lower common library operations
• Inline caching
• Inline primitive operations
• Cache globals
• …
11

Peak Performance of Optimized Interpreter
Compiled
SOM[MT]
Compiled
SOM[PE]
1
4
8
12
Bounce
BubbleSort
DeltaBlue
Fannkuch
GraphSearch
Json
Mandelbrot
NBody
PageRank
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Bounce
BubbleSort
DeltaBlue
Fannkuch
GraphSearch
Json
Mandelbrot
NBody
PageRank
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Runtimenormalizedto
Runtime
Normalized
to Java 8
(lower is
better)
Optimized SOMMT
3x slower
min. 1.5x
max. 11x
OptimizedSOMPE
2.3x slower
min. 4%
max. 4.9x
2.4x
speedup
80x
speedup

Optimization Impact on SOMPE
13
I
I
I
I
I
I
I
I
I
I
I
I
I
lower control structures
inline caching
cache globals
typed fields
lower common ops
array strategies
inline basic ops.
typed vars
opt. local vars
baseline
min. escaping closures
typed args
catch−return nodes 0.85
1.00
1.20
1.50
2.00
3.00
4.00
5.00
7.00
8.00
10.00
12.00
Speedup Factor
(higher is better, logarithmic scale)Speedup Factor
(higher is better, logarithmic scale)

Implementation Sizes
RPython
From Minimal to Optimized
+57% LOC
From 3,455 LOC to 5,414 LOC
Truffle
From Minimal to Optimized
+ 103% LOC
From 5,424 LOC to 11,037 LOC
14
The Way I write
Python
The Way I write
Java

WHICH APPROACH GIVES BETTER
STARTUP PERFORMANCE?
Considering the User-Perceived System Performance
15

Measuring “Whole Program” Runtime
16
4
8
12
16
0 200 400 600
GeoMeanOf(Wall−ClockTimeforxIterations,dividedbycorrespondingJavaresult)
VM
Java
RTruffleSOM−jit−ex
TruffleSOM−graal−n
Wall−Clock Behavior for Various Run Lengths: Aggregation over all Benchmarks
FactoroverJava,forx-iterations
Iterations of Benchmark in Same Process
8sec 25sec 46sec
• Process Start to Finish
• Overall Wall-clock time
• Normalized to Java
Java
SOMMT
SOMPE

Tracing vs. Partial Evaluation
• Peak performance seems similar
– No indications of conceptual limitations
• Startup Performance
– Unclear, tiered compilation?
• But, tracing is faster fast!
– Requires less optimizations
– Better ‘prototype’ performance
18

Peak Performance of Optimized Interpreter
Compiled
SOM[MT]
Compiled
SOM[PE]
1
4
8
12
Bounce
BubbleSort
DeltaBlue
Fannkuch
GraphSearch
Json
Mandelbrot
NBody
PageRank
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Bounce
BubbleSort
DeltaBlue
Fannkuch
GraphSearch
Json
Mandelbrot
NBody
PageRank
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Runtimenormalizedto
Runtime
Normalized
to Java 8
(lower is
better)
Optimized SOMMT
3x slower
min. 1.5x
max. 11x
OptimizedSOMPE
2.3x slower
min. 4%
max. 4.9x

Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

More Related Content

What's hot

Similar to Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

More from Stefan Marr

Recently uploaded

Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

Editor's Notes