Tracing versus Partial Evaluation
Which Meta-Compilation Approach is
Better for Self-Optimizing
Interpreters?
Stefan Marr, Stéphane Ducasse
OOPSLA, October 28, 2015
Work Done At
Disclaimer
2
I am currently funded by
* Würthinger, T.; Wimmer, C.; Wöß A.; Stadler, L.; Duboscq, G.; Humer, C.; Richards, G.; Simon, D. & Wolczko, M,
One VM to Rule Them All,
in Proceedings of the 2013 ACM International Symposium on New Ideas,
New Paradigms, and Reflections on Programming & Software, ACM.
Oracle Labs
3
Compare Concrete Systems
Truffle + Graal
with Partial Evaluation
RPython
with Meta-Tracing
[3] Würthinger et al., One VM to Rule Them All, Onward!
2013, ACM, pp. 187-204.
[2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT
Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
Oracle Labs
Selecting A Case Study
 On both Systems
5
 Self-Optimizing AST Interpreter
Represents Large Group of
Dynamic Languages
Dynamically Typed (Smalltalk)
Classes
(and everything is an Object)
Closures (lambdas)
Non-local Returns
(almost exceptions)
Set of Benchmark
6
http://som-st.github.io
SOMMT versus SOMPE
Meta-Tracing Partial Evaluation
7
cnt
1
+
cnt:
=
if
cnt:
=
0
cnt
1
+
cnt:
=if cnt:
=
0
[3] Würthinger et al., One VM to Rule Them
All, Onward! 2013, ACM, pp. 187-204.
[2] Bolz et al., Tracing the Meta-level: PyPy's
Tracing JIT Compiler, ICOOOLPS Workshop
2009, ACM, pp. 18-25.
WHICH APPROACH IS FASTER FAST?
minimal amount of engineering to get good performance
8
Peak Performance of Basic Interpreters
Runtime
Normalized
to Java 8
(lower is
better)
Compiled
SOM[MT]
Compiled
SOM[PE]
10
100
Bounce
BubbleSort
DeltaBlue
Fannkuch
GraphSearch
Json
Mandelbrot
NBody
PageRank
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Bounce
BubbleSort
DeltaBlue
Fannkuch
GraphSearch
Json
Mandelbrot
NBody
PageRank
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Runtimenormalizedto
Java(compiledorinterpreted)
SOMMT on RPython SOMPE on Truffle
Minimal SOMMT
5.5x slower
min. 1.6x
max. 14x
Minimal SOMPE
170x slower
min. 60x
max. 600x
WHICH APPROACH IS THE FASTEST?
best peak performance
10
Which Self-Optimizations Should a
Language Implementer Add?
• Type-specialize variables
• Type-specialize object fields
• Type-specialize collection storage
• Lower control structures from library
• Lower common library operations
• Inline caching
• Inline primitive operations
• Cache globals
• …
11
Peak Performance of Optimized Interpreter
Compiled
SOM[MT]
Compiled
SOM[PE]
1
4
8
12
Bounce
BubbleSort
DeltaBlue
Fannkuch
GraphSearch
Json
Mandelbrot
NBody
PageRank
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Bounce
BubbleSort
DeltaBlue
Fannkuch
GraphSearch
Json
Mandelbrot
NBody
PageRank
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Runtimenormalizedto
Java(compiledorinterpreted)
SOMMT on RPython SOMPE on Truffle
Runtime
Normalized
to Java 8
(lower is
better)
Optimized SOMMT
3x slower
min. 1.5x
max. 11x
OptimizedSOMPE
2.3x slower
min. 4%
max. 4.9x
2.4x
speedup
80x
speedup
Optimization Impact on SOMPE
13
I
I
I
I
I
I
I
I
I
I
I
I
I
lower control structures
inline caching
cache globals
typed fields
lower common ops
array strategies
inline basic ops.
typed vars
opt. local vars
baseline
min. escaping closures
typed args
catch−return nodes 0.85
1.00
1.20
1.50
2.00
3.00
4.00
5.00
7.00
8.00
10.00
12.00
Speedup Factor
(higher is better, logarithmic scale)Speedup Factor
(higher is better, logarithmic scale)
Implementation Sizes
RPython
From Minimal to Optimized
+57% LOC
From 3,455 LOC to 5,414 LOC
Truffle
From Minimal to Optimized
+ 103% LOC
From 5,424 LOC to 11,037 LOC
14
The Way I write
Python
The Way I write
Java
WHICH APPROACH GIVES BETTER
STARTUP PERFORMANCE?
Considering the User-Perceived System Performance
15
Measuring “Whole Program” Runtime
16
4
8
12
16
0 200 400 600
GeoMeanOf(Wall−ClockTimeforxIterations,dividedbycorrespondingJavaresult)
VM
Java
RTruffleSOM−jit−ex
TruffleSOM−graal−n
Wall−Clock Behavior for Various Run Lengths: Aggregation over all Benchmarks
FactoroverJava,forx-iterations
Iterations of Benchmark in Same Process
8sec 25sec 46sec
• Process Start to Finish
• Overall Wall-clock time
• Normalized to Java
Java
SOMMT
SOMPE
CONCLUSIONS
17
Tracing vs. Partial Evaluation
• Peak performance seems similar
– No indications of conceptual limitations
• Startup Performance
– Unclear, tiered compilation?
• But, tracing is faster fast!
– Requires less optimizations
– Better ‘prototype’ performance
18
Peak Performance of Optimized Interpreter
Compiled
SOM[MT]
Compiled
SOM[PE]
1
4
8
12
Bounce
BubbleSort
DeltaBlue
Fannkuch
GraphSearch
Json
Mandelbrot
NBody
PageRank
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Bounce
BubbleSort
DeltaBlue
Fannkuch
GraphSearch
Json
Mandelbrot
NBody
PageRank
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Runtimenormalizedto
Java(compiledorinterpreted)
SOMMT on RPython SOMPE on Truffle
Runtime
Normalized
to Java 8
(lower is
better)
Optimized SOMMT
3x slower
min. 1.5x
max. 11x
OptimizedSOMPE
2.3x slower
min. 4%
max. 4.9x

Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters?

  • 1.
    Tracing versus PartialEvaluation Which Meta-Compilation Approach is Better for Self-Optimizing Interpreters? Stefan Marr, Stéphane Ducasse OOPSLA, October 28, 2015 Work Done At
  • 2.
    Disclaimer 2 I am currentlyfunded by * Würthinger, T.; Wimmer, C.; Wöß A.; Stadler, L.; Duboscq, G.; Humer, C.; Richards, G.; Simon, D. & Wolczko, M, One VM to Rule Them All, in Proceedings of the 2013 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, ACM. Oracle Labs
  • 3.
  • 4.
    Compare Concrete Systems Truffle+ Graal with Partial Evaluation RPython with Meta-Tracing [3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204. [2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25. Oracle Labs
  • 5.
    Selecting A CaseStudy  On both Systems 5  Self-Optimizing AST Interpreter
  • 6.
    Represents Large Groupof Dynamic Languages Dynamically Typed (Smalltalk) Classes (and everything is an Object) Closures (lambdas) Non-local Returns (almost exceptions) Set of Benchmark 6 http://som-st.github.io
  • 7.
    SOMMT versus SOMPE Meta-TracingPartial Evaluation 7 cnt 1 + cnt: = if cnt: = 0 cnt 1 + cnt: =if cnt: = 0 [3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204. [2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
  • 8.
    WHICH APPROACH ISFASTER FAST? minimal amount of engineering to get good performance 8
  • 9.
    Peak Performance ofBasic Interpreters Runtime Normalized to Java 8 (lower is better) Compiled SOM[MT] Compiled SOM[PE] 10 100 Bounce BubbleSort DeltaBlue Fannkuch GraphSearch Json Mandelbrot NBody PageRank Permute Queens QuickSort Richards Sieve Storage Towers Bounce BubbleSort DeltaBlue Fannkuch GraphSearch Json Mandelbrot NBody PageRank Permute Queens QuickSort Richards Sieve Storage Towers Runtimenormalizedto Java(compiledorinterpreted) SOMMT on RPython SOMPE on Truffle Minimal SOMMT 5.5x slower min. 1.6x max. 14x Minimal SOMPE 170x slower min. 60x max. 600x
  • 10.
    WHICH APPROACH ISTHE FASTEST? best peak performance 10
  • 11.
    Which Self-Optimizations Shoulda Language Implementer Add? • Type-specialize variables • Type-specialize object fields • Type-specialize collection storage • Lower control structures from library • Lower common library operations • Inline caching • Inline primitive operations • Cache globals • … 11
  • 12.
    Peak Performance ofOptimized Interpreter Compiled SOM[MT] Compiled SOM[PE] 1 4 8 12 Bounce BubbleSort DeltaBlue Fannkuch GraphSearch Json Mandelbrot NBody PageRank Permute Queens QuickSort Richards Sieve Storage Towers Bounce BubbleSort DeltaBlue Fannkuch GraphSearch Json Mandelbrot NBody PageRank Permute Queens QuickSort Richards Sieve Storage Towers Runtimenormalizedto Java(compiledorinterpreted) SOMMT on RPython SOMPE on Truffle Runtime Normalized to Java 8 (lower is better) Optimized SOMMT 3x slower min. 1.5x max. 11x OptimizedSOMPE 2.3x slower min. 4% max. 4.9x 2.4x speedup 80x speedup
  • 13.
    Optimization Impact onSOMPE 13 I I I I I I I I I I I I I lower control structures inline caching cache globals typed fields lower common ops array strategies inline basic ops. typed vars opt. local vars baseline min. escaping closures typed args catch−return nodes 0.85 1.00 1.20 1.50 2.00 3.00 4.00 5.00 7.00 8.00 10.00 12.00 Speedup Factor (higher is better, logarithmic scale)Speedup Factor (higher is better, logarithmic scale)
  • 14.
    Implementation Sizes RPython From Minimalto Optimized +57% LOC From 3,455 LOC to 5,414 LOC Truffle From Minimal to Optimized + 103% LOC From 5,424 LOC to 11,037 LOC 14 The Way I write Python The Way I write Java
  • 15.
    WHICH APPROACH GIVESBETTER STARTUP PERFORMANCE? Considering the User-Perceived System Performance 15
  • 16.
    Measuring “Whole Program”Runtime 16 4 8 12 16 0 200 400 600 GeoMeanOf(Wall−ClockTimeforxIterations,dividedbycorrespondingJavaresult) VM Java RTruffleSOM−jit−ex TruffleSOM−graal−n Wall−Clock Behavior for Various Run Lengths: Aggregation over all Benchmarks FactoroverJava,forx-iterations Iterations of Benchmark in Same Process 8sec 25sec 46sec • Process Start to Finish • Overall Wall-clock time • Normalized to Java Java SOMMT SOMPE
  • 17.
  • 18.
    Tracing vs. PartialEvaluation • Peak performance seems similar – No indications of conceptual limitations • Startup Performance – Unclear, tiered compilation? • But, tracing is faster fast! – Requires less optimizations – Better ‘prototype’ performance 18
  • 19.
    Peak Performance ofOptimized Interpreter Compiled SOM[MT] Compiled SOM[PE] 1 4 8 12 Bounce BubbleSort DeltaBlue Fannkuch GraphSearch Json Mandelbrot NBody PageRank Permute Queens QuickSort Richards Sieve Storage Towers Bounce BubbleSort DeltaBlue Fannkuch GraphSearch Json Mandelbrot NBody PageRank Permute Queens QuickSort Richards Sieve Storage Towers Runtimenormalizedto Java(compiledorinterpreted) SOMMT on RPython SOMPE on Truffle Runtime Normalized to Java 8 (lower is better) Optimized SOMMT 3x slower min. 1.5x max. 11x OptimizedSOMPE 2.3x slower min. 4% max. 4.9x

Editor's Notes

  • #8 It is about how to determine the compilation unit. Remember, the interpreter is implemented in one language, and the compilation works on the meta-level. The main idea is that we want to take the implementation, add information from the execution context, and use that to do very aggressive and speculative optimizations on the interpreter implementation. This avoids the need to write custom JIT compilers.
  • #10  VM type BenchRatio.geomean BenchRatio.min BenchRatio.max 1 Java Compiled 1.000000 1.000000 1.00000 2 SOM[MT] Compiled 5.528967 1.565665 13.90805 3 SOM[PE] Compiled 176.488620 63.952457 606.62440 >
  • #12 Type-specialize function arguments Min. escaping closures Catch-return nodes Opt. local vars Min escaping vards
  • #17  Cores time.ms time.s time.m 1 1 2428.125 2.428125 0.04046875 2 5 3617.917 3.617917 0.06029861 3 10 4930.000 4.930000 0.08216667 4 50 13810.625 13.810625 0.23017708 5 100 24861.250 24.861250 0.41435417 6 200 46516.250 46.516250 0.77527083 7 400 89221.875 89.221875 1.48703125 8 500 110605.417 110.605417 1.84342361 9 750 164434.583 164.434583 2.74057639 10 1000 217541.875 217.541875 3.62569792 11 1250 270658.750 270.658750 4.51097917 12 1500 325657.917 325.657917 5.42763194