Building Efficient and Highly Run-Time Adaptable Virtual Machines
1. Diego Garbervetsky, LaFHIS, UBA, Argentina - Stefan Marr, JKU, Linz, Austria
Building Efficient and Highly Run-Time
Adaptable Virtual Machines
Guido Chari, LaFHIS, UBA, CONICET, Argentina
2. Fully-Reflective Execution Environments (FREE)
Every entity at both, the application and the VM-levels, must provide
reflective capabilities for its observation and modification at run time.
MOPFine-Grained Scoping
Global
Method
Object
4. Starting Point
Extremely slow
Smalltalk interpreter!
Insights for optimizing
reflective systems
What are the fundamental performance overheads of a
FREE? How can we optimize them as much as possible?
5. def setX(arg)
x = arg
return x
Difference with Standard VMs
def IntercessionHandling(frame, operation){
if (getGlobalMetaobject(operation))
return delegateToGlobal(operation);
if (frame.getMetaobject(operation))
return delegateToFrame(operation);
if (rcvr().getMetaobject(operation))
return delegateToRcvr(operation);
executeOperationInVM;
}
It is hard for the compiler to speculate on the meta behavior because it can
not guess the current metaobject for each scope on each subsequent IH
4 Operations
1 Arg Read,
1 Field Write,
1 Field Read,
1 Return
6. Intercession
Handling (IH)
✤ Ubiquitous: every operation
must be intercepted.
✤ Complex: every interception
must consider scoping
conditions.
✤ Tests that depend on run-
time values and jeopardize
optimizations.
✤ Lookup and marshaling for
delegation to language-level.
7. Conjectures for
Optimization
✤ Stable Semantics: Moderated
dynamicity at run time.
✤ Low-local metavariability: IH
sites similar to call sites: most
monomorphic, some
polymorphic, few megamorphic.
8. Optimization Model
✤ Aggressive and exhaustive speculation of the meta-
model: speculate on each observed metaobject at
every scoping condition for every IH site.
✤ Mitigate as much as possible the overhead of the
speculation guards.
9. Rcvr
Speculate for Each Metaobject +
Scope + IH Site
def fieldWrite(field, value)
throw writeException()
def fieldWrite(field, value)
def returnValue(val)
return val + 2
def setX(arg)
x = arg
return x
def fieldWrite(field, value)
write in DB
def fieldRead(field)
read in DB
2
1
0
3
_NO_METAOBJECT
Field write
Read arg
Global Frame
30 1 0 1 2
0
Global Frame Rcvr
0 0 0
Observe the run-time behavior until become stable
10. Speculate For Each Metaobject +
Scope + IH Site
def fieldWrite(field, value)
throw writeException()
def fieldWrite(field, value);
def returnValue(val)
return val + 2
def setX(arg)
x = arg
return x
def fieldWrite(field, value)
write in DB
def fieldRead(field)
read in DB
2
1
0
3
_NO_METAOBJECT
Return
Field read
Global Frame Rcvr
0 0 20
Global Frame Rcvr
0 0 03
11. JIT Compiling the Fast Path
def IntercessionHandling(frame){
executeVMStandardArgRead Global Frame Rcvr
0 0 0
if (globalMO(writeField) == 3) write in DB;
else if (frameMO(writeField) == 1 or rcvr.MO(op) == 1)
throw writeException()
else if (rcvr.MO(writeField) == 2) Nop
else executeVMStandardWrite
Global Frame Rcvr
30 1 0 1 2
0
Global Frame Rcvr
0 0 03
if (globalMO(readField) == 3) read in DB
else executeVMStandardRead
Global Frame Rcvr
0 0 20
if (rcvr.MO(ret) == 2) return val + 2
else executeVMStandardReturn
}
Argument read
Field write
Field read
Return
Global and Frame scoping: optimizable.
Instance Scoping: still need to access memory
12. Mitigating Speculation Guards:
Metaobjects in Layouts
Shapes are usually needed in the context of a method
def setX(arg)
x = arg
return x
Global Frame Rcvr
0 0 0
Return
20,0 1,2
if (rcvr.shape == 1) return val + 2;
else executeVMStandardReturn
1
No extra memory access for instances
if (rcvr.MO(op) == 2) return val + 2
else executeVMStandardReturn
}
1
15. Peak Performance of Using the VM
Reflective Capabilities Exhaustively
Baseline: TruffleMATE
Overall mean peak performance overhead: 1-3.4x
16. Breaking Assumptions
Mega 18.5x, Mono 1.10x
Severe performance degradation when assumptions do not
hold
i = 0
foreach (point in list)
i += point.x
17. Results
✤ Ran in most cases quite efficiently.
✤ Positive indicator for our
optimization model.
✤ Still room for improvements.
✤ High-local meta variability leads to
severe performance degradation.
18. Open Paths
✤ Go deeper into the VM (memory, garbage collection).
✤ Would a reflective compiler enable significant
improvements?
✤ Statistics such as code bloat, length of dispatch
chains, etc.
TruffleMATE: https://github.com/charig/truffleMate
Editor's Notes
Present myself and coauthors
Next: background on FREE.
Common arch: Two clear levels, application in top of Black-Box VM providing Semantics, Security, Optimizations, etc.
Next: Interaction between levels.
The VM interacts with the application. It must know its methods, its instances, modify their fields, etc. But the app usually has no possibility of affecting the VM behavior. So the division is more like a valve
next: Our proposal, FREE.
FREE allows this two-level interaction. However, the application can not do whatever it wants but a predefined API (MOPs) provides some kind of control.
next: Summary of FREE, universal reflection + MOP + fine-grained scoping.
This is a work about optimizing fully-reflective execution environments, a particular kind of VM we have been pushing in the last years. Summarizing, a FREE main characteristics is: Universal reflection. The flavor of FREE we have been exploring mediates the interaction between the application and the VM reflective capabilities with a special kind of API, very well known for reflective systems, a MOP. It also enables a very fine-grained scoping for VM-level redefinitions, so ideally a single instance should be able to customize how it is reclaimed from memory by the gc. Next: Example.
Now a brief example for illustrating a FREE. Let’s consider we have 3 instances of Point, the first does not use reflective capabilities. The other two has different mos for implementing immutability. So let’s see how this works in the context of a very simple method that sets the x field and return its value. For point 1, it just sets 2 to x and return the value 2. For 2 it does nothing to x and throws an exception. For 3 it does nothing to x but returns 3 because the return value is also redefined. Next: Starting Point, a FREE Slow at Onward!.
VMs must run fast. General Belief reflection is slow. FREE, Reflective VM with complex scoping conditions. Onwards show a FREE is feasible but slow!
Next: Zero-overhead.
We presented a prototype of a FREE at onward! last year implemented as an Smalltalk interpreter. It did not include any kind of compilation nor optimization and was prohibitively slow. At the same time we had Stefan’s work providing precise insights for optimizing reflective operations with dispatch chains, mainly at the language-level.
next: Starting Line, Slow.
Reflection at language-level can be fast. Mainly when variability is controlled. For VM just consider a tiny MOP. Far from an integral approach to VM reflection. Only behavioral. 10% inherent overhead, could it be improved? In the context of more complex scoping conditions?. No experimentation for redefining behavior application-wise. No optimizations model presented.
Next: The Problem for slowness, IH.
So I return to the same method for illustrating the fundamental difference regarding performance between a standard VM and a FREE. Considering only the execution of statements, we have 4 VM operations in a very simple method. A standard VM only has to decode the operation and execute the corresponding code. Instead, a FREE must execute always an IH first. Although subsequent executions of the IH may repeat the same tests, it is hard for a compiler to speculate and optimize the meta-behavior. It could not guess that metaobjects will not change for subsequent executions. Next: Summary of IH.
IH ubiquitous and complex. Proliferation of tests depending on run-time values. In case of redefinitions, lookup and marshalling from VM to language and backwards.
Next: Conjectures for optimization.
4.6x Inherent. This seems already quite good. Evidently, there is lot of repetition and simple comparisons in the IH. However 4.6 is still prohibitive for a sensible approach competing with standard VMs. On the other hand, using the reflective capabilities is prohibitive.
Next: The meta-variability problem.
FREEs enables to freely change metaobjects. A comp cannot make assumptions about the meta-level. Prevents optimizations. Example: code that accesses the mo of an object multiple times -> optimizations such as inlining may not be generally applicable any longer -> proliferation of indirections.
next: Conjectures for Opt: stable semantics + low-local metavar.
After our analysis of the fundamental performance overheads, we designed and implemented an optimization model for a FREE. We know from the experience obtained by dynamic systems that they need to make assumptions and optimize for the most common patterns of execution. This is because it is hard to reason about and optimize frequently changing and variable behavior. In case assumptions do not hold they ensure semantics by switching to a non optimized path. Analogously, our assumptions are: don’t expect mo to be changed often. In very rare cases many observed metaobjects will make an IH site megamorphic.
Next: Optimization model -> Speculation.
So, based on the aforementioned assumptions, our optimization model follows the following two main ideas: Aggressive and exhaustive speculation. Since this kind of speculation leads to a proliferation of guards to ensure sound semantics, we also want to mitigate their performance cost as much as possible.
Next: Implementation high-level example.
For a matter of time, I will go for a very simplified example of our model, it is just to give a big picture of our main insights. More details can be found in the paper. So, lets return once more to the setX method. I incorporate a third mo for doing persistence in a db. So, speculation depends on warming up, i.e, analyzing and caching special values of the behavior of the system for a long time, and when the system becomes stable optimize for the observed variability. Let's assume we already ran for a long-time and arrived to a stable point with these observed values for each IH site. First statement x= arg. Read arg no mo anywhere. Field write a particular combination for every different scope.
Next: Second statement, return x.
Second statement return x. 1 return redefinition in mo 2. Field read for persistence in DB 3.
Next: Compilation stage, compiling fast path VM execution.
Since we already know the operation and the meta object we know the method redefining it and we can, for instance, inline the operation and save all the delegation overheads. In addition, the compiler could factor out guards, mainly when considering frame and global guards. But we still have guards accessing memory and executing pointer equality tests. A compiler may factor out some repetitions, but, could we do better?
Next: Problem of guards with receivers.
Previous speculation resolves a lot. But the guards still are costly. For the global and frame we can assume that if we inline several subsequent IHs they could be factored. But the receivers may be more and also the cost to access them.
next: Metaobjects in Layouts.
It is common to represent objects in some OO platforms as subsequent blocks of memory and a shape describing its contents. Shapes are almost always needed for the VM to execute methods, access the class of instances for dispatching, accessing state, etc. If we put the mo in the shape, we can cache the shape + meta object, but guard only on the shape which is free. Next: Validation, Research Questions.
In most OO platforms, it is common to represent objects as subsequent blocks of memory and a shape describing its contents. Also it is common that shapes are almost always needed for the VM to execute methods, access the class of instances for dispatching, accessing state, etc. If we put the mo in the shape, we can cache the shape + meta object, but guard only on the shape which is free. Next: Validation, Research Questions.
next: RQ1 Inherent (micro benchs) + brief presentation of TruffleSOM/TruffleMATE.
We gathered TruffleSOM an efficient VM in the same order of performance of V8, and we extend it to be a FREE for Smalltalk. We then compare the performance of running the same set of benchmarks in both VMs. In this experiment, the benchmarks do not use the meta model so we are just measuring the fundamental cost of adding the IH ubiquitously. Results: our VM ran even slightly faster.
next: RQ1 + Macro.
DeltaBlue more sensitive to the inlining parameters of the Graal compiler. Also made heavy use of operations on primitive types.
next: RQ2 Individual + Mega/Mono.
Next: RQ3, Using the meta-level: Read-only + Tracing.
We also measure the performance overheads of scattering metaobjects through the whole application. We trace all the method activations of some micro and macro benchmarks. The baselines are the plain benchmarks, so the overheads include the cost of the counting logic in all activations. The results show the overheads are considerably low, ranging from 1x to 3.4x. The worst is DeltaBlue that we already mentioned present some challenges for our compiler and also executes a lot of primitive operations.
Next: Breaking assumptions.
Next: Breaking assumptions.
Finally we decided to validate if our assumptions were sensible. We design a benchmark that walks through a list of point instances and accumulate the value of the x field. In one scenario all the point instances have the same metaobject redefining the field read. In the other, each instance has a different metaobject. So we measure the difference between a mono and a megamorphic IH site. The results valide that breaking our assumptions of low-local metavariability leads to a severe performance degradation. Next: Conclusions.
Our experiments refute the general belief of slow reflection for low-level components and opens the door for more experimentation. Nonetheless, results such as those for DeltaBlue showed that there is still room for improvements.
Next: Open Paths and Questions.
Compiler -> Customization of compilation/inlining thresholds/ Adaptation of the structure and length of dispatch chains. Decide whether an optimization should be based on runtime
or compile-time checks. Explicitly disable or < degree of dynamicity supported by a MOP
Finish