High Performance JavaScript: JITs in Firefox
Power of the Web
How did we get here?Making existing web tech faster and faster
Adding new technologies, like video, audio, integration and 2D+3D drawingThe Browser
Getting a Webpagewww.intel.com192.198.164.158
Getting a Webpagewww.intel.com192.198.164.158GET / HTTP/1.0(HTTP Request)
Getting a Webpage<html><head><title>Laptop, Notebooks, ...</title></head><script language="javascript">functiondoStuff() {...}</script><body>...(HTML, CSS, JavaScript)
Core Web ComponentsHTML provides content
CSS provides styling
JavaScript provides a programming interfaceNew Technologies2D Canvas
Video, Audio tags
WebGL (OpenGL + JavaScript)DOM(Source: http://www.w3schools.com/htmldom/default.asp)
CSSCascading Style SheetsSeparates presentation from content<b class=”warning”>This is red</b>.warning {color: red;text-decoration: underline;}
What is JavaScript?C-like syntax
De facto language of the web
Interacts with the DOM and the browserC-Like SyntaxBraces, semicolons, familiar keywordsif (x) {for (i=0; i<100; i++)print("Hello!");}
Displaying a WebpageLayout engine applies CSS to DOMComputes geometry of elementsJavaScript engine executes codeThis can change DOM, triggers “reflow”Finished layout is sent to graphics layer
The Result
JavaScriptInvented by Brendan Eich in 1995 for Netscape 2
Initial version written in 10 days
Anything from small browser interactions, complex apps (Gmail) to intense graphics (WebGL)The Result
Think LISP or Self, not JavaUntyped - no type declarationsMulti-Paradigm – objects, closures, first-class functionsHighly dynamic - objects are dictionaries
No Type DeclarationsProperties, variables, return values can be anything:functionf(a) {var x =72.3;if (a)        x = a +"string";return x;}
No Type DeclarationsProperties, variables, return values can be anything:if a is:   number: add;   string: concat;functionf(a) {var x =“hi”;if (a)        x = a +33.2;return x;}
FunctionalFunctions may be returned, passed as arguments:functionf(a) {returnfunction () {return a;    }}var m = f(5);print(m());
ObjectsObjects are dictionaries mapping strings to values
Properties may be deleted or added at any time!var point = { x : 5, y : 10 };deletepoint.x;point.z=12;
PrototypesEvery object can have a prototype objectIf a property is not found on an object, its prototype is searched instead… And the prototype’s prototype, etc..
NumbersJavaScript specifies that numbers are IEEE-754 64-bit floating point
Engines use 32-bit integers to optimize
Must preserve semantics: integers overflow to doublesNumbersvarx = 0x7FFFFFFF;x++;intx = 0x7FFFFFFF;x++;
JavaScript ImplementationsValues (Objects, strings, numbers)
Garbage collector
Runtime library
Execution engine (VM)ValuesThe runtime must be able to query a value’s type to perform the right computation.When storing values to variables or object fields, the type must be stored as well.
ValuesBoxed values required for variables, object fields
Unboxed values required for computationBoxed ValuesNeed a representation in C++
Easy idea: 96-bit struct
32-bits for type
64-bits for double, pointer, or integer
Too big! We have to pack better.Boxed ValuesWe could use pointers, and tag the low bitsDoubles would have to be allocated on the heapIndirection, GC pressure is badThere is a middle ground…
Values are 64-bit
Doubles are normal IEEE-754
How to pack non-doubles?
51 bits of NaN space!IA32, ARMNunboxing
NunboxingIA32, ARMTypePayload0x400c00000x00000000630Full value: 0x400c000000000000
Type is double because it’s not a NaN
Encodes: (Double, 3.5)NunboxingIA32, ARMTypePayload0x000000400xFFFF0001630Full value: 0xFFFF000100000040
Value is in NaN space
Type is 0xFFFF0001(Int32)
Value is (Int32, 0x00000040)Punboxingx86-64TypePayload630471111 1111 1111 1000 0000 .. 0000 0000 0000 0000 0100 0000Full value: 0xFFFF800000000040
Value is in NaN space
Bits >> 47 == 0x1FFFF == INT32
Bits 47-63 masked off to retrieve payload
Value(Int32, 0x00000040)* Some 64-bit OSes can restrict mmap() to 32-bits
Garbage CollectionNeed to reclaim memory without pausing user workflow, animations, etcNeed very fast object allocationConsider lots of small objects like points, vectorsReading and writing to the heap must be fast
Mark and SweepAll live objects are found via a recursive traversal of the root value setAll dead objects are added to a free listVery slow to traverse entire heapBuilding free lists can bring “dead” memory into cache
ImprovementsSweeping, freelist building have been moved off-thread
Marking can be incremental, interleaved with program executionGenerational GCObservation: most objects are short-livedAll new objects are bump-allocated into a newborn spaceOnce newborn space is full, live objects are moved to the tenured spaceNewborn space is then reset to empty
Generational GCInactiveActiveNewborn Space8MB8MBTenured Space
After Minor GCActiveInactiveNewborn Space8MB8MBTenured Space
BarriersIncremental marking, generational GC need read or write barriersRead barriers are much slowerAzul Systems has hardware read barrierWrite barrier seems to be well-predicted?
UnknownsHow much do cache effects matter?Memory GC has to touchLocality of objectsWhat do we need to consider to make GC fast?
Running JavaScriptInterpreter – Runs code, not fast
Basic JIT – Simple, untyped compiler
Advanced JIT – Typed compiler, lightspeedInterpreterGood for code that runs once
Giant switch loop
Handles all edge cases of JS semanticsInterpreterwhile (true) {switch (*pc) {case OP_ADD:            ...case OP_SUB:            ...case OP_RETURN:            ...    }    pc++;}
Interpretercase OP_ADD: {        Value lhs = POP();        Value rhs = POP();        Value result;if (lhs.isInt32() && rhs.isInt32()) {int left = rhs.toInt32();int right = rhs.toInt32();if (AddOverflows(left, right, left + right))                result.setInt32(left + right);elseresult.setNumber(double(left) + double(right));        } elseif (lhs.isString() || rhs.isString()) {            String *left = ValueToString(lhs);            String *right = ValueToString(rhs);            String *r = Concatenate(left, right);result.setString(r);        } else {double left = ValueToNumber(lhs);double right = ValueToNumber(rhs);result.setDouble(left + right);        }PUSH(result);break;
InterpreterVery slow!Lots of opcode and type dispatchLots of interpreter stack trafficJust-in-time compilation solves both
Just In TimeCompilation must be very fastCan’t introduce noticeable pausesPeople care about memory useCan’t JIT everythingCan’t create bloated codeMay have to discard code at any time
Basic JITJägerMonkey in Firefox 4Every opcode has a hand-coded template of assembly (registers left blank)Method at a time:Single pass through bytecode stream!Compiler uses assembly templates corresponding to each opcode
BytecodeGETARG 0 ; fetch xGETARG 1 ; fetch yADDRETURNfunctionAdd(x, y) {return x + y;}
Interpretercase OP_ADD: {        Value lhs = POP();        Value rhs = POP();        Value result;if (lhs.isInt32() && rhs.isInt32()) {int left = rhs.toInt32();int right = rhs.toInt32();if (AddOverflows(left, right, left + right))                result.setInt32(left + right);elseresult.setNumber(double(left) + double(right));        } elseif (lhs.isString() || rhs.isString()) {            String *left = ValueToString(lhs);            String *right = ValueToString(rhs);            String *r = Concatenate(left, right);result.setString(r);        } else {double left = ValueToNumber(lhs);double right = ValueToNumber(rhs);result.setDouble(left + right);        }PUSH(result);break;
Assembling ADDInlining that huge chunk for every ADD would be very slow
Observation:
Some input types much more common than othersInteger Math – Commonvar j =0;for (i=0; i<10000; i++) {    j +=i;}Weird Stuff – Rare!var j =12.3;for (i =0; i <10000; i++) {    j +=new Object() + i.toString();}
Assembling ADDOnly generate code for the easiest and most common cases
Large design space
Can consider integers common, or
Integers and doubles, or
Anything!Basic JIT - ADDif (arg0.type != INT32) gotoslow_add;if (arg1.type != INT32)gotoslow_add;
Basic JIT - ADDif (arg0.type != INT32) gotoslow_add;if (arg1.type != INT32)gotoslow_add;R0 = arg0.dataR1 = arg1.dataGreedy Register Allocator
Basic JIT - ADDif (arg0.type != INT32) gotoslow_add;if (arg1.type != INT32)gotoslow_add;R0 = arg0.data;R1 = arg1.data;R2 = R0 + R1;if (OVERFLOWED)gotoslow_add;
Slow Pathsslow_add:    Value result = runtime::Add(arg0, arg1);
InlineOut-of-lineif (arg0.type != INT32) gotoslow_add;if (arg1.type != INT32)gotoslow_add;R0 = arg0.data;R1 = arg1.data;R2 = R0 + R1;  if (OVERFLOWED)gotoslow_add;rejoin:slow_add:    Value result = Interpreter::Add(arg0, arg1);R2 = result.data;goto rejoin;
RejoiningInlineOut-of-line  if (arg0.type != INT32) gotoslow_add;  if (arg1.type != INT32)gotoslow_add;  R0 = arg0.data;  R1 = arg1.data;  R2 = R0 + R1;  if (OVERFLOWED)gotoslow_add;R3 = TYPE_INT32;rejoin:slow_add:    Value result = runtime::Add(arg0, arg1);    R2 = result.data;    R3 = result.type;goto rejoin;
Final CodeInlineOut-of-lineif (arg0.type != INT32) goto slow_add; if (arg1.type != INT32)goto slow_add;R0 = arg0.data;R1 = arg1.data;R2 = R0 + R1;  if (OVERFLOWED)goto slow_add;R3 = TYPE_INT32;rejoin:return Value(R3, R2);slow_add:    Value result = Interpreter::Add(arg0, arg1);R2 = result.data;R3 = result.type;goto rejoin;
ObservationsEven though we have fast paths, no type information can flow in between opcodesCannot assume result of ADD is integerMeans more branchesTypes increase register pressure
Basic JIT – Object Accessx.yis multiple dictionary searches!
How can we create a fast-path?
We could try to cache the lookup, but
We want it to work for >1 objectsfunctionf(x) {returnx.y;}
Object AccessObservationUsually, objects flowing through an access site look the sameIf we knew an object’s layout, we could bypass the dictionary lookup
Object Layoutvar obj = { x: 50,             y: 100,           z: 20    };
Object Layout
Object Layoutvarobj = { x: 50,             y: 100,           z: 20    };…varobj2 = { x: 78,             y: 93,             z: 600    };
Object Layout
Familiar ProblemsWe don’t know an object’s shape during JIT compilation... Or even that a property access receives an object!Solution: leave code “blank,” lazily generate it later
Inline Cachesfunctionf(x) {returnx.y;}
Inline CachesStart as normal, guarding on type and loading dataif (arg0.type != OBJECT)gotoslow_property;JSObject *obj = arg0.data;
No information yet: leave blank (nops). if (arg0.type != OBJECT)gotoslow_property;JSObject *obj = arg0.data;gotoproperty_ic;Inline Caches
Property ICsfunctionf(x) {returnx.y;}f({x: 5, y: 10, z: 30});
Property ICs
Property ICs
Property ICs
Inline CachesShape = S1, slot is 1 if (arg0.type != OBJECT)gotoslow_property;JSObject *obj = arg0.data;gotoproperty_ic;
Inline CachesShape = S1, slot is 1 if (arg0.type != OBJECT)gotoslow_property;JSObject *obj = arg0.data;gotoproperty_ic;
Inline CachesShape = SHAPE1, slot is 1  if (arg0.type != OBJECT)gotoslow_property;JSObject *obj = arg0.data;  if (obj->shape != SHAPE1)gotoproperty_ic;R0 = obj->slots[1].type;R1 = obj->slots[1].data;
PolymorphismWhat happens if two different shapes pass through a property access?
Add another fast-path...functionf(x) {returnx.y;}f({ x: 5, y: 10, z: 30});f({ y: 40});
PolymorphismMain Method if (arg0.type != OBJECT)gotoslow_path;JSObject *obj = arg0.data;if (obj->shape != SHAPE1)gotoproperty_ic;R0 = obj->slots[1].type;R1 = obj->slots[1].data;rejoin:
PolymorphismMain MethodGenerated Stub  if (arg0.type != OBJECT)gotoslow_path;JSObject *obj = arg0.data;  if (obj->shape != SHAPE0)gotoproperty_ic;  R0 = obj->slots[1].type;  R1 = obj->slots[1].data;rejoin:  stub:if (obj->shape != SHAPE2)gotoproperty_ic;R0 = obj->slots[0].type;R1 = obj->slots[0].data;goto rejoin;
PolymorphismMain MethodGenerated Stubif (arg0.type != OBJECT)    goto slow_path;  JSObject *obj = arg0.data;if (obj->shape != SHAPE1)goto stub;  R0 = obj->slots[1].type;  R1 = obj->slots[1].data;rejoin:  stub:if (obj->shape != SHAPE2)goto property_ic;R0 = obj->slots[0].type;R1 = obj->slots[0].data;goto rejoin;
Chain of Stubsif(obj->shape == S1) {    result = obj->slots[0];} else {if (obj->shape == S2) {        result = obj->slots[1];    } else {if (obj->shape == S3) {             result = obj->slots[2];        } else {if (obj->shape == S4) {                result = obj->slots[3];            } else {goto property_ic;            }        }    }}
Code MemoryGenerated code is patched from inside the method
Self-modifying, but single threaded
Code memory is always rwx
Concerned about protection-flipping expenseBasic JIT SummaryGenerates simple code to handle common cases
Inline caches adapt code based on runtime observations
Lots of guards, poor type informationOptimizing HarderSingle pass too limited: we want to perform whole-method optimizations
We could generate an IR, but without type information...
Slow paths prevent most optimizations.Code Motion?functionAdd(x, n) {varsum = 0;for (vari = 0; i < n; i++)        sum += x + n;    return sum;}The addition looks loop invariant.Can we hoist it?
Code Motion?functionAdd(x, n) {varsum = 0;vartemp0 = x + n;for (vari = 0; i < n; i++)        sum += temp0;    return sum;}Let’s try it.
Wrong Resultvar global =0;varobj= {valueOf: function () {return++global;    }}Add(obj, 10);Original code returns 155
Hoisted version returns 110!IdempotencyUntyped operations are usually not provably idempotentSlow paths are re-entrant, can have observable side effectsThis prevents code motion, redundancy elimination
Ideal ScenarioActual knowledge about types!
Remove slow paths

Intel JIT Talk

Editor's Notes

  • #2 I’m going to start off by showing my favorite demo, that I think really captures what we’re enabling by making the browser as fast possible.
  • #3 So this is Doom, translated - actually compiled - from C++ to JavaScript, being played in Firefox. In addition to JavaScript it’s using the new 2D drawing element called Canvas. This demo is completely native to the browser. There’s no Flash, it’s all open web technologies. And it’s managing to get a pretty good framerate, 30fps, despite being mostly unoptimized. This is a really cool demo for me because, three years ago, this kind of demo on the web was just impossible without proprietary extensions. Some of the technology just wasn’t there, and the rest of it was anywhere from ten to a hundred times slower. So we’ve come a long way.… Unfortunately all I have of the demo is a recording, it was playable online for about a day but iD Software requested that we take it down.
  • #4 To step back for a minute, I’m going to briefly overview how the browser and the major web technologies interact, and then jump into JavaScript which is the heart of this talk.
  • #5 This is a much less gory example, a major company’s website in a popular web browser. The process of getting from the intel.com URL to this display is actually pretty complicated.
  • #7 The browser then sends intel.com this short plaintext message which says, “give me the webpage corresponding to the top of your website’s hierarchy.”
  • #10 How does the browser get from all of this to the a fully rendered website?
  • #27 Now that we’re familiar with the language’s main features, let’s talk about implementing a JavaScript engine. A JavaScript implementation has four main pieces
  • #28 One of the challenges of a JavaScript engine is dealing with the lack of type declarations. In order to dispatch the correct computation, the runtime must be able to query a variable or field’s current type.
  • #45 We’re done with data representation and garbage collection, so it’s time for the really fun stuff: running JavaScript code, and making it run fast. One way to run code, but not make it run fast, is to use an interpreter.