• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
JavaScript on the GPU
 

JavaScript on the GPU

on

  • 10,508 views

Jarred Nicholls, David Foelber and Phil Strong experiment with running JavaScript on the GPU - see how the first iteration of the experiment went.

Jarred Nicholls, David Foelber and Phil Strong experiment with running JavaScript on the GPU - see how the first iteration of the experiment went.

Statistics

Views

Total Views
10,508
Views on SlideShare
10,103
Embed Views
405

Actions

Likes
10
Downloads
92
Comments
2

6 Embeds 405

http://lanyrd.com 364
https://twitter.com 24
http://www.linkedin.com 13
http://us-w1.rockmelt.com 2
http://faavorite.com 1
http://115.112.207.57 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • var input = new Array(10); input.forEach(function(v, i) {input[i] = Math.pow((i + 1) / 1.23, 3);});
    Are you sure you want to
    Your message goes here
    Processing…
  • still on slide 45# but ITS INSANEEEE! Javascript is the future :D and I love it!
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    JavaScript on the GPU JavaScript on the GPU Presentation Transcript

    • If you don’t get this ref...shame on you
    • Jarred Nicholls @jarrednichollsjarred@webkit.org
    • Work @ SenchaWeb Platform Team Doing webkitty things...
    • WebKit Committer
    • Co-AuthorW3C Web Cryptography API
    • JavaScript on the GPU
    • What I’ll blabber about todayWhy JavaScript on the GPURunning JavaScript on the GPUWhat’s to come...
    • Why JavaScript on the GPU?
    • Why JavaScript on the GPU? Better question: Why a GPU?
    • Why JavaScript on the GPU? Better question: Why a GPU? A: They’re fast! (well, at certain things...)
    • GPUs are fast b/c...Totally different paradigm from CPUsData parallelism vs. Task parallelismStream processing vs. Sequential processing GPUs can divide-and-conquerHardware capable of a large number of “threads” e.g. ATI Radeon HD 6770m: 480 stream processing units == 480 coresTypically very high memory bandwidthMany, many GigaFLOPs
    • GPUs don’t solve all problemsNot all tasks can be accelerated by GPUsTasks must be parallelizable, i.e.: Side effect free Homogeneous and/or streamableOverall tasks will become limited by Amdahl’s Law
    • Let’s find out...
    • ExperimentCode Name “LateralJS”
    • LateralJSOur MissionTo make JavaScript a first-class citizen on all GPUsand take advantage of hardware acceleratedoperations & data parallelization.
    • Our Options OpenCL Nvidia CUDAAMD, Nvidia, Intel, etc. Nvidia onlyA shitty version of C99 C++ (C for CUDA)No dynamic memory Dynamic memoryNo recursion RecursionNo function pointers Function pointersTerrible tooling Great dev. toolingImmature (arguably) More mature (arguably)
    • Our Options OpenCL Nvidia CUDAAMD, Nvidia, Intel, etc. Nvidia onlyA shitty version of C99 C++ (C for CUDA)No dynamic memory Dynamic memoryNo recursion RecursionNo function pointers Function pointersTerrible tooling Great dev. toolingImmature (arguably) More mature (arguably)
    • Why not a Static Compiler?We want full JavaScript support Object / prototype Closures Recursion Functions as objects Variable typingType Inference limitationsReasonably limited to size and complexity of “kernel-esque” functionsNot nearly insane enough
    • Why an Interpreter?We want it all baby - full JavaScript support!Most insane approachChallenging to make it good, but holds a lot of promise
    • OpenCL Headaches
    • Oh the agony...Multiple memory spaces - pointer hellNo recursion - all inlined functionsNo standard libc librariesNo dynamic memoryNo standard data structures - apart from vector opsBuggy ass AMD/Nvidia compilers
    • Multiple Memory SpacesIn the order of fastest to slowest: space description very fast private stream processor cache (~64KB) scoped to a single work item fast local ~= L1 cache on CPUs (~64KB) scoped to a single work group slow, by orders of magnitude global ~= system memory over slow bus constant available to all work groups/items all the VRAM on the card (MBs)
    • Memory Space Pointer Hellglobal uchar* gptr = 0x1000;local uchar* lptr = (local uchar*) gptr; // FAIL!uchar* pptr = (uchar*) gptr; // FAIL! private is implicit 0x1000 global local private 0x1000 points to something different depending on the address space!
    • Memory Space Pointer Hell Pointers must always be fully qualified Macros to help ease the pain#define GPTR(TYPE) global TYPE*#define CPTR(TYPE) constant TYPE*#define LPTR(TYPE) local TYPE*#define PPTR(TYPE) private TYPE*
    • No Recursion!?!?!? No call stack All functions are inlined to the kernel functionuint factorial(uint n) { if (n <= 1) return 1; else return n * factorial(n - 1); // compile-time error}
    • No standard libc librariesmemcpy?strcpy?strcmp?etc...
    • No standard libc libraries Implement our own#define MEMCPY(NAME, DEST_AS, SRC_AS) DEST_AS void* NAME(DEST_AS void*, SRC_AS const void*, uint); DEST_AS void* NAME(DEST_AS void* dest, SRC_AS const void* src, uint size) { DEST_AS uchar* cDest = (DEST_AS uchar*)dest; SRC_AS const uchar* cSrc = (SRC_AS const uchar*)src; for (uint i = 0; i < size; i++) cDest[i] = cSrc[i]; return (DEST_AS void*)cDest; }PTR_MACRO_DEST_SRC(MEMCPY, memcpy) Produces memcpy_g memcpy_gc memcpy_lc memcpy_pc memcpy_l memcpy_gl memcpy_lg memcpy_pg memcpy_p memcpy_gp memcpy_lp memcpy_pl
    • No dynamic memoryNo malloc()No free()What to do...
    • Yes! dynamic memory Create a large buffer of global memory - our “heap” Implement our own malloc() and free() Create a handle structure - “virtual memory” P(T, hnd) macro to get the current pointer addressGPTR(handle) hnd = malloc(sizeof(uint));GPTR(uint) ptr = P(uint, hnd);*ptr = 0xdeadbeef;free(hnd);
    • Ok, we get the point... FYL!
    • High-level Architecture V8 Data HeapEsprima Parser Stack-based Interpreter Host Host Host GPUsData Serializer & Marshaller Garbage Collector Device Mgr
    • High-level Architecture eval(code); V8 Data Heap Build JSON ASTEsprima Parser Stack-based Interpreter Host Host Host GPUsData Serializer & Marshaller Garbage Collector Device Mgr
    • High-level Architecture eval(code); V8 Data Heap Build JSON ASTEsprima Parser Stack-based Interpreter Serialize AST Host Host Host JSON => C Structs GPUsData Serializer & Marshaller Garbage Collector Device Mgr
    • High-level Architecture eval(code); V8 Data Heap Build JSON ASTEsprima Parser Stack-based Interpreter Serialize AST Host Host Host JSON => C Structs GPUsData Serializer & Marshaller Garbage Collector Ship to GPU to Interpret Device Mgr
    • High-level Architecture eval(code); V8 Data Heap Build JSON ASTEsprima Parser Stack-based Interpreter Serialize AST Host Host Host JSON => C Structs GPUsData Serializer & Marshaller Garbage Collector Ship to GPU to Interpret Device Mgr Fetch Result
    • AST Generation
    • AST Generation JSON ASTJavaScript Source (v8::Object) Lateral AST Esprima in V8 (C structs)
    • Embed esprima.js Resource Generator$ resgen esprima.js resgen_esprima_js.c
    • Embed esprima.js resgen_esprima_js.cconst unsigned char resgen_esprima_js[] = { 0x2f, 0x2a, 0x0a, 0x20, 0x20, 0x43, 0x6f, 0x70, 0x79, 0x72, 0x69, 0x67, 0x68, 0x74, 0x20, 0x28, 0x43, 0x29, 0x20, 0x32, ... 0x20, 0x3a, 0x20, 0x2a, 0x2f, 0x0a, 0x0a, 0};
    • Embed esprima.js ASTGenerator.cppextern const char resgen_esprima_js;void ASTGenerator::init(){ HandleScope scope; s_context = Context::New(); s_context->Enter(); Handle<Script> script = Script::Compile(String::New(&resgen_esprima_js)); script->Run(); s_context->Exit(); s_initialized = true;}
    • Build JSON AST e.g.ASTGenerator::esprimaParse( "var xyz = new Array(10);");
    • Build JSON ASTHandle<Object> ASTGenerator::esprimaParse(const char* javascript){ if (!s_initialized) init(); HandleScope scope; s_context->Enter(); Handle<Object> global = s_context->Global(); Handle<Object> esprima = Handle<Object>::Cast(global->Get(String::New("esprima"))); Handle<Function> esprimaParse = Handle<Function>::Cast(esprima->Get(String::New("parse"))); Handle<String> code = String::New(javascript); Handle<Object> ast = Handle<Object>::Cast(esprimaParse->Call(esprima, 1,(Handle<Value>*)&code)); s_context->Exit(); return scope.Close(ast);}
    • Build JSON AST{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "xyz" }, "init": { "type": "NewExpression", "callee": { "type": "Identifier", "name": "Array" }, "arguments": [ { "type": "Literal", "value": 10 } ] } } ], "kind": "var"}
    • Lateral AST structstypedef struct ast_type_st { #ifdef __OPENCL_VERSION__ CL(uint) id; #define CL(TYPE) TYPE CL(uint) size; #else} ast_type; #define CL(TYPE) cl_##TYPE #endiftypedef struct ast_program_st { ast_type type; CL(uint) body; CL(uint) numBody; Structs shared between} ast_program; Host and OpenCLtypedef struct ast_identifier_st { ast_type type; CL(uint) name;} ast_identifier;
    • Lateral AST structs v8::Object => ast_type expandedast_type* vd1_1_init_id = (ast_type*)astCreateIdentifier("Array");ast_type* vd1_1_init_args[1];vd1_1_init_args[0] = (ast_type*)astCreateNumberLiteral(10);ast_type* vd1_1_init = (ast_type*)astCreateNewExpression(vd1_1_init_id, vd1_1_init_args, 1);free(vd1_1_init_id);for (int i = 0; i < 1; i++) free(vd1_1_init_args[i]);ast_type* vd1_1_id = (ast_type*)astCreateIdentifier("xyz");ast_type* vd1_decls[1];vd1_decls[0] = (ast_type*)astCreateVariableDeclarator(vd1_1_id, vd1_1_init);free(vd1_1_id);free(vd1_1_init);ast_type* vd1 = (ast_type*)astCreateVariableDeclaration(vd1_decls, 1, "var");for (int i = 0; i < 1; i++) free(vd1_decls[i]);
    • Lateral AST structs astCreateIdentifierast_identifier* astCreateIdentifier(const char* str) { CL(uint) size = sizeof(ast_identifier) + rnd(strlen(str) + 1, 4); ast_identifier* ast_id = (ast_identifier*)malloc(size); // copy the string strcpy((char*)(ast_id + 1), str); // fill the struct ast_id->type.id = AST_IDENTIFIER; ast_id->type.size = size; ast_id->name = sizeof(ast_identifier); // offset return ast_id;}
    • Lateral AST structs astCreateIdentifier(“xyz”)offset field value 0 type.id AST_IDENTIFIER (0x01) 4 type.size 16 8 name 12 (offset) 12 str[0] ‘x’ 13 str[1] ‘y’ 14 str[2] ‘z’ 15 str[3] ‘0’
    • Lateral AST structs astCreateNewExpressionast_expression_new* astCreateNewExpression(ast_type* callee, ast_type** arguments, int numArgs) { CL(uint) size = sizeof(ast_expression_new) + callee->size; for (int i = 0; i < numArgs; i++) size += arguments[i]->size; ast_expression_new* ast_new = (ast_expression_new*)malloc(size); ast_new->type.id = AST_NEW_EXPR; ast_new->type.size = size; CL(uint) offset = sizeof(ast_expression_new); char* dest = (char*)ast_new; // copy callee memcpy(dest + offset, callee, callee->size); ast_new->callee = offset; offset += callee->size; // copy arguments if (numArgs) { ast_new->arguments = offset; for (int i = 0; i < numArgs; i++) { ast_type* arg = arguments[i]; memcpy(dest + offset, arg, arg->size); offset += arg->size; } } else ast_new->arguments = 0; ast_new->numArguments = numArgs; return ast_new;}
    • Lateral AST structs new Array(10)offset field value 0 type.id AST_NEW_EXPR (0x308) 4 type.size 52 8 callee 20 (offset) 12 arguments 40 (offset) 16 numArguments 1 20 callee node ast_identifier (“Array”) arguments 40 ast_literal_number (10) node
    • Lateral AST structsShared across the Host and the OpenCL runtime Host writes, Lateral readsConstructed on Host as contiguous blobs Easy to send to GPU: memcpy(gpu, ast, ast->size); Fast to send to GPU, single buffer write Simple to traverse w/ pointer arithmetic
    • Stack-based Interpreter
    • Building Blocks JS Type StructsAST Traverse Stack Lateral State Call/Exec Stack Heap Symbol/Ref Table Return Stack Scope StackAST Traverse Loop Interpret Loop
    • Kernels#include "state.h"#include "jsvm/asttraverse.h"#include "jsvm/interpreter.h"// Setup VM structureskernel void lateral_init(GPTR(uchar) lateral_heap) { LATERAL_STATE_INIT}// Interpret the ASTkernel void lateral(GPTR(uchar) lateral_heap, GPTR(ast_type) lateral_ast) { LATERAL_STATE ast_push(lateral_ast); while (!Q_EMPTY(lateral_state->ast_stack, ast_q) || !Q_EMPTY(lateral_state->call_stack,call_q)) { while (!Q_EMPTY(lateral_state->ast_stack, ast_q)) traverse(); if (!Q_EMPTY(lateral_state->call_stack, call_q)) interpret(); }}
    • Let’s interpret... var x = 1 + 2;
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDecl "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Binary "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Literal Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Literal Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", Literal "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor "id": { "type": "Identifier", Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", Literal "operator": "+", "left": { Ident "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", Literal "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", Binary 1 }, "name": "x" Literal "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", Binary 1 }, "name": "x" 2 "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", 3 "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
    • Benchmark
    • Benchmark Small loop of FLOPsvar input = new Array(10);for (var i = 0; i < input.length; i++) { input[i] = Math.pow((i + 1) / 1.23, 3);}
    • Execution Time Lateral GPU CL CPU CL V8 ATI Radeon 6770m Intel Core i7 4x2.4Ghz Intel Core i7 4x2.4Ghz116.571533ms 0.226007ms 0.090664ms
    • Execution Time Lateral GPU CL CPU CL V8 ATI Radeon 6770m Intel Core i7 4x2.4Ghz Intel Core i7 4x2.4Ghz116.571533ms 0.226007ms 0.090664ms
    • What went wrong?EverythingStack-based AST Interpreter, no optimizationsHeavy global memory access, no optimizationsNo data or task parallelism
    • Stack-based InterpreterSlow as molassesMemory hog Eclipse styleHeavy memory access “var x = 1 + 2;” == 30 stack hits alone! Too much dynamic allocationNo inline optimizations, just following the yellow brick ASTStraight up lazyReplace with something better!Bytecode compiler on HostBytecode register-based interpreter on Device
    • Too much global access Everything is dynamically allocated to global memory Register based interpreter & bytecode compiler can make better use of local and private memory// 11.1207 secondssize_t tid = get_global_id(0);c[tid] = a[tid];while(b[tid] > 0) { // touch global memory on each loop b[tid]--; // touch global memory on each loop c[tid]++; // touch global memory on each loop Optimizing memory access}// 0.0445558 seconds!! HOLY SHIT! yields crazy resultssize_t tid = get_global_id(0);int tmp = a[tid]; // temp private variablefor(int i=b[tid]; i > 0; i--) tmp++; // touch private variables on each loopc[tid] = tmp; // touch global memory one time
    • No data or task parallelism Everything being interpreted in a single “thread” We have hundreds of cores available to us! Build in heuristics Identify side-effect free statements Break into parallel tasks - very magical input[0] = Math.pow((0 + 1) / 1.23, 3);var input = new Array(10);for (var i = 0; i < input.length; i++) { input[1] = Math.pow((1 + 1) / 1.23, 3);} input[i] = Math.pow((i + 1) / 1.23, 3); ... input[9] = Math.pow((9 + 1) / 1.23, 3);
    • What’s in storeAcceptable performance on all CL devicesV8/Node extension to launch Lateral tasksHigh-level API to perform map-reduce, etc.Lateral-cluster...mmmmm
    • Thanks! Jarred Nicholls @jarrednichollsjarred@webkit.org