Your SlideShare is downloading. ×
  • Like
JavaScript on the GPU
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

JavaScript on the GPU

  • 14,560 views
Published

I experimented with running JavaScript on the GPU - see how the first iteration of the experiment went.

I experimented with running JavaScript on the GPU - see how the first iteration of the experiment went.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • var input = new Array(10); input.forEach(function(v, i) {input[i] = Math.pow((i + 1) / 1.23, 3);});
    Are you sure you want to
    Your message goes here
  • still on slide 45# but ITS INSANEEEE! Javascript is the future :D and I love it!
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
14,560
On SlideShare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
108
Comments
2
Likes
25

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. If you don’t get this ref...shame on you
  • 2. Jarred Nicholls @jarrednichollsjarred@webkit.org
  • 3. Work @ SenchaWeb Platform Team Doing webkitty things...
  • 4. WebKit Committer
  • 5. Co-AuthorW3C Web Cryptography API
  • 6. JavaScript on the GPU
  • 7. What I’ll blabber about todayWhy JavaScript on the GPURunning JavaScript on the GPUWhat’s to come...
  • 8. Why JavaScript on the GPU?
  • 9. Why JavaScript on the GPU? Better question: Why a GPU?
  • 10. Why JavaScript on the GPU? Better question: Why a GPU? A: They’re fast! (well, at certain things...)
  • 11. GPUs are fast b/c...Totally different paradigm from CPUsData parallelism vs. Task parallelismStream processing vs. Sequential processing GPUs can divide-and-conquerHardware capable of a large number of “threads” e.g. ATI Radeon HD 6770m: 480 stream processing units == 480 coresTypically very high memory bandwidthMany, many GigaFLOPs
  • 12. GPUs don’t solve all problemsNot all tasks can be accelerated by GPUsTasks must be parallelizable, i.e.: Side effect free Homogeneous and/or streamableOverall tasks will become limited by Amdahl’s Law
  • 13. Let’s find out...
  • 14. ExperimentCode Name “LateralJS”
  • 15. LateralJSOur MissionTo make JavaScript a first-class citizen on all GPUsand take advantage of hardware acceleratedoperations & data parallelization.
  • 16. Our Options OpenCL Nvidia CUDAAMD, Nvidia, Intel, etc. Nvidia onlyA shitty version of C99 C++ (C for CUDA)No dynamic memory Dynamic memoryNo recursion RecursionNo function pointers Function pointersTerrible tooling Great dev. toolingImmature (arguably) More mature (arguably)
  • 17. Our Options OpenCL Nvidia CUDAAMD, Nvidia, Intel, etc. Nvidia onlyA shitty version of C99 C++ (C for CUDA)No dynamic memory Dynamic memoryNo recursion RecursionNo function pointers Function pointersTerrible tooling Great dev. toolingImmature (arguably) More mature (arguably)
  • 18. Why not a Static Compiler?We want full JavaScript support Object / prototype Closures Recursion Functions as objects Variable typingType Inference limitationsReasonably limited to size and complexity of “kernel-esque” functionsNot nearly insane enough
  • 19. Why an Interpreter?We want it all baby - full JavaScript support!Most insane approachChallenging to make it good, but holds a lot of promise
  • 20. OpenCL Headaches
  • 21. Oh the agony...Multiple memory spaces - pointer hellNo recursion - all inlined functionsNo standard libc librariesNo dynamic memoryNo standard data structures - apart from vector opsBuggy ass AMD/Nvidia compilers
  • 22. Multiple Memory SpacesIn the order of fastest to slowest: space description very fast private stream processor cache (~64KB) scoped to a single work item fast local ~= L1 cache on CPUs (~64KB) scoped to a single work group slow, by orders of magnitude global ~= system memory over slow bus constant available to all work groups/items all the VRAM on the card (MBs)
  • 23. Memory Space Pointer Hellglobal uchar* gptr = 0x1000;local uchar* lptr = (local uchar*) gptr; // FAIL!uchar* pptr = (uchar*) gptr; // FAIL! private is implicit 0x1000 global local private 0x1000 points to something different depending on the address space!
  • 24. Memory Space Pointer Hell Pointers must always be fully qualified Macros to help ease the pain#define GPTR(TYPE) global TYPE*#define CPTR(TYPE) constant TYPE*#define LPTR(TYPE) local TYPE*#define PPTR(TYPE) private TYPE*
  • 25. No Recursion!?!?!? No call stack All functions are inlined to the kernel functionuint factorial(uint n) { if (n <= 1) return 1; else return n * factorial(n - 1); // compile-time error}
  • 26. No standard libc librariesmemcpy?strcpy?strcmp?etc...
  • 27. No standard libc libraries Implement our own#define MEMCPY(NAME, DEST_AS, SRC_AS) DEST_AS void* NAME(DEST_AS void*, SRC_AS const void*, uint); DEST_AS void* NAME(DEST_AS void* dest, SRC_AS const void* src, uint size) { DEST_AS uchar* cDest = (DEST_AS uchar*)dest; SRC_AS const uchar* cSrc = (SRC_AS const uchar*)src; for (uint i = 0; i < size; i++) cDest[i] = cSrc[i]; return (DEST_AS void*)cDest; }PTR_MACRO_DEST_SRC(MEMCPY, memcpy) Produces memcpy_g memcpy_gc memcpy_lc memcpy_pc memcpy_l memcpy_gl memcpy_lg memcpy_pg memcpy_p memcpy_gp memcpy_lp memcpy_pl
  • 28. No dynamic memoryNo malloc()No free()What to do...
  • 29. Yes! dynamic memory Create a large buffer of global memory - our “heap” Implement our own malloc() and free() Create a handle structure - “virtual memory” P(T, hnd) macro to get the current pointer addressGPTR(handle) hnd = malloc(sizeof(uint));GPTR(uint) ptr = P(uint, hnd);*ptr = 0xdeadbeef;free(hnd);
  • 30. Ok, we get the point... FYL!
  • 31. High-level Architecture V8 Data HeapEsprima Parser Stack-based Interpreter Host Host Host GPUsData Serializer & Marshaller Garbage Collector Device Mgr
  • 32. High-level Architecture eval(code); V8 Data Heap Build JSON ASTEsprima Parser Stack-based Interpreter Host Host Host GPUsData Serializer & Marshaller Garbage Collector Device Mgr
  • 33. High-level Architecture eval(code); V8 Data Heap Build JSON ASTEsprima Parser Stack-based Interpreter Serialize AST Host Host Host JSON => C Structs GPUsData Serializer & Marshaller Garbage Collector Device Mgr
  • 34. High-level Architecture eval(code); V8 Data Heap Build JSON ASTEsprima Parser Stack-based Interpreter Serialize AST Host Host Host JSON => C Structs GPUsData Serializer & Marshaller Garbage Collector Ship to GPU to Interpret Device Mgr
  • 35. High-level Architecture eval(code); V8 Data Heap Build JSON ASTEsprima Parser Stack-based Interpreter Serialize AST Host Host Host JSON => C Structs GPUsData Serializer & Marshaller Garbage Collector Ship to GPU to Interpret Device Mgr Fetch Result
  • 36. AST Generation
  • 37. AST Generation JSON ASTJavaScript Source (v8::Object) Lateral AST Esprima in V8 (C structs)
  • 38. Embed esprima.js Resource Generator$ resgen esprima.js resgen_esprima_js.c
  • 39. Embed esprima.js resgen_esprima_js.cconst unsigned char resgen_esprima_js[] = { 0x2f, 0x2a, 0x0a, 0x20, 0x20, 0x43, 0x6f, 0x70, 0x79, 0x72, 0x69, 0x67, 0x68, 0x74, 0x20, 0x28, 0x43, 0x29, 0x20, 0x32, ... 0x20, 0x3a, 0x20, 0x2a, 0x2f, 0x0a, 0x0a, 0};
  • 40. Embed esprima.js ASTGenerator.cppextern const char resgen_esprima_js;void ASTGenerator::init(){ HandleScope scope; s_context = Context::New(); s_context->Enter(); Handle<Script> script = Script::Compile(String::New(&resgen_esprima_js)); script->Run(); s_context->Exit(); s_initialized = true;}
  • 41. Build JSON AST e.g.ASTGenerator::esprimaParse( "var xyz = new Array(10);");
  • 42. Build JSON ASTHandle<Object> ASTGenerator::esprimaParse(const char* javascript){ if (!s_initialized) init(); HandleScope scope; s_context->Enter(); Handle<Object> global = s_context->Global(); Handle<Object> esprima = Handle<Object>::Cast(global->Get(String::New("esprima"))); Handle<Function> esprimaParse = Handle<Function>::Cast(esprima->Get(String::New("parse"))); Handle<String> code = String::New(javascript); Handle<Object> ast = Handle<Object>::Cast(esprimaParse->Call(esprima, 1,(Handle<Value>*)&code)); s_context->Exit(); return scope.Close(ast);}
  • 43. Build JSON AST{ "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "xyz" }, "init": { "type": "NewExpression", "callee": { "type": "Identifier", "name": "Array" }, "arguments": [ { "type": "Literal", "value": 10 } ] } } ], "kind": "var"}
  • 44. Lateral AST structstypedef struct ast_type_st { #ifdef __OPENCL_VERSION__ CL(uint) id; #define CL(TYPE) TYPE CL(uint) size; #else} ast_type; #define CL(TYPE) cl_##TYPE #endiftypedef struct ast_program_st { ast_type type; CL(uint) body; CL(uint) numBody; Structs shared between} ast_program; Host and OpenCLtypedef struct ast_identifier_st { ast_type type; CL(uint) name;} ast_identifier;
  • 45. Lateral AST structs v8::Object => ast_type expandedast_type* vd1_1_init_id = (ast_type*)astCreateIdentifier("Array");ast_type* vd1_1_init_args[1];vd1_1_init_args[0] = (ast_type*)astCreateNumberLiteral(10);ast_type* vd1_1_init = (ast_type*)astCreateNewExpression(vd1_1_init_id, vd1_1_init_args, 1);free(vd1_1_init_id);for (int i = 0; i < 1; i++) free(vd1_1_init_args[i]);ast_type* vd1_1_id = (ast_type*)astCreateIdentifier("xyz");ast_type* vd1_decls[1];vd1_decls[0] = (ast_type*)astCreateVariableDeclarator(vd1_1_id, vd1_1_init);free(vd1_1_id);free(vd1_1_init);ast_type* vd1 = (ast_type*)astCreateVariableDeclaration(vd1_decls, 1, "var");for (int i = 0; i < 1; i++) free(vd1_decls[i]);
  • 46. Lateral AST structs astCreateIdentifierast_identifier* astCreateIdentifier(const char* str) { CL(uint) size = sizeof(ast_identifier) + rnd(strlen(str) + 1, 4); ast_identifier* ast_id = (ast_identifier*)malloc(size); // copy the string strcpy((char*)(ast_id + 1), str); // fill the struct ast_id->type.id = AST_IDENTIFIER; ast_id->type.size = size; ast_id->name = sizeof(ast_identifier); // offset return ast_id;}
  • 47. Lateral AST structs astCreateIdentifier(“xyz”)offset field value 0 type.id AST_IDENTIFIER (0x01) 4 type.size 16 8 name 12 (offset) 12 str[0] ‘x’ 13 str[1] ‘y’ 14 str[2] ‘z’ 15 str[3] ‘0’
  • 48. Lateral AST structs astCreateNewExpressionast_expression_new* astCreateNewExpression(ast_type* callee, ast_type** arguments, int numArgs) { CL(uint) size = sizeof(ast_expression_new) + callee->size; for (int i = 0; i < numArgs; i++) size += arguments[i]->size; ast_expression_new* ast_new = (ast_expression_new*)malloc(size); ast_new->type.id = AST_NEW_EXPR; ast_new->type.size = size; CL(uint) offset = sizeof(ast_expression_new); char* dest = (char*)ast_new; // copy callee memcpy(dest + offset, callee, callee->size); ast_new->callee = offset; offset += callee->size; // copy arguments if (numArgs) { ast_new->arguments = offset; for (int i = 0; i < numArgs; i++) { ast_type* arg = arguments[i]; memcpy(dest + offset, arg, arg->size); offset += arg->size; } } else ast_new->arguments = 0; ast_new->numArguments = numArgs; return ast_new;}
  • 49. Lateral AST structs new Array(10)offset field value 0 type.id AST_NEW_EXPR (0x308) 4 type.size 52 8 callee 20 (offset) 12 arguments 40 (offset) 16 numArguments 1 20 callee node ast_identifier (“Array”) arguments 40 ast_literal_number (10) node
  • 50. Lateral AST structsShared across the Host and the OpenCL runtime Host writes, Lateral readsConstructed on Host as contiguous blobs Easy to send to GPU: memcpy(gpu, ast, ast->size); Fast to send to GPU, single buffer write Simple to traverse w/ pointer arithmetic
  • 51. Stack-based Interpreter
  • 52. Building Blocks JS Type StructsAST Traverse Stack Lateral State Call/Exec Stack Heap Symbol/Ref Table Return Stack Scope StackAST Traverse Loop Interpret Loop
  • 53. Kernels#include "state.h"#include "jsvm/asttraverse.h"#include "jsvm/interpreter.h"// Setup VM structureskernel void lateral_init(GPTR(uchar) lateral_heap) { LATERAL_STATE_INIT}// Interpret the ASTkernel void lateral(GPTR(uchar) lateral_heap, GPTR(ast_type) lateral_ast) { LATERAL_STATE ast_push(lateral_ast); while (!Q_EMPTY(lateral_state->ast_stack, ast_q) || !Q_EMPTY(lateral_state->call_stack,call_q)) { while (!Q_EMPTY(lateral_state->ast_stack, ast_q)) traverse(); if (!Q_EMPTY(lateral_state->call_stack, call_q)) interpret(); }}
  • 54. Let’s interpret... var x = 1 + 2;
  • 55. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 56. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDecl "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 57. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 58. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Binary "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 59. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Literal Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 60. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Literal Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 61. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", Literal "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 62. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor "id": { "type": "Identifier", Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", Literal "operator": "+", "left": { Ident "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 63. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", Literal "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 64. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", Binary 1 }, "name": "x" Literal "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 65. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", Binary 1 }, "name": "x" 2 "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 66. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", 3 "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 67. var x = 1 + 2;{ "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var"}
  • 68. Benchmark
  • 69. Benchmark Small loop of FLOPsvar input = new Array(10);for (var i = 0; i < input.length; i++) { input[i] = Math.pow((i + 1) / 1.23, 3);}
  • 70. Execution Time Lateral GPU CL CPU CL V8 ATI Radeon 6770m Intel Core i7 4x2.4Ghz Intel Core i7 4x2.4Ghz116.571533ms 0.226007ms 0.090664ms
  • 71. Execution Time Lateral GPU CL CPU CL V8 ATI Radeon 6770m Intel Core i7 4x2.4Ghz Intel Core i7 4x2.4Ghz116.571533ms 0.226007ms 0.090664ms
  • 72. What went wrong?EverythingStack-based AST Interpreter, no optimizationsHeavy global memory access, no optimizationsNo data or task parallelism
  • 73. Stack-based InterpreterSlow as molassesMemory hog Eclipse styleHeavy memory access “var x = 1 + 2;” == 30 stack hits alone! Too much dynamic allocationNo inline optimizations, just following the yellow brick ASTStraight up lazyReplace with something better!Bytecode compiler on HostBytecode register-based interpreter on Device
  • 74. Too much global access Everything is dynamically allocated to global memory Register based interpreter & bytecode compiler can make better use of local and private memory// 11.1207 secondssize_t tid = get_global_id(0);c[tid] = a[tid];while(b[tid] > 0) { // touch global memory on each loop b[tid]--; // touch global memory on each loop c[tid]++; // touch global memory on each loop Optimizing memory access}// 0.0445558 seconds!! HOLY SHIT! yields crazy resultssize_t tid = get_global_id(0);int tmp = a[tid]; // temp private variablefor(int i=b[tid]; i > 0; i--) tmp++; // touch private variables on each loopc[tid] = tmp; // touch global memory one time
  • 75. No data or task parallelism Everything being interpreted in a single “thread” We have hundreds of cores available to us! Build in heuristics Identify side-effect free statements Break into parallel tasks - very magical input[0] = Math.pow((0 + 1) / 1.23, 3);var input = new Array(10);for (var i = 0; i < input.length; i++) { input[1] = Math.pow((1 + 1) / 1.23, 3);} input[i] = Math.pow((i + 1) / 1.23, 3); ... input[9] = Math.pow((9 + 1) / 1.23, 3);
  • 76. What’s in storeAcceptable performance on all CL devicesV8/Node extension to launch Lateral tasksHigh-level API to perform map-reduce, etc.Lateral-cluster...mmmmm
  • 77. Thanks! Jarred Nicholls @jarrednichollsjarred@webkit.org