Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
AST - the only true tool for building JavaScript
Next
Download to read offline and view in fullscreen.

37

Share

JavaScript on the GPU

Download to read offline

I experimented with running JavaScript on the GPU - see how the first iteration of the experiment went.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

JavaScript on the GPU

  1. If you don’t get this ref...shame on you
  2. Jarred Nicholls @jarrednicholls jarred@webkit.org
  3. Work @ Sencha Web Platform Team Doing webkitty things...
  4. WebKit Committer
  5. Co-Author W3C Web Cryptography API
  6. JavaScript on the GPU
  7. What I’ll blabber about today Why JavaScript on the GPU Running JavaScript on the GPU What’s to come...
  8. Why JavaScript on the GPU?
  9. Why JavaScript on the GPU? Better question: Why a GPU?
  10. Why JavaScript on the GPU? Better question: Why a GPU? A: They’re fast! (well, at certain things...)
  11. GPUs are fast b/c... Totally different paradigm from CPUs Data parallelism vs. Task parallelism Stream processing vs. Sequential processing GPUs can divide-and-conquer Hardware capable of a large number of “threads” e.g. ATI Radeon HD 6770m: 480 stream processing units == 480 cores Typically very high memory bandwidth Many, many GigaFLOPs
  12. GPUs don’t solve all problems Not all tasks can be accelerated by GPUs Tasks must be parallelizable, i.e.: Side effect free Homogeneous and/or streamable Overall tasks will become limited by Amdahl’s Law
  13. Let’s find out...
  14. Experiment Code Name “LateralJS”
  15. LateralJS Our Mission To make JavaScript a first-class citizen on all GPUs and take advantage of hardware accelerated operations & data parallelization.
  16. Our Options OpenCL Nvidia CUDA AMD, Nvidia, Intel, etc. Nvidia only A shitty version of C99 C++ (C for CUDA) No dynamic memory Dynamic memory No recursion Recursion No function pointers Function pointers Terrible tooling Great dev. tooling Immature (arguably) More mature (arguably)
  17. Our Options OpenCL Nvidia CUDA AMD, Nvidia, Intel, etc. Nvidia only A shitty version of C99 C++ (C for CUDA) No dynamic memory Dynamic memory No recursion Recursion No function pointers Function pointers Terrible tooling Great dev. tooling Immature (arguably) More mature (arguably)
  18. Why not a Static Compiler? We want full JavaScript support Object / prototype Closures Recursion Functions as objects Variable typing Type Inference limitations Reasonably limited to size and complexity of “kernel- esque” functions Not nearly insane enough
  19. Why an Interpreter? We want it all baby - full JavaScript support! Most insane approach Challenging to make it good, but holds a lot of promise
  20. OpenCL Headaches
  21. Oh the agony... Multiple memory spaces - pointer hell No recursion - all inlined functions No standard libc libraries No dynamic memory No standard data structures - apart from vector ops Buggy ass AMD/Nvidia compilers
  22. Multiple Memory Spaces In the order of fastest to slowest: space description very fast private stream processor cache (~64KB) scoped to a single work item fast local ~= L1 cache on CPUs (~64KB) scoped to a single work group slow, by orders of magnitude global ~= system memory over slow bus constant available to all work groups/items all the VRAM on the card (MBs)
  23. Memory Space Pointer Hell global uchar* gptr = 0x1000; local uchar* lptr = (local uchar*) gptr; // FAIL! uchar* pptr = (uchar*) gptr; // FAIL! private is implicit 0x1000 global local private 0x1000 points to something different depending on the address space!
  24. Memory Space Pointer Hell Pointers must always be fully qualified Macros to help ease the pain #define GPTR(TYPE) global TYPE* #define CPTR(TYPE) constant TYPE* #define LPTR(TYPE) local TYPE* #define PPTR(TYPE) private TYPE*
  25. No Recursion!?!?!? No call stack All functions are inlined to the kernel function uint factorial(uint n) { if (n <= 1) return 1; else return n * factorial(n - 1); // compile-time error }
  26. No standard libc libraries memcpy? strcpy? strcmp? etc...
  27. No standard libc libraries Implement our own #define MEMCPY(NAME, DEST_AS, SRC_AS) DEST_AS void* NAME(DEST_AS void*, SRC_AS const void*, uint); DEST_AS void* NAME(DEST_AS void* dest, SRC_AS const void* src, uint size) { DEST_AS uchar* cDest = (DEST_AS uchar*)dest; SRC_AS const uchar* cSrc = (SRC_AS const uchar*)src; for (uint i = 0; i < size; i++) cDest[i] = cSrc[i]; return (DEST_AS void*)cDest; } PTR_MACRO_DEST_SRC(MEMCPY, memcpy) Produces memcpy_g memcpy_gc memcpy_lc memcpy_pc memcpy_l memcpy_gl memcpy_lg memcpy_pg memcpy_p memcpy_gp memcpy_lp memcpy_pl
  28. No dynamic memory No malloc() No free() What to do...
  29. Yes! dynamic memory Create a large buffer of global memory - our “heap” Implement our own malloc() and free() Create a handle structure - “virtual memory” P(T, hnd) macro to get the current pointer address GPTR(handle) hnd = malloc(sizeof(uint)); GPTR(uint) ptr = P(uint, hnd); *ptr = 0xdeadbeef; free(hnd);
  30. Ok, we get the point... FYL!
  31. High-level Architecture V8 Data Heap Esprima Parser Stack-based Interpreter Host Host Host GPUs Data Serializer & Marshaller Garbage Collector Device Mgr
  32. High-level Architecture eval(code); V8 Data Heap Build JSON AST Esprima Parser Stack-based Interpreter Host Host Host GPUs Data Serializer & Marshaller Garbage Collector Device Mgr
  33. High-level Architecture eval(code); V8 Data Heap Build JSON AST Esprima Parser Stack-based Interpreter Serialize AST Host Host Host JSON => C Structs GPUs Data Serializer & Marshaller Garbage Collector Device Mgr
  34. High-level Architecture eval(code); V8 Data Heap Build JSON AST Esprima Parser Stack-based Interpreter Serialize AST Host Host Host JSON => C Structs GPUs Data Serializer & Marshaller Garbage Collector Ship to GPU to Interpret Device Mgr
  35. High-level Architecture eval(code); V8 Data Heap Build JSON AST Esprima Parser Stack-based Interpreter Serialize AST Host Host Host JSON => C Structs GPUs Data Serializer & Marshaller Garbage Collector Ship to GPU to Interpret Device Mgr Fetch Result
  36. AST Generation
  37. AST Generation JSON AST JavaScript Source (v8::Object) Lateral AST Esprima in V8 (C structs)
  38. Embed esprima.js Resource Generator $ resgen esprima.js resgen_esprima_js.c
  39. Embed esprima.js resgen_esprima_js.c const unsigned char resgen_esprima_js[] = { 0x2f, 0x2a, 0x0a, 0x20, 0x20, 0x43, 0x6f, 0x70, 0x79, 0x72, 0x69, 0x67, 0x68, 0x74, 0x20, 0x28, 0x43, 0x29, 0x20, 0x32, ... 0x20, 0x3a, 0x20, 0x2a, 0x2f, 0x0a, 0x0a, 0 };
  40. Embed esprima.js ASTGenerator.cpp extern const char resgen_esprima_js; void ASTGenerator::init() { HandleScope scope; s_context = Context::New(); s_context->Enter(); Handle<Script> script = Script::Compile(String::New(&resgen_esprima_js)); script->Run(); s_context->Exit(); s_initialized = true; }
  41. Build JSON AST e.g. ASTGenerator::esprimaParse( "var xyz = new Array(10);" );
  42. Build JSON AST Handle<Object> ASTGenerator::esprimaParse(const char* javascript) { if (!s_initialized) init(); HandleScope scope; s_context->Enter(); Handle<Object> global = s_context->Global(); Handle<Object> esprima = Handle<Object>::Cast(global->Get(String::New("esprima"))); Handle<Function> esprimaParse = Handle<Function>::Cast(esprima- >Get(String::New("parse"))); Handle<String> code = String::New(javascript); Handle<Object> ast = Handle<Object>::Cast(esprimaParse->Call(esprima, 1, (Handle<Value>*)&code)); s_context->Exit(); return scope.Close(ast); }
  43. Build JSON AST { "type": "VariableDeclaration", "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "xyz" }, "init": { "type": "NewExpression", "callee": { "type": "Identifier", "name": "Array" }, "arguments": [ { "type": "Literal", "value": 10 } ] } } ], "kind": "var" }
  44. Lateral AST structs typedef struct ast_type_st { #ifdef __OPENCL_VERSION__ CL(uint) id; #define CL(TYPE) TYPE CL(uint) size; #else } ast_type; #define CL(TYPE) cl_##TYPE #endif typedef struct ast_program_st { ast_type type; CL(uint) body; CL(uint) numBody; Structs shared between } ast_program; Host and OpenCL typedef struct ast_identifier_st { ast_type type; CL(uint) name; } ast_identifier;
  45. Lateral AST structs v8::Object => ast_type expanded ast_type* vd1_1_init_id = (ast_type*)astCreateIdentifier("Array"); ast_type* vd1_1_init_args[1]; vd1_1_init_args[0] = (ast_type*)astCreateNumberLiteral(10); ast_type* vd1_1_init = (ast_type*)astCreateNewExpression(vd1_1_init_id, vd1_1_init_args, 1); free(vd1_1_init_id); for (int i = 0; i < 1; i++) free(vd1_1_init_args[i]); ast_type* vd1_1_id = (ast_type*)astCreateIdentifier("xyz"); ast_type* vd1_decls[1]; vd1_decls[0] = (ast_type*)astCreateVariableDeclarator(vd1_1_id, vd1_1_init); free(vd1_1_id); free(vd1_1_init); ast_type* vd1 = (ast_type*)astCreateVariableDeclaration(vd1_decls, 1, "var"); for (int i = 0; i < 1; i++) free(vd1_decls[i]);
  46. Lateral AST structs astCreateIdentifier ast_identifier* astCreateIdentifier(const char* str) { CL(uint) size = sizeof(ast_identifier) + rnd(strlen(str) + 1, 4); ast_identifier* ast_id = (ast_identifier*)malloc(size); // copy the string strcpy((char*)(ast_id + 1), str); // fill the struct ast_id->type.id = AST_IDENTIFIER; ast_id->type.size = size; ast_id->name = sizeof(ast_identifier); // offset return ast_id; }
  47. Lateral AST structs astCreateIdentifier(“xyz”) offset field value 0 type.id AST_IDENTIFIER (0x01) 4 type.size 16 8 name 12 (offset) 12 str[0] ‘x’ 13 str[1] ‘y’ 14 str[2] ‘z’ 15 str[3] ‘0’
  48. Lateral AST structs astCreateNewExpression ast_expression_new* astCreateNewExpression(ast_type* callee, ast_type** arguments, int numArgs) { CL(uint) size = sizeof(ast_expression_new) + callee->size; for (int i = 0; i < numArgs; i++) size += arguments[i]->size; ast_expression_new* ast_new = (ast_expression_new*)malloc(size); ast_new->type.id = AST_NEW_EXPR; ast_new->type.size = size; CL(uint) offset = sizeof(ast_expression_new); char* dest = (char*)ast_new; // copy callee memcpy(dest + offset, callee, callee->size); ast_new->callee = offset; offset += callee->size; // copy arguments if (numArgs) { ast_new->arguments = offset; for (int i = 0; i < numArgs; i++) { ast_type* arg = arguments[i]; memcpy(dest + offset, arg, arg->size); offset += arg->size; } } else ast_new->arguments = 0; ast_new->numArguments = numArgs; return ast_new; }
  49. Lateral AST structs new Array(10) offset field value 0 type.id AST_NEW_EXPR (0x308) 4 type.size 52 8 callee 20 (offset) 12 arguments 40 (offset) 16 numArguments 1 20 callee node ast_identifier (“Array”) arguments 40 ast_literal_number (10) node
  50. Lateral AST structs Shared across the Host and the OpenCL runtime Host writes, Lateral reads Constructed on Host as contiguous blobs Easy to send to GPU: memcpy(gpu, ast, ast->size); Fast to send to GPU, single buffer write Simple to traverse w/ pointer arithmetic
  51. Stack-based Interpreter
  52. Building Blocks JS Type Structs AST Traverse Stack Lateral State Call/Exec Stack Heap Symbol/Ref Table Return Stack Scope Stack AST Traverse Loop Interpret Loop
  53. Kernels #include "state.h" #include "jsvm/asttraverse.h" #include "jsvm/interpreter.h" // Setup VM structures kernel void lateral_init(GPTR(uchar) lateral_heap) { LATERAL_STATE_INIT } // Interpret the AST kernel void lateral(GPTR(uchar) lateral_heap, GPTR(ast_type) lateral_ast) { LATERAL_STATE ast_push(lateral_ast); while (!Q_EMPTY(lateral_state->ast_stack, ast_q) || !Q_EMPTY(lateral_state->call_stack, call_q)) { while (!Q_EMPTY(lateral_state->ast_stack, ast_q)) traverse(); if (!Q_EMPTY(lateral_state->call_stack, call_q)) interpret(); } }
  54. Let’s interpret... var x = 1 + 2;
  55. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  56. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDecl "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  57. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  58. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Binary "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  59. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Literal Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  60. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Literal Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  61. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", Ident VarDtor "id": { "type": "Identifier", Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", Literal "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  62. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor "id": { "type": "Identifier", Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", Literal "operator": "+", "left": { Ident "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  63. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", Binary }, "name": "x" Literal "init": { "type": "BinaryExpression", Literal "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  64. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", Binary 1 }, "name": "x" Literal "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  65. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", Binary 1 }, "name": "x" 2 "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  66. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", VarDtor “x” "id": { "type": "Identifier", 3 "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  67. var x = 1 + 2; { "type": "VariableDeclaration", AST Call Return "declarations": [ { "type": "VariableDeclarator", "id": { "type": "Identifier", "name": "x" }, "init": { "type": "BinaryExpression", "operator": "+", "left": { "type": "Literal", "value": 1 }, "right": { "type": "Literal", "value": 2 } } } ], "kind": "var" }
  68. Benchmark
  69. Benchmark Small loop of FLOPs var input = new Array(10); for (var i = 0; i < input.length; i++) { input[i] = Math.pow((i + 1) / 1.23, 3); }
  70. Execution Time Lateral GPU CL CPU CL V8 ATI Radeon 6770m Intel Core i7 4x2.4Ghz Intel Core i7 4x2.4Ghz 116.571533ms 0.226007ms 0.090664ms
  71. Execution Time Lateral GPU CL CPU CL V8 ATI Radeon 6770m Intel Core i7 4x2.4Ghz Intel Core i7 4x2.4Ghz 116.571533ms 0.226007ms 0.090664ms
  72. What went wrong? Everything Stack-based AST Interpreter, no optimizations Heavy global memory access, no optimizations No data or task parallelism
  73. Stack-based Interpreter Slow as molasses Memory hog Eclipse style Heavy memory access “var x = 1 + 2;” == 30 stack hits alone! Too much dynamic allocation No inline optimizations, just following the yellow brick AST Straight up lazy Replace with something better! Bytecode compiler on Host Bytecode register-based interpreter on Device
  74. Too much global access Everything is dynamically allocated to global memory Register based interpreter & bytecode compiler can make better use of local and private memory // 11.1207 seconds size_t tid = get_global_id(0); c[tid] = a[tid]; while(b[tid] > 0) { // touch global memory on each loop b[tid]--; // touch global memory on each loop c[tid]++; // touch global memory on each loop Optimizing memory access } // 0.0445558 seconds!! HOLY SHIT! yields crazy results size_t tid = get_global_id(0); int tmp = a[tid]; // temp private variable for(int i=b[tid]; i > 0; i--) tmp++; // touch private variables on each loop c[tid] = tmp; // touch global memory one time
  75. No data or task parallelism Everything being interpreted in a single “thread” We have hundreds of cores available to us! Build in heuristics Identify side-effect free statements Break into parallel tasks - very magical input[0] = Math.pow((0 + 1) / 1.23, 3); var input = new Array(10); for (var i = 0; i < input.length; i++) { input[1] = Math.pow((1 + 1) / 1.23, 3); } input[i] = Math.pow((i + 1) / 1.23, 3); ... input[9] = Math.pow((9 + 1) / 1.23, 3);
  76. What’s in store Acceptable performance on all CL devices V8/Node extension to launch Lateral tasks High-level API to perform map-reduce, etc. Lateral-cluster...mmmmm
  77. Thanks! Jarred Nicholls @jarrednicholls jarred@webkit.org
  • IvoYankulovski

    Jul. 17, 2017
  • CatalinNiculescu1

    May. 10, 2017
  • VictorCoutinho8

    Apr. 27, 2017
  • zixia

    Feb. 23, 2017
  • feeldesign.cn

    May. 23, 2016
  • antingtsai

    Jan. 18, 2016
  • DanStJohn

    Oct. 2, 2015
  • keichit

    Jul. 24, 2015
  • TokyoIncidents

    Jun. 18, 2015
  • icysoul

    Mar. 26, 2015
  • rafrybnik

    Mar. 26, 2015
  • beherca

    Feb. 5, 2015
  • dinkla

    Dec. 15, 2014
  • kevinnedelec50

    Sep. 18, 2014
  • hecomi

    Sep. 1, 2014
  • dariogarcia65

    Aug. 31, 2014
  • hamasyou

    Aug. 31, 2014
  • dr_strange_lov

    Aug. 31, 2014
  • xxshimizuxx

    Aug. 31, 2014
  • tetsuoyutani

    Aug. 31, 2014

I experimented with running JavaScript on the GPU - see how the first iteration of the experiment went.

Views

Total views

44,642

On Slideshare

0

From embeds

0

Number of embeds

716

Actions

Downloads

165

Shares

0

Comments

0

Likes

37

×