10. Why JavaScript on the GPU?
Better question:
Why a GPU?
A: They’re fast!
(well, at certain things...)
11. GPUs are fast b/c...
Totally different paradigm from CPUs
Data parallelism vs. Task parallelism
Stream processing vs. Sequential processing
GPUs can divide-and-conquer
Hardware capable of a large number of “threads”
e.g. ATI Radeon HD 6770m:
480 stream processing units == 480 cores
Typically very high memory bandwidth
Many, many GigaFLOPs
12. GPUs don’t solve all problems
Not all tasks can be accelerated by GPUs
Tasks must be parallelizable, i.e.:
Side effect free
Homogeneous and/or streamable
Overall tasks will become limited by Amdahl’s Law
16. LateralJS
Our Mission
To make JavaScript a first-class citizen on all GPUs
and take advantage of hardware accelerated
operations & data parallelization.
17. Our Options
OpenCL Nvidia CUDA
AMD, Nvidia, Intel, etc. Nvidia only
A shitty version of C99 C++ (C for CUDA)
No dynamic memory Dynamic memory
No recursion Recursion
No function pointers Function pointers
Terrible tooling Great dev. tooling
Immature (arguably) More mature (arguably)
18. Our Options
OpenCL Nvidia CUDA
AMD, Nvidia, Intel, etc. Nvidia only
A shitty version of C99 C++ (C for CUDA)
No dynamic memory Dynamic memory
No recursion Recursion
No function pointers Function pointers
Terrible tooling Great dev. tooling
Immature (arguably) More mature (arguably)
19. Why not a Static Compiler?
We want full JavaScript support
Object / prototype
Closures
Recursion
Functions as objects
Variable typing
Type Inference limitations
Reasonably limited to size and complexity of “kernel-
esque” functions
Not nearly insane enough
21. Why an Interpreter?
We want it all baby - full JavaScript support!
Most insane approach
Challenging to make it good, but holds a lot of promise
24. Oh the agony...
Multiple memory spaces - pointer hell
No recursion - all inlined functions
No standard libc libraries
No dynamic memory
No standard data structures - apart from vector ops
Buggy ass AMD/Nvidia compilers
26. Multiple Memory Spaces
In the order of fastest to slowest:
space description
very fast
private stream processor cache (~64KB)
scoped to a single work item
fast
local ~= L1 cache on CPUs (~64KB)
scoped to a single work group
slow, by orders of magnitude
global ~= system memory over slow bus
constant available to all work groups/items
all the VRAM on the card (MBs)
27. Memory Space Pointer Hell
global uchar* gptr = 0x1000;
local uchar* lptr = (local uchar*) gptr; // FAIL!
uchar* pptr = (uchar*) gptr; // FAIL! private is implicit
0x1000
global local private
0x1000 points to something different
depending on the address space!
28. Memory Space Pointer Hell
Pointers must always be fully qualified
Macros to help ease the pain
#define GPTR(TYPE) global TYPE*
#define CPTR(TYPE) constant TYPE*
#define LPTR(TYPE) local TYPE*
#define PPTR(TYPE) private TYPE*
29. No Recursion!?!?!?
No call stack
All functions are inlined to the kernel function
uint factorial(uint n) {
if (n <= 1)
return 1;
else
return n * factorial(n - 1); // compile-time error
}
33. Yes! dynamic memory
Create a large buffer of global memory - our “heap”
Implement our own malloc() and free()
Create a handle structure - “virtual memory”
P(T, hnd) macro to get the current pointer address
GPTR(handle) hnd = malloc(sizeof(uint));
GPTR(uint) ptr = P(uint, hnd);
*ptr = 0xdeadbeef;
free(hnd);
55. Lateral AST structs
Shared across the Host and the OpenCL runtime
Host writes, Lateral reads
Constructed on Host as contiguous blobs
Easy to send to GPU: memcpy(gpu, ast, ast->size);
Fast to send to GPU, single buffer write
Simple to traverse w/ pointer arithmetic
79. Stack-based Interpreter
Slow as molasses
Memory hog Eclipse style
Heavy memory access
“var x = 1 + 2;” == 30 stack hits alone!
Too much dynamic allocation
No inline optimizations, just following the yellow brick AST
Straight up lazy
Replace with something better!
Bytecode compiler on Host
Bytecode register-based interpreter on Device
81. Too much global access
Everything is dynamically allocated to global memory
Register based interpreter & bytecode compiler can
make better use of local and private memory
// 11.1207 seconds
size_t tid = get_global_id(0);
c[tid] = a[tid];
while(b[tid] > 0) { // touch global memory on each loop
b[tid]--; // touch global memory on each loop
c[tid]++; // touch global memory on each loop Optimizing memory access
}
// 0.0445558 seconds!! HOLY SHIT!
yields crazy results
size_t tid = get_global_id(0);
int tmp = a[tid]; // temp private variable
for(int i=b[tid]; i > 0; i--) tmp++; // touch private variables on each loop
c[tid] = tmp; // touch global memory one time
82. No data or task parallelism
Everything being interpreted in a single “thread”
We have hundreds of cores available to us!
Build in heuristics
Identify side-effect free statements
Break into parallel tasks - very magical
input[0] = Math.pow((0 + 1) / 1.23, 3);
var input = new Array(10);
for (var i = 0; i < input.length; i++) { input[1] = Math.pow((1 + 1) / 1.23, 3);
}
input[i] = Math.pow((i + 1) / 1.23, 3);
...
input[9] = Math.pow((9 + 1) / 1.23, 3);
83. What’s in store
Acceptable performance on all CL devices
V8/Node extension to launch Lateral tasks
High-level API to perform map-reduce, etc.
Lateral-cluster...mmmmm