Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs

Speculative Execution of Parallel Programs
with Precise Exception Semantics on GPUs
LCPC2013, Qualcomm
Akihiro Hayashi, Max Grossman,
Jisheng Zhao, Jun Shirako, Vivek Sarkar
Rice University, Houston, TX, USA
1

Background:
GPGPU and Java
The initial wave of programming models
for GPGPU has provided low-level APIs:
CUDA (NVIDIA)
OpenCL (Khronos)
→Often faster than natively running
C/C++ applications
High-level languages such as Java
provide high-productivity features:
Type safety
Garbage Collection
Precise Exception Semantics
2

OpenCL Kernel
JNI
OpenCL
APIs
JNI
Motivation:
GPU Execution From Java
JNIEXPORT void JNICALL_Java_Test (…) {
void ∗aptr = (env)−>GetPrimitiveArrayCritical(arrays , 0);
...
/∗ Create Buffer ∗/
Aobj = clCreateBuffer(context, …);
/∗ Host to Device Communication ∗/
clEnqueueWriteBuffer(queue, Aobj, …);
/∗ Kernel Compilation ∗/
/* Kernel Invocation */
…
(env)−>ReleasePrimitiveArrayCritical(arrays, aptr, 0);
__kernel
void run(…) {
int gid = …;
A[gid] = …;
}
Utilizing GPU from Java adds non-trivial
amount of work
3

Computation Body
RootBeer
API
Related Work:
RootBeer
Still requires special API invocation in
addition to computation body
4
int[][] arrays = new int[N][M];
int[] result = new int[N];
... arrays initialization ...
List<Kernel> jobs =
new ArrayList<Kernel>();
for(int i = 0; i < N; i++) {
jobs.add(
new ArraySumKernel(
arrays[i], result, i)
);
}
Rootbeer rootbeer = new Rootbeer();
rootbeer.runAll(jobs);
class ArraySumKernel
implements Kernel {
private int[] source;
private int[] ret;
private int index;
public void gpuMethod() {
int sum = 0;
for(int i = 0; i< N; i++) {
sum += source[i];
}
ret[index] = sum;
}
}

Our Approach:
HJ-OpenCL Overview
Automatic generation of OpenCL kernels
and JNI glue code from a parallel-for
construct forall
Built on the top of Habanero-Java(HJ)
“Habanero-Java: the New Adventures of Old X10”
Cave et al. (PPPJ’11)
“Accelerating Habanero-Java Programs with
OpenCL Generation” Hayashi et al. (PPPJ’13)
– User provided “Safe” constructs
OpenCL acceleration with precise
exception semantics
Our primary contribution
5

Overview of Habanero-Java
(HJ) Language
New language and implementation
developed at Rice since 2007
 Derived from Java-based version of X10
language (v1.5) in 2007
 HJ is currently an extension of Java 1.4
All Java 5 & 6 libraries and classes can be
called from HJ programs
6

Overview of Habanero-Java
(HJ) Language (Cont’d)
 HJ’s parallel extensions are focused on task
parallelism
1. Dynamic task creation & termination:
async, finish, force, forall, foreach
2. Collective and point-to-point synchronization: phaser, next
3. Mutual exclusion and isolation: isolated
4. Locality control --- task and data distributions: places, here
 Sequential HJ extensions added for convenience
 extern, point, region, pointwise for, complex data type,
array views
 Habanero-C and Habanero-Scala are also available
with similar constructs
7

HJ OpenCL
Implementation
HJ-OpenCL Example
→Programmers can utlize OpenCL by just replacing for with fora
8
public class ArraySum {
public static void main(String[] args) {
int[] base = new int[N*M];
int[] result = new int[N];
int[.] arrays = new arrayView(base, 0, [0:N-1,0:M-1]);
... initialization ...
boolean isSafe = ...;
forall(point [i] : [0:N-1]) {
result[i] = 0;
for(int j=0; j<M; j++) {
result[i] += arrays[i,j];
}
}
}
}

The compilation flow
HJ
Program
.class files on JVM
(bytecode)
OpenCL_hjstub.c
(JNI glue code)
OpenCLKernel.clas
s
(bytecode)
HJ
Compiler
C compiler
APARAPI
Translator
OpenCL Kernel
Kernel.c
Native library
(.so, .dll, .dylib)
JVM
Host
JNI
Device
OpenCL
Program is
translated into
three files
9

APARAPI
Open Source Project for data parallel Java
from AMD
 https://code.google.com/p/aparapi/
 APARAPI converts Java bytecode to
OpenCL kernel at runtime
Restriction
 Can only handle primitive type, not object
10
→we prepared static version of APARAPI
to reduce runtime compilation overhead

Acceleration vs. Exception Semantics
Safe? High Performance?
Java Yes No
OpenCL/CUDA No Yes
11
OpenCL/CUDAJava
Mix!
Pictures borrowed from http://wot.motortrend.com/
http://www.boston.com/bigpicture/2008/09/the_singapore_grand_prix.html

Basic Idea:
Non-Speculative Exception Checking
12
Exception
Checking
(in parallel)
Multicore-CPU GPU
Non-Speculative
Exception
Checking
Code
boolean excpFlag = false;
/* (1) Exception Checking Code on JVM */
try {
forall (point [i]:[0:N-1]) {
… = A[i];
}
} catch (Exception e) {
excpFlag = true;
}
/* (2) JNI Call */
if (!excpFlag) {
openCL_Kernel();
} else{
// Original Implementation on JVM
forall() {}
}
Host to device
Data Transfer
Computation
Device to host
Data Transfer
No Exception
Time
Same as
original Forall
except
ArrayStore

Proposed Idea:
Speculative Exception Checking
13
Exception
Checking
(in parallel)
Multicore-CPU GPU
Speculative
Exception
Checking
Code
boolean excpFlag = false;
/* JNI Call 1*/
openCL_Kernel1();
/* (2) Exception Checking Code on JVM */
try {
… = A[i];
}
} catch (Exception e) {
excpFlag = true;
}
/* (2) JNI Call */
openCL_Kernel2(excpFlag);
if (excpFlag) {
// Original Implementation on JVM
forall() {}
}
Host to device
Data Transfer
Computation
Device to host
Data Transfer
No Exception
Time
Same as
original Forall
except
ArrayStore

Opportunity for Optmization
Target exceptions possibly occurred
during GPU execution
(due to APARAPI restriction)
ArrayIndexOutOfBoundsException
ArithmeticException
– Divided by Zero
NullPointerException
What kind of “Optimization”?
Delete statements which will not cause the
above exceptions to accelerate exception
checking at compile time
14

The exception checking code
optimization algorithm
Key Idea
Delete statements which do not derive
array subscripts and denominator of
division statement by considering control
flow
15
i = …;
X = …;
Y = …;
… = A[i] + X;
… = B[i] / Y;
Before
i = …;
X = …;
Y = …;
… = A[i] + X;
… = B[i] / Y;
After

Exception Checking Code
Optimization Example
A[Index[i]] = B[i] + C[i];
}
// IR
$i2 = Index[i];
$i3 = B[i];
$i4 = C[i];
$i5 = $i3 + $i4;
A[$i2] = $i5;
// IR
$i2 = Index[i];
$i3 = B[i];
$i4 = C[i];
$i5 = $i3 + $i4;
dummy = A[$i2]
mark
// IR
$i2 = Index[i];
$i3 = B[i];
$i4 = C[i];
delete
dummy = A[$i2]
// IR
$i2 = Index[i];
$i3 = B[i];
$i4 = C[i];
dummy = A[$i2]
Dummy read
Optimized Code
16

Benchmarks
Benchmark Data Size Remarks
SparseMatmult JGF N= 500,000 Sparse Matrix
Doitgen Polybench 128x128x128
Crypt JGF N = 50,000,000
Blackscholes 16,777,216 options
MRIQ Parboil 64x64x64
MatMult 1024x1024
SAXPY N= 25,000x 25,000 Sparse Matrix
GEMVER SparseBLAS 10,000,000 Sparse Matrix
17

Platforms
AMD A10-5800K Westmere
CPU 4-cores (APU) 6-cores x 2 Xeon 5660
GPU
Radeon HD 7660D
384-cores
NVIDIA Tesla M2050
448-cores
Java Runtime JRE (build 1.6.0_21-b06 JRE (build 1.6.0_25-b06)
JVM
HotSpot 64-Bit Server
VM
(build 17.0-b11,
mixed mode)
HotSpot 64-Bit Server
VM(Build 20.0-b11,
mixed mode)
18

Experimental Methodologies
We tested execution in the following
modes:
Sequential Java
HJ-OpenCL
No Checking (Unsafe)
Non-Speculative Exception Checking
– Optimized/Unoptimized
Speculative Exception Checking
– Optimized/Unoptimized
19

Result On AMD(A10-5800K)
No Checking vs. Speculative Checking
 Up to 18% slowdown while maintaining Java exception
semantics
20
2.4
0.2
3.6
9.8
21.3
12.6
5.1
9.9
2.1
0.2
3.0
8.6
21.1
11.7
4.5
9.6
0.0
5.0
10.0
15.0
20.0
25.0
JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER
SpeeduprelativetoSequentialJava
no checking: UNSAFE
Optimized Speculative: SAFE, PROPOSED
Benchmarks
Higher is better

Result On AMD(A10-5800K)
Slowdown for Exception Checking
21
0.9
1.0
0.4
0.2
0.1
0.9
0.9 0.9
0.9
1.0
0.4
0.2
0.1
0.9 0.9 0.90.9
1.0
0.8
0.9
1.0
0.9 0.9
1.0
0.9
1.0
0.8
0.9
1.0
0.9
0.9
1.0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
SpeeduprelativetoNo-Checking
Non-Speculative, Unoptimized,
Speculative, Unoptimized
Non-Speculative, Optimized
Speculative Optimized
Speedup by
Optimizatio
n
Speedup by
Optimizatio
n
Almost
Same
Performanc
e
Almost
Same
Performanc
e
Benchmarks
Higher is better

Analysis of Results on AMD:
Checking code optimization issues
In speculative Execution, optimization is
effective only if exception checking code is
on critical path
 JGF-Crypt, BlackScholes, MRIQ, GEMVER
22
Unoptimized
Exception
Checking
(in parallel)
Multicore-CPU GPU
Host to device
Data Transfer
Computation
Device to host
Data Transfer
Time
Exception
Checking
Code is on
Critical path

Analysis of Results on AMD:
Speculative Execution Issues
Speculation does not much accelerate
program execution because GPU execution
is quite longer than checking
 JGF-SparseMatMultDoitgen, MatMult, SAXPY
23
Checking
Multicore-CPU GPU
Host to device
Data Transfer
Computation
Device to host
Data Transfer
Time
Checking
Multicore-CPU GPU
Host to device
Data Transfer
Computation
Device to host
Data Transfer

Result On Westmere
No Checking vs. Speculative Checking
 Up to 22% slowdown while maintaining Java exception
semantics
24
9.8
2.7
14.4
37.8
330.8
43.1
8.8
22.3
7.7
2.8
13.6
35.2
331.0
44.1
7.6
22.4
1.0
10.0
100.0
1000.0
SpeeduprelativetoSequentialJava
no checking: UNSAFE
Optimized Speculative: SAFE, PROPOSED
Benchmarks
Log Scale, Higher is better

Result On Westmere
Slowdown for Exception Checking
 Both speculation and optimization are effective
25
0.6
0.9
0.3
0.2
0.0
0.8 0.8
0.9
0.7
1.0
0.3
0.2
0.0
1.0
0.9
1.0
0.7
0.9
0.8
0.7
0.9
0.8
0.8
1.0
0.8
1.0
0.9 0.9
1.0 1.0
0.9
1.0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
SpeeduprelativetoNo-Checking
Non-Speculative, Unoptimized,
Speculative, Unoptimized
Non-Speculative, Optimized
Speculative Optimized
Benchmarks
Higher is better

Insights
Removal of redundant java.Math
methods in exception checking code
can enables significant performance
improvements
JGF-Crypt, BlackScholes, MRIQ
Speculation is not effective on AMD
due to
Insufficient processors in GPU
Lazy AMD OpenCL Runtime
26

Related Work:
High-level language to GPU code
Lime (PLDI’12)
JVM compatible language
RootBeer
Compiles Java bytecode to CUDA
X10 and Chapel
Provides programming model for CUDA
Sponge (ASPLOS’11)
Compiles StreamIt to CUDA
→ None of these approaches considers
Java Exception Semantics
28

Related Work:
Exception Semantics in Java
 Artigas et al. (ICS’00) and Moreira et al.(ACM Trans.
‘00)
 Generates exception- safe and -unsafe regions of code.
 Wurthinger et al.(PPPJ’07)
 Proposes an algorithm on Static Single Assignment(SSA)
form for the JIT compiler which eliminates un- necessary
bounds checking.
 ABCD (PLDI’00)
 Provides an array bounds checking elimination algorithm,
which is based on graph traversal on an extended SSA
form.
 Jeffery et al. (In Concurrency and Compu- tations:
Practice and Experience,‘09)
 Proposes a static annotation framework to reduce the
overhead of dynamic checking in the JIT compiler.
29

Summary:
HJ-OpenCL
Programmer can utilize OpenCL by just
putting “forall” construct
Automatic generation of exception checking
code on JVM
Accelerating Java program with precise
exception semantics
Performance improvement
Upto 21x speedup on AMD APU
upto 330x speedup on NVIDIA GPU
30

“next” construct for global
barrier synchronization on GPUs
Semantics
 Wait until all thread reaches the synchronization
point
Note that OpenCL does not support all-to-
all barrier as a kernel language feature
 The HJ compiler internally partitions the forall
loop body into blocks separated by
synchronization points
32

next construct (cont’d)
33
forall (point [i]:[0:n-1]) {
method1(i);
// synchronization point 1
next;
method2(i);
// synchronization point 2
next;
}
Thread0
method1(0);
Thread1
method1(1);
WAIT
method2(0); method2(1);
WAIT

“ArrayView” for Supporting
Contiguous Multidimensional
array
 HJ ArrayView is backed by one-dimensional Java Array
 Enables reduction of data transfer between
host and device
Java Array
A[i][j]
HJ Array View
A[i, j]
0
1
2
0
0
1
2
0 1
0 1 2 3
A[0][1]
A[0,1]
34

Speculative Exception Checking
Speculative
Execution of GPU
Exception Checking
(in parallel)
Data transfer to
GPU & kernel
invocation
Computation
Data transfer from
GPU
GPU Cleanup
Exception Occurred?
Fall back to
original code
HJ Runtime on JVM OpenCL Runtime Kernel on Device
No
Yes
Multi-core CPU Many-core GPUTime
35

Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs

Similar to Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs (20)

More from Akihiro Hayashi

More from Akihiro Hayashi (9)

Recently uploaded

Recently uploaded (20)

Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs

Editor's Notes