Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).
Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs
1. Speculative Execution of Parallel Programs
with Precise Exception Semantics on GPUs
LCPC2013, Qualcomm
Akihiro Hayashi, Max Grossman,
Jisheng Zhao, Jun Shirako, Vivek Sarkar
Rice University, Houston, TX, USA
1
2. Background:
GPGPU and Java
The initial wave of programming models
for GPGPU has provided low-level APIs:
CUDA (NVIDIA)
OpenCL (Khronos)
→Often faster than natively running
C/C++ applications
High-level languages such as Java
provide high-productivity features:
Type safety
Garbage Collection
Precise Exception Semantics
2
4. Computation Body
RootBeer
API
Related Work:
RootBeer
Still requires special API invocation in
addition to computation body
4
int[][] arrays = new int[N][M];
int[] result = new int[N];
... arrays initialization ...
List<Kernel> jobs =
new ArrayList<Kernel>();
for(int i = 0; i < N; i++) {
jobs.add(
new ArraySumKernel(
arrays[i], result, i)
);
}
Rootbeer rootbeer = new Rootbeer();
rootbeer.runAll(jobs);
class ArraySumKernel
implements Kernel {
private int[] source;
private int[] ret;
private int index;
public void gpuMethod() {
int sum = 0;
for(int i = 0; i< N; i++) {
sum += source[i];
}
ret[index] = sum;
}
}
5. Our Approach:
HJ-OpenCL Overview
Automatic generation of OpenCL kernels
and JNI glue code from a parallel-for
construct forall
Built on the top of Habanero-Java(HJ)
“Habanero-Java: the New Adventures of Old X10”
Cave et al. (PPPJ’11)
“Accelerating Habanero-Java Programs with
OpenCL Generation” Hayashi et al. (PPPJ’13)
– User provided “Safe” constructs
OpenCL acceleration with precise
exception semantics
Our primary contribution
5
6. Overview of Habanero-Java
(HJ) Language
New language and implementation
developed at Rice since 2007
Derived from Java-based version of X10
language (v1.5) in 2007
HJ is currently an extension of Java 1.4
All Java 5 & 6 libraries and classes can be
called from HJ programs
6
7. Overview of Habanero-Java
(HJ) Language (Cont’d)
HJ’s parallel extensions are focused on task
parallelism
1. Dynamic task creation & termination:
async, finish, force, forall, foreach
2. Collective and point-to-point synchronization: phaser, next
3. Mutual exclusion and isolation: isolated
4. Locality control --- task and data distributions: places, here
Sequential HJ extensions added for convenience
extern, point, region, pointwise for, complex data type,
array views
Habanero-C and Habanero-Scala are also available
with similar constructs
7
8. HJ OpenCL
Implementation
HJ-OpenCL Example
→Programmers can utlize OpenCL by just replacing for with fora
8
public class ArraySum {
public static void main(String[] args) {
int[] base = new int[N*M];
int[] result = new int[N];
int[.] arrays = new arrayView(base, 0, [0:N-1,0:M-1]);
... initialization ...
boolean isSafe = ...;
forall(point [i] : [0:N-1]) {
result[i] = 0;
for(int j=0; j<M; j++) {
result[i] += arrays[i,j];
}
}
}
}
9. The compilation flow
HJ
Program
.class files on JVM
(bytecode)
OpenCL_hjstub.c
(JNI glue code)
OpenCLKernel.clas
s
(bytecode)
HJ
Compiler
C compiler
APARAPI
Translator
OpenCL Kernel
Kernel.c
Native library
(.so, .dll, .dylib)
JVM
Host
JNI
Device
OpenCL
Program is
translated into
three files
9
10. APARAPI
Open Source Project for data parallel Java
from AMD
https://code.google.com/p/aparapi/
APARAPI converts Java bytecode to
OpenCL kernel at runtime
Restriction
Can only handle primitive type, not object
10
→we prepared static version of APARAPI
to reduce runtime compilation overhead
11. Acceleration vs. Exception Semantics
Safe? High Performance?
Java Yes No
OpenCL/CUDA No Yes
11
OpenCL/CUDAJava
Mix!
Pictures borrowed from http://wot.motortrend.com/
http://www.boston.com/bigpicture/2008/09/the_singapore_grand_prix.html
12. Basic Idea:
Non-Speculative Exception Checking
12
Exception
Checking
(in parallel)
Multicore-CPU GPU
Non-Speculative
Exception
Checking
Code
boolean excpFlag = false;
/* (1) Exception Checking Code on JVM */
try {
forall (point [i]:[0:N-1]) {
… = A[i];
}
} catch (Exception e) {
excpFlag = true;
}
/* (2) JNI Call */
if (!excpFlag) {
openCL_Kernel();
} else{
// Original Implementation on JVM
forall() {}
}
Host to device
Data Transfer
Computation
Device to host
Data Transfer
No Exception
Time
Same as
original Forall
except
ArrayStore
13. Proposed Idea:
Speculative Exception Checking
13
Exception
Checking
(in parallel)
Multicore-CPU GPU
Speculative
Exception
Checking
Code
boolean excpFlag = false;
/* JNI Call 1*/
openCL_Kernel1();
/* (2) Exception Checking Code on JVM */
try {
forall (point [i]:[0:N-1]) {
… = A[i];
}
} catch (Exception e) {
excpFlag = true;
}
/* (2) JNI Call */
openCL_Kernel2(excpFlag);
if (excpFlag) {
// Original Implementation on JVM
forall() {}
}
Host to device
Data Transfer
Computation
Device to host
Data Transfer
No Exception
Time
Same as
original Forall
except
ArrayStore
14. Opportunity for Optmization
Target exceptions possibly occurred
during GPU execution
(due to APARAPI restriction)
ArrayIndexOutOfBoundsException
ArithmeticException
– Divided by Zero
NullPointerException
What kind of “Optimization”?
Delete statements which will not cause the
above exceptions to accelerate exception
checking at compile time
14
15. The exception checking code
optimization algorithm
Key Idea
Delete statements which do not derive
array subscripts and denominator of
division statement by considering control
flow
15
i = …;
X = …;
Y = …;
… = A[i] + X;
… = B[i] / Y;
Before
i = …;
X = …;
Y = …;
… = A[i] + X;
… = B[i] / Y;
After
18. Platforms
AMD A10-5800K Westmere
CPU 4-cores (APU) 6-cores x 2 Xeon 5660
GPU
Radeon HD 7660D
384-cores
NVIDIA Tesla M2050
448-cores
Java Runtime JRE (build 1.6.0_21-b06 JRE (build 1.6.0_25-b06)
JVM
HotSpot 64-Bit Server
VM
(build 17.0-b11,
mixed mode)
HotSpot 64-Bit Server
VM(Build 20.0-b11,
mixed mode)
18
19. Experimental Methodologies
We tested execution in the following
modes:
Sequential Java
HJ-OpenCL
No Checking (Unsafe)
Non-Speculative Exception Checking
– Optimized/Unoptimized
Speculative Exception Checking
– Optimized/Unoptimized
19
20. Result On AMD(A10-5800K)
No Checking vs. Speculative Checking
Up to 18% slowdown while maintaining Java exception
semantics
20
2.4
0.2
3.6
9.8
21.3
12.6
5.1
9.9
2.1
0.2
3.0
8.6
21.1
11.7
4.5
9.6
0.0
5.0
10.0
15.0
20.0
25.0
JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER
SpeeduprelativetoSequentialJava
no checking: UNSAFE
Optimized Speculative: SAFE, PROPOSED
Benchmarks
Higher is better
21. Result On AMD(A10-5800K)
Slowdown for Exception Checking
21
0.9
1.0
0.4
0.2
0.1
0.9
0.9 0.9
0.9
1.0
0.4
0.2
0.1
0.9 0.9 0.90.9
1.0
0.8
0.9
1.0
0.9 0.9
1.0
0.9
1.0
0.8
0.9
1.0
0.9
0.9
1.0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER
SpeeduprelativetoNo-Checking
Non-Speculative, Unoptimized,
Speculative, Unoptimized
Non-Speculative, Optimized
Speculative Optimized
Speedup by
Optimizatio
n
Speedup by
Optimizatio
n
Almost
Same
Performanc
e
Almost
Same
Performanc
e
Benchmarks
Higher is better
22. Analysis of Results on AMD:
Checking code optimization issues
In speculative Execution, optimization is
effective only if exception checking code is
on critical path
JGF-Crypt, BlackScholes, MRIQ, GEMVER
22
Unoptimized
Exception
Checking
(in parallel)
Multicore-CPU GPU
Host to device
Data Transfer
Computation
Device to host
Data Transfer
Time
Exception
Checking
Code is on
Critical path
23. Analysis of Results on AMD:
Speculative Execution Issues
Speculation does not much accelerate
program execution because GPU execution
is quite longer than checking
JGF-SparseMatMultDoitgen, MatMult, SAXPY
23
Checking
Multicore-CPU GPU
Host to device
Data Transfer
Computation
Device to host
Data Transfer
Time
Checking
Multicore-CPU GPU
Host to device
Data Transfer
Computation
Device to host
Data Transfer
24. Result On Westmere
No Checking vs. Speculative Checking
Up to 22% slowdown while maintaining Java exception
semantics
24
9.8
2.7
14.4
37.8
330.8
43.1
8.8
22.3
7.7
2.8
13.6
35.2
331.0
44.1
7.6
22.4
1.0
10.0
100.0
1000.0
JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER
SpeeduprelativetoSequentialJava
no checking: UNSAFE
Optimized Speculative: SAFE, PROPOSED
Benchmarks
Log Scale, Higher is better
26. Insights
Removal of redundant java.Math
methods in exception checking code
can enables significant performance
improvements
JGF-Crypt, BlackScholes, MRIQ
Speculation is not effective on AMD
due to
Insufficient processors in GPU
Lazy AMD OpenCL Runtime
26
27. Sample timeline of the Black-
Scholes application on AMD
27
0.0E+00 1.0E+08 2.0E+08 3.0E+08 4.0E+08 5.0E+08 6.0E+08
transfer|h-d|pending|dt1
transfer|h-d|pending|dt2
transfer|h-d|pending|dt3
kernel|pending
transfer|h-d|running|dt1
transfer|h-d|running|dt2
transfer|h-d|running|dt1
kernel|running
transfer|d-h|pending|dt1
transfer|d-h|pending|dt2
transfer|d-h|pending|dt3
transfer|d-h|running|dt1
transfer|d-h|running|dt2
transfer|d-h|running|dt3
Time (ns)
ApplicationStage
Accounts for 40%
of total execution
time!
28. Related Work:
High-level language to GPU code
Lime (PLDI’12)
JVM compatible language
RootBeer
Compiles Java bytecode to CUDA
X10 and Chapel
Provides programming model for CUDA
Sponge (ASPLOS’11)
Compiles StreamIt to CUDA
→ None of these approaches considers
Java Exception Semantics
28
29. Related Work:
Exception Semantics in Java
Artigas et al. (ICS’00) and Moreira et al.(ACM Trans.
‘00)
Generates exception- safe and -unsafe regions of code.
Wurthinger et al.(PPPJ’07)
Proposes an algorithm on Static Single Assignment(SSA)
form for the JIT compiler which eliminates un- necessary
bounds checking.
ABCD (PLDI’00)
Provides an array bounds checking elimination algorithm,
which is based on graph traversal on an extended SSA
form.
Jeffery et al. (In Concurrency and Compu- tations:
Practice and Experience,‘09)
Proposes a static annotation framework to reduce the
overhead of dynamic checking in the JIT compiler.
29
30. Summary:
HJ-OpenCL
Programmer can utilize OpenCL by just
putting “forall” construct
Automatic generation of exception checking
code on JVM
Accelerating Java program with precise
exception semantics
Performance improvement
Upto 21x speedup on AMD APU
upto 330x speedup on NVIDIA GPU
30
32. “next” construct for global
barrier synchronization on GPUs
Semantics
Wait until all thread reaches the synchronization
point
Note that OpenCL does not support all-to-
all barrier as a kernel language feature
The HJ compiler internally partitions the forall
loop body into blocks separated by
synchronization points
32
33. next construct (cont’d)
33
forall (point [i]:[0:n-1]) {
method1(i);
// synchronization point 1
next;
method2(i);
// synchronization point 2
next;
}
Thread0
method1(0);
Thread1
method1(1);
WAIT
method2(0); method2(1);
WAIT
34. “ArrayView” for Supporting
Contiguous Multidimensional
array
HJ ArrayView is backed by one-dimensional Java Array
Enables reduction of data transfer between
host and device
Java Array
A[i][j]
HJ Array View
A[i, j]
0
1
2
0
0
1
2
0 1
0 1 2 3
A[0][1]
A[0,1]
34
35. Speculative Exception Checking
Speculative
Execution of GPU
Exception Checking
(in parallel)
Data transfer to
GPU & kernel
invocation
Computation
Data transfer from
GPU
GPU Cleanup
Exception Occurred?
Fall back to
original code
HJ Runtime on JVM OpenCL Runtime Kernel on Device
No
Yes
Multi-core CPU Many-core GPUTime
35
Editor's Notes
Thanks for your introduction.
Hello, everyone. Welcome to my talk. This is the last talk in LCPC workshop.
My name is Akihiro Hayashi and I’m a posdoc at Rice university.
Today I’ll be talking about speculative execution of parallel programs with precise exception semantics on GPUs.
Let me first talk about the background.
Programming models for GPGPU, such as CUDA and OpenCL can enable significant improvements for certain classes of applications.
If you use these programming model, applications will be running faster than natively running C/C++ applications.
These programming models provide C-like kernel language and low-level APIs including data transfer API and kernel invocation API which are usually accessible from C/C++ languages.
On the other hand,
High-level languages such as Java provide high-productivity features including type safety, garbage collection and precise exception semantics.
But, If you want to utilize GPU from Java, It requires programmers to write non-trivial amount of application code.
Here is an example.
You can see three kinds of code here.
These are JNI code, OpenCL APIs and OpenCL Kernel code.
In JNI code, you can see the declaration of JNI prototype and the code which get/release the pointer of original Java array.
In OpenCL APIs. I omitted a lot of codes. But actually programmers have to write a lot of code to offload the task onto GPU by using memory allocation APIs, data transfer APIs, kernel compilation APIs, kernel invocation APIs.
When you take a look at the OpenCL kernel, you can see c-like function.
Programmers can write a kernel computation in SPMD manner.
However I think, utilizing GPU from Java adds non-trivial amount of work.
In past work, Rootbeer provides high-level programming model for GPU.
In Rootbeer the programmers prepare a special class which implements Kernel interface, It allows the programmer to write computation body with overriding the method called gpuMethod().
We don’t need to write JNI and OpenCL Code anymore,
You can invoke kernel on GPU by using RootBeer API as shown on your left. It just adds the jobs to queue.
But it still requires programmer to write special class and special API to invoke GPU.
In our prior work,
we proposed HJ-OpenCL which performs automatic generation of OpenCL kernels and JNI glue code from a prallel-for construct named forall.
Our compiler automatically compiles forall constructs to OpenCL kernel.
HJ-OpenCL is built on the top of Haabnero-Java language.
In this work we focus on maintaining Java’s exception semantics when we use the GPU from Java.
Before explaining our methodologies, Let me tell you the overview of Habanero Java language.
Habenero-Java is a new language developed at Rice since 2007.
It’s derived from Java-based version of X10 language.
HJ provides several parallel extensions focused on task parallelism.
You can use async statement to create task and You can also use forall to express parallel loop.
If you want to do all-to-all and point-to-point synchronization, you can use phaser. A subset of phaser is available in Java 7 .
For mutual exclusion, you can use isolated statement.
HJ allows you to set the affinity of task with places.
Habanero-C and Habanero-Scala are also avaialable with similar constructs.
Here is a HJ-OpenCL Example.
In HJ-OpenCL, you can express GPU kernel by just replacing for with forall.
Unlike RootBeer, you don’t need to prepare special class to GPU from Java.
This is the compilation flow of HJ-OpenCL
The HJ compiler takes HJ program and generates three kinds of files.
These are .class files on JVM, OpenCL_hjstub.c which consists of JNI glue code and OpenCLAPI, and a special class named OpenCLKernel.class.
OpenCLKernel.class is passed to APARAPI translator, which takes Java bytecode and generates OpenCL kernel.
The native static C compiler takes OpenCL_hjstub.c and OpenCL kernel generated by APARAPI translator and generates Native dynamic library.
This compilation is done automatically.
Let me describe the detail of APARAPI.
APARAPI is an open source project for data parallel Java.
It compiles Java bytecode to OpenCL at runtime.
There is a rescriction with regards to kernel generation.
APARAPI can only handle primitive type, can not handle instance of object.
We prepared static version of APARAPI to reduce runtime compilation overhead.
Let’s talk about the exception semantics.
As you know, we can say Java is safe language but not high performance language because Java checks exceptions at runtime.
On the other hand, OpenCL and CUDA is not safe but hight performance language.
This has some analogy to these picture.
You know, Consumer cars have several safe features. F1 machine is not safe but can move really fast.
We want to mix High Performance and Safety
This is basic idea.
Basically, we run exception checking code in parallel in JVM first.
And in the case that no exception occurred we’ll invoke GPU through JNI call.
In the code shown on your right,
You can see the exception checking code which is enclosed by try-catch statement.
Note that this code is same as the original forall implemtantion except array store.
We should transform all array store to array read to keep program semantics.
In catch statement, true is assigned to excpFlag.
In the case that excpFlag is false we’ll invoke GPU through JNI call.
Otherwise we’ll execute original implementation on JVM.
That’s why we can maintain exception semantics.
Additionally, We can run exception checking code and computation in parallel.
We speculatively run the computation on GPU. If no exception occurred we’ll get the data from device.
Otherwise we’ll run original implementation on JVM like no speculative execution.
This is our proposing methodology.
Compiler automatically geneartes this code.
If we focus on several exceptions which are possibly occurred during GPU execution.
We can optimize the exception checking code,
If we can focus on ArrayIndexOutOfBoundsException, ArithmeticException, and NullPointerException,
We can just delete statements which will not cause these exceptions to accelerate exception checking at compile time.
OK, let’s talk about the optimization.
As I mentioned before, our algorithm deletes statement which do not derive array subscript and denominator of division statement by considering control flow
Here is some example.
Some value is assigned to I, which will be used in A[i] in latter statement.
So we cannot delete this statement.
Similary, we cannot delete assignment of Y because this statement derives denominator of division.
But we can delete assignment of X because this will not derive array subscript and denominator of division statement.