Speculative Execution of Parallel Programs
with Precise Exception Semantics on GPUs
LCPC2013, Qualcomm
Akihiro Hayashi, Max Grossman,
Jisheng Zhao, Jun Shirako, Vivek Sarkar
Rice University, Houston, TX, USA
1
Background:
GPGPU and Java
The initial wave of programming models
for GPGPU has provided low-level APIs:
CUDA (NVIDIA)
OpenCL (Khronos)
→Often faster than natively running
C/C++ applications
High-level languages such as Java
provide high-productivity features:
Type safety
Garbage Collection
Precise Exception Semantics
2
OpenCL Kernel
JNI
OpenCL
APIs
JNI
Motivation:
GPU Execution From Java
JNIEXPORT void JNICALL_Java_Test (…) {
void ∗aptr = (env)−>GetPrimitiveArrayCritical(arrays , 0);
...
/∗ Create Buffer ∗/
Aobj = clCreateBuffer(context, …);
/∗ Host to Device Communication ∗/
clEnqueueWriteBuffer(queue, Aobj, …);
/∗ Kernel Compilation ∗/
/* Kernel Invocation */
…
(env)−>ReleasePrimitiveArrayCritical(arrays, aptr, 0);
__kernel
void run(…) {
int gid = …;
A[gid] = …;
}
Utilizing GPU from Java adds non-trivial
amount of work
3
Computation Body
RootBeer
API
Related Work:
RootBeer
Still requires special API invocation in
addition to computation body
4
int[][] arrays = new int[N][M];
int[] result = new int[N];
... arrays initialization ...
List<Kernel> jobs =
new ArrayList<Kernel>();
for(int i = 0; i < N; i++) {
jobs.add(
new ArraySumKernel(
arrays[i], result, i)
);
}
Rootbeer rootbeer = new Rootbeer();
rootbeer.runAll(jobs);
class ArraySumKernel
implements Kernel {
private int[] source;
private int[] ret;
private int index;
public void gpuMethod() {
int sum = 0;
for(int i = 0; i< N; i++) {
sum += source[i];
}
ret[index] = sum;
}
}
Our Approach:
HJ-OpenCL Overview
Automatic generation of OpenCL kernels
and JNI glue code from a parallel-for
construct forall
Built on the top of Habanero-Java(HJ)
“Habanero-Java: the New Adventures of Old X10”
Cave et al. (PPPJ’11)
“Accelerating Habanero-Java Programs with
OpenCL Generation” Hayashi et al. (PPPJ’13)
– User provided “Safe” constructs
OpenCL acceleration with precise
exception semantics
Our primary contribution
5
Overview of Habanero-Java
(HJ) Language
New language and implementation
developed at Rice since 2007
 Derived from Java-based version of X10
language (v1.5) in 2007
 HJ is currently an extension of Java 1.4
All Java 5 & 6 libraries and classes can be
called from HJ programs
6
Overview of Habanero-Java
(HJ) Language (Cont’d)
 HJ’s parallel extensions are focused on task
parallelism
1. Dynamic task creation & termination:
async, finish, force, forall, foreach
2. Collective and point-to-point synchronization: phaser, next
3. Mutual exclusion and isolation: isolated
4. Locality control --- task and data distributions: places, here
 Sequential HJ extensions added for convenience
 extern, point, region, pointwise for, complex data type,
array views
 Habanero-C and Habanero-Scala are also available
with similar constructs
7
HJ OpenCL
Implementation
HJ-OpenCL Example
→Programmers can utlize OpenCL by just replacing for with fora
8
public class ArraySum {
public static void main(String[] args) {
int[] base = new int[N*M];
int[] result = new int[N];
int[.] arrays = new arrayView(base, 0, [0:N-1,0:M-1]);
... initialization ...
boolean isSafe = ...;
forall(point [i] : [0:N-1]) {
result[i] = 0;
for(int j=0; j<M; j++) {
result[i] += arrays[i,j];
}
}
}
}
The compilation flow
HJ
Program
.class files on JVM
(bytecode)
OpenCL_hjstub.c
(JNI glue code)
OpenCLKernel.clas
s
(bytecode)
HJ
Compiler
C compiler
APARAPI
Translator
OpenCL Kernel
Kernel.c
Native library
(.so, .dll, .dylib)
JVM
Host
JNI
Device
OpenCL
Program is
translated into
three files
9
APARAPI
Open Source Project for data parallel Java
from AMD
 https://code.google.com/p/aparapi/
 APARAPI converts Java bytecode to
OpenCL kernel at runtime
Restriction
 Can only handle primitive type, not object
10
→we prepared static version of APARAPI
to reduce runtime compilation overhead
Acceleration vs. Exception Semantics
Safe? High Performance?
Java Yes No
OpenCL/CUDA No Yes
11
OpenCL/CUDAJava
Mix!
Pictures borrowed from http://wot.motortrend.com/
http://www.boston.com/bigpicture/2008/09/the_singapore_grand_prix.html
Basic Idea:
Non-Speculative Exception Checking
12
Exception
Checking
(in parallel)
Multicore-CPU GPU
Non-Speculative
Exception
Checking
Code
boolean excpFlag = false;
/* (1) Exception Checking Code on JVM */
try {
forall (point [i]:[0:N-1]) {
… = A[i];
}
} catch (Exception e) {
excpFlag = true;
}
/* (2) JNI Call */
if (!excpFlag) {
openCL_Kernel();
} else{
// Original Implementation on JVM
forall() {}
}
Host to device
Data Transfer
Computation
Device to host
Data Transfer
No Exception
Time
Same as
original Forall
except
ArrayStore
Proposed Idea:
Speculative Exception Checking
13
Exception
Checking
(in parallel)
Multicore-CPU GPU
Speculative
Exception
Checking
Code
boolean excpFlag = false;
/* JNI Call 1*/
openCL_Kernel1();
/* (2) Exception Checking Code on JVM */
try {
forall (point [i]:[0:N-1]) {
… = A[i];
}
} catch (Exception e) {
excpFlag = true;
}
/* (2) JNI Call */
openCL_Kernel2(excpFlag);
if (excpFlag) {
// Original Implementation on JVM
forall() {}
}
Host to device
Data Transfer
Computation
Device to host
Data Transfer
No Exception
Time
Same as
original Forall
except
ArrayStore
Opportunity for Optmization
Target exceptions possibly occurred
during GPU execution
(due to APARAPI restriction)
ArrayIndexOutOfBoundsException
ArithmeticException
– Divided by Zero
NullPointerException
What kind of “Optimization”?
Delete statements which will not cause the
above exceptions to accelerate exception
checking at compile time
14
The exception checking code
optimization algorithm
Key Idea
Delete statements which do not derive
array subscripts and denominator of
division statement by considering control
flow
15
i = …;
X = …;
Y = …;
… = A[i] + X;
… = B[i] / Y;
Before
i = …;
X = …;
Y = …;
… = A[i] + X;
… = B[i] / Y;
After
Exception Checking Code
Optimization Example
forall (point [i]:[0:N-1]) {
A[Index[i]] = B[i] + C[i];
}
// IR
$i2 = Index[i];
$i3 = B[i];
$i4 = C[i];
$i5 = $i3 + $i4;
A[$i2] = $i5;
// IR
$i2 = Index[i];
$i3 = B[i];
$i4 = C[i];
$i5 = $i3 + $i4;
dummy = A[$i2]
mark
// IR
$i2 = Index[i];
$i3 = B[i];
$i4 = C[i];
delete
dummy = A[$i2]
// IR
$i2 = Index[i];
$i3 = B[i];
$i4 = C[i];
dummy = A[$i2]
Dummy read
Optimized Code
16
Benchmarks
Benchmark Data Size Remarks
SparseMatmult JGF N= 500,000 Sparse Matrix
Doitgen Polybench 128x128x128
Crypt JGF N = 50,000,000
Blackscholes 16,777,216 options
MRIQ Parboil 64x64x64
MatMult 1024x1024
SAXPY N= 25,000x 25,000 Sparse Matrix
GEMVER SparseBLAS 10,000,000 Sparse Matrix
17
Platforms
AMD A10-5800K Westmere
CPU 4-cores (APU) 6-cores x 2 Xeon 5660
GPU
Radeon HD 7660D
384-cores
NVIDIA Tesla M2050
448-cores
Java Runtime JRE (build 1.6.0_21-b06 JRE (build 1.6.0_25-b06)
JVM
HotSpot 64-Bit Server
VM
(build 17.0-b11,
mixed mode)
HotSpot 64-Bit Server
VM(Build 20.0-b11,
mixed mode)
18
Experimental Methodologies
We tested execution in the following
modes:
Sequential Java
HJ-OpenCL
No Checking (Unsafe)
Non-Speculative Exception Checking
– Optimized/Unoptimized
Speculative Exception Checking
– Optimized/Unoptimized
19
Result On AMD(A10-5800K)
No Checking vs. Speculative Checking
 Up to 18% slowdown while maintaining Java exception
semantics
20
2.4
0.2
3.6
9.8
21.3
12.6
5.1
9.9
2.1
0.2
3.0
8.6
21.1
11.7
4.5
9.6
0.0
5.0
10.0
15.0
20.0
25.0
JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER
SpeeduprelativetoSequentialJava
no checking: UNSAFE
Optimized Speculative: SAFE, PROPOSED
Benchmarks
Higher is better
Result On AMD(A10-5800K)
Slowdown for Exception Checking
21
0.9
1.0
0.4
0.2
0.1
0.9
0.9 0.9
0.9
1.0
0.4
0.2
0.1
0.9 0.9 0.90.9
1.0
0.8
0.9
1.0
0.9 0.9
1.0
0.9
1.0
0.8
0.9
1.0
0.9
0.9
1.0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER
SpeeduprelativetoNo-Checking
Non-Speculative, Unoptimized,
Speculative, Unoptimized
Non-Speculative, Optimized
Speculative Optimized
Speedup by
Optimizatio
n
Speedup by
Optimizatio
n
Almost
Same
Performanc
e
Almost
Same
Performanc
e
Benchmarks
Higher is better
Analysis of Results on AMD:
Checking code optimization issues
In speculative Execution, optimization is
effective only if exception checking code is
on critical path
 JGF-Crypt, BlackScholes, MRIQ, GEMVER
22
Unoptimized
Exception
Checking
(in parallel)
Multicore-CPU GPU
Host to device
Data Transfer
Computation
Device to host
Data Transfer
Time
Exception
Checking
Code is on
Critical path
Analysis of Results on AMD:
Speculative Execution Issues
Speculation does not much accelerate
program execution because GPU execution
is quite longer than checking
 JGF-SparseMatMultDoitgen, MatMult, SAXPY
23
Checking
Multicore-CPU GPU
Host to device
Data Transfer
Computation
Device to host
Data Transfer
Time
Checking
Multicore-CPU GPU
Host to device
Data Transfer
Computation
Device to host
Data Transfer
Result On Westmere
No Checking vs. Speculative Checking
 Up to 22% slowdown while maintaining Java exception
semantics
24
9.8
2.7
14.4
37.8
330.8
43.1
8.8
22.3
7.7
2.8
13.6
35.2
331.0
44.1
7.6
22.4
1.0
10.0
100.0
1000.0
JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER
SpeeduprelativetoSequentialJava
no checking: UNSAFE
Optimized Speculative: SAFE, PROPOSED
Benchmarks
Log Scale, Higher is better
Result On Westmere
Slowdown for Exception Checking
 Both speculation and optimization are effective
25
0.6
0.9
0.3
0.2
0.0
0.8 0.8
0.9
0.7
1.0
0.3
0.2
0.0
1.0
0.9
1.0
0.7
0.9
0.8
0.7
0.9
0.8
0.8
1.0
0.8
1.0
0.9 0.9
1.0 1.0
0.9
1.0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER
SpeeduprelativetoNo-Checking
Non-Speculative, Unoptimized,
Speculative, Unoptimized
Non-Speculative, Optimized
Speculative Optimized
Benchmarks
Higher is better
Insights
Removal of redundant java.Math
methods in exception checking code
can enables significant performance
improvements
JGF-Crypt, BlackScholes, MRIQ
Speculation is not effective on AMD
due to
Insufficient processors in GPU
Lazy AMD OpenCL Runtime
26
Sample timeline of the Black-
Scholes application on AMD
27
0.0E+00 1.0E+08 2.0E+08 3.0E+08 4.0E+08 5.0E+08 6.0E+08
transfer|h-d|pending|dt1
transfer|h-d|pending|dt2
transfer|h-d|pending|dt3
kernel|pending
transfer|h-d|running|dt1
transfer|h-d|running|dt2
transfer|h-d|running|dt1
kernel|running
transfer|d-h|pending|dt1
transfer|d-h|pending|dt2
transfer|d-h|pending|dt3
transfer|d-h|running|dt1
transfer|d-h|running|dt2
transfer|d-h|running|dt3
Time (ns)
ApplicationStage
Accounts for 40%
of total execution
time!
Related Work:
High-level language to GPU code
Lime (PLDI’12)
JVM compatible language
RootBeer
Compiles Java bytecode to CUDA
X10 and Chapel
Provides programming model for CUDA
Sponge (ASPLOS’11)
Compiles StreamIt to CUDA
→ None of these approaches considers
Java Exception Semantics
28
Related Work:
Exception Semantics in Java
 Artigas et al. (ICS’00) and Moreira et al.(ACM Trans.
‘00)
 Generates exception- safe and -unsafe regions of code.
 Wurthinger et al.(PPPJ’07)
 Proposes an algorithm on Static Single Assignment(SSA)
form for the JIT compiler which eliminates un- necessary
bounds checking.
 ABCD (PLDI’00)
 Provides an array bounds checking elimination algorithm,
which is based on graph traversal on an extended SSA
form.
 Jeffery et al. (In Concurrency and Compu- tations:
Practice and Experience,‘09)
 Proposes a static annotation framework to reduce the
overhead of dynamic checking in the JIT compiler.
29
Summary:
HJ-OpenCL
Programmer can utilize OpenCL by just
putting “forall” construct
Automatic generation of exception checking
code on JVM
Accelerating Java program with precise
exception semantics
Performance improvement
Upto 21x speedup on AMD APU
upto 330x speedup on NVIDIA GPU
30
Backup
31
“next” construct for global
barrier synchronization on GPUs
Semantics
 Wait until all thread reaches the synchronization
point
Note that OpenCL does not support all-to-
all barrier as a kernel language feature
 The HJ compiler internally partitions the forall
loop body into blocks separated by
synchronization points
32
next construct (cont’d)
33
forall (point [i]:[0:n-1]) {
method1(i);
// synchronization point 1
next;
method2(i);
// synchronization point 2
next;
}
Thread0
method1(0);
Thread1
method1(1);
WAIT
method2(0); method2(1);
WAIT
“ArrayView” for Supporting
Contiguous Multidimensional
array
 HJ ArrayView is backed by one-dimensional Java Array
 Enables reduction of data transfer between
host and device
Java Array
A[i][j]
HJ Array View
A[i, j]
0
1
2
0
0
1
2
0 1
0 1 2 3
A[0][1]
A[0,1]
34
Speculative Exception Checking
Speculative
Execution of GPU
Exception Checking
(in parallel)
Data transfer to
GPU & kernel
invocation
Computation
Data transfer from
GPU
GPU Cleanup
Exception Occurred?
Fall back to
original code
HJ Runtime on JVM OpenCL Runtime Kernel on Device
No
Yes
Multi-core CPU Many-core GPUTime
35

Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs

  • 1.
    Speculative Execution ofParallel Programs with Precise Exception Semantics on GPUs LCPC2013, Qualcomm Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar Rice University, Houston, TX, USA 1
  • 2.
    Background: GPGPU and Java Theinitial wave of programming models for GPGPU has provided low-level APIs: CUDA (NVIDIA) OpenCL (Khronos) →Often faster than natively running C/C++ applications High-level languages such as Java provide high-productivity features: Type safety Garbage Collection Precise Exception Semantics 2
  • 3.
    OpenCL Kernel JNI OpenCL APIs JNI Motivation: GPU ExecutionFrom Java JNIEXPORT void JNICALL_Java_Test (…) { void ∗aptr = (env)−>GetPrimitiveArrayCritical(arrays , 0); ... /∗ Create Buffer ∗/ Aobj = clCreateBuffer(context, …); /∗ Host to Device Communication ∗/ clEnqueueWriteBuffer(queue, Aobj, …); /∗ Kernel Compilation ∗/ /* Kernel Invocation */ … (env)−>ReleasePrimitiveArrayCritical(arrays, aptr, 0); __kernel void run(…) { int gid = …; A[gid] = …; } Utilizing GPU from Java adds non-trivial amount of work 3
  • 4.
    Computation Body RootBeer API Related Work: RootBeer Stillrequires special API invocation in addition to computation body 4 int[][] arrays = new int[N][M]; int[] result = new int[N]; ... arrays initialization ... List<Kernel> jobs = new ArrayList<Kernel>(); for(int i = 0; i < N; i++) { jobs.add( new ArraySumKernel( arrays[i], result, i) ); } Rootbeer rootbeer = new Rootbeer(); rootbeer.runAll(jobs); class ArraySumKernel implements Kernel { private int[] source; private int[] ret; private int index; public void gpuMethod() { int sum = 0; for(int i = 0; i< N; i++) { sum += source[i]; } ret[index] = sum; } }
  • 5.
    Our Approach: HJ-OpenCL Overview Automaticgeneration of OpenCL kernels and JNI glue code from a parallel-for construct forall Built on the top of Habanero-Java(HJ) “Habanero-Java: the New Adventures of Old X10” Cave et al. (PPPJ’11) “Accelerating Habanero-Java Programs with OpenCL Generation” Hayashi et al. (PPPJ’13) – User provided “Safe” constructs OpenCL acceleration with precise exception semantics Our primary contribution 5
  • 6.
    Overview of Habanero-Java (HJ)Language New language and implementation developed at Rice since 2007  Derived from Java-based version of X10 language (v1.5) in 2007  HJ is currently an extension of Java 1.4 All Java 5 & 6 libraries and classes can be called from HJ programs 6
  • 7.
    Overview of Habanero-Java (HJ)Language (Cont’d)  HJ’s parallel extensions are focused on task parallelism 1. Dynamic task creation & termination: async, finish, force, forall, foreach 2. Collective and point-to-point synchronization: phaser, next 3. Mutual exclusion and isolation: isolated 4. Locality control --- task and data distributions: places, here  Sequential HJ extensions added for convenience  extern, point, region, pointwise for, complex data type, array views  Habanero-C and Habanero-Scala are also available with similar constructs 7
  • 8.
    HJ OpenCL Implementation HJ-OpenCL Example →Programmerscan utlize OpenCL by just replacing for with fora 8 public class ArraySum { public static void main(String[] args) { int[] base = new int[N*M]; int[] result = new int[N]; int[.] arrays = new arrayView(base, 0, [0:N-1,0:M-1]); ... initialization ... boolean isSafe = ...; forall(point [i] : [0:N-1]) { result[i] = 0; for(int j=0; j<M; j++) { result[i] += arrays[i,j]; } } } }
  • 9.
    The compilation flow HJ Program .classfiles on JVM (bytecode) OpenCL_hjstub.c (JNI glue code) OpenCLKernel.clas s (bytecode) HJ Compiler C compiler APARAPI Translator OpenCL Kernel Kernel.c Native library (.so, .dll, .dylib) JVM Host JNI Device OpenCL Program is translated into three files 9
  • 10.
    APARAPI Open Source Projectfor data parallel Java from AMD  https://code.google.com/p/aparapi/  APARAPI converts Java bytecode to OpenCL kernel at runtime Restriction  Can only handle primitive type, not object 10 →we prepared static version of APARAPI to reduce runtime compilation overhead
  • 11.
    Acceleration vs. ExceptionSemantics Safe? High Performance? Java Yes No OpenCL/CUDA No Yes 11 OpenCL/CUDAJava Mix! Pictures borrowed from http://wot.motortrend.com/ http://www.boston.com/bigpicture/2008/09/the_singapore_grand_prix.html
  • 12.
    Basic Idea: Non-Speculative ExceptionChecking 12 Exception Checking (in parallel) Multicore-CPU GPU Non-Speculative Exception Checking Code boolean excpFlag = false; /* (1) Exception Checking Code on JVM */ try { forall (point [i]:[0:N-1]) { … = A[i]; } } catch (Exception e) { excpFlag = true; } /* (2) JNI Call */ if (!excpFlag) { openCL_Kernel(); } else{ // Original Implementation on JVM forall() {} } Host to device Data Transfer Computation Device to host Data Transfer No Exception Time Same as original Forall except ArrayStore
  • 13.
    Proposed Idea: Speculative ExceptionChecking 13 Exception Checking (in parallel) Multicore-CPU GPU Speculative Exception Checking Code boolean excpFlag = false; /* JNI Call 1*/ openCL_Kernel1(); /* (2) Exception Checking Code on JVM */ try { forall (point [i]:[0:N-1]) { … = A[i]; } } catch (Exception e) { excpFlag = true; } /* (2) JNI Call */ openCL_Kernel2(excpFlag); if (excpFlag) { // Original Implementation on JVM forall() {} } Host to device Data Transfer Computation Device to host Data Transfer No Exception Time Same as original Forall except ArrayStore
  • 14.
    Opportunity for Optmization Targetexceptions possibly occurred during GPU execution (due to APARAPI restriction) ArrayIndexOutOfBoundsException ArithmeticException – Divided by Zero NullPointerException What kind of “Optimization”? Delete statements which will not cause the above exceptions to accelerate exception checking at compile time 14
  • 15.
    The exception checkingcode optimization algorithm Key Idea Delete statements which do not derive array subscripts and denominator of division statement by considering control flow 15 i = …; X = …; Y = …; … = A[i] + X; … = B[i] / Y; Before i = …; X = …; Y = …; … = A[i] + X; … = B[i] / Y; After
  • 16.
    Exception Checking Code OptimizationExample forall (point [i]:[0:N-1]) { A[Index[i]] = B[i] + C[i]; } // IR $i2 = Index[i]; $i3 = B[i]; $i4 = C[i]; $i5 = $i3 + $i4; A[$i2] = $i5; // IR $i2 = Index[i]; $i3 = B[i]; $i4 = C[i]; $i5 = $i3 + $i4; dummy = A[$i2] mark // IR $i2 = Index[i]; $i3 = B[i]; $i4 = C[i]; delete dummy = A[$i2] // IR $i2 = Index[i]; $i3 = B[i]; $i4 = C[i]; dummy = A[$i2] Dummy read Optimized Code 16
  • 17.
    Benchmarks Benchmark Data SizeRemarks SparseMatmult JGF N= 500,000 Sparse Matrix Doitgen Polybench 128x128x128 Crypt JGF N = 50,000,000 Blackscholes 16,777,216 options MRIQ Parboil 64x64x64 MatMult 1024x1024 SAXPY N= 25,000x 25,000 Sparse Matrix GEMVER SparseBLAS 10,000,000 Sparse Matrix 17
  • 18.
    Platforms AMD A10-5800K Westmere CPU4-cores (APU) 6-cores x 2 Xeon 5660 GPU Radeon HD 7660D 384-cores NVIDIA Tesla M2050 448-cores Java Runtime JRE (build 1.6.0_21-b06 JRE (build 1.6.0_25-b06) JVM HotSpot 64-Bit Server VM (build 17.0-b11, mixed mode) HotSpot 64-Bit Server VM(Build 20.0-b11, mixed mode) 18
  • 19.
    Experimental Methodologies We testedexecution in the following modes: Sequential Java HJ-OpenCL No Checking (Unsafe) Non-Speculative Exception Checking – Optimized/Unoptimized Speculative Exception Checking – Optimized/Unoptimized 19
  • 20.
    Result On AMD(A10-5800K) NoChecking vs. Speculative Checking  Up to 18% slowdown while maintaining Java exception semantics 20 2.4 0.2 3.6 9.8 21.3 12.6 5.1 9.9 2.1 0.2 3.0 8.6 21.1 11.7 4.5 9.6 0.0 5.0 10.0 15.0 20.0 25.0 JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER SpeeduprelativetoSequentialJava no checking: UNSAFE Optimized Speculative: SAFE, PROPOSED Benchmarks Higher is better
  • 21.
    Result On AMD(A10-5800K) Slowdownfor Exception Checking 21 0.9 1.0 0.4 0.2 0.1 0.9 0.9 0.9 0.9 1.0 0.4 0.2 0.1 0.9 0.9 0.90.9 1.0 0.8 0.9 1.0 0.9 0.9 1.0 0.9 1.0 0.8 0.9 1.0 0.9 0.9 1.0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER SpeeduprelativetoNo-Checking Non-Speculative, Unoptimized, Speculative, Unoptimized Non-Speculative, Optimized Speculative Optimized Speedup by Optimizatio n Speedup by Optimizatio n Almost Same Performanc e Almost Same Performanc e Benchmarks Higher is better
  • 22.
    Analysis of Resultson AMD: Checking code optimization issues In speculative Execution, optimization is effective only if exception checking code is on critical path  JGF-Crypt, BlackScholes, MRIQ, GEMVER 22 Unoptimized Exception Checking (in parallel) Multicore-CPU GPU Host to device Data Transfer Computation Device to host Data Transfer Time Exception Checking Code is on Critical path
  • 23.
    Analysis of Resultson AMD: Speculative Execution Issues Speculation does not much accelerate program execution because GPU execution is quite longer than checking  JGF-SparseMatMultDoitgen, MatMult, SAXPY 23 Checking Multicore-CPU GPU Host to device Data Transfer Computation Device to host Data Transfer Time Checking Multicore-CPU GPU Host to device Data Transfer Computation Device to host Data Transfer
  • 24.
    Result On Westmere NoChecking vs. Speculative Checking  Up to 22% slowdown while maintaining Java exception semantics 24 9.8 2.7 14.4 37.8 330.8 43.1 8.8 22.3 7.7 2.8 13.6 35.2 331.0 44.1 7.6 22.4 1.0 10.0 100.0 1000.0 JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER SpeeduprelativetoSequentialJava no checking: UNSAFE Optimized Speculative: SAFE, PROPOSED Benchmarks Log Scale, Higher is better
  • 25.
    Result On Westmere Slowdownfor Exception Checking  Both speculation and optimization are effective 25 0.6 0.9 0.3 0.2 0.0 0.8 0.8 0.9 0.7 1.0 0.3 0.2 0.0 1.0 0.9 1.0 0.7 0.9 0.8 0.7 0.9 0.8 0.8 1.0 0.8 1.0 0.9 0.9 1.0 1.0 0.9 1.0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER SpeeduprelativetoNo-Checking Non-Speculative, Unoptimized, Speculative, Unoptimized Non-Speculative, Optimized Speculative Optimized Benchmarks Higher is better
  • 26.
    Insights Removal of redundantjava.Math methods in exception checking code can enables significant performance improvements JGF-Crypt, BlackScholes, MRIQ Speculation is not effective on AMD due to Insufficient processors in GPU Lazy AMD OpenCL Runtime 26
  • 27.
    Sample timeline ofthe Black- Scholes application on AMD 27 0.0E+00 1.0E+08 2.0E+08 3.0E+08 4.0E+08 5.0E+08 6.0E+08 transfer|h-d|pending|dt1 transfer|h-d|pending|dt2 transfer|h-d|pending|dt3 kernel|pending transfer|h-d|running|dt1 transfer|h-d|running|dt2 transfer|h-d|running|dt1 kernel|running transfer|d-h|pending|dt1 transfer|d-h|pending|dt2 transfer|d-h|pending|dt3 transfer|d-h|running|dt1 transfer|d-h|running|dt2 transfer|d-h|running|dt3 Time (ns) ApplicationStage Accounts for 40% of total execution time!
  • 28.
    Related Work: High-level languageto GPU code Lime (PLDI’12) JVM compatible language RootBeer Compiles Java bytecode to CUDA X10 and Chapel Provides programming model for CUDA Sponge (ASPLOS’11) Compiles StreamIt to CUDA → None of these approaches considers Java Exception Semantics 28
  • 29.
    Related Work: Exception Semanticsin Java  Artigas et al. (ICS’00) and Moreira et al.(ACM Trans. ‘00)  Generates exception- safe and -unsafe regions of code.  Wurthinger et al.(PPPJ’07)  Proposes an algorithm on Static Single Assignment(SSA) form for the JIT compiler which eliminates un- necessary bounds checking.  ABCD (PLDI’00)  Provides an array bounds checking elimination algorithm, which is based on graph traversal on an extended SSA form.  Jeffery et al. (In Concurrency and Compu- tations: Practice and Experience,‘09)  Proposes a static annotation framework to reduce the overhead of dynamic checking in the JIT compiler. 29
  • 30.
    Summary: HJ-OpenCL Programmer can utilizeOpenCL by just putting “forall” construct Automatic generation of exception checking code on JVM Accelerating Java program with precise exception semantics Performance improvement Upto 21x speedup on AMD APU upto 330x speedup on NVIDIA GPU 30
  • 31.
  • 32.
    “next” construct forglobal barrier synchronization on GPUs Semantics  Wait until all thread reaches the synchronization point Note that OpenCL does not support all-to- all barrier as a kernel language feature  The HJ compiler internally partitions the forall loop body into blocks separated by synchronization points 32
  • 33.
    next construct (cont’d) 33 forall(point [i]:[0:n-1]) { method1(i); // synchronization point 1 next; method2(i); // synchronization point 2 next; } Thread0 method1(0); Thread1 method1(1); WAIT method2(0); method2(1); WAIT
  • 34.
    “ArrayView” for Supporting ContiguousMultidimensional array  HJ ArrayView is backed by one-dimensional Java Array  Enables reduction of data transfer between host and device Java Array A[i][j] HJ Array View A[i, j] 0 1 2 0 0 1 2 0 1 0 1 2 3 A[0][1] A[0,1] 34
  • 35.
    Speculative Exception Checking Speculative Executionof GPU Exception Checking (in parallel) Data transfer to GPU & kernel invocation Computation Data transfer from GPU GPU Cleanup Exception Occurred? Fall back to original code HJ Runtime on JVM OpenCL Runtime Kernel on Device No Yes Multi-core CPU Many-core GPUTime 35

Editor's Notes

  • #2 Thanks for your introduction. Hello, everyone. Welcome to my talk. This is the last talk in LCPC workshop. My name is Akihiro Hayashi and I’m a posdoc at Rice university. Today I’ll be talking about speculative execution of parallel programs with precise exception semantics on GPUs.
  • #3 Let me first talk about the background. Programming models for GPGPU, such as CUDA and OpenCL can enable significant improvements for certain classes of applications. If you use these programming model, applications will be running faster than natively running C/C++ applications. These programming models provide C-like kernel language and low-level APIs including data transfer API and kernel invocation API which are usually accessible from C/C++ languages. On the other hand, High-level languages such as Java provide high-productivity features including type safety, garbage collection and precise exception semantics.
  • #4 But, If you want to utilize GPU from Java, It requires programmers to write non-trivial amount of application code. Here is an example. You can see three kinds of code here. These are JNI code, OpenCL APIs and OpenCL Kernel code. In JNI code, you can see the declaration of JNI prototype and the code which get/release the pointer of original Java array. In OpenCL APIs. I omitted a lot of codes. But actually programmers have to write a lot of code to offload the task onto GPU by using memory allocation APIs, data transfer APIs, kernel compilation APIs, kernel invocation APIs. When you take a look at the OpenCL kernel, you can see c-like function. Programmers can write a kernel computation in SPMD manner. However I think, utilizing GPU from Java adds non-trivial amount of work.
  • #5 In past work, Rootbeer provides high-level programming model for GPU. In Rootbeer the programmers prepare a special class which implements Kernel interface, It allows the programmer to write computation body with overriding the method called gpuMethod(). We don’t need to write JNI and OpenCL Code anymore, You can invoke kernel on GPU by using RootBeer API as shown on your left. It just adds the jobs to queue. But it still requires programmer to write special class and special API to invoke GPU.
  • #6 In our prior work, we proposed HJ-OpenCL which performs automatic generation of OpenCL kernels and JNI glue code from a prallel-for construct named forall. Our compiler automatically compiles forall constructs to OpenCL kernel. HJ-OpenCL is built on the top of Haabnero-Java language. In this work we focus on maintaining Java’s exception semantics when we use the GPU from Java.
  • #7 Before explaining our methodologies, Let me tell you the overview of Habanero Java language. Habenero-Java is a new language developed at Rice since 2007. It’s derived from Java-based version of X10 language.
  • #8 HJ provides several parallel extensions focused on task parallelism. You can use async statement to create task and You can also use forall to express parallel loop. If you want to do all-to-all and point-to-point synchronization, you can use phaser. A subset of phaser is available in Java 7 . For mutual exclusion, you can use isolated statement. HJ allows you to set the affinity of task with places. Habanero-C and Habanero-Scala are also avaialable with similar constructs.
  • #9 Here is a HJ-OpenCL Example. In HJ-OpenCL, you can express GPU kernel by just replacing for with forall. Unlike RootBeer, you don’t need to prepare special class to GPU from Java.
  • #10 This is the compilation flow of HJ-OpenCL The HJ compiler takes HJ program and generates three kinds of files. These are .class files on JVM, OpenCL_hjstub.c which consists of JNI glue code and OpenCLAPI, and a special class named OpenCLKernel.class. OpenCLKernel.class is passed to APARAPI translator, which takes Java bytecode and generates OpenCL kernel. The native static C compiler takes OpenCL_hjstub.c and OpenCL kernel generated by APARAPI translator and generates Native dynamic library. This compilation is done automatically.
  • #11 Let me describe the detail of APARAPI. APARAPI is an open source project for data parallel Java. It compiles Java bytecode to OpenCL at runtime. There is a rescriction with regards to kernel generation. APARAPI can only handle primitive type, can not handle instance of object. We prepared static version of APARAPI to reduce runtime compilation overhead.
  • #12 Let’s talk about the exception semantics. As you know, we can say Java is safe language but not high performance language because Java checks exceptions at runtime. On the other hand, OpenCL and CUDA is not safe but hight performance language. This has some analogy to these picture. You know, Consumer cars have several safe features. F1 machine is not safe but can move really fast. We want to mix High Performance and Safety
  • #13 This is basic idea. Basically, we run exception checking code in parallel in JVM first. And in the case that no exception occurred we’ll invoke GPU through JNI call. In the code shown on your right, You can see the exception checking code which is enclosed by try-catch statement. Note that this code is same as the original forall implemtantion except array store. We should transform all array store to array read to keep program semantics. In catch statement, true is assigned to excpFlag. In the case that excpFlag is false we’ll invoke GPU through JNI call. Otherwise we’ll execute original implementation on JVM. That’s why we can maintain exception semantics.
  • #14 Additionally, We can run exception checking code and computation in parallel. We speculatively run the computation on GPU. If no exception occurred we’ll get the data from device. Otherwise we’ll run original implementation on JVM like no speculative execution. This is our proposing methodology. Compiler automatically geneartes this code.
  • #15 If we focus on several exceptions which are possibly occurred during GPU execution. We can optimize the exception checking code, If we can focus on ArrayIndexOutOfBoundsException, ArithmeticException, and NullPointerException, We can just delete statements which will not cause these exceptions to accelerate exception checking at compile time.
  • #16 OK, let’s talk about the optimization. As I mentioned before, our algorithm deletes statement which do not derive array subscript and denominator of division statement by considering control flow Here is some example. Some value is assigned to I, which will be used in A[i] in latter statement. So we cannot delete this statement. Similary, we cannot delete assignment of Y because this statement derives denominator of division. But we can delete assignment of X because this will not derive array subscript and denominator of division statement.