SlideShare a Scribd company logo
1 of 35
Speculative Execution of Parallel Programs
with Precise Exception Semantics on GPUs
LCPC2013, Qualcomm
Akihiro Hayashi, Max Grossman,
Jisheng Zhao, Jun Shirako, Vivek Sarkar
Rice University, Houston, TX, USA
1
Background:
GPGPU and Java
The initial wave of programming models
for GPGPU has provided low-level APIs:
CUDA (NVIDIA)
OpenCL (Khronos)
→Often faster than natively running
C/C++ applications
High-level languages such as Java
provide high-productivity features:
Type safety
Garbage Collection
Precise Exception Semantics
2
OpenCL Kernel
JNI
OpenCL
APIs
JNI
Motivation:
GPU Execution From Java
JNIEXPORT void JNICALL_Java_Test (…) {
void ∗aptr = (env)−>GetPrimitiveArrayCritical(arrays , 0);
...
/∗ Create Buffer ∗/
Aobj = clCreateBuffer(context, …);
/∗ Host to Device Communication ∗/
clEnqueueWriteBuffer(queue, Aobj, …);
/∗ Kernel Compilation ∗/
/* Kernel Invocation */
…
(env)−>ReleasePrimitiveArrayCritical(arrays, aptr, 0);
__kernel
void run(…) {
int gid = …;
A[gid] = …;
}
Utilizing GPU from Java adds non-trivial
amount of work
3
Computation Body
RootBeer
API
Related Work:
RootBeer
Still requires special API invocation in
addition to computation body
4
int[][] arrays = new int[N][M];
int[] result = new int[N];
... arrays initialization ...
List<Kernel> jobs =
new ArrayList<Kernel>();
for(int i = 0; i < N; i++) {
jobs.add(
new ArraySumKernel(
arrays[i], result, i)
);
}
Rootbeer rootbeer = new Rootbeer();
rootbeer.runAll(jobs);
class ArraySumKernel
implements Kernel {
private int[] source;
private int[] ret;
private int index;
public void gpuMethod() {
int sum = 0;
for(int i = 0; i< N; i++) {
sum += source[i];
}
ret[index] = sum;
}
}
Our Approach:
HJ-OpenCL Overview
Automatic generation of OpenCL kernels
and JNI glue code from a parallel-for
construct forall
Built on the top of Habanero-Java(HJ)
“Habanero-Java: the New Adventures of Old X10”
Cave et al. (PPPJ’11)
“Accelerating Habanero-Java Programs with
OpenCL Generation” Hayashi et al. (PPPJ’13)
– User provided “Safe” constructs
OpenCL acceleration with precise
exception semantics
Our primary contribution
5
Overview of Habanero-Java
(HJ) Language
New language and implementation
developed at Rice since 2007
 Derived from Java-based version of X10
language (v1.5) in 2007
 HJ is currently an extension of Java 1.4
All Java 5 & 6 libraries and classes can be
called from HJ programs
6
Overview of Habanero-Java
(HJ) Language (Cont’d)
 HJ’s parallel extensions are focused on task
parallelism
1. Dynamic task creation & termination:
async, finish, force, forall, foreach
2. Collective and point-to-point synchronization: phaser, next
3. Mutual exclusion and isolation: isolated
4. Locality control --- task and data distributions: places, here
 Sequential HJ extensions added for convenience
 extern, point, region, pointwise for, complex data type,
array views
 Habanero-C and Habanero-Scala are also available
with similar constructs
7
HJ OpenCL
Implementation
HJ-OpenCL Example
→Programmers can utlize OpenCL by just replacing for with fora
8
public class ArraySum {
public static void main(String[] args) {
int[] base = new int[N*M];
int[] result = new int[N];
int[.] arrays = new arrayView(base, 0, [0:N-1,0:M-1]);
... initialization ...
boolean isSafe = ...;
forall(point [i] : [0:N-1]) {
result[i] = 0;
for(int j=0; j<M; j++) {
result[i] += arrays[i,j];
}
}
}
}
The compilation flow
HJ
Program
.class files on JVM
(bytecode)
OpenCL_hjstub.c
(JNI glue code)
OpenCLKernel.clas
s
(bytecode)
HJ
Compiler
C compiler
APARAPI
Translator
OpenCL Kernel
Kernel.c
Native library
(.so, .dll, .dylib)
JVM
Host
JNI
Device
OpenCL
Program is
translated into
three files
9
APARAPI
Open Source Project for data parallel Java
from AMD
 https://code.google.com/p/aparapi/
 APARAPI converts Java bytecode to
OpenCL kernel at runtime
Restriction
 Can only handle primitive type, not object
10
→we prepared static version of APARAPI
to reduce runtime compilation overhead
Acceleration vs. Exception Semantics
Safe? High Performance?
Java Yes No
OpenCL/CUDA No Yes
11
OpenCL/CUDAJava
Mix!
Pictures borrowed from http://wot.motortrend.com/
http://www.boston.com/bigpicture/2008/09/the_singapore_grand_prix.html
Basic Idea:
Non-Speculative Exception Checking
12
Exception
Checking
(in parallel)
Multicore-CPU GPU
Non-Speculative
Exception
Checking
Code
boolean excpFlag = false;
/* (1) Exception Checking Code on JVM */
try {
forall (point [i]:[0:N-1]) {
… = A[i];
}
} catch (Exception e) {
excpFlag = true;
}
/* (2) JNI Call */
if (!excpFlag) {
openCL_Kernel();
} else{
// Original Implementation on JVM
forall() {}
}
Host to device
Data Transfer
Computation
Device to host
Data Transfer
No Exception
Time
Same as
original Forall
except
ArrayStore
Proposed Idea:
Speculative Exception Checking
13
Exception
Checking
(in parallel)
Multicore-CPU GPU
Speculative
Exception
Checking
Code
boolean excpFlag = false;
/* JNI Call 1*/
openCL_Kernel1();
/* (2) Exception Checking Code on JVM */
try {
forall (point [i]:[0:N-1]) {
… = A[i];
}
} catch (Exception e) {
excpFlag = true;
}
/* (2) JNI Call */
openCL_Kernel2(excpFlag);
if (excpFlag) {
// Original Implementation on JVM
forall() {}
}
Host to device
Data Transfer
Computation
Device to host
Data Transfer
No Exception
Time
Same as
original Forall
except
ArrayStore
Opportunity for Optmization
Target exceptions possibly occurred
during GPU execution
(due to APARAPI restriction)
ArrayIndexOutOfBoundsException
ArithmeticException
– Divided by Zero
NullPointerException
What kind of “Optimization”?
Delete statements which will not cause the
above exceptions to accelerate exception
checking at compile time
14
The exception checking code
optimization algorithm
Key Idea
Delete statements which do not derive
array subscripts and denominator of
division statement by considering control
flow
15
i = …;
X = …;
Y = …;
… = A[i] + X;
… = B[i] / Y;
Before
i = …;
X = …;
Y = …;
… = A[i] + X;
… = B[i] / Y;
After
Exception Checking Code
Optimization Example
forall (point [i]:[0:N-1]) {
A[Index[i]] = B[i] + C[i];
}
// IR
$i2 = Index[i];
$i3 = B[i];
$i4 = C[i];
$i5 = $i3 + $i4;
A[$i2] = $i5;
// IR
$i2 = Index[i];
$i3 = B[i];
$i4 = C[i];
$i5 = $i3 + $i4;
dummy = A[$i2]
mark
// IR
$i2 = Index[i];
$i3 = B[i];
$i4 = C[i];
delete
dummy = A[$i2]
// IR
$i2 = Index[i];
$i3 = B[i];
$i4 = C[i];
dummy = A[$i2]
Dummy read
Optimized Code
16
Benchmarks
Benchmark Data Size Remarks
SparseMatmult JGF N= 500,000 Sparse Matrix
Doitgen Polybench 128x128x128
Crypt JGF N = 50,000,000
Blackscholes 16,777,216 options
MRIQ Parboil 64x64x64
MatMult 1024x1024
SAXPY N= 25,000x 25,000 Sparse Matrix
GEMVER SparseBLAS 10,000,000 Sparse Matrix
17
Platforms
AMD A10-5800K Westmere
CPU 4-cores (APU) 6-cores x 2 Xeon 5660
GPU
Radeon HD 7660D
384-cores
NVIDIA Tesla M2050
448-cores
Java Runtime JRE (build 1.6.0_21-b06 JRE (build 1.6.0_25-b06)
JVM
HotSpot 64-Bit Server
VM
(build 17.0-b11,
mixed mode)
HotSpot 64-Bit Server
VM(Build 20.0-b11,
mixed mode)
18
Experimental Methodologies
We tested execution in the following
modes:
Sequential Java
HJ-OpenCL
No Checking (Unsafe)
Non-Speculative Exception Checking
– Optimized/Unoptimized
Speculative Exception Checking
– Optimized/Unoptimized
19
Result On AMD(A10-5800K)
No Checking vs. Speculative Checking
 Up to 18% slowdown while maintaining Java exception
semantics
20
2.4
0.2
3.6
9.8
21.3
12.6
5.1
9.9
2.1
0.2
3.0
8.6
21.1
11.7
4.5
9.6
0.0
5.0
10.0
15.0
20.0
25.0
JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER
SpeeduprelativetoSequentialJava
no checking: UNSAFE
Optimized Speculative: SAFE, PROPOSED
Benchmarks
Higher is better
Result On AMD(A10-5800K)
Slowdown for Exception Checking
21
0.9
1.0
0.4
0.2
0.1
0.9
0.9 0.9
0.9
1.0
0.4
0.2
0.1
0.9 0.9 0.90.9
1.0
0.8
0.9
1.0
0.9 0.9
1.0
0.9
1.0
0.8
0.9
1.0
0.9
0.9
1.0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER
SpeeduprelativetoNo-Checking
Non-Speculative, Unoptimized,
Speculative, Unoptimized
Non-Speculative, Optimized
Speculative Optimized
Speedup by
Optimizatio
n
Speedup by
Optimizatio
n
Almost
Same
Performanc
e
Almost
Same
Performanc
e
Benchmarks
Higher is better
Analysis of Results on AMD:
Checking code optimization issues
In speculative Execution, optimization is
effective only if exception checking code is
on critical path
 JGF-Crypt, BlackScholes, MRIQ, GEMVER
22
Unoptimized
Exception
Checking
(in parallel)
Multicore-CPU GPU
Host to device
Data Transfer
Computation
Device to host
Data Transfer
Time
Exception
Checking
Code is on
Critical path
Analysis of Results on AMD:
Speculative Execution Issues
Speculation does not much accelerate
program execution because GPU execution
is quite longer than checking
 JGF-SparseMatMultDoitgen, MatMult, SAXPY
23
Checking
Multicore-CPU GPU
Host to device
Data Transfer
Computation
Device to host
Data Transfer
Time
Checking
Multicore-CPU GPU
Host to device
Data Transfer
Computation
Device to host
Data Transfer
Result On Westmere
No Checking vs. Speculative Checking
 Up to 22% slowdown while maintaining Java exception
semantics
24
9.8
2.7
14.4
37.8
330.8
43.1
8.8
22.3
7.7
2.8
13.6
35.2
331.0
44.1
7.6
22.4
1.0
10.0
100.0
1000.0
JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER
SpeeduprelativetoSequentialJava
no checking: UNSAFE
Optimized Speculative: SAFE, PROPOSED
Benchmarks
Log Scale, Higher is better
Result On Westmere
Slowdown for Exception Checking
 Both speculation and optimization are effective
25
0.6
0.9
0.3
0.2
0.0
0.8 0.8
0.9
0.7
1.0
0.3
0.2
0.0
1.0
0.9
1.0
0.7
0.9
0.8
0.7
0.9
0.8
0.8
1.0
0.8
1.0
0.9 0.9
1.0 1.0
0.9
1.0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER
SpeeduprelativetoNo-Checking
Non-Speculative, Unoptimized,
Speculative, Unoptimized
Non-Speculative, Optimized
Speculative Optimized
Benchmarks
Higher is better
Insights
Removal of redundant java.Math
methods in exception checking code
can enables significant performance
improvements
JGF-Crypt, BlackScholes, MRIQ
Speculation is not effective on AMD
due to
Insufficient processors in GPU
Lazy AMD OpenCL Runtime
26
Sample timeline of the Black-
Scholes application on AMD
27
0.0E+00 1.0E+08 2.0E+08 3.0E+08 4.0E+08 5.0E+08 6.0E+08
transfer|h-d|pending|dt1
transfer|h-d|pending|dt2
transfer|h-d|pending|dt3
kernel|pending
transfer|h-d|running|dt1
transfer|h-d|running|dt2
transfer|h-d|running|dt1
kernel|running
transfer|d-h|pending|dt1
transfer|d-h|pending|dt2
transfer|d-h|pending|dt3
transfer|d-h|running|dt1
transfer|d-h|running|dt2
transfer|d-h|running|dt3
Time (ns)
ApplicationStage
Accounts for 40%
of total execution
time!
Related Work:
High-level language to GPU code
Lime (PLDI’12)
JVM compatible language
RootBeer
Compiles Java bytecode to CUDA
X10 and Chapel
Provides programming model for CUDA
Sponge (ASPLOS’11)
Compiles StreamIt to CUDA
→ None of these approaches considers
Java Exception Semantics
28
Related Work:
Exception Semantics in Java
 Artigas et al. (ICS’00) and Moreira et al.(ACM Trans.
‘00)
 Generates exception- safe and -unsafe regions of code.
 Wurthinger et al.(PPPJ’07)
 Proposes an algorithm on Static Single Assignment(SSA)
form for the JIT compiler which eliminates un- necessary
bounds checking.
 ABCD (PLDI’00)
 Provides an array bounds checking elimination algorithm,
which is based on graph traversal on an extended SSA
form.
 Jeffery et al. (In Concurrency and Compu- tations:
Practice and Experience,‘09)
 Proposes a static annotation framework to reduce the
overhead of dynamic checking in the JIT compiler.
29
Summary:
HJ-OpenCL
Programmer can utilize OpenCL by just
putting “forall” construct
Automatic generation of exception checking
code on JVM
Accelerating Java program with precise
exception semantics
Performance improvement
Upto 21x speedup on AMD APU
upto 330x speedup on NVIDIA GPU
30
Backup
31
“next” construct for global
barrier synchronization on GPUs
Semantics
 Wait until all thread reaches the synchronization
point
Note that OpenCL does not support all-to-
all barrier as a kernel language feature
 The HJ compiler internally partitions the forall
loop body into blocks separated by
synchronization points
32
next construct (cont’d)
33
forall (point [i]:[0:n-1]) {
method1(i);
// synchronization point 1
next;
method2(i);
// synchronization point 2
next;
}
Thread0
method1(0);
Thread1
method1(1);
WAIT
method2(0); method2(1);
WAIT
“ArrayView” for Supporting
Contiguous Multidimensional
array
 HJ ArrayView is backed by one-dimensional Java Array
 Enables reduction of data transfer between
host and device
Java Array
A[i][j]
HJ Array View
A[i, j]
0
1
2
0
0
1
2
0 1
0 1 2 3
A[0][1]
A[0,1]
34
Speculative Exception Checking
Speculative
Execution of GPU
Exception Checking
(in parallel)
Data transfer to
GPU & kernel
invocation
Computation
Data transfer from
GPU
GPU Cleanup
Exception Occurred?
Fall back to
original code
HJ Runtime on JVM OpenCL Runtime Kernel on Device
No
Yes
Multi-core CPU Many-core GPUTime
35

More Related Content

What's hot

JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020Joseph Kuo
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linuxMiller Lee
 
Concurrency Concepts in Java
Concurrency Concepts in JavaConcurrency Concepts in Java
Concurrency Concepts in JavaDoug Hawkins
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackKernel TLV
 
TWJUG x Oracle Groundbreakers 2019 Taiwan - What’s New in Last Java Versions
TWJUG x Oracle Groundbreakers 2019 Taiwan - What’s New in Last Java VersionsTWJUG x Oracle Groundbreakers 2019 Taiwan - What’s New in Last Java Versions
TWJUG x Oracle Groundbreakers 2019 Taiwan - What’s New in Last Java VersionsJoseph Kuo
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gcexsuns
 
JCConf 2018 - Retrospect and Prospect of Java
JCConf 2018 - Retrospect and Prospect of JavaJCConf 2018 - Retrospect and Prospect of Java
JCConf 2018 - Retrospect and Prospect of JavaJoseph Kuo
 
Csw2016 gong pwn_a_nexus_device_with_a_single_vulnerability
Csw2016 gong pwn_a_nexus_device_with_a_single_vulnerabilityCsw2016 gong pwn_a_nexus_device_with_a_single_vulnerability
Csw2016 gong pwn_a_nexus_device_with_a_single_vulnerabilityCanSecWest
 
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...Yuji Kubota
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu WorksZhen Wei
 
Csw2016 gawlik bypassing_differentdefenseschemes
Csw2016 gawlik bypassing_differentdefenseschemesCsw2016 gawlik bypassing_differentdefenseschemes
Csw2016 gawlik bypassing_differentdefenseschemesCanSecWest
 
Non-blocking synchronization — what is it and why we (don't?) need it
Non-blocking synchronization — what is it and why we (don't?) need itNon-blocking synchronization — what is it and why we (don't?) need it
Non-blocking synchronization — what is it and why we (don't?) need itAlexey Fyodorov
 
Speedup Your Java Apps with Hardware Counters
Speedup Your Java Apps with Hardware CountersSpeedup Your Java Apps with Hardware Counters
Speedup Your Java Apps with Hardware CountersC4Media
 
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack FirmwareSimen Li
 
Jdk 7 4-forkjoin
Jdk 7 4-forkjoinJdk 7 4-forkjoin
Jdk 7 4-forkjoinknight1128
 
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)Simen Li
 
Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)Alexey Fyodorov
 

What's hot (20)

JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
Concurrency Concepts in Java
Concurrency Concepts in JavaConcurrency Concepts in Java
Concurrency Concepts in Java
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network Stack
 
TWJUG x Oracle Groundbreakers 2019 Taiwan - What’s New in Last Java Versions
TWJUG x Oracle Groundbreakers 2019 Taiwan - What’s New in Last Java VersionsTWJUG x Oracle Groundbreakers 2019 Taiwan - What’s New in Last Java Versions
TWJUG x Oracle Groundbreakers 2019 Taiwan - What’s New in Last Java Versions
 
Qemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System Emulation
 
TVM VTA (TSIM)
TVM VTA (TSIM) TVM VTA (TSIM)
TVM VTA (TSIM)
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gc
 
JCConf 2018 - Retrospect and Prospect of Java
JCConf 2018 - Retrospect and Prospect of JavaJCConf 2018 - Retrospect and Prospect of Java
JCConf 2018 - Retrospect and Prospect of Java
 
Csw2016 gong pwn_a_nexus_device_with_a_single_vulnerability
Csw2016 gong pwn_a_nexus_device_with_a_single_vulnerabilityCsw2016 gong pwn_a_nexus_device_with_a_single_vulnerability
Csw2016 gong pwn_a_nexus_device_with_a_single_vulnerability
 
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
 
Csw2016 gawlik bypassing_differentdefenseschemes
Csw2016 gawlik bypassing_differentdefenseschemesCsw2016 gawlik bypassing_differentdefenseschemes
Csw2016 gawlik bypassing_differentdefenseschemes
 
Non-blocking synchronization — what is it and why we (don't?) need it
Non-blocking synchronization — what is it and why we (don't?) need itNon-blocking synchronization — what is it and why we (don't?) need it
Non-blocking synchronization — what is it and why we (don't?) need it
 
Speedup Your Java Apps with Hardware Counters
Speedup Your Java Apps with Hardware CountersSpeedup Your Java Apps with Hardware Counters
Speedup Your Java Apps with Hardware Counters
 
JVM Mechanics
JVM MechanicsJVM Mechanics
JVM Mechanics
 
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware
 
Jdk 7 4-forkjoin
Jdk 7 4-forkjoinJdk 7 4-forkjoin
Jdk 7 4-forkjoin
 
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
 
Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)
 

Viewers also liked

Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopNushrat
 
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarnApache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarnShuya Tsukamoto
 
Marketo Management Tips
Marketo Management TipsMarketo Management Tips
Marketo Management TipsJosh Hill
 
Snr 2012 ee17828 1844
Snr 2012 ee17828  1844Snr 2012 ee17828  1844
Snr 2012 ee17828 1844Johana201225
 
Descubrimientos bíblicos para mis hermanos y amigos parte I
Descubrimientos bíblicos para mis hermanos y amigos parte IDescubrimientos bíblicos para mis hermanos y amigos parte I
Descubrimientos bíblicos para mis hermanos y amigos parte Ieldescubridor2015
 
February 10 2016
February 10 2016February 10 2016
February 10 2016khyps13
 
TQS Certificate 2016
TQS Certificate 2016TQS Certificate 2016
TQS Certificate 2016Danielle Ling
 
環境放射能水準調査結果(月間降下物)
環境放射能水準調査結果(月間降下物)環境放射能水準調査結果(月間降下物)
環境放射能水準調査結果(月間降下物)Kazuhide Fukada
 
TRB Certificate 2016
TRB Certificate 2016TRB Certificate 2016
TRB Certificate 2016Danielle Ling
 
Speed controller2003
Speed controller2003Speed controller2003
Speed controller2003Rishabh Soni
 
Asian Orange Chicken Recipe - Kidney Diet Secrets
Asian Orange Chicken Recipe - Kidney Diet SecretsAsian Orange Chicken Recipe - Kidney Diet Secrets
Asian Orange Chicken Recipe - Kidney Diet SecretsKidneyDietSecrets
 
Official Transcript SFU
Official Transcript SFUOfficial Transcript SFU
Official Transcript SFUDanielle Ling
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...StampedeCon
 

Viewers also liked (19)

Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarnApache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
 
Marketo Management Tips
Marketo Management TipsMarketo Management Tips
Marketo Management Tips
 
Snr 2012 ee17828 1844
Snr 2012 ee17828  1844Snr 2012 ee17828  1844
Snr 2012 ee17828 1844
 
Descubrimientos bíblicos para mis hermanos y amigos parte I
Descubrimientos bíblicos para mis hermanos y amigos parte IDescubrimientos bíblicos para mis hermanos y amigos parte I
Descubrimientos bíblicos para mis hermanos y amigos parte I
 
February 10 2016
February 10 2016February 10 2016
February 10 2016
 
TQS Certificate 2016
TQS Certificate 2016TQS Certificate 2016
TQS Certificate 2016
 
Bermain AKPER PENKAB MUNA
Bermain AKPER PENKAB MUNABermain AKPER PENKAB MUNA
Bermain AKPER PENKAB MUNA
 
環境放射能水準調査結果(月間降下物)
環境放射能水準調査結果(月間降下物)環境放射能水準調査結果(月間降下物)
環境放射能水準調査結果(月間降下物)
 
Marcio Reis CV
Marcio Reis CVMarcio Reis CV
Marcio Reis CV
 
TRB Certificate 2016
TRB Certificate 2016TRB Certificate 2016
TRB Certificate 2016
 
Speed controller2003
Speed controller2003Speed controller2003
Speed controller2003
 
Asian Orange Chicken Recipe - Kidney Diet Secrets
Asian Orange Chicken Recipe - Kidney Diet SecretsAsian Orange Chicken Recipe - Kidney Diet Secrets
Asian Orange Chicken Recipe - Kidney Diet Secrets
 
Official Transcript SFU
Official Transcript SFUOfficial Transcript SFU
Official Transcript SFU
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 

Similar to Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs

Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
 
Java gpu computing
Java gpu computingJava gpu computing
Java gpu computingArjan Lamers
 
Java 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from OredevJava 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from OredevMattias Karlsson
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班Paul Chao
 
Secure coding for developers
Secure coding for developersSecure coding for developers
Secure coding for developerssluge
 
OpenHFT: An Advanced Java Data Locality and IPC Transport Solution
OpenHFT: An Advanced Java Data Locality and IPC Transport SolutionOpenHFT: An Advanced Java Data Locality and IPC Transport Solution
OpenHFT: An Advanced Java Data Locality and IPC Transport SolutionBen Cotton
 
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorganShared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorganHazelcast
 
emips_overview_apr08
emips_overview_apr08emips_overview_apr08
emips_overview_apr08Neil Pittman
 
Stealthy, Hypervisor-based Malware Analysis
Stealthy, Hypervisor-based Malware AnalysisStealthy, Hypervisor-based Malware Analysis
Stealthy, Hypervisor-based Malware AnalysisTamas K Lengyel
 
Compromising Linux Virtual Machines with Debugging Mechanisms
Compromising Linux Virtual Machines with Debugging MechanismsCompromising Linux Virtual Machines with Debugging Mechanisms
Compromising Linux Virtual Machines with Debugging MechanismsRussell Sanford
 
How to reverse engineer Android applications
How to reverse engineer Android applicationsHow to reverse engineer Android applications
How to reverse engineer Android applicationshubx
 
How to reverse engineer Android applications—using a popular word game as an ...
How to reverse engineer Android applications—using a popular word game as an ...How to reverse engineer Android applications—using a popular word game as an ...
How to reverse engineer Android applications—using a popular word game as an ...Christoph Matthies
 
Dissecting the Hotspot JVM
Dissecting the Hotspot JVMDissecting the Hotspot JVM
Dissecting the Hotspot JVMIvan Ivanov
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Akihiro Hayashi
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Hadoop: Code Injection, Distributed Fault Injection
Hadoop: Code Injection, Distributed Fault InjectionHadoop: Code Injection, Distributed Fault Injection
Hadoop: Code Injection, Distributed Fault InjectionCloudera, Inc.
 

Similar to Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs (20)

Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
 
Java gpu computing
Java gpu computingJava gpu computing
Java gpu computing
 
Java 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from OredevJava 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from Oredev
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
 
Secure coding for developers
Secure coding for developersSecure coding for developers
Secure coding for developers
 
OpenHFT: An Advanced Java Data Locality and IPC Transport Solution
OpenHFT: An Advanced Java Data Locality and IPC Transport SolutionOpenHFT: An Advanced Java Data Locality and IPC Transport Solution
OpenHFT: An Advanced Java Data Locality and IPC Transport Solution
 
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorganShared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
 
emips_overview_apr08
emips_overview_apr08emips_overview_apr08
emips_overview_apr08
 
Nodejs Intro Part One
Nodejs Intro Part OneNodejs Intro Part One
Nodejs Intro Part One
 
Stealthy, Hypervisor-based Malware Analysis
Stealthy, Hypervisor-based Malware AnalysisStealthy, Hypervisor-based Malware Analysis
Stealthy, Hypervisor-based Malware Analysis
 
De Java 8 a Java 11 y 14
De Java 8 a Java 11 y 14De Java 8 a Java 11 y 14
De Java 8 a Java 11 y 14
 
Compromising Linux Virtual Machines with Debugging Mechanisms
Compromising Linux Virtual Machines with Debugging MechanismsCompromising Linux Virtual Machines with Debugging Mechanisms
Compromising Linux Virtual Machines with Debugging Mechanisms
 
How to reverse engineer Android applications
How to reverse engineer Android applicationsHow to reverse engineer Android applications
How to reverse engineer Android applications
 
How to reverse engineer Android applications—using a popular word game as an ...
How to reverse engineer Android applications—using a popular word game as an ...How to reverse engineer Android applications—using a popular word game as an ...
How to reverse engineer Android applications—using a popular word game as an ...
 
Dissecting the Hotspot JVM
Dissecting the Hotspot JVMDissecting the Hotspot JVM
Dissecting the Hotspot JVM
 
Java On Speed
Java On SpeedJava On Speed
Java On Speed
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Hadoop: Code Injection, Distributed Fault Injection
Hadoop: Code Injection, Distributed Fault InjectionHadoop: Code Injection, Distributed Fault Injection
Hadoop: Code Injection, Distributed Fault Injection
 

More from Akihiro Hayashi

GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesAkihiro Hayashi
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral CompilationAkihiro Hayashi
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsAkihiro Hayashi
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Akihiro Hayashi
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...Akihiro Hayashi
 

More from Akihiro Hayashi (9)

GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
 

Recently uploaded

SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 

Recently uploaded (20)

SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 

Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs

  • 1. Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs LCPC2013, Qualcomm Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar Rice University, Houston, TX, USA 1
  • 2. Background: GPGPU and Java The initial wave of programming models for GPGPU has provided low-level APIs: CUDA (NVIDIA) OpenCL (Khronos) →Often faster than natively running C/C++ applications High-level languages such as Java provide high-productivity features: Type safety Garbage Collection Precise Exception Semantics 2
  • 3. OpenCL Kernel JNI OpenCL APIs JNI Motivation: GPU Execution From Java JNIEXPORT void JNICALL_Java_Test (…) { void ∗aptr = (env)−>GetPrimitiveArrayCritical(arrays , 0); ... /∗ Create Buffer ∗/ Aobj = clCreateBuffer(context, …); /∗ Host to Device Communication ∗/ clEnqueueWriteBuffer(queue, Aobj, …); /∗ Kernel Compilation ∗/ /* Kernel Invocation */ … (env)−>ReleasePrimitiveArrayCritical(arrays, aptr, 0); __kernel void run(…) { int gid = …; A[gid] = …; } Utilizing GPU from Java adds non-trivial amount of work 3
  • 4. Computation Body RootBeer API Related Work: RootBeer Still requires special API invocation in addition to computation body 4 int[][] arrays = new int[N][M]; int[] result = new int[N]; ... arrays initialization ... List<Kernel> jobs = new ArrayList<Kernel>(); for(int i = 0; i < N; i++) { jobs.add( new ArraySumKernel( arrays[i], result, i) ); } Rootbeer rootbeer = new Rootbeer(); rootbeer.runAll(jobs); class ArraySumKernel implements Kernel { private int[] source; private int[] ret; private int index; public void gpuMethod() { int sum = 0; for(int i = 0; i< N; i++) { sum += source[i]; } ret[index] = sum; } }
  • 5. Our Approach: HJ-OpenCL Overview Automatic generation of OpenCL kernels and JNI glue code from a parallel-for construct forall Built on the top of Habanero-Java(HJ) “Habanero-Java: the New Adventures of Old X10” Cave et al. (PPPJ’11) “Accelerating Habanero-Java Programs with OpenCL Generation” Hayashi et al. (PPPJ’13) – User provided “Safe” constructs OpenCL acceleration with precise exception semantics Our primary contribution 5
  • 6. Overview of Habanero-Java (HJ) Language New language and implementation developed at Rice since 2007  Derived from Java-based version of X10 language (v1.5) in 2007  HJ is currently an extension of Java 1.4 All Java 5 & 6 libraries and classes can be called from HJ programs 6
  • 7. Overview of Habanero-Java (HJ) Language (Cont’d)  HJ’s parallel extensions are focused on task parallelism 1. Dynamic task creation & termination: async, finish, force, forall, foreach 2. Collective and point-to-point synchronization: phaser, next 3. Mutual exclusion and isolation: isolated 4. Locality control --- task and data distributions: places, here  Sequential HJ extensions added for convenience  extern, point, region, pointwise for, complex data type, array views  Habanero-C and Habanero-Scala are also available with similar constructs 7
  • 8. HJ OpenCL Implementation HJ-OpenCL Example →Programmers can utlize OpenCL by just replacing for with fora 8 public class ArraySum { public static void main(String[] args) { int[] base = new int[N*M]; int[] result = new int[N]; int[.] arrays = new arrayView(base, 0, [0:N-1,0:M-1]); ... initialization ... boolean isSafe = ...; forall(point [i] : [0:N-1]) { result[i] = 0; for(int j=0; j<M; j++) { result[i] += arrays[i,j]; } } } }
  • 9. The compilation flow HJ Program .class files on JVM (bytecode) OpenCL_hjstub.c (JNI glue code) OpenCLKernel.clas s (bytecode) HJ Compiler C compiler APARAPI Translator OpenCL Kernel Kernel.c Native library (.so, .dll, .dylib) JVM Host JNI Device OpenCL Program is translated into three files 9
  • 10. APARAPI Open Source Project for data parallel Java from AMD  https://code.google.com/p/aparapi/  APARAPI converts Java bytecode to OpenCL kernel at runtime Restriction  Can only handle primitive type, not object 10 →we prepared static version of APARAPI to reduce runtime compilation overhead
  • 11. Acceleration vs. Exception Semantics Safe? High Performance? Java Yes No OpenCL/CUDA No Yes 11 OpenCL/CUDAJava Mix! Pictures borrowed from http://wot.motortrend.com/ http://www.boston.com/bigpicture/2008/09/the_singapore_grand_prix.html
  • 12. Basic Idea: Non-Speculative Exception Checking 12 Exception Checking (in parallel) Multicore-CPU GPU Non-Speculative Exception Checking Code boolean excpFlag = false; /* (1) Exception Checking Code on JVM */ try { forall (point [i]:[0:N-1]) { … = A[i]; } } catch (Exception e) { excpFlag = true; } /* (2) JNI Call */ if (!excpFlag) { openCL_Kernel(); } else{ // Original Implementation on JVM forall() {} } Host to device Data Transfer Computation Device to host Data Transfer No Exception Time Same as original Forall except ArrayStore
  • 13. Proposed Idea: Speculative Exception Checking 13 Exception Checking (in parallel) Multicore-CPU GPU Speculative Exception Checking Code boolean excpFlag = false; /* JNI Call 1*/ openCL_Kernel1(); /* (2) Exception Checking Code on JVM */ try { forall (point [i]:[0:N-1]) { … = A[i]; } } catch (Exception e) { excpFlag = true; } /* (2) JNI Call */ openCL_Kernel2(excpFlag); if (excpFlag) { // Original Implementation on JVM forall() {} } Host to device Data Transfer Computation Device to host Data Transfer No Exception Time Same as original Forall except ArrayStore
  • 14. Opportunity for Optmization Target exceptions possibly occurred during GPU execution (due to APARAPI restriction) ArrayIndexOutOfBoundsException ArithmeticException – Divided by Zero NullPointerException What kind of “Optimization”? Delete statements which will not cause the above exceptions to accelerate exception checking at compile time 14
  • 15. The exception checking code optimization algorithm Key Idea Delete statements which do not derive array subscripts and denominator of division statement by considering control flow 15 i = …; X = …; Y = …; … = A[i] + X; … = B[i] / Y; Before i = …; X = …; Y = …; … = A[i] + X; … = B[i] / Y; After
  • 16. Exception Checking Code Optimization Example forall (point [i]:[0:N-1]) { A[Index[i]] = B[i] + C[i]; } // IR $i2 = Index[i]; $i3 = B[i]; $i4 = C[i]; $i5 = $i3 + $i4; A[$i2] = $i5; // IR $i2 = Index[i]; $i3 = B[i]; $i4 = C[i]; $i5 = $i3 + $i4; dummy = A[$i2] mark // IR $i2 = Index[i]; $i3 = B[i]; $i4 = C[i]; delete dummy = A[$i2] // IR $i2 = Index[i]; $i3 = B[i]; $i4 = C[i]; dummy = A[$i2] Dummy read Optimized Code 16
  • 17. Benchmarks Benchmark Data Size Remarks SparseMatmult JGF N= 500,000 Sparse Matrix Doitgen Polybench 128x128x128 Crypt JGF N = 50,000,000 Blackscholes 16,777,216 options MRIQ Parboil 64x64x64 MatMult 1024x1024 SAXPY N= 25,000x 25,000 Sparse Matrix GEMVER SparseBLAS 10,000,000 Sparse Matrix 17
  • 18. Platforms AMD A10-5800K Westmere CPU 4-cores (APU) 6-cores x 2 Xeon 5660 GPU Radeon HD 7660D 384-cores NVIDIA Tesla M2050 448-cores Java Runtime JRE (build 1.6.0_21-b06 JRE (build 1.6.0_25-b06) JVM HotSpot 64-Bit Server VM (build 17.0-b11, mixed mode) HotSpot 64-Bit Server VM(Build 20.0-b11, mixed mode) 18
  • 19. Experimental Methodologies We tested execution in the following modes: Sequential Java HJ-OpenCL No Checking (Unsafe) Non-Speculative Exception Checking – Optimized/Unoptimized Speculative Exception Checking – Optimized/Unoptimized 19
  • 20. Result On AMD(A10-5800K) No Checking vs. Speculative Checking  Up to 18% slowdown while maintaining Java exception semantics 20 2.4 0.2 3.6 9.8 21.3 12.6 5.1 9.9 2.1 0.2 3.0 8.6 21.1 11.7 4.5 9.6 0.0 5.0 10.0 15.0 20.0 25.0 JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER SpeeduprelativetoSequentialJava no checking: UNSAFE Optimized Speculative: SAFE, PROPOSED Benchmarks Higher is better
  • 21. Result On AMD(A10-5800K) Slowdown for Exception Checking 21 0.9 1.0 0.4 0.2 0.1 0.9 0.9 0.9 0.9 1.0 0.4 0.2 0.1 0.9 0.9 0.90.9 1.0 0.8 0.9 1.0 0.9 0.9 1.0 0.9 1.0 0.8 0.9 1.0 0.9 0.9 1.0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER SpeeduprelativetoNo-Checking Non-Speculative, Unoptimized, Speculative, Unoptimized Non-Speculative, Optimized Speculative Optimized Speedup by Optimizatio n Speedup by Optimizatio n Almost Same Performanc e Almost Same Performanc e Benchmarks Higher is better
  • 22. Analysis of Results on AMD: Checking code optimization issues In speculative Execution, optimization is effective only if exception checking code is on critical path  JGF-Crypt, BlackScholes, MRIQ, GEMVER 22 Unoptimized Exception Checking (in parallel) Multicore-CPU GPU Host to device Data Transfer Computation Device to host Data Transfer Time Exception Checking Code is on Critical path
  • 23. Analysis of Results on AMD: Speculative Execution Issues Speculation does not much accelerate program execution because GPU execution is quite longer than checking  JGF-SparseMatMultDoitgen, MatMult, SAXPY 23 Checking Multicore-CPU GPU Host to device Data Transfer Computation Device to host Data Transfer Time Checking Multicore-CPU GPU Host to device Data Transfer Computation Device to host Data Transfer
  • 24. Result On Westmere No Checking vs. Speculative Checking  Up to 22% slowdown while maintaining Java exception semantics 24 9.8 2.7 14.4 37.8 330.8 43.1 8.8 22.3 7.7 2.8 13.6 35.2 331.0 44.1 7.6 22.4 1.0 10.0 100.0 1000.0 JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER SpeeduprelativetoSequentialJava no checking: UNSAFE Optimized Speculative: SAFE, PROPOSED Benchmarks Log Scale, Higher is better
  • 25. Result On Westmere Slowdown for Exception Checking  Both speculation and optimization are effective 25 0.6 0.9 0.3 0.2 0.0 0.8 0.8 0.9 0.7 1.0 0.3 0.2 0.0 1.0 0.9 1.0 0.7 0.9 0.8 0.7 0.9 0.8 0.8 1.0 0.8 1.0 0.9 0.9 1.0 1.0 0.9 1.0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER SpeeduprelativetoNo-Checking Non-Speculative, Unoptimized, Speculative, Unoptimized Non-Speculative, Optimized Speculative Optimized Benchmarks Higher is better
  • 26. Insights Removal of redundant java.Math methods in exception checking code can enables significant performance improvements JGF-Crypt, BlackScholes, MRIQ Speculation is not effective on AMD due to Insufficient processors in GPU Lazy AMD OpenCL Runtime 26
  • 27. Sample timeline of the Black- Scholes application on AMD 27 0.0E+00 1.0E+08 2.0E+08 3.0E+08 4.0E+08 5.0E+08 6.0E+08 transfer|h-d|pending|dt1 transfer|h-d|pending|dt2 transfer|h-d|pending|dt3 kernel|pending transfer|h-d|running|dt1 transfer|h-d|running|dt2 transfer|h-d|running|dt1 kernel|running transfer|d-h|pending|dt1 transfer|d-h|pending|dt2 transfer|d-h|pending|dt3 transfer|d-h|running|dt1 transfer|d-h|running|dt2 transfer|d-h|running|dt3 Time (ns) ApplicationStage Accounts for 40% of total execution time!
  • 28. Related Work: High-level language to GPU code Lime (PLDI’12) JVM compatible language RootBeer Compiles Java bytecode to CUDA X10 and Chapel Provides programming model for CUDA Sponge (ASPLOS’11) Compiles StreamIt to CUDA → None of these approaches considers Java Exception Semantics 28
  • 29. Related Work: Exception Semantics in Java  Artigas et al. (ICS’00) and Moreira et al.(ACM Trans. ‘00)  Generates exception- safe and -unsafe regions of code.  Wurthinger et al.(PPPJ’07)  Proposes an algorithm on Static Single Assignment(SSA) form for the JIT compiler which eliminates un- necessary bounds checking.  ABCD (PLDI’00)  Provides an array bounds checking elimination algorithm, which is based on graph traversal on an extended SSA form.  Jeffery et al. (In Concurrency and Compu- tations: Practice and Experience,‘09)  Proposes a static annotation framework to reduce the overhead of dynamic checking in the JIT compiler. 29
  • 30. Summary: HJ-OpenCL Programmer can utilize OpenCL by just putting “forall” construct Automatic generation of exception checking code on JVM Accelerating Java program with precise exception semantics Performance improvement Upto 21x speedup on AMD APU upto 330x speedup on NVIDIA GPU 30
  • 32. “next” construct for global barrier synchronization on GPUs Semantics  Wait until all thread reaches the synchronization point Note that OpenCL does not support all-to- all barrier as a kernel language feature  The HJ compiler internally partitions the forall loop body into blocks separated by synchronization points 32
  • 33. next construct (cont’d) 33 forall (point [i]:[0:n-1]) { method1(i); // synchronization point 1 next; method2(i); // synchronization point 2 next; } Thread0 method1(0); Thread1 method1(1); WAIT method2(0); method2(1); WAIT
  • 34. “ArrayView” for Supporting Contiguous Multidimensional array  HJ ArrayView is backed by one-dimensional Java Array  Enables reduction of data transfer between host and device Java Array A[i][j] HJ Array View A[i, j] 0 1 2 0 0 1 2 0 1 0 1 2 3 A[0][1] A[0,1] 34
  • 35. Speculative Exception Checking Speculative Execution of GPU Exception Checking (in parallel) Data transfer to GPU & kernel invocation Computation Data transfer from GPU GPU Cleanup Exception Occurred? Fall back to original code HJ Runtime on JVM OpenCL Runtime Kernel on Device No Yes Multi-core CPU Many-core GPUTime 35

Editor's Notes

  1. Thanks for your introduction. Hello, everyone. Welcome to my talk. This is the last talk in LCPC workshop. My name is Akihiro Hayashi and I’m a posdoc at Rice university. Today I’ll be talking about speculative execution of parallel programs with precise exception semantics on GPUs.
  2. Let me first talk about the background. Programming models for GPGPU, such as CUDA and OpenCL can enable significant improvements for certain classes of applications. If you use these programming model, applications will be running faster than natively running C/C++ applications. These programming models provide C-like kernel language and low-level APIs including data transfer API and kernel invocation API which are usually accessible from C/C++ languages. On the other hand, High-level languages such as Java provide high-productivity features including type safety, garbage collection and precise exception semantics.
  3. But, If you want to utilize GPU from Java, It requires programmers to write non-trivial amount of application code. Here is an example. You can see three kinds of code here. These are JNI code, OpenCL APIs and OpenCL Kernel code. In JNI code, you can see the declaration of JNI prototype and the code which get/release the pointer of original Java array. In OpenCL APIs. I omitted a lot of codes. But actually programmers have to write a lot of code to offload the task onto GPU by using memory allocation APIs, data transfer APIs, kernel compilation APIs, kernel invocation APIs. When you take a look at the OpenCL kernel, you can see c-like function. Programmers can write a kernel computation in SPMD manner. However I think, utilizing GPU from Java adds non-trivial amount of work.
  4. In past work, Rootbeer provides high-level programming model for GPU. In Rootbeer the programmers prepare a special class which implements Kernel interface, It allows the programmer to write computation body with overriding the method called gpuMethod(). We don’t need to write JNI and OpenCL Code anymore, You can invoke kernel on GPU by using RootBeer API as shown on your left. It just adds the jobs to queue. But it still requires programmer to write special class and special API to invoke GPU.
  5. In our prior work, we proposed HJ-OpenCL which performs automatic generation of OpenCL kernels and JNI glue code from a prallel-for construct named forall. Our compiler automatically compiles forall constructs to OpenCL kernel. HJ-OpenCL is built on the top of Haabnero-Java language. In this work we focus on maintaining Java’s exception semantics when we use the GPU from Java.
  6. Before explaining our methodologies, Let me tell you the overview of Habanero Java language. Habenero-Java is a new language developed at Rice since 2007. It’s derived from Java-based version of X10 language.
  7. HJ provides several parallel extensions focused on task parallelism. You can use async statement to create task and You can also use forall to express parallel loop. If you want to do all-to-all and point-to-point synchronization, you can use phaser. A subset of phaser is available in Java 7 . For mutual exclusion, you can use isolated statement. HJ allows you to set the affinity of task with places. Habanero-C and Habanero-Scala are also avaialable with similar constructs.
  8. Here is a HJ-OpenCL Example. In HJ-OpenCL, you can express GPU kernel by just replacing for with forall. Unlike RootBeer, you don’t need to prepare special class to GPU from Java.
  9. This is the compilation flow of HJ-OpenCL The HJ compiler takes HJ program and generates three kinds of files. These are .class files on JVM, OpenCL_hjstub.c which consists of JNI glue code and OpenCLAPI, and a special class named OpenCLKernel.class. OpenCLKernel.class is passed to APARAPI translator, which takes Java bytecode and generates OpenCL kernel. The native static C compiler takes OpenCL_hjstub.c and OpenCL kernel generated by APARAPI translator and generates Native dynamic library. This compilation is done automatically.
  10. Let me describe the detail of APARAPI. APARAPI is an open source project for data parallel Java. It compiles Java bytecode to OpenCL at runtime. There is a rescriction with regards to kernel generation. APARAPI can only handle primitive type, can not handle instance of object. We prepared static version of APARAPI to reduce runtime compilation overhead.
  11. Let’s talk about the exception semantics. As you know, we can say Java is safe language but not high performance language because Java checks exceptions at runtime. On the other hand, OpenCL and CUDA is not safe but hight performance language. This has some analogy to these picture. You know, Consumer cars have several safe features. F1 machine is not safe but can move really fast. We want to mix High Performance and Safety
  12. This is basic idea. Basically, we run exception checking code in parallel in JVM first. And in the case that no exception occurred we’ll invoke GPU through JNI call. In the code shown on your right, You can see the exception checking code which is enclosed by try-catch statement. Note that this code is same as the original forall implemtantion except array store. We should transform all array store to array read to keep program semantics. In catch statement, true is assigned to excpFlag. In the case that excpFlag is false we’ll invoke GPU through JNI call. Otherwise we’ll execute original implementation on JVM. That’s why we can maintain exception semantics.
  13. Additionally, We can run exception checking code and computation in parallel. We speculatively run the computation on GPU. If no exception occurred we’ll get the data from device. Otherwise we’ll run original implementation on JVM like no speculative execution. This is our proposing methodology. Compiler automatically geneartes this code.
  14. If we focus on several exceptions which are possibly occurred during GPU execution. We can optimize the exception checking code, If we can focus on ArrayIndexOutOfBoundsException, ArithmeticException, and NullPointerException, We can just delete statements which will not cause these exceptions to accelerate exception checking at compile time.
  15. OK, let’s talk about the optimization. As I mentioned before, our algorithm deletes statement which do not derive array subscript and denominator of division statement by considering control flow Here is some example. Some value is assigned to I, which will be used in A[i] in latter statement. So we cannot delete this statement. Similary, we cannot delete assignment of Y because this statement derives denominator of division. But we can delete assignment of X because this will not derive array subscript and denominator of division statement.