Your SlideShare is downloading. ×
Speculative Execution of Parallel Programs
with Precise Exception Semantics on GPUs
LCPC2013, Qualcomm
Akihiro Hayashi, Ma...
Background:
GPGPU and Java
The initial wave of programming models
for GPGPU has provided low-level APIs:
CUDA (NVIDIA)
...
OpenCL Kernel
JNI
OpenCL
APIs
JNI
Motivation:
GPU Execution From Java
JNIEXPORT void JNICALL_Java_Test (…) {
void ∗aptr = ...
Computation Body
RootBeer
API
Related Work:
RootBeer
Still requires special API invocation in
addition to computation bod...
Our Approach:
HJ-OpenCL Overview
Automatic generation of OpenCL kernels
and JNI glue code from a parallel-for
construct f...
Overview of Habanero-Java
(HJ) Language
New language and implementation
developed at Rice since 2007
 Derived from Java-...
Overview of Habanero-Java
(HJ) Language (Cont’d)
 HJ’s parallel extensions are focused on task
parallelism
1. Dynamic tas...
HJ OpenCL
Implementation
HJ-OpenCL Example
→Programmers can utlize OpenCL by just replacing for with fora
8
public class A...
The compilation flow
HJ
Program
.class files on JVM
(bytecode)
OpenCL_hjstub.c
(JNI glue code)
OpenCLKernel.clas
s
(byteco...
APARAPI
Open Source Project for data parallel Java
from AMD
 https://code.google.com/p/aparapi/
 APARAPI converts Java ...
Acceleration vs. Exception Semantics
Safe? High Performance?
Java Yes No
OpenCL/CUDA No Yes
11
OpenCL/CUDAJava
Mix!
Pictur...
Basic Idea:
Non-Speculative Exception Checking
12
Exception
Checking
(in parallel)
Multicore-CPU GPU
Non-Speculative
Excep...
Proposed Idea:
Speculative Exception Checking
13
Exception
Checking
(in parallel)
Multicore-CPU GPU
Speculative
Exception
...
Opportunity for Optmization
Target exceptions possibly occurred
during GPU execution
(due to APARAPI restriction)
ArrayI...
The exception checking code
optimization algorithm
Key Idea
Delete statements which do not derive
array subscripts and d...
Exception Checking Code
Optimization Example
forall (point [i]:[0:N-1]) {
A[Index[i]] = B[i] + C[i];
}
// IR
$i2 = Index[i...
Benchmarks
Benchmark Data Size Remarks
SparseMatmult JGF N= 500,000 Sparse Matrix
Doitgen Polybench 128x128x128
Crypt JGF ...
Platforms
AMD A10-5800K Westmere
CPU 4-cores (APU) 6-cores x 2 Xeon 5660
GPU
Radeon HD 7660D
384-cores
NVIDIA Tesla M2050
...
Experimental Methodologies
We tested execution in the following
modes:
Sequential Java
HJ-OpenCL
No Checking (Unsafe)
...
Result On AMD(A10-5800K)
No Checking vs. Speculative Checking
 Up to 18% slowdown while maintaining Java exception
semant...
Result On AMD(A10-5800K)
Slowdown for Exception Checking
21
0.9
1.0
0.4
0.2
0.1
0.9
0.9 0.9
0.9
1.0
0.4
0.2
0.1
0.9 0.9 0....
Analysis of Results on AMD:
Checking code optimization issues
In speculative Execution, optimization is
effective only if...
Analysis of Results on AMD:
Speculative Execution Issues
Speculation does not much accelerate
program execution because G...
Result On Westmere
No Checking vs. Speculative Checking
 Up to 22% slowdown while maintaining Java exception
semantics
24...
Result On Westmere
Slowdown for Exception Checking
 Both speculation and optimization are effective
25
0.6
0.9
0.3
0.2
0....
Insights
Removal of redundant java.Math
methods in exception checking code
can enables significant performance
improvemen...
Sample timeline of the Black-
Scholes application on AMD
27
0.0E+00 1.0E+08 2.0E+08 3.0E+08 4.0E+08 5.0E+08 6.0E+08
transf...
Related Work:
High-level language to GPU code
Lime (PLDI’12)
JVM compatible language
RootBeer
Compiles Java bytecode t...
Related Work:
Exception Semantics in Java
 Artigas et al. (ICS’00) and Moreira et al.(ACM Trans.
‘00)
 Generates excepti...
Summary:
HJ-OpenCL
Programmer can utilize OpenCL by just
putting “forall” construct
Automatic generation of exception c...
Backup
31
“next” construct for global
barrier synchronization on GPUs
Semantics
 Wait until all thread reaches the synchronization...
next construct (cont’d)
33
forall (point [i]:[0:n-1]) {
method1(i);
// synchronization point 1
next;
method2(i);
// synchr...
“ArrayView” for Supporting
Contiguous Multidimensional
array
 HJ ArrayView is backed by one-dimensional Java Array
 Enab...
Speculative Exception Checking
Speculative
Execution of GPU
Exception Checking
(in parallel)
Data transfer to
GPU & kernel...
Upcoming SlideShare
Loading in...5
×

Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs

113

Published on

Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
113
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Thanks for your introduction.
    Hello, everyone. Welcome to my talk. This is the last talk in LCPC workshop.
    My name is Akihiro Hayashi and I’m a posdoc at Rice university.

    Today I’ll be talking about speculative execution of parallel programs with precise exception semantics on GPUs.
  • Let me first talk about the background.

    Programming models for GPGPU, such as CUDA and OpenCL can enable significant improvements for certain classes of applications.
    If you use these programming model, applications will be running faster than natively running C/C++ applications.
    These programming models provide C-like kernel language and low-level APIs including data transfer API and kernel invocation API which are usually accessible from C/C++ languages.

    On the other hand,
    High-level languages such as Java provide high-productivity features including type safety, garbage collection and precise exception semantics.
  • But, If you want to utilize GPU from Java, It requires programmers to write non-trivial amount of application code.

    Here is an example.
    You can see three kinds of code here.
    These are JNI code, OpenCL APIs and OpenCL Kernel code.

    In JNI code, you can see the declaration of JNI prototype and the code which get/release the pointer of original Java array.
    In OpenCL APIs. I omitted a lot of codes. But actually programmers have to write a lot of code to offload the task onto GPU by using memory allocation APIs, data transfer APIs, kernel compilation APIs, kernel invocation APIs.

    When you take a look at the OpenCL kernel, you can see c-like function.
    Programmers can write a kernel computation in SPMD manner.

    However I think, utilizing GPU from Java adds non-trivial amount of work.
  • In past work, Rootbeer provides high-level programming model for GPU.
    In Rootbeer the programmers prepare a special class which implements Kernel interface, It allows the programmer to write computation body with overriding the method called gpuMethod().
    We don’t need to write JNI and OpenCL Code anymore,
    You can invoke kernel on GPU by using RootBeer API as shown on your left. It just adds the jobs to queue.

    But it still requires programmer to write special class and special API to invoke GPU.
  • In our prior work,
    we proposed HJ-OpenCL which performs automatic generation of OpenCL kernels and JNI glue code from a prallel-for construct named forall.
    Our compiler automatically compiles forall constructs to OpenCL kernel.
    HJ-OpenCL is built on the top of Haabnero-Java language.

    In this work we focus on maintaining Java’s exception semantics when we use the GPU from Java.
  • Before explaining our methodologies, Let me tell you the overview of Habanero Java language.
    Habenero-Java is a new language developed at Rice since 2007.
    It’s derived from Java-based version of X10 language.
  • HJ provides several parallel extensions focused on task parallelism.
    You can use async statement to create task and You can also use forall to express parallel loop.
    If you want to do all-to-all and point-to-point synchronization, you can use phaser. A subset of phaser is available in Java 7 .
    For mutual exclusion, you can use isolated statement.
    HJ allows you to set the affinity of task with places.

    Habanero-C and Habanero-Scala are also avaialable with similar constructs.
  • Here is a HJ-OpenCL Example.

    In HJ-OpenCL, you can express GPU kernel by just replacing for with forall.
    Unlike RootBeer, you don’t need to prepare special class to GPU from Java.

  • This is the compilation flow of HJ-OpenCL

    The HJ compiler takes HJ program and generates three kinds of files.
    These are .class files on JVM, OpenCL_hjstub.c which consists of JNI glue code and OpenCLAPI, and a special class named OpenCLKernel.class.
    OpenCLKernel.class is passed to APARAPI translator, which takes Java bytecode and generates OpenCL kernel.
    The native static C compiler takes OpenCL_hjstub.c and OpenCL kernel generated by APARAPI translator and generates Native dynamic library.

    This compilation is done automatically.
  • Let me describe the detail of APARAPI.

    APARAPI is an open source project for data parallel Java.
    It compiles Java bytecode to OpenCL at runtime.
    There is a rescriction with regards to kernel generation.
    APARAPI can only handle primitive type, can not handle instance of object.

    We prepared static version of APARAPI to reduce runtime compilation overhead.
  • Let’s talk about the exception semantics.

    As you know, we can say Java is safe language but not high performance language because Java checks exceptions at runtime.
    On the other hand, OpenCL and CUDA is not safe but hight performance language.

    This has some analogy to these picture.
    You know, Consumer cars have several safe features. F1 machine is not safe but can move really fast.
    We want to mix High Performance and Safety
  • This is basic idea.

    Basically, we run exception checking code in parallel in JVM first.
    And in the case that no exception occurred we’ll invoke GPU through JNI call.

    In the code shown on your right,
    You can see the exception checking code which is enclosed by try-catch statement.
    Note that this code is same as the original forall implemtantion except array store.
    We should transform all array store to array read to keep program semantics.

    In catch statement, true is assigned to excpFlag.
    In the case that excpFlag is false we’ll invoke GPU through JNI call.
    Otherwise we’ll execute original implementation on JVM.

    That’s why we can maintain exception semantics.
  • Additionally, We can run exception checking code and computation in parallel.
    We speculatively run the computation on GPU. If no exception occurred we’ll get the data from device.
    Otherwise we’ll run original implementation on JVM like no speculative execution.

    This is our proposing methodology.
    Compiler automatically geneartes this code.
  • If we focus on several exceptions which are possibly occurred during GPU execution.
    We can optimize the exception checking code,
    If we can focus on ArrayIndexOutOfBoundsException, ArithmeticException, and NullPointerException,
    We can just delete statements which will not cause these exceptions to accelerate exception checking at compile time.
  • OK, let’s talk about the optimization.
    As I mentioned before, our algorithm deletes statement which do not derive array subscript and denominator of division statement by considering control flow

    Here is some example.
    Some value is assigned to I, which will be used in A[i] in latter statement.
    So we cannot delete this statement.
    Similary, we cannot delete assignment of Y because this statement derives denominator of division.
    But we can delete assignment of X because this will not derive array subscript and denominator of division statement.
  • Transcript of "Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs"

    1. 1. Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs LCPC2013, Qualcomm Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar Rice University, Houston, TX, USA 1
    2. 2. Background: GPGPU and Java The initial wave of programming models for GPGPU has provided low-level APIs: CUDA (NVIDIA) OpenCL (Khronos) →Often faster than natively running C/C++ applications High-level languages such as Java provide high-productivity features: Type safety Garbage Collection Precise Exception Semantics 2
    3. 3. OpenCL Kernel JNI OpenCL APIs JNI Motivation: GPU Execution From Java JNIEXPORT void JNICALL_Java_Test (…) { void ∗aptr = (env)−>GetPrimitiveArrayCritical(arrays , 0); ... /∗ Create Buffer ∗/ Aobj = clCreateBuffer(context, …); /∗ Host to Device Communication ∗/ clEnqueueWriteBuffer(queue, Aobj, …); /∗ Kernel Compilation ∗/ /* Kernel Invocation */ … (env)−>ReleasePrimitiveArrayCritical(arrays, aptr, 0); __kernel void run(…) { int gid = …; A[gid] = …; } Utilizing GPU from Java adds non-trivial amount of work 3
    4. 4. Computation Body RootBeer API Related Work: RootBeer Still requires special API invocation in addition to computation body 4 int[][] arrays = new int[N][M]; int[] result = new int[N]; ... arrays initialization ... List<Kernel> jobs = new ArrayList<Kernel>(); for(int i = 0; i < N; i++) { jobs.add( new ArraySumKernel( arrays[i], result, i) ); } Rootbeer rootbeer = new Rootbeer(); rootbeer.runAll(jobs); class ArraySumKernel implements Kernel { private int[] source; private int[] ret; private int index; public void gpuMethod() { int sum = 0; for(int i = 0; i< N; i++) { sum += source[i]; } ret[index] = sum; } }
    5. 5. Our Approach: HJ-OpenCL Overview Automatic generation of OpenCL kernels and JNI glue code from a parallel-for construct forall Built on the top of Habanero-Java(HJ) “Habanero-Java: the New Adventures of Old X10” Cave et al. (PPPJ’11) “Accelerating Habanero-Java Programs with OpenCL Generation” Hayashi et al. (PPPJ’13) – User provided “Safe” constructs OpenCL acceleration with precise exception semantics Our primary contribution 5
    6. 6. Overview of Habanero-Java (HJ) Language New language and implementation developed at Rice since 2007  Derived from Java-based version of X10 language (v1.5) in 2007  HJ is currently an extension of Java 1.4 All Java 5 & 6 libraries and classes can be called from HJ programs 6
    7. 7. Overview of Habanero-Java (HJ) Language (Cont’d)  HJ’s parallel extensions are focused on task parallelism 1. Dynamic task creation & termination: async, finish, force, forall, foreach 2. Collective and point-to-point synchronization: phaser, next 3. Mutual exclusion and isolation: isolated 4. Locality control --- task and data distributions: places, here  Sequential HJ extensions added for convenience  extern, point, region, pointwise for, complex data type, array views  Habanero-C and Habanero-Scala are also available with similar constructs 7
    8. 8. HJ OpenCL Implementation HJ-OpenCL Example →Programmers can utlize OpenCL by just replacing for with fora 8 public class ArraySum { public static void main(String[] args) { int[] base = new int[N*M]; int[] result = new int[N]; int[.] arrays = new arrayView(base, 0, [0:N-1,0:M-1]); ... initialization ... boolean isSafe = ...; forall(point [i] : [0:N-1]) { result[i] = 0; for(int j=0; j<M; j++) { result[i] += arrays[i,j]; } } } }
    9. 9. The compilation flow HJ Program .class files on JVM (bytecode) OpenCL_hjstub.c (JNI glue code) OpenCLKernel.clas s (bytecode) HJ Compiler C compiler APARAPI Translator OpenCL Kernel Kernel.c Native library (.so, .dll, .dylib) JVM Host JNI Device OpenCL Program is translated into three files 9
    10. 10. APARAPI Open Source Project for data parallel Java from AMD  https://code.google.com/p/aparapi/  APARAPI converts Java bytecode to OpenCL kernel at runtime Restriction  Can only handle primitive type, not object 10 →we prepared static version of APARAPI to reduce runtime compilation overhead
    11. 11. Acceleration vs. Exception Semantics Safe? High Performance? Java Yes No OpenCL/CUDA No Yes 11 OpenCL/CUDAJava Mix! Pictures borrowed from http://wot.motortrend.com/ http://www.boston.com/bigpicture/2008/09/the_singapore_grand_prix.html
    12. 12. Basic Idea: Non-Speculative Exception Checking 12 Exception Checking (in parallel) Multicore-CPU GPU Non-Speculative Exception Checking Code boolean excpFlag = false; /* (1) Exception Checking Code on JVM */ try { forall (point [i]:[0:N-1]) { … = A[i]; } } catch (Exception e) { excpFlag = true; } /* (2) JNI Call */ if (!excpFlag) { openCL_Kernel(); } else{ // Original Implementation on JVM forall() {} } Host to device Data Transfer Computation Device to host Data Transfer No Exception Time Same as original Forall except ArrayStore
    13. 13. Proposed Idea: Speculative Exception Checking 13 Exception Checking (in parallel) Multicore-CPU GPU Speculative Exception Checking Code boolean excpFlag = false; /* JNI Call 1*/ openCL_Kernel1(); /* (2) Exception Checking Code on JVM */ try { forall (point [i]:[0:N-1]) { … = A[i]; } } catch (Exception e) { excpFlag = true; } /* (2) JNI Call */ openCL_Kernel2(excpFlag); if (excpFlag) { // Original Implementation on JVM forall() {} } Host to device Data Transfer Computation Device to host Data Transfer No Exception Time Same as original Forall except ArrayStore
    14. 14. Opportunity for Optmization Target exceptions possibly occurred during GPU execution (due to APARAPI restriction) ArrayIndexOutOfBoundsException ArithmeticException – Divided by Zero NullPointerException What kind of “Optimization”? Delete statements which will not cause the above exceptions to accelerate exception checking at compile time 14
    15. 15. The exception checking code optimization algorithm Key Idea Delete statements which do not derive array subscripts and denominator of division statement by considering control flow 15 i = …; X = …; Y = …; … = A[i] + X; … = B[i] / Y; Before i = …; X = …; Y = …; … = A[i] + X; … = B[i] / Y; After
    16. 16. Exception Checking Code Optimization Example forall (point [i]:[0:N-1]) { A[Index[i]] = B[i] + C[i]; } // IR $i2 = Index[i]; $i3 = B[i]; $i4 = C[i]; $i5 = $i3 + $i4; A[$i2] = $i5; // IR $i2 = Index[i]; $i3 = B[i]; $i4 = C[i]; $i5 = $i3 + $i4; dummy = A[$i2] mark // IR $i2 = Index[i]; $i3 = B[i]; $i4 = C[i]; delete dummy = A[$i2] // IR $i2 = Index[i]; $i3 = B[i]; $i4 = C[i]; dummy = A[$i2] Dummy read Optimized Code 16
    17. 17. Benchmarks Benchmark Data Size Remarks SparseMatmult JGF N= 500,000 Sparse Matrix Doitgen Polybench 128x128x128 Crypt JGF N = 50,000,000 Blackscholes 16,777,216 options MRIQ Parboil 64x64x64 MatMult 1024x1024 SAXPY N= 25,000x 25,000 Sparse Matrix GEMVER SparseBLAS 10,000,000 Sparse Matrix 17
    18. 18. Platforms AMD A10-5800K Westmere CPU 4-cores (APU) 6-cores x 2 Xeon 5660 GPU Radeon HD 7660D 384-cores NVIDIA Tesla M2050 448-cores Java Runtime JRE (build 1.6.0_21-b06 JRE (build 1.6.0_25-b06) JVM HotSpot 64-Bit Server VM (build 17.0-b11, mixed mode) HotSpot 64-Bit Server VM(Build 20.0-b11, mixed mode) 18
    19. 19. Experimental Methodologies We tested execution in the following modes: Sequential Java HJ-OpenCL No Checking (Unsafe) Non-Speculative Exception Checking – Optimized/Unoptimized Speculative Exception Checking – Optimized/Unoptimized 19
    20. 20. Result On AMD(A10-5800K) No Checking vs. Speculative Checking  Up to 18% slowdown while maintaining Java exception semantics 20 2.4 0.2 3.6 9.8 21.3 12.6 5.1 9.9 2.1 0.2 3.0 8.6 21.1 11.7 4.5 9.6 0.0 5.0 10.0 15.0 20.0 25.0 JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER SpeeduprelativetoSequentialJava no checking: UNSAFE Optimized Speculative: SAFE, PROPOSED Benchmarks Higher is better
    21. 21. Result On AMD(A10-5800K) Slowdown for Exception Checking 21 0.9 1.0 0.4 0.2 0.1 0.9 0.9 0.9 0.9 1.0 0.4 0.2 0.1 0.9 0.9 0.90.9 1.0 0.8 0.9 1.0 0.9 0.9 1.0 0.9 1.0 0.8 0.9 1.0 0.9 0.9 1.0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER SpeeduprelativetoNo-Checking Non-Speculative, Unoptimized, Speculative, Unoptimized Non-Speculative, Optimized Speculative Optimized Speedup by Optimizatio n Speedup by Optimizatio n Almost Same Performanc e Almost Same Performanc e Benchmarks Higher is better
    22. 22. Analysis of Results on AMD: Checking code optimization issues In speculative Execution, optimization is effective only if exception checking code is on critical path  JGF-Crypt, BlackScholes, MRIQ, GEMVER 22 Unoptimized Exception Checking (in parallel) Multicore-CPU GPU Host to device Data Transfer Computation Device to host Data Transfer Time Exception Checking Code is on Critical path
    23. 23. Analysis of Results on AMD: Speculative Execution Issues Speculation does not much accelerate program execution because GPU execution is quite longer than checking  JGF-SparseMatMultDoitgen, MatMult, SAXPY 23 Checking Multicore-CPU GPU Host to device Data Transfer Computation Device to host Data Transfer Time Checking Multicore-CPU GPU Host to device Data Transfer Computation Device to host Data Transfer
    24. 24. Result On Westmere No Checking vs. Speculative Checking  Up to 22% slowdown while maintaining Java exception semantics 24 9.8 2.7 14.4 37.8 330.8 43.1 8.8 22.3 7.7 2.8 13.6 35.2 331.0 44.1 7.6 22.4 1.0 10.0 100.0 1000.0 JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER SpeeduprelativetoSequentialJava no checking: UNSAFE Optimized Speculative: SAFE, PROPOSED Benchmarks Log Scale, Higher is better
    25. 25. Result On Westmere Slowdown for Exception Checking  Both speculation and optimization are effective 25 0.6 0.9 0.3 0.2 0.0 0.8 0.8 0.9 0.7 1.0 0.3 0.2 0.0 1.0 0.9 1.0 0.7 0.9 0.8 0.7 0.9 0.8 0.8 1.0 0.8 1.0 0.9 0.9 1.0 1.0 0.9 1.0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 JGF-SparseMatMult Polybench Doitgen JGF-Crypt Black-Scholes MRIQ MatMult SAXPY GEMVER SpeeduprelativetoNo-Checking Non-Speculative, Unoptimized, Speculative, Unoptimized Non-Speculative, Optimized Speculative Optimized Benchmarks Higher is better
    26. 26. Insights Removal of redundant java.Math methods in exception checking code can enables significant performance improvements JGF-Crypt, BlackScholes, MRIQ Speculation is not effective on AMD due to Insufficient processors in GPU Lazy AMD OpenCL Runtime 26
    27. 27. Sample timeline of the Black- Scholes application on AMD 27 0.0E+00 1.0E+08 2.0E+08 3.0E+08 4.0E+08 5.0E+08 6.0E+08 transfer|h-d|pending|dt1 transfer|h-d|pending|dt2 transfer|h-d|pending|dt3 kernel|pending transfer|h-d|running|dt1 transfer|h-d|running|dt2 transfer|h-d|running|dt1 kernel|running transfer|d-h|pending|dt1 transfer|d-h|pending|dt2 transfer|d-h|pending|dt3 transfer|d-h|running|dt1 transfer|d-h|running|dt2 transfer|d-h|running|dt3 Time (ns) ApplicationStage Accounts for 40% of total execution time!
    28. 28. Related Work: High-level language to GPU code Lime (PLDI’12) JVM compatible language RootBeer Compiles Java bytecode to CUDA X10 and Chapel Provides programming model for CUDA Sponge (ASPLOS’11) Compiles StreamIt to CUDA → None of these approaches considers Java Exception Semantics 28
    29. 29. Related Work: Exception Semantics in Java  Artigas et al. (ICS’00) and Moreira et al.(ACM Trans. ‘00)  Generates exception- safe and -unsafe regions of code.  Wurthinger et al.(PPPJ’07)  Proposes an algorithm on Static Single Assignment(SSA) form for the JIT compiler which eliminates un- necessary bounds checking.  ABCD (PLDI’00)  Provides an array bounds checking elimination algorithm, which is based on graph traversal on an extended SSA form.  Jeffery et al. (In Concurrency and Compu- tations: Practice and Experience,‘09)  Proposes a static annotation framework to reduce the overhead of dynamic checking in the JIT compiler. 29
    30. 30. Summary: HJ-OpenCL Programmer can utilize OpenCL by just putting “forall” construct Automatic generation of exception checking code on JVM Accelerating Java program with precise exception semantics Performance improvement Upto 21x speedup on AMD APU upto 330x speedup on NVIDIA GPU 30
    31. 31. Backup 31
    32. 32. “next” construct for global barrier synchronization on GPUs Semantics  Wait until all thread reaches the synchronization point Note that OpenCL does not support all-to- all barrier as a kernel language feature  The HJ compiler internally partitions the forall loop body into blocks separated by synchronization points 32
    33. 33. next construct (cont’d) 33 forall (point [i]:[0:n-1]) { method1(i); // synchronization point 1 next; method2(i); // synchronization point 2 next; } Thread0 method1(0); Thread1 method1(1); WAIT method2(0); method2(1); WAIT
    34. 34. “ArrayView” for Supporting Contiguous Multidimensional array  HJ ArrayView is backed by one-dimensional Java Array  Enables reduction of data transfer between host and device Java Array A[i][j] HJ Array View A[i, j] 0 1 2 0 0 1 2 0 1 0 1 2 3 A[0][1] A[0,1] 34
    35. 35. Speculative Exception Checking Speculative Execution of GPU Exception Checking (in parallel) Data transfer to GPU & kernel invocation Computation Data transfer from GPU GPU Cleanup Exception Occurred? Fall back to original code HJ Runtime on JVM OpenCL Runtime Kernel on Device No Yes Multi-core CPU Many-core GPUTime 35

    ×