Your SlideShare is downloading. ×
0
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Accelerating Habanero-Java Program with OpenCL Generation

364

Published on

Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of …

Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
364
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Hello, everyone. Welcome to my talk.
    My name is Akihiro Hayashi and I’m a posdoc at Rice university.

    Toiday I’ll be talking about accelerating habanero-java programs with OpenCL generation.
  • Let me first talk about the background.

    Programming models for GPGPU, such as CUDA and OpenCL,
    can enable significant performance and energy improvemtns for certain classes of applications.
    These programming models provide C-like kernel language and low-level APIs which is usually accessible from C/C++.

    On the other hand,
    High-level languages such as Java provide high-productivity features including type safety, garbage collection and precise exception semantics.
  • But, If you want to utilize GPU from Java, It requires programmers to write non-trivial amount of application code.

    Here is an example.
    There are two kinds of code here.

    The code shown on your left will be executed on host, and the code shown on your right will executed on device.
    In host code, you can see JNI call which get and release array pointer of Java array and several OpenCL APIs like memory allocation, data transfe, kernel compilation and kernel invocation.
    In kernel code, you can see C-like kernel code which is written in SPMD manner.

    As you can see utilizing GPU from Java adds non-trivial amount of work.
  • In past work, Rootbeer provides high-level programming model for GPU.
    We don’t need to write JNI and OpenCL Code anymore, But it requires programmer to write spececial class and special API to invoke GPU.
  • In our approach, we propose HJ-OpenCL which performs automatic generation of OpenCL kernels and JNI glue code from a prallel-for construct named forall.
    HJ-OpenCL is built on the top of Haabnero-Java language.
    It also provides a way to maintain precise exeception semantics
  • In HJ-OpenCL We don’t need to prepare
  • Because it does not check exceptions at runtime.
  • Transcript

    • 1. Accelerating Habanero-Java Programs with OpenCL Generation Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar Rice University, Houston, Texas, USA 1
    • 2. Background: GPGPU and Java The initial wave of programming models for GPGPU has provided low-level APIs: CUDA (NVIDIA) OpenCL (Khronos) →Often faster than natively running application High-level languages such as Java provide high-productivity features: Type safety Garbage Collection Precise Exception Semantics 2
    • 3. OpenCL Kernel JNI OpenCL JNI Motivation: GPU Execution From Java JNIEXPORT void JNICALL_Java_Test (…) { void ∗aptr = (env)−>GetPrimitiveArrayCritical(arrays , 0); ... /∗ Create Buffer ∗/ cl mem Aobj = clCreateBuffer ( context , . . . ) ; /∗ Host to Device Communication ∗/ clEnqueueWriteBuffer(queue, Aobj, ... , aptr , ...) ; /∗ Kernel Compilation ∗/ ... (env)−>ReleasePrimitiveArrayCritical(arrays, aptr, 0); } __kernel void run(…) { int gid = get_global_id(0); ... Utilizing GPU from Java adds non-trivial amount of work 3
    • 4. Computation Body RootBeer API Related Work: RootBeer public class ArraySum { public static void main(String[] args) { int[][] arrays = new int[N][M]; int[] result = new int[N]; ... arrays initialization ... List<Kernel> jobs = new ArrayList<Kernel>(); for(int i = 0; i < N; i++) { jobs.add(new ArraySumKernel(arrays[i], result, i); } Rootbeer rootbeer = new Rootbeer(); rootbeer.runAll(jobs); } } class ArraySumKernel implements Kernel { private int[] source; private int[] ret; private int index; public ArraySumKernel(int[] source, int[] ret, int i) { this.source = source; this.ret = ret; this.index = i; } public void gpuMethod() { int sum = 0; for(int i = 0; i < source.length; i++) { sum += source[i]; } ret[index] = sum; } } Requires special API invocation in addition to computation body 4
    • 5. Our Approach: HJ-OpenCL Overview  Automatic generation of OpenCL kernels and JNI glue code from a parallel-for construct forall Built on the top of Habanero-Java Language (PPPJ’11) OpenCL acceleration with precise exception semantics Our primary contribution 5
    • 6. Overview of Habanero-Java (HJ) Language  New language and implementation developed at Rice since 2007  Derived from Java-based version of X10 language (v1.5) in 2007  HJ is currently an extension of Java 1.4  All Java 5 & 6 libraries and classes can be called from HJ programs  HJ’s parallel extensions are focused on task parallelism 1. Dynamic task creation & termination: async, finish, force, forall, foreach 2. Collective and point-to-point synchronization: phaser, next 3. Mutual exclusion and isolation: isolated 4. Locality control --- task and data distributions: places, here  Sequential HJ extensions added for convenience  extern, point, region, pointwise for, complex data type, array views  Habanero-C and Habanero-Scala are also available with similar constructs 6
    • 7. HJ OpenCL Implementation HJ-OpenCL Example public class ArraySum { public static void main(String[] args) { int[] base = new int[N*M]; int[] result = new int[N]; int[.] arrays = new arrayView(base, 0, [0:N-1,0:M- 1]); ... initialization ... boolean isSafe = ...; safe(isSafe) { forall(point [i] : [0:N-1]) { result[i] = 0; for(int j=0; j<M; j++) { result[i] += arrays[i,j]; } } } } } →Programmers can utlize OpenCL by just putting fora Safe Construct for Precise Exception Semantics 7
    • 8. The compilation flow HJ Program .class files on JVM (bytecode) OpenCL_hjstub.c (JNI glue code) OpenCLKernel.clas s (bytecode) HJ Compiler C compiler APARAPI Translator OpenCL Kernel Kernel.c Native library (.so, .dll, .dylib) JVM Host JNI Device OpenCL Program is translated into three files 8
    • 9. APARAPI Open Source Project for data parallel Java https://code.google.com/p/aparapi/  APARAPI converts Java bytecode to OpenCL at runtime 9 Kernel kernel = new Kernel(){ @Override public void run(){ int i= getGlobalId(); result[i]=intA[i]+inB[i]; } }; Range range = Range.create(result.length); kernel.execute(range); →we prepared static version of APARAPI to reduce runtime overhead
    • 10. Code Generation Demo 10
    • 11. Acceleration vs. Exception Semantics Safe? High Performance? Java Yes No OpenCL/CUDA No Yes 11 Picture is borrowed from http://www.boston.com/bigpicture/2008/09/the_singapore_grand_prix.html
    • 12. For Precise Exception Semantics on GPUs “safe” language construct Programmers specify the safe condition Can be useful for testing too 12 safe (cond) { … }
    • 13. Generated CodeHJ Implementation Safe construct for exception semantics Asserts that no exception will be thrown inside the body boolean no_excp = …; safe (no_excp) { // mapped to GPU forall () { … } } safe (cond) { … } boolean no_excp = …; if (no_excp) { OpenCL_exec(); // JNI } else { forall() {} // On JVM } 13
    • 14. Exception Checking Example of Safe Construct boolean isSafe = result.length < N; safe(isSafe) { forall(point [i] : [0:N-1]) { result[i] = i; } } Example 1: array bounds checking 14
    • 15. Example of Safe Construct (Cont’d) Exception Checkingboolean isSafe = true; for (int i = 0; i < N; i++) { if (index[i] >= result.length) isSafe = false; } safe(isSafe) { forall(point [i] : [0:N-1]) { for (j = 0; j < M; j++) { result[index[i]] += A[j] * B[i, j]; } } } Example 2: indirect array access Indirect acesses 15 Checks if all element of index is greater than result.length
    • 16. “next” construct for global barrier synchronization on GPUs Semantics  Wait until all thread reaches the synchronization point Note that OpenCL does not support all-to-all barrier as a kernel language feature  The HJ compiler internally partitions the forall loop body into blocks separated by synchronization points 16
    • 17. next construct (cont’d) 17 forall (point [i]:[0:n-1]) { method1(i); // synchronization point 1 next; method2(i); // synchronization point 2 next; } Thread0 method1(0); Thread1 method1(1); WAIT method2(0); method2(1); WAIT
    • 18. “ArrayView” for Supporting Contiguous Multidimensional array  HJ ArrayView is backed by one-dimensional Java Array  Enables reduction of data transfer between host and device Java Array A[i][j] HJ Array View A[i, j] 0 1 2 0 0 1 2 0 1 0 1 2 3 A[0][1] A[0,1] 18
    • 19. Benchmarks Benchmark Data Size Next? Blackscholes 16,777,216 options No Crypt JGF N = 50,000,000 No MatMult 1024x1024 No Doitgen Polybench 128x128x128 No MRIQ Parboil 64x64x64 No Syrk Polybench 2048x2048 No Jacobi Polybench T=50, N = 134,217,728 No SparseMatmult JGF N= 500,000 No Spectral-norm CLBG N = 2,000 Yes SOR JGF N = 2,000 Yes 19
    • 20. Platforms AMD A10-5800K Westmere CPU 4-cores 6-cores x 2 Xeon 5660 GPU Radeon HD 7660D 384-cores NVIDIA Tesla M2050 448-cores Java Runtime JRE (build 1.6.0_21-b06 JRE (build 1.6.0_25-b06) JVM HotSpot 64-Bit Server VM (build 17.0-b11, mixed mode) HotSpot 64-Bit Server VM(Build 20.0-b11, mixed mode) 20
    • 21. Experimental Methodologies We tested execution in the following modes: Sequential Java HJ (on JVM) Sequential HJ Parallel HJ HJ-OpenCL with Safe Construct (on Device) OpenCL CPU OpenCL GPU 21
    • 22. Result on AMD A10-5800K 0.99 1 0.21 0.78 1.01 0.99 0.96 0.98 1.01 1.06 2.06 1.99 0.4 1.35 2.02 1.92 1.88 1.88 2.34 1.2 4.75 3.01 0.72 2.89 6.28 2.07 36.71 2.43 2.06 1.19 8.88 3.59 12.91 0.19 21.19 0.69 55.01 2.08 0.86 0.21 0.1 1 10 100 Black-Scholes Crypt MatMult Doitgen MRIQ Syrk Jacobi SparseMatMult Spectral-norm SOR SpeeduprelativetoSequentialJava Benchmarks Sequential HJ Parallel HJ HJ OpenCL CPU HJ OpenCL GPU 22 Log scale
    • 23. Result on Westmere 1.02 0.98 1.62 0.99 1.01 1.04 1 0.97 0.97 0.97 6.22 5.64 6.88 5.06 6.1 6.26 2.96 4.86 10.16 3.18 18.62 4.73 9.98 5.91 29.26 3.55 35.68 1.68 10.22 2.93 37.2 13.91 43.56 2.82 324.22 1.17 36.62 6.63 28.13 1.22 0.1 1 10 100 1000 Black-Scholes Crypt-C MatMult Doitgen MRIQ Syrk Jacobi SparseMatMult Spectral-norm SOR SpeeduprelativetoSequentialJava Benchmarks Sequential HJ Parallel HJ HJ OpenCL CPU HJ OpenCL GPU 23 Log scale
    • 24. Slowdown for exception checking Device Black Schol es Crypt MatM ult Doitge n MRIQ Syrk Jacobi Sparse Matm ult Spectr al- Norm SOR CPU 0.99 0.99 1.00 1.04 1.03 0.99 1.00 0.94 0.98 0.98 GPU 1.02 0.99 1.00 1.00 1.00 1.00 0.97 0.91 1.00 1.00 On A10-5800K Device Black Schol es Crypt MatM ult Doitge n MRIQ Syrk Jacobi Sparse Matm ult Spectr al- Norm SOR CPU 0.98 0.98 0.98 0.99 1.00 1.00 1.00 0.97 1.00 1.02 GPU 0.95 0.94 0.99 1.00 0.98 1.00 0.99 0.68 0.99 1.00 On Westmere Indirect acess 24
    • 25. Related Work: High-level language to GPU code Lime (PLDI’12) JVM compatible language RootBeer Compiles Java bytecode to CUDA X10 and Chapel Provides programming model for CUDA Sponge (ASPLOS’11) Compiles StreamIt to CUDA → None of these approaches considers Java Exception Semantics 25
    • 26. Related Work: Exception Semantics in Java  Artigas et al. (ICS’00) and Moreira et al.(ACM Trans. ‘00)  Generates exception- safe and -unsafe regions of code.  Wurthinger et al.(PPPJ’07)  Proposes an algorithm on Static Single Assignment(SSA) form for the JIT compiler which eliminates un- necessary bounds checking.  ABCD (PLDI’00)  Provides an array bounds checking elimination algorithm, which is based on graph traversal on an extended SSA form.  Jeffery et al. (In Concurrency and Compu- tations: Practice and Experience,‘09)  Proposes a static annotation framework to reduce the overhead of dynamic checking in the JIT compiler. 26
    • 27. Conclusions: HJ-OpenCL Programmer can utilize OpenCL by just putting “forall” construct “safe” construct for precise exception semantics “next” construct for barrier synchronization Performance improvement upto 55x speedup on AMD APU upto 324x speedup on NVIDIA GPU 27
    • 28. Future Work Speculative Exception Checking Speculative Execution of Parallel Programs with Precise Exception Semantics. A.Hayashi et al. (LCPC’13) Automatic generation of exception checking code 28

    ×