SlideShare a Scribd company logo
Accelerating Habanero-Java
Programs with OpenCL Generation
Akihiro Hayashi, Max Grossman,
Jisheng Zhao, Jun Shirako, Vivek Sarkar
Rice University, Houston, Texas, USA
1
Background:
GPGPU and Java
The initial wave of programming models for
GPGPU has provided low-level APIs:
CUDA (NVIDIA)
OpenCL (Khronos)
→Often faster than natively running application
High-level languages such as Java provide
high-productivity features:
Type safety
Garbage Collection
Precise Exception Semantics
2
OpenCL Kernel
JNI
OpenCL
JNI
Motivation:
GPU Execution From Java
JNIEXPORT void JNICALL_Java_Test (…) {
void ∗aptr = (env)−>GetPrimitiveArrayCritical(arrays , 0);
...
/∗ Create Buffer ∗/
cl mem Aobj = clCreateBuffer ( context , . . . ) ;
/∗ Host to Device Communication ∗/
clEnqueueWriteBuffer(queue, Aobj, ... , aptr , ...) ;
/∗ Kernel Compilation ∗/
...
(env)−>ReleasePrimitiveArrayCritical(arrays, aptr, 0);
}
__kernel
void run(…) {
int gid =
get_global_id(0);
...
Utilizing GPU from Java adds non-trivial
amount of work
3
Computation
Body
RootBeer
API
Related Work:
RootBeer
public class ArraySum {
public static void main(String[] args) {
int[][] arrays = new int[N][M];
int[] result = new int[N];
... arrays initialization ...
List<Kernel> jobs =
new ArrayList<Kernel>();
for(int i = 0; i < N; i++) {
jobs.add(new ArraySumKernel(arrays[i],
result, i);
}
Rootbeer rootbeer = new Rootbeer();
rootbeer.runAll(jobs); } }
class ArraySumKernel implements Kernel {
private int[] source;
private int[] ret;
private int index;
public ArraySumKernel(int[] source,
int[] ret, int i) {
this.source = source;
this.ret = ret; this.index = i;
}
public void gpuMethod() {
int sum = 0;
for(int i = 0; i < source.length; i++) {
sum += source[i];
}
ret[index] = sum;
}
}
Requires special API invocation in addition
to computation body
4
Our Approach:
HJ-OpenCL Overview
 Automatic generation of OpenCL kernels
and JNI glue code from a parallel-for
construct forall
Built on the top of Habanero-Java
Language
(PPPJ’11)
OpenCL acceleration with precise
exception semantics
Our primary contribution
5
Overview of Habanero-Java (HJ)
Language
 New language and implementation developed at Rice
since 2007
 Derived from Java-based version of X10 language (v1.5) in 2007
 HJ is currently an extension of Java 1.4
 All Java 5 & 6 libraries and classes can be called from HJ programs
 HJ’s parallel extensions are focused on task parallelism
1. Dynamic task creation & termination: async, finish, force, forall, foreach
2. Collective and point-to-point synchronization: phaser, next
3. Mutual exclusion and isolation: isolated
4. Locality control --- task and data distributions: places, here
 Sequential HJ extensions added for convenience
 extern, point, region, pointwise for, complex data type, array views
 Habanero-C and Habanero-Scala are also available with similar
constructs
6
HJ OpenCL
Implementation
HJ-OpenCL Example
public class ArraySum {
public static void main(String[] args) {
int[] base = new int[N*M];
int[] result = new int[N];
int[.] arrays = new arrayView(base, 0, [0:N-1,0:M-
1]);
... initialization ...
boolean isSafe = ...;
safe(isSafe) {
forall(point [i] : [0:N-1]) {
result[i] = 0;
for(int j=0; j<M; j++) {
result[i] += arrays[i,j];
}
}
}
}
}
→Programmers can utlize OpenCL by just putting fora
Safe
Construct for
Precise
Exception
Semantics
7
The compilation flow
HJ
Program
.class files on JVM
(bytecode)
OpenCL_hjstub.c
(JNI glue code)
OpenCLKernel.clas
s
(bytecode)
HJ
Compiler
C compiler
APARAPI
Translator
OpenCL Kernel
Kernel.c
Native library
(.so, .dll, .dylib)
JVM
Host
JNI
Device
OpenCL
Program is
translated into
three files
8
APARAPI
Open Source Project for data parallel Java
https://code.google.com/p/aparapi/
 APARAPI converts Java bytecode to
OpenCL at runtime
9
Kernel kernel = new Kernel(){
@Override public void run(){
int i= getGlobalId();
result[i]=intA[i]+inB[i];
}
};
Range range = Range.create(result.length);
kernel.execute(range);
→we prepared static version
of APARAPI to reduce runtime overhead
Code Generation Demo
10
Acceleration vs. Exception Semantics
Safe? High Performance?
Java Yes No
OpenCL/CUDA No Yes
11
Picture is borrowed from
http://www.boston.com/bigpicture/2008/09/the_singapore_grand_prix.html
For Precise Exception Semantics
on GPUs
“safe” language construct
Programmers specify the safe condition
Can be useful for testing too
12
safe (cond) { … }
Generated CodeHJ Implementation
Safe construct for exception
semantics
Asserts that no exception will be thrown
inside the body
boolean no_excp = …;
safe (no_excp) {
// mapped to GPU
forall () {
…
}
}
safe (cond) { … }
boolean no_excp = …;
if (no_excp) {
OpenCL_exec(); //
JNI
} else {
forall() {} // On JVM
}
13
Exception Checking
Example of Safe Construct
boolean isSafe = result.length < N;
safe(isSafe) {
forall(point [i] : [0:N-1]) {
result[i] = i;
}
}
Example 1: array bounds checking
14
Example of Safe Construct (Cont’d)
Exception Checkingboolean isSafe = true;
for (int i = 0; i < N; i++) {
if (index[i] >= result.length) isSafe = false;
}
safe(isSafe) {
forall(point [i] : [0:N-1]) {
for (j = 0; j < M; j++) {
result[index[i]] += A[j] * B[i, j];
}
}
}
Example 2: indirect array access
Indirect
acesses
15
Checks if all
element of index
is greater than
result.length
“next” construct for global barrier
synchronization on GPUs
Semantics
 Wait until all thread reaches the synchronization point
Note that OpenCL does not support all-to-all
barrier as a kernel language feature
 The HJ compiler internally partitions the forall loop
body into blocks separated by synchronization points
16
next construct (cont’d)
17
forall (point [i]:[0:n-1]) {
method1(i);
// synchronization point 1
next;
method2(i);
// synchronization point 2
next;
}
Thread0
method1(0);
Thread1
method1(1);
WAIT
method2(0); method2(1);
WAIT
“ArrayView” for Supporting
Contiguous Multidimensional array
 HJ ArrayView is backed by one-dimensional Java
Array
 Enables reduction of data transfer between
host and device
Java Array
A[i][j]
HJ Array View
A[i, j]
0
1
2
0
0
1
2
0 1
0 1 2 3
A[0][1]
A[0,1]
18
Benchmarks
Benchmark Data Size Next?
Blackscholes 16,777,216 options No
Crypt JGF N = 50,000,000 No
MatMult 1024x1024 No
Doitgen Polybench 128x128x128 No
MRIQ Parboil 64x64x64 No
Syrk Polybench 2048x2048 No
Jacobi Polybench T=50, N = 134,217,728 No
SparseMatmult JGF N= 500,000 No
Spectral-norm CLBG N = 2,000 Yes
SOR JGF N = 2,000 Yes
19
Platforms
AMD A10-5800K Westmere
CPU 4-cores 6-cores x 2 Xeon 5660
GPU
Radeon HD 7660D
384-cores
NVIDIA Tesla M2050
448-cores
Java Runtime JRE (build 1.6.0_21-b06 JRE (build 1.6.0_25-b06)
JVM
HotSpot 64-Bit Server
VM
(build 17.0-b11,
mixed mode)
HotSpot 64-Bit Server
VM(Build 20.0-b11,
mixed mode)
20
Experimental Methodologies
We tested execution in the following
modes:
Sequential Java
HJ (on JVM)
Sequential HJ
Parallel HJ
HJ-OpenCL with Safe Construct (on Device)
OpenCL CPU
OpenCL GPU
21
Result on AMD A10-5800K
0.99 1
0.21
0.78
1.01 0.99 0.96 0.98 1.01 1.06
2.06 1.99
0.4
1.35
2.02 1.92 1.88 1.88
2.34
1.2
4.75
3.01
0.72
2.89
6.28
2.07
36.71
2.43
2.06
1.19
8.88
3.59
12.91
0.19
21.19
0.69
55.01
2.08
0.86
0.21
0.1
1
10
100
Black-Scholes Crypt MatMult Doitgen MRIQ Syrk Jacobi SparseMatMult Spectral-norm SOR
SpeeduprelativetoSequentialJava
Benchmarks
Sequential HJ
Parallel HJ
HJ OpenCL CPU
HJ OpenCL GPU
22
Log scale
Result on Westmere
1.02 0.98
1.62
0.99 1.01 1.04 1 0.97 0.97 0.97
6.22 5.64
6.88
5.06
6.1 6.26
2.96
4.86
10.16
3.18
18.62
4.73
9.98
5.91
29.26
3.55
35.68
1.68
10.22
2.93
37.2
13.91
43.56
2.82
324.22
1.17
36.62
6.63
28.13
1.22
0.1
1
10
100
1000
Black-Scholes Crypt-C MatMult Doitgen MRIQ Syrk Jacobi SparseMatMult Spectral-norm SOR
SpeeduprelativetoSequentialJava
Benchmarks
Sequential HJ
Parallel HJ
HJ OpenCL CPU
HJ OpenCL GPU
23
Log scale
Slowdown for exception checking
Device Black
Schol
es
Crypt MatM
ult
Doitge
n
MRIQ Syrk Jacobi Sparse
Matm
ult
Spectr
al-
Norm
SOR
CPU 0.99 0.99 1.00 1.04 1.03 0.99 1.00 0.94 0.98 0.98
GPU 1.02 0.99 1.00 1.00 1.00 1.00 0.97 0.91 1.00 1.00
On A10-5800K
Device Black
Schol
es
Crypt MatM
ult
Doitge
n
MRIQ Syrk Jacobi Sparse
Matm
ult
Spectr
al-
Norm
SOR
CPU 0.98 0.98 0.98 0.99 1.00 1.00 1.00 0.97 1.00 1.02
GPU 0.95 0.94 0.99 1.00 0.98 1.00 0.99 0.68 0.99 1.00
On Westmere
Indirect
acess
24
Related Work:
High-level language to GPU code
Lime (PLDI’12)
JVM compatible language
RootBeer
Compiles Java bytecode to CUDA
X10 and Chapel
Provides programming model for CUDA
Sponge (ASPLOS’11)
Compiles StreamIt to CUDA
→ None of these approaches considers Java
Exception Semantics
25
Related Work:
Exception Semantics in Java
 Artigas et al. (ICS’00) and Moreira et al.(ACM Trans.
‘00)
 Generates exception- safe and -unsafe regions of code.
 Wurthinger et al.(PPPJ’07)
 Proposes an algorithm on Static Single Assignment(SSA)
form for the JIT compiler which eliminates un- necessary
bounds checking.
 ABCD (PLDI’00)
 Provides an array bounds checking elimination algorithm,
which is based on graph traversal on an extended SSA
form.
 Jeffery et al. (In Concurrency and Compu- tations:
Practice and Experience,‘09)
 Proposes a static annotation framework to reduce the
overhead of dynamic checking in the JIT compiler.
26
Conclusions:
HJ-OpenCL
Programmer can utilize OpenCL by just
putting “forall” construct
“safe” construct for precise exception
semantics
“next” construct for barrier synchronization
Performance improvement
upto 55x speedup on AMD APU
upto 324x speedup on NVIDIA GPU
27
Future Work
Speculative Exception Checking
Speculative Execution of Parallel Programs
with Precise Exception Semantics. A.Hayashi
et al. (LCPC’13)
Automatic generation of exception checking
code
28

More Related Content

What's hot

JVM Mechanics
JVM MechanicsJVM Mechanics
JVM Mechanics
Doug Hawkins
 
Silicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsSilicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM Mechanics
Azul Systems, Inc.
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challenge
kenluck2001
 
Implementing STM in Java
Implementing STM in JavaImplementing STM in Java
Implementing STM in Java
Misha Kozik
 
Multithreading done right
Multithreading done rightMultithreading done right
Multithreading done right
Platonov Sergey
 
Landmark Retrieval & Recognition
Landmark Retrieval & RecognitionLandmark Retrieval & Recognition
Landmark Retrieval & Recognition
kenluck2001
 
EdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for Java
Lisa Hua
 
GPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPGPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPMiller Lee
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
David Walker
 
Encoder + decoder
Encoder + decoderEncoder + decoder
Encoder + decoder
COMSATS Abbottabad
 
Verilog 語法教學
Verilog 語法教學 Verilog 語法教學
Verilog 語法教學
艾鍗科技
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
Subhajit Sahu
 
Paractical Solutions for Multicore Programming
Paractical Solutions for Multicore ProgrammingParactical Solutions for Multicore Programming
Paractical Solutions for Multicore Programming
Guy Korland
 
Crafting a Ready-to-Go STM
Crafting  a Ready-to-Go STMCrafting  a Ready-to-Go STM
Crafting a Ready-to-Go STM
Guy Korland
 
Compressed Sensing using Generative Model
Compressed Sensing using Generative ModelCompressed Sensing using Generative Model
Compressed Sensing using Generative Model
kenluck2001
 
Story of static code analyzer development
Story of static code analyzer developmentStory of static code analyzer development
Story of static code analyzer development
Andrey Karpov
 
Lowering STM Overhead with Static Analysis
Lowering STM Overhead with Static AnalysisLowering STM Overhead with Static Analysis
Lowering STM Overhead with Static Analysis
Guy Korland
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gc
exsuns
 
Node.js System: The Landing
Node.js System: The LandingNode.js System: The Landing
Node.js System: The Landing
Haci Murat Yaman
 

What's hot (20)

JVM Mechanics
JVM MechanicsJVM Mechanics
JVM Mechanics
 
Silicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsSilicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM Mechanics
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challenge
 
Implementing STM in Java
Implementing STM in JavaImplementing STM in Java
Implementing STM in Java
 
Multithreading done right
Multithreading done rightMultithreading done right
Multithreading done right
 
Landmark Retrieval & Recognition
Landmark Retrieval & RecognitionLandmark Retrieval & Recognition
Landmark Retrieval & Recognition
 
EdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for Java
 
GPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPGPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMP
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Encoder + decoder
Encoder + decoderEncoder + decoder
Encoder + decoder
 
Verilog 語法教學
Verilog 語法教學 Verilog 語法教學
Verilog 語法教學
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
Paractical Solutions for Multicore Programming
Paractical Solutions for Multicore ProgrammingParactical Solutions for Multicore Programming
Paractical Solutions for Multicore Programming
 
Crafting a Ready-to-Go STM
Crafting  a Ready-to-Go STMCrafting  a Ready-to-Go STM
Crafting a Ready-to-Go STM
 
Compressed Sensing using Generative Model
Compressed Sensing using Generative ModelCompressed Sensing using Generative Model
Compressed Sensing using Generative Model
 
Story of static code analyzer development
Story of static code analyzer developmentStory of static code analyzer development
Story of static code analyzer development
 
Lowering STM Overhead with Static Analysis
Lowering STM Overhead with Static AnalysisLowering STM Overhead with Static Analysis
Lowering STM Overhead with Static Analysis
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gc
 
Node.js System: The Landing
Node.js System: The LandingNode.js System: The Landing
Node.js System: The Landing
 
Arvindsujeeth scaladays12
Arvindsujeeth scaladays12Arvindsujeeth scaladays12
Arvindsujeeth scaladays12
 

Similar to Accelerating Habanero-Java Program with OpenCL Generation

Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Akihiro Hayashi
 
Java gpu computing
Java gpu computingJava gpu computing
Java gpu computing
Arjan Lamers
 
JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?
Doug Hawkins
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
Paul Chao
 
05. Java Loops Methods and Classes
05. Java Loops Methods and Classes05. Java Loops Methods and Classes
05. Java Loops Methods and Classes
Intro C# Book
 
How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJIT
Egor Bogatov
 
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Andrei Varanovich
 
Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)
Andrei Varanovich
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009bosc
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
Miller Lee
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
Stefan Marr
 
Secure coding for developers
Secure coding for developersSecure coding for developers
Secure coding for developers
sluge
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -evechiportal
 
Introduction to Julia
Introduction to JuliaIntroduction to Julia
Introduction to Julia
岳華 杜
 
bluespec talk
bluespec talkbluespec talk
bluespec talk
Suman Karumuri
 
Unit testing in iOS featuring OCUnit, GHUnit & OCMock
Unit testing in iOS featuring OCUnit, GHUnit & OCMockUnit testing in iOS featuring OCUnit, GHUnit & OCMock
Unit testing in iOS featuring OCUnit, GHUnit & OCMockRobot Media
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
Alcides Fonseca
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
Tigabu Yaya
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
AMD Developer Central
 
Stealthy, Hypervisor-based Malware Analysis
Stealthy, Hypervisor-based Malware AnalysisStealthy, Hypervisor-based Malware Analysis
Stealthy, Hypervisor-based Malware Analysis
Tamas K Lengyel
 

Similar to Accelerating Habanero-Java Program with OpenCL Generation (20)

Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
 
Java gpu computing
Java gpu computingJava gpu computing
Java gpu computing
 
JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
 
05. Java Loops Methods and Classes
05. Java Loops Methods and Classes05. Java Loops Methods and Classes
05. Java Loops Methods and Classes
 
How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJIT
 
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
 
Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
 
Secure coding for developers
Secure coding for developersSecure coding for developers
Secure coding for developers
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 
Introduction to Julia
Introduction to JuliaIntroduction to Julia
Introduction to Julia
 
bluespec talk
bluespec talkbluespec talk
bluespec talk
 
Unit testing in iOS featuring OCUnit, GHUnit & OCMock
Unit testing in iOS featuring OCUnit, GHUnit & OCMockUnit testing in iOS featuring OCUnit, GHUnit & OCMock
Unit testing in iOS featuring OCUnit, GHUnit & OCMock
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
Stealthy, Hypervisor-based Malware Analysis
Stealthy, Hypervisor-based Malware AnalysisStealthy, Hypervisor-based Malware Analysis
Stealthy, Hypervisor-based Malware Analysis
 

More from Akihiro Hayashi

GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
Akihiro Hayashi
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Akihiro Hayashi
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Akihiro Hayashi
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
Akihiro Hayashi
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs
Akihiro Hayashi
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Akihiro Hayashi
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Akihiro Hayashi
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Akihiro Hayashi
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
Akihiro Hayashi
 

More from Akihiro Hayashi (10)

GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
 

Recently uploaded

WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 

Recently uploaded (20)

WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 

Accelerating Habanero-Java Program with OpenCL Generation

  • 1. Accelerating Habanero-Java Programs with OpenCL Generation Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar Rice University, Houston, Texas, USA 1
  • 2. Background: GPGPU and Java The initial wave of programming models for GPGPU has provided low-level APIs: CUDA (NVIDIA) OpenCL (Khronos) →Often faster than natively running application High-level languages such as Java provide high-productivity features: Type safety Garbage Collection Precise Exception Semantics 2
  • 3. OpenCL Kernel JNI OpenCL JNI Motivation: GPU Execution From Java JNIEXPORT void JNICALL_Java_Test (…) { void ∗aptr = (env)−>GetPrimitiveArrayCritical(arrays , 0); ... /∗ Create Buffer ∗/ cl mem Aobj = clCreateBuffer ( context , . . . ) ; /∗ Host to Device Communication ∗/ clEnqueueWriteBuffer(queue, Aobj, ... , aptr , ...) ; /∗ Kernel Compilation ∗/ ... (env)−>ReleasePrimitiveArrayCritical(arrays, aptr, 0); } __kernel void run(…) { int gid = get_global_id(0); ... Utilizing GPU from Java adds non-trivial amount of work 3
  • 4. Computation Body RootBeer API Related Work: RootBeer public class ArraySum { public static void main(String[] args) { int[][] arrays = new int[N][M]; int[] result = new int[N]; ... arrays initialization ... List<Kernel> jobs = new ArrayList<Kernel>(); for(int i = 0; i < N; i++) { jobs.add(new ArraySumKernel(arrays[i], result, i); } Rootbeer rootbeer = new Rootbeer(); rootbeer.runAll(jobs); } } class ArraySumKernel implements Kernel { private int[] source; private int[] ret; private int index; public ArraySumKernel(int[] source, int[] ret, int i) { this.source = source; this.ret = ret; this.index = i; } public void gpuMethod() { int sum = 0; for(int i = 0; i < source.length; i++) { sum += source[i]; } ret[index] = sum; } } Requires special API invocation in addition to computation body 4
  • 5. Our Approach: HJ-OpenCL Overview  Automatic generation of OpenCL kernels and JNI glue code from a parallel-for construct forall Built on the top of Habanero-Java Language (PPPJ’11) OpenCL acceleration with precise exception semantics Our primary contribution 5
  • 6. Overview of Habanero-Java (HJ) Language  New language and implementation developed at Rice since 2007  Derived from Java-based version of X10 language (v1.5) in 2007  HJ is currently an extension of Java 1.4  All Java 5 & 6 libraries and classes can be called from HJ programs  HJ’s parallel extensions are focused on task parallelism 1. Dynamic task creation & termination: async, finish, force, forall, foreach 2. Collective and point-to-point synchronization: phaser, next 3. Mutual exclusion and isolation: isolated 4. Locality control --- task and data distributions: places, here  Sequential HJ extensions added for convenience  extern, point, region, pointwise for, complex data type, array views  Habanero-C and Habanero-Scala are also available with similar constructs 6
  • 7. HJ OpenCL Implementation HJ-OpenCL Example public class ArraySum { public static void main(String[] args) { int[] base = new int[N*M]; int[] result = new int[N]; int[.] arrays = new arrayView(base, 0, [0:N-1,0:M- 1]); ... initialization ... boolean isSafe = ...; safe(isSafe) { forall(point [i] : [0:N-1]) { result[i] = 0; for(int j=0; j<M; j++) { result[i] += arrays[i,j]; } } } } } →Programmers can utlize OpenCL by just putting fora Safe Construct for Precise Exception Semantics 7
  • 8. The compilation flow HJ Program .class files on JVM (bytecode) OpenCL_hjstub.c (JNI glue code) OpenCLKernel.clas s (bytecode) HJ Compiler C compiler APARAPI Translator OpenCL Kernel Kernel.c Native library (.so, .dll, .dylib) JVM Host JNI Device OpenCL Program is translated into three files 8
  • 9. APARAPI Open Source Project for data parallel Java https://code.google.com/p/aparapi/  APARAPI converts Java bytecode to OpenCL at runtime 9 Kernel kernel = new Kernel(){ @Override public void run(){ int i= getGlobalId(); result[i]=intA[i]+inB[i]; } }; Range range = Range.create(result.length); kernel.execute(range); →we prepared static version of APARAPI to reduce runtime overhead
  • 11. Acceleration vs. Exception Semantics Safe? High Performance? Java Yes No OpenCL/CUDA No Yes 11 Picture is borrowed from http://www.boston.com/bigpicture/2008/09/the_singapore_grand_prix.html
  • 12. For Precise Exception Semantics on GPUs “safe” language construct Programmers specify the safe condition Can be useful for testing too 12 safe (cond) { … }
  • 13. Generated CodeHJ Implementation Safe construct for exception semantics Asserts that no exception will be thrown inside the body boolean no_excp = …; safe (no_excp) { // mapped to GPU forall () { … } } safe (cond) { … } boolean no_excp = …; if (no_excp) { OpenCL_exec(); // JNI } else { forall() {} // On JVM } 13
  • 14. Exception Checking Example of Safe Construct boolean isSafe = result.length < N; safe(isSafe) { forall(point [i] : [0:N-1]) { result[i] = i; } } Example 1: array bounds checking 14
  • 15. Example of Safe Construct (Cont’d) Exception Checkingboolean isSafe = true; for (int i = 0; i < N; i++) { if (index[i] >= result.length) isSafe = false; } safe(isSafe) { forall(point [i] : [0:N-1]) { for (j = 0; j < M; j++) { result[index[i]] += A[j] * B[i, j]; } } } Example 2: indirect array access Indirect acesses 15 Checks if all element of index is greater than result.length
  • 16. “next” construct for global barrier synchronization on GPUs Semantics  Wait until all thread reaches the synchronization point Note that OpenCL does not support all-to-all barrier as a kernel language feature  The HJ compiler internally partitions the forall loop body into blocks separated by synchronization points 16
  • 17. next construct (cont’d) 17 forall (point [i]:[0:n-1]) { method1(i); // synchronization point 1 next; method2(i); // synchronization point 2 next; } Thread0 method1(0); Thread1 method1(1); WAIT method2(0); method2(1); WAIT
  • 18. “ArrayView” for Supporting Contiguous Multidimensional array  HJ ArrayView is backed by one-dimensional Java Array  Enables reduction of data transfer between host and device Java Array A[i][j] HJ Array View A[i, j] 0 1 2 0 0 1 2 0 1 0 1 2 3 A[0][1] A[0,1] 18
  • 19. Benchmarks Benchmark Data Size Next? Blackscholes 16,777,216 options No Crypt JGF N = 50,000,000 No MatMult 1024x1024 No Doitgen Polybench 128x128x128 No MRIQ Parboil 64x64x64 No Syrk Polybench 2048x2048 No Jacobi Polybench T=50, N = 134,217,728 No SparseMatmult JGF N= 500,000 No Spectral-norm CLBG N = 2,000 Yes SOR JGF N = 2,000 Yes 19
  • 20. Platforms AMD A10-5800K Westmere CPU 4-cores 6-cores x 2 Xeon 5660 GPU Radeon HD 7660D 384-cores NVIDIA Tesla M2050 448-cores Java Runtime JRE (build 1.6.0_21-b06 JRE (build 1.6.0_25-b06) JVM HotSpot 64-Bit Server VM (build 17.0-b11, mixed mode) HotSpot 64-Bit Server VM(Build 20.0-b11, mixed mode) 20
  • 21. Experimental Methodologies We tested execution in the following modes: Sequential Java HJ (on JVM) Sequential HJ Parallel HJ HJ-OpenCL with Safe Construct (on Device) OpenCL CPU OpenCL GPU 21
  • 22. Result on AMD A10-5800K 0.99 1 0.21 0.78 1.01 0.99 0.96 0.98 1.01 1.06 2.06 1.99 0.4 1.35 2.02 1.92 1.88 1.88 2.34 1.2 4.75 3.01 0.72 2.89 6.28 2.07 36.71 2.43 2.06 1.19 8.88 3.59 12.91 0.19 21.19 0.69 55.01 2.08 0.86 0.21 0.1 1 10 100 Black-Scholes Crypt MatMult Doitgen MRIQ Syrk Jacobi SparseMatMult Spectral-norm SOR SpeeduprelativetoSequentialJava Benchmarks Sequential HJ Parallel HJ HJ OpenCL CPU HJ OpenCL GPU 22 Log scale
  • 23. Result on Westmere 1.02 0.98 1.62 0.99 1.01 1.04 1 0.97 0.97 0.97 6.22 5.64 6.88 5.06 6.1 6.26 2.96 4.86 10.16 3.18 18.62 4.73 9.98 5.91 29.26 3.55 35.68 1.68 10.22 2.93 37.2 13.91 43.56 2.82 324.22 1.17 36.62 6.63 28.13 1.22 0.1 1 10 100 1000 Black-Scholes Crypt-C MatMult Doitgen MRIQ Syrk Jacobi SparseMatMult Spectral-norm SOR SpeeduprelativetoSequentialJava Benchmarks Sequential HJ Parallel HJ HJ OpenCL CPU HJ OpenCL GPU 23 Log scale
  • 24. Slowdown for exception checking Device Black Schol es Crypt MatM ult Doitge n MRIQ Syrk Jacobi Sparse Matm ult Spectr al- Norm SOR CPU 0.99 0.99 1.00 1.04 1.03 0.99 1.00 0.94 0.98 0.98 GPU 1.02 0.99 1.00 1.00 1.00 1.00 0.97 0.91 1.00 1.00 On A10-5800K Device Black Schol es Crypt MatM ult Doitge n MRIQ Syrk Jacobi Sparse Matm ult Spectr al- Norm SOR CPU 0.98 0.98 0.98 0.99 1.00 1.00 1.00 0.97 1.00 1.02 GPU 0.95 0.94 0.99 1.00 0.98 1.00 0.99 0.68 0.99 1.00 On Westmere Indirect acess 24
  • 25. Related Work: High-level language to GPU code Lime (PLDI’12) JVM compatible language RootBeer Compiles Java bytecode to CUDA X10 and Chapel Provides programming model for CUDA Sponge (ASPLOS’11) Compiles StreamIt to CUDA → None of these approaches considers Java Exception Semantics 25
  • 26. Related Work: Exception Semantics in Java  Artigas et al. (ICS’00) and Moreira et al.(ACM Trans. ‘00)  Generates exception- safe and -unsafe regions of code.  Wurthinger et al.(PPPJ’07)  Proposes an algorithm on Static Single Assignment(SSA) form for the JIT compiler which eliminates un- necessary bounds checking.  ABCD (PLDI’00)  Provides an array bounds checking elimination algorithm, which is based on graph traversal on an extended SSA form.  Jeffery et al. (In Concurrency and Compu- tations: Practice and Experience,‘09)  Proposes a static annotation framework to reduce the overhead of dynamic checking in the JIT compiler. 26
  • 27. Conclusions: HJ-OpenCL Programmer can utilize OpenCL by just putting “forall” construct “safe” construct for precise exception semantics “next” construct for barrier synchronization Performance improvement upto 55x speedup on AMD APU upto 324x speedup on NVIDIA GPU 27
  • 28. Future Work Speculative Exception Checking Speculative Execution of Parallel Programs with Precise Exception Semantics. A.Hayashi et al. (LCPC’13) Automatic generation of exception checking code 28

Editor's Notes

  1. Hello, everyone. Welcome to my talk. My name is Akihiro Hayashi and I’m a posdoc at Rice university. Toiday I’ll be talking about accelerating habanero-java programs with OpenCL generation.
  2. Let me first talk about the background. Programming models for GPGPU, such as CUDA and OpenCL, can enable significant performance and energy improvemtns for certain classes of applications. These programming models provide C-like kernel language and low-level APIs which is usually accessible from C/C++. On the other hand, High-level languages such as Java provide high-productivity features including type safety, garbage collection and precise exception semantics.
  3. But, If you want to utilize GPU from Java, It requires programmers to write non-trivial amount of application code. Here is an example. There are two kinds of code here. The code shown on your left will be executed on host, and the code shown on your right will executed on device. In host code, you can see JNI call which get and release array pointer of Java array and several OpenCL APIs like memory allocation, data transfe, kernel compilation and kernel invocation. In kernel code, you can see C-like kernel code which is written in SPMD manner. As you can see utilizing GPU from Java adds non-trivial amount of work.
  4. In past work, Rootbeer provides high-level programming model for GPU. We don’t need to write JNI and OpenCL Code anymore, But it requires programmer to write spececial class and special API to invoke GPU.
  5. In our approach, we propose HJ-OpenCL which performs automatic generation of OpenCL kernels and JNI glue code from a prallel-for construct named forall. HJ-OpenCL is built on the top of Haabnero-Java language. It also provides a way to maintain precise exeception semantics
  6. In HJ-OpenCL We don’t need to prepare
  7. Because it does not check exceptions at runtime.