SlideShare a Scribd company logo
1 of 50
Download to read offline
Keynote at The Fourth International Symposium on Computing and
Networking (CANDAR’16)
Kazuaki Ishizaki
IBM Research – Tokyo
Transparent GPU Exploitation for Java
1
My Research History
 1992-1995 Static compiler for High Performance Fortran
 1996-now Just-in-time compiler for IBM Developers Kit for
Java
– 1996-2000 Benchmark and GUI applications
– 2000-2010 Web and Enterprise applications
– 2012- Analytics applications
 2014- Java language with GPUs
 2015- Apache Spark (in-memory data processing framework)
with GPUs
2 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
My Research History
 1990-1992 My master thesis with FPGA
– Used XC3000 series with schematic editor
 Verilog and VHDL were just available
 1992-1995 Static compiler for High Performance Fortran
 1996-now Just-in-time compiler for IBM Developers Kit for
Java
3 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
What has Happened in HPC from 1995 to 2016
 Program is becoming simpler
 Hardware is becoming complicated
1995 2016
Hardware Fast scalar processors Commodity processors with hardware
accelerators
Applications Weather, wind, fluid, and
physics simulations
Machine learning and
deep learning with big data
Program Complicated and
hardware-dependent code
Simple and clean code
(e.g. mapreduce by Hadoop)
Users Limited to programmers
who are well-educated for HPC
Data scientists
who are non-familiar with hardware
Hardware
Example4
GPUPowerPC
Quiz: Can this program be executed in parallel?
5 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
void test(float a[], int idx[], int N) {
for (int i = 0; i < N; i++) {
a[idx[i]] = i;
}
}
Answer: Depend on idx[]
 Can this program be executed in parallel?
6 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
void test(float a[], int idx[], int N) {
for (int i = 0; i < N; i++) {
a[idx[i]] = i;
}
}
idx = {0, 1, 2, 3, …} idx = {0, 1, 0, 3, …}
Execute in parallel Execute sequentially
How Can We Know idx[]?
 (Word-based) Transactional memory
 Parallelization analysis at
– Compilation time: Not easy
– Runtime: Require much time
7 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
void test(float a[], int idx[], int N) {
for (int i = 0; i < N; i++) {
a[idx[i]] = i;
}
}
What We Want To Ask Programmer
 Programmer usually knows everything
8 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
void test(float a[], int idx[], int N) {
#pragma parallel
for (int i = 0; i < N; i++) {
a[idx[i]] = i;
}
}
idx = {0, 1, 2, 3, …}
What We Do Not Want To Ask Programmer
 What Hardware Will This Program Use?
–CPU?
–GPU?
–FPGA?
–ASIC?
9 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
My Recent Interest
 How system generates hardware accelerator code from
program with high-level abstraction
– Expected (practical) result
 People execute program without knowing usage of hardware accelerator
– Challenge
 How to optimize code for a certain hardware accelerator without specific
information
–On-going research
 GPU exploitation from Java program
 GPU exploitation in Apache Spark
work with Akihiro Hayashi *, Alon Shalev Housfater -, Hiroshi Inoue +,
Madhusudanan Kandasamy  , Gita Koblents -, Moriyoshi Ohara +,
Vivek Sarkar *, and Jan Wroblewski (intern) +
+ IBM Research – Tokyo, - IBM Canada,  IBM India, * Rice University
10 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
GPU Exploitation from Java Program
Why Java for GPU Programming?
 High productivity
– Safety and flexibility
– Good program portability among different machines
 “write once, run anywhere”
– One of the most popular programming languages
 Hard to use CUDA and OpenCL for non-expert programmers
 Many computation-intensive applications in non-HPC area
– Data analytics and data science (Hadoop, Spark, etc.)
– Security analysis (events in log files)
– Natural language processing (messages in social network system)
12 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
From https://www.flickr.com/photos/dlato/5530553658
CUDA is a programming language
for GPU offered by NVIDIA
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
How We Write GPU Program
 Five steps
1. Allocate GPU device memory
2. Copy data on CPU main memory
to GPU device memory
3. Launch a GPU kernel to be executed
in parallel on cores
4. Copy back data on GPU device
memory to CPU main memory
5. Free GPU device memory
device memory
(up to 16GB)
main memory
(up to 1TB/socket)
CPU GPU
Data copy over
PCIe or NVLink
dozen cores/socket thousands cores
13
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
How We Optimize GPU Program
device memory
(up to 16GB)
main memory
(up to 1TB/socket)
CPU GPUdozen cores/socket thousands cores
14
Exploit faster memory
• Read-only cache (Read only)
• Shared memory (SMEM)
Data copy over
PCIe or NVLink
From GTC presentation by NVIDIA
Reduce data copy
 Five steps
1. Allocate GPU device memory
2. Copy data on CPU main memory
to GPU device memory
3. Launch a GPU kernel to be executed
in parallel on cores
4. Copy back data on GPU device
memory to CPU main memory
5. Free GPU device memory
Fewer Code Makes GPU Programming Easy
 Current programming model requires programmers to
explicitly write operations for
– managing device memories
– copying data
between CPU and GPU
– expressing parallelism
– exploiting faster memory
 Java 8 enables programmers
to just focus on
– expressing parallelism
15 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
void fooCUDA(N, float *A, float *B, int N) {
int sizeN = N * sizeof(float);
cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN);
cudaMemcpy(d_A, A, sizeN, Host2Device);
GPU<<<N, 1>>>(d_A, d_B, N);
cudaMemcpy(B, d_B, sizeN, Device2Host);
cudaFree(d_B); cudaFree(d_A);
}
// code for GPU
__global__ void GPU(float* d_A, float* d_B, int N) {
int i = threadIdx.x;
if (N <= i) return;
d_B[i] = __ldg(&d_A[i]) * 2.0; //__ldg() for read-only cache
}
void fooJava(float A[], float B[], int N) {
// similar to for (idx = 0; i < N; i++)
IntStream.range(0, N).parallel().forEach(i -> {
B[i] = A[i] * 2.0;
});
}
Goal
 Build a Java just-in-time (JIT) compiler to generate high
performance GPU code from a parallel loop construct
 Implementing four performance optimizations
 Offering performance evaluations on POWER8 with a GPU
 Supporting Java language feature (See [PACT2015])
 Predicting Performance on CPU and GPU [PPPJ2015]
 Available in IBM Java 8 ppc64le and x86_64
– https://www.ibm.com/developerworks/java/jdk/java8/
16 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Accomplishments
Parallel Programming in Java 8
 Express parallelism by using parallel stream API among
iterations of a lambda expression (index variable: i)
17 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
class Par {
void foo(float[] a, float[] b, float[] c, int n) {
java.util.Stream.IntStream.range(0, n).parallel().forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
}
}
Reference implementation of Java 8 can execute this on multiple CPU threads
i =0 on thread 0
i = 3 on thread 1
i = 4 on thread 2
i = 1 on thread 3
i = 2 on thread 0
time
Portability among Different Hardware
 A just-in-time compiler in IBM Java 8 runtime generates
native instructions
– for a target machine including GPUs from Java bytecode
– for GPU which exploit device-specific capabilities more easily than
OpenCL
18
Java
program
(.java)
Java
bytecode
(.class,
.jar)
IBM Java 8 runtime
Target machine
Interpreter
just-in-time
compiler
> javac Par.java > java Par for GPU
IntStream.range(0, n)
.parallel().forEach(i -> {
...
});
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Overview of Our JIT Compiler
 Java bytecode
sequence is divided
into two intermediate
presentation (IR) parts
– Lambda expression:
generate GPU code
using NVIDIA tool chain
(right hand side)
– Others:
generate CPU code
using conventional JIT
compiler (left hand side)
19 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
NVIDIA GPU binary
for lambda expression
CPU binary for
- managing device memory
- copying data
- launching GPU binary
Conventional Java JIT compiler
Parallel stream APIs detection
// Parallel stream code
IntStream.range(0, n).parallel()
.forEach(i -> { ...c[i] = a[i]...});
IR for GPUs
...
c[i] = a[i]...
IR for CPUs
Java bytecode
CPU native code
generator GPU native code
generator (by NVIDIA)
Additional modules for GPU
GPUs optimizations
Optimizations for GPU in Our JIT Compiler
 Optimizing alignment of Java arrays on GPUs
– Reduce # of memory transactions to a GPU global memory
 Using read-only cache
– Reduce # of memory transactions to a GPU global memory
 Optimizing data copy between CPU and GPU
– Reduce amount of data copy
 Eliminating redundant exception checks
– Reduce # of instructions in GPU binary
20 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Reducing # of memory transactions to GPU global memory
 Aligning the starting address of an array body in GPU global
memory with memory transaction boundary
21 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
0 128
a[0]-a[31]
Object header
Memory address
a[32]-a[63]
Naive
alignment
strategy
a[0]-a[31] a[32]-a[63]
256 384
Our
alignment
strategy
One memory transaction for a[0:31]
Two memory transactions for a[0:31]
IntStream.range(0,n).parallel().
forEach(i->{
...= a[i]...; // a[] : float
...;
});
a[64]-a[95]
a[64]-a[95]
A 128-byte memory
transaction boundary
Using Read-Only Cache
 Prepare two versions of GPU code and execute 1. if a != b and
a != c
1. Use read-only cache for a[i]
2. Use no read-only cache for a[i]
22 Easy and High Performance GPU Programming for Java Programmers
Equivalent to CUDA code
void foo(float[] a, float[] b, float[] c, int n) {
if ((a[] != b[]) && (a[] != c[])) {
// 1.
IntStream.range(0, n).parallel().forEach( i -> {
b[i] = ROa[i] * 2.0;
c[i] = ROa[i] * 3.0;
});
} else {
// 2. execute code w/o a read-only cache
}
}
// Equivalent to CUDA code
__device__ foo(*a, *b, *c, N)
b[i] = __ldg(&a[i]) * 2.0;
c[i] = __ldg(&a[i]) * 3.0;
}
// original
IntStream.range(0,n).parallel().forEach(i->{
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
Optimizing Data Copy between CPU and GPU
 Eliminate data copy from GPU if an array (e.g. a[]) is not
updated in GPU binary [Jablin11][Pai12]
 Copy only a read or write set if an array index form is
‘i + constant’ (the set is contiguous)
23 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
sz = (n – 0) * sizeof(float)
cudaMemCopy(&a[0], d_a, sz, H2D); // copy only a read set
cudaMemCopy(&b[0], d_b, sz, H2D);
cudaMemCopy(&c[0], d_c, sz, H2D);
IntStream.range(0, n).parallel().forEach( i -> {
b[i] = a[i]...;
c[i] = a[i]...;
});
cudaMemcpy(a, d_a, sz, D2H);
cudaMemcpy(&b[0], d_b, sz, D2H); // copy only a write set
cudaMemcpy(&c[0], c_b, sz, D2H); // copy only a write set
Optimizing Data Copy between CPU and GPU
 Eliminate data copy between CPU and GPU[Pai12]
– if an array (e.g., a[] and b[]), which was accessed on GPU, is not
accessed on CPU
24
// Data copy for a[] from CPU to GPU
for (int t = 0; t < T; t++) {
IntStream.range(0, N*N).parallel().forEach(idx -> {
b[idx] = a[...];
});
// No data copy for b[] between GPU and CPU
IntStream.range(0, N*N).parallel().forEach(idx -> {
a[idx] = b[...];
}
// No data copy for a[] between GPU and CPU
}
// Data copy for a[] and b[] from GPU to CPU
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Eliminating Redundant Exception Checks
 Generate GPU code without exception checks by using
– loop versioning [Artigas00] that guarantees safe region by using pre-
condition checks on CPU
25 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
if (
// check cond. for NullPointerException
a != null && b != null && c != null &&
// check cond. for ArrayIndexOutOfBoundsException
a.length <l n && b.length <l n && c.length <l n) {
...
<<<...>>> GPUbinary(...)
...
} else {
// execute this construct on CPU
// to produce an exception
// under the original exception semantics
}
IntStream.range(0,n).parallel().
forEach(i->{
b[i] = a[i]...;
c[i] = a[i]...;
});
GPU binary for {
// safe region:
// no exception
// check is required
i = ...;
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
}
Automatically Optimized for CPU and GPU
 CPU code
– handles GPU device memory management and data copying
– checks whether optimized CPU and GPU code can be executed
 GPU code
is optimized
– Using
read-only
cache
– Eliminating
exception
checks
26 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
if (a != null && b != null && c != null &&
a.length < n && b.length < n && c.length < n &&
(a[] != b[]) && (a[] != c[])) {
cudaMalloc(d_a, a.length*sizeof(float)+128);
if (b!=a) cudaMalloc(d_b, b.length*sizeof(float)+128);
if (c!=a && c!=b) cudaMalloc(d_c, c.length*sizeof(float)+128);
int sz = (n – 0) * sizeof(float), szh = sz + Jhdrsz;
cudaMemCopy(a, d_a + align - Jhdrsz, szh, H2D);
<<...>> GPU(d_a, d_b, d_c, n) // launch GPU
cudaMemcpy(b + Jhdrsz, d_b + align, sz, D2H);
cudaMemcpy(c + Jhdrsz, d_c + align, sz, D2H);
cudaFree(d_a);
if (b!=a) cudaFree(d_b);
if (c=!a && c!=b) cudaFree(d_c);
} else {
// execute CPU binary
}
CPU
__global__ void GPU(float *a,
float *b, float *c, int n)
{
// no exception checks
i = ...
b[i] = ROa[i] * 2.0;
c[i] = ROa[i] * 3.0;
}
GPU
Benchmark Programs
 Prepare sequential and parallel stream API versions in Java
27 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Name Summary Data size Type
Blackscholes Financial application that calculates the price of put and call
options
4,194,304 virtual
options
double
MM A standard dense matrix multiplication: C = A.B 1,024 x 1,024 double
Crypt Cryptographic application [Java Grande Benchmarks] N = 50,000,000 byte
Series the first N fourier coefficients of the function [Java Grande
Benchamark]
N = 1,000,000 double
SpMM Sparse matrix multiplication [Java Grande Benchmarks] N = 500,000 double
MRIQ 3D image benchmark for MRI [Parboil benchmarks] 64x64x64 float
Gemm Matrix multiplication: C = α.A.B + β.C [PolyBench] 1,024 x 1,024 int
Gesummv Scalar, vector, and Matrix multiplication [PolyBench] 4,096 x 4,096 int
Performance Improvements of GPU Version
Over Sequential and Parallel CPU Versions
 Achieve 127.9x on geomean and 2067.7x for Series over 1 CPU thread
 Achieve 3.3x on geomean and 32.8x for Series over 160 CPU threads
 Degrade performance for SpMM and Gesummv against 160 CPU threads
Transparent GPU Exploitation for Java / Kazuaki Ishizaki28
Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memory
with one NVIDIA Kepler K40m GPU at 876 MHz with 12-GB global memory (ECC off)
Ubuntu 14.10, CUDA 5.5
Modified IBM Java 8 runtime for PowerPC
Performance Impact of Each Optimization
 MM: LV/DC/ALIGN/ROC are very effective
 BlackScholes: DC is effective
 MRIQ: LV/ALIGN/ROC is effective
 SpMM and Gesummv: data transfer time for large arrays is dominant
Apply optimizations cumulatively
 BASE: Disabled our four optimizations
 LV: loop versioning
 DC: data copy
 ALIGN: alignment optimization
 ROC: read-only cache
Breakdown of the execution time
29 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
0.85
0.45
1.51
0.92
0.74
0.11
1.19
3.47
0
0.5
1
1.5
2
2.5
3
3.5
4
BlackScholes MM Crypt Series SpMM MRIQ Gemm Gesummv
SpeeduprelativetoCUDA
Performance Comparison with Hand-Coded CUDA
 Achieve 0.83x on geomean over CUDA
 Crypt, Gemm, and Gesummv: usage of a read-only cache
 BlackScholes: usage of larger CUDA threads per block (1024 vs. 128)
 SpMM: overhead of exception checks
 MRIQ: miss of ‘-use-fast-math’ compile option
 MM: lack of usage of shared memory with loop tiling
30 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Higher is better
GPU Version is Slower than Parallel CPU Version
Transparent GPU Exploitation for Java / Kazuaki Ishizaki31
 Can we choose an appropriate device (CPU or GPU) to avoid
performance degradation?
– Want to make sure to achieve equal or better performance
Machine-learning-based Performance Heuristics
 Construct a binary prediction model offline by supervised
machine learning with support vector machines (SVMs)
– Features
 Loop range
 Dynamic number of instructions (memory access, arithmetic operation, …)
 Dynamic number of array accesses (a[i], a[i + c], a[c * i], a[idx[i]])
 Data transfer size (CPU to GPU, GPU to CPU)
32 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
data1
Bytecode
App A feature 1
Features
extraction
LIBSVM Java
Runtime
Prediction
Model
data2
Bytecode
App A feature 2
Features
extraction
data3
Bytecode
App B feature 3
Features
extraction
CPU GPU
Most Predictions are Correct
Use 291 cases to build model
 Succeeded in predicting cases of performance degradations on GPU
 Failed to predict BlackScholes
Transparent GPU Exploitation for Java / Kazuaki Ishizaki33
Prediction
      
1.8->1.0 0.8->1.0 0.4->1.0
Related Work
 Our research enables memory and communication
optimizations with machine-learning-based device selection
34 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Work Language Exception
support
JIT
compiler
How to write GPU kernel Data copy
optimization
GPU memory
optimization
Device selection
JCUDA Java × × CUDA Manual Manual GPU only
JaBEE Java × √ Override run method × × GPU only
Aparapi Java × √
Override run
method/Lambda
× × Static
Hadoop-CL Java × √
Override map/reduce
method
× × Static
Rootbeer Java × √ Override run method Not described × Not described
[PPPJ09] Java √ √ Java for-loop Not described ×
Dynamic with
regression
HJ-OpenCL
Habanero-
Java
√ √ Forall constructs √ × Static
Our work Java √ √
Standard parallel
stream API
√
ROCache /
alignment
Dynamic with
machine learning
Future Work
 Exploiting shared memory
– Not easy to predict performance
– Require non-lightweight analysis for identifying reuse
 Supporting additional Java operations
35 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
GPU Exploitation in Apache Spark
What is Apache Spark?
 Framework that processes distributed computing by transforming
distributed immutable memory structure using set of parallel operations
 e.g. map(), filter(), reduce(), …
– Distributed immutable in-memory structures
 RDD (Resilient Distributed Dataset), DataFrame, Dataset
– Scala is primary language for programming on Spark
 Provide domain specific libraries
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Spark Runtime (written in Java and Scala)
Spark
Streaming
(real-time)
GraphX
(graph)
SparkSQL
(SQL)
MLlib
(machine
learning)
Java Virtual Machine
tasks Executor
Driver
Executor
results
Executor
Data
Data
Data
Open source: http://spark.apache.org/
Data Source (HDFS, DB, File, etc.)
Latest version is 2.0.3 released in 2016/11
37
How Program Works on Apache Spark
 Parallel operations can be executed among partitions
 In a partition, data can be processed sequentially
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
val ds1: Dataset[Pt] = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDS
val ds2: Dataset[Pt] = ds1.map(p => Pt(p.x+1, p.y*2))
val cnt: Int = ds2.reduce((p1, p2) => p1.x + p2.x)
ds1 ds2
p.x+1
p.y*2
p1.x + p2.x
9
5
14
partition
partition
cnt
54
32
+ =
+ =1 5
2 6
partition
pt
partition
38
2 10
3 12
3 7
4 8
4 14
5 16
How We Can Run Program Faster on GPU
 Assign many parallel computations into cores
 Make memory accesses coalesce
– Column-oriented layout results in better performance
 [Che2011] reports on about 3x performance improvement of GPU kernel execution of
kmeans with column-oriented layout over row-oriented layout
1 52 61 5 3 7
Assumption: 4 consecutive data elements
can be coalesced using GPU hardware
2 v.s. 4
memory accesses to
GPU device memory
Row-oriented layoutColumn-oriented layout
Pt(x: Int, y: Int)
Pt(1,5), Pt(2,6), Pt(3,7), Pt(4,8)
Load four Pt.x
Load four Pt.y
2 6 4 843 87
cores
x1 x2 x3 x4
cores
Load Pt.x Load Pt.y Load Pt.x Load Pt.y
1 2 31 2 4
y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4
Transparent GPU Exploitation for Java / Kazuaki Ishizaki39
Idea to Transparently Exploit GPUs on Apache Spark
 Generate GPU code from a set of parallel operations
– Made it in another research already
 Physically put distributed immutable in-memory structures
(e.g. Dataset) in column-oriented representation
– Dataset is statically typed, but physical layout is not specified in program
Transparent GPU Exploitation for Java / Kazuaki Ishizaki40
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Overview of GPU Exploitation on Apache Spark
User’s Spark Program
case class Pt(x: Int, y: Int)
ds1 = sc.parallelize(Seq(
Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2)
.toDS
ds2 = ds1.map(p => Pt(p.x+1, p.y*2))
cnt = ds2.reduce((p1, p2) => p1.x + p2.x)
Nativecode
GPU
10
12
14
+1=
*2=
ds1
Data
transfer
x y x y
ds2
partition
GPU
kernel
CPU
16
2
3
4
5
10
12
14
16
2
3
4
5
5
6
1
2
7
8
3
4
5
6
1
2
7
8
3
4
41
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Overview of GPU Exploitation on Apache Spark
 Efficient
– Reduce data copy overhead between CPU and GPU
– Make memory accesses efficient on GPU
 Transparent
– Map parallelism in program
into GPU native code
User’s Spark Program
case class Pt(x: Int, y: Int)
ds1 = sc.parallelize(Seq(
Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2)
.toDS
ds2 = ds1.map(p => Pt(p.x+1, p.y*2))
cnt = ds2.reduce((p1, p2) => p1.x + p2.x)
Drive
GPU native
code
Nativecode
GPU
+1=
*2=
ds1
Data
transfer
x y
GPU manager
Columnar storage
x y
GPU can exploit parallelism both
among partitions in Dataset and
within a partition of Dataset
ds2
partition
GPU
kernel
CPU
Memoryaddress
42
10
12
14
16
2
3
4
5
10
12
14
16
2
3
4
5
5
6
1
2
7
8
3
4
5
6
1
2
7
8
3
4
Exploit Parallelism Between GPU Kernels
 Overlap data transfers and computations among different GPU kernels on
a GPU
Data transfer
CPU to GPU
GPU
kernel
GPU
kernel
GPU
kernel
GPU
kernel
GPU
kernel
GPU
kernel
GPU
kernel
GPU
kernel
GPU
kernel
GPU
kernel
GPU
kernel
GPU
kernel
GPU
kernel
GPU
kernel
GPU
kernel
Time
Spark worker
for GPU
Spark worker
for GPU
Spark worker
for GPU
Data transfer
GPU to CPU
Transparent GPU Exploitation for Java / Kazuaki Ishizaki43
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
How We Write Program And What is Executed
 Write a program using a relational operation for DataFrame or a lambda
expression for Dataset.
 Catalyst performs optimization and code generation for the program.
 The corresponding Java bytecode for the generated Java code is executed.
ds1 = data.toDS()
ds2 = ds2.map(p => p.x+1)
ds2.reduce((a,b) => a+b)
df1 = data.toDF(…)
df2 = df2.selectExpr("x+1")
df2.agg(sum())
Frontend
API
DataFrame (v1.3-) Dataset (v1.6-)
Backend
computation
Catalyst-generated Java bytecode
Java code
Catalyst
2 61 5
Java heap
Row-oriented Data
data =
Seq(Pt(1, 5),Pt(2, 6))
44
“Catalyst” is a code-name
for optimizer and code generator
in Apache Spark
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
How Program is Executed on GPU
 For DataFrame and Dataset, enhanced Catalyst generates Java code optimized for
GPU.
 A just-in-time compiler in Java virtual machine can generate GPU code.
ds1 = data.toDS()
ds2 = ds2.map(p => p.x+1)
ds2.reduce((a,b) => a+b)
df1 = data.toDF(…)
df2 = df2.selectExpr("x+1")
df2.agg(sum())
Frontend
API
Backend
computationAutomatically generated GPU code
Optimized Java code
Enhanced Catalyst
Data2 61 5
GPU device memory
Column-oriented
45
DataFrame (v1.3-) Dataset (v1.6-)data =
Seq(Pt(1, 5),Pt(2, 6))
Pseudo Java Code by Current Catalyst
 Perform optimization that merges multiple parallel operations
(selectExpr() and agg(sum()) into one loop
int sum = 0
while (rowIterator.hasNext()) { // iterator-based access
Row row = rowIterator.next(); // for df1
int x = row.getInteger(0);
// selectExpr(x + 1)
int x_new = x + 1; // for df2
sum += x_new;
}
val df1 = (-1 to 1).toDF("x")
val df2 = df1.selectExpr("x + 1")
df2.agg(sum())
Generated code corresponds to selectExpr() and local sum()
1
3
1
0
-1
-1 0
DataFrame program for Spark
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
20 1
Read sequentially
46
df1
x
x_new
sum
Row-oriented
Catalyst
Generated pseudo Java code
Pseudo Java Code by Enhanced Catalyst
 Get column0 from column-oriented storage
 For-loop can be executed in a parallel reduction manner
Column column0 = df1.getColumn(0); // df1
int sum = 0;
for (int i = 0; i < column0.numRows; i++) {
int x = column0.getInteger(i);
// selectExpr(x + 1)
int x_new = x + 1; // for df2
sum += x_new;
}
1
10-1
-1 0
Generated pseudo Java code
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
3
20 1
47
df1
x
x_new
sum
Column-orientedEnhanced Catalyst
Generate GPU Code Transparently from Spark Program
 Copy column-oriented storage into GPU
 Execute add and reduction in one GPU kernel
Column column0 = df1.getColumn(0);
int nRows = column0.numRows;
cudaMalloc(&d_c0, nRows*4);
cudaMemcpy(d_c0, column0, nRows, H2D);
int sum = 0;
cudaMalloc(&d_sum, 4);
cudaMemcpy(d_c0, &sum, 4, H2D);
<<...>> GPU(d_c0, d_sum, nRows) // launch GPU
cudaMemcpy(d_c0, &sum, 4, D2H);
cudaFree(d_sum); cudaFree(d_c0);
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
val df1 = (-1 to 1).toDF("x")
val df2 = df1.selectExpr("x + 1")
df2.agg(sum())
// GPU code
__global__ void GPU(
int *d_c0, int *d_sum, long size) {
long ix = … // 0, 1, 2
if (size <= ix) return;
int x = d_c0[ix];
int x_new = x + 1;
reduction(d_sum, x_new);
}
48
1
10-1
-1 0
3
20 1
x
x_new
d_sum
d_c0
Execute in parallel
Generated CPU code
Related Work
 Spark With Accelerated Tasks [Grossman2016]
– Generate GPU code from lambda function in map() in RDD
– Very similar to enhanced Catalyst using columnar storage to transparently
exploit GPUs. However, work for RDD with map()
 GPU Columnar (proposed by Kiran Lonikar)
– Generate GPU code from program using select() method in DataFrame
– Very similar to enhanced Catalyst using columnar storage to transparently
exploit GPUs
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
val inputRDD = cl(sc.objectFile[Int]( hdfsPath ))
val doubledRDD = inputRDD.map(i => 2 * i)
49
Takeaway
 How system generates hardware accelerator code from
program with high-level abstraction
–Most of programmers are not Ninja programmers
–Compiler can transform program for hardware features, but
does not want to do trial and error at runtime
–How can compiler and hardware build good relationship?
 (Not talked today) What can we do for deep learning?
– Current deep learning frameworks use GPU by calling libraries (e.g.
cnDNN/cuRNN by NVIDIA)
– How will system support rapid evolution in deep learning?
 New neural network structures are still proposed
Transparent GPU Exploitation for Java / Kazuaki Ishizaki50

More Related Content

What's hot

PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Kazuaki Ishizaki
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkDatabricks
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)Kohei KaiGai
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStromKohei KaiGai
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningAMD Developer Central
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 

What's hot (20)

PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache Spark
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 

Similar to Transparent GPU Exploitation for Java

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.J On The Beach
 
Accelerating SDN/NFV with transparent offloading architecture
Accelerating SDN/NFV with transparent offloading architectureAccelerating SDN/NFV with transparent offloading architecture
Accelerating SDN/NFV with transparent offloading architectureOpen Networking Summits
 
Go native benchmark test su dispositivi x86: java, ndk, ipp e tbb
Go native  benchmark test su dispositivi x86: java, ndk, ipp e tbbGo native  benchmark test su dispositivi x86: java, ndk, ipp e tbb
Go native benchmark test su dispositivi x86: java, ndk, ipp e tbbJooinK
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
Gpu computing-webgl
Gpu computing-webglGpu computing-webgl
Gpu computing-webglVisCircle
 
Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)IT-Доминанта
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyAMD Developer Central
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 

Similar to Transparent GPU Exploitation for Java (20)

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Cuda
CudaCuda
Cuda
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Accelerating SDN/NFV with transparent offloading architecture
Accelerating SDN/NFV with transparent offloading architectureAccelerating SDN/NFV with transparent offloading architecture
Accelerating SDN/NFV with transparent offloading architecture
 
Go native benchmark test su dispositivi x86: java, ndk, ipp e tbb
Go native  benchmark test su dispositivi x86: java, ndk, ipp e tbbGo native  benchmark test su dispositivi x86: java, ndk, ipp e tbb
Go native benchmark test su dispositivi x86: java, ndk, ipp e tbb
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Gpu computing-webgl
Gpu computing-webglGpu computing-webgl
Gpu computing-webgl
 
Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 

More from Kazuaki Ishizaki

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdfKazuaki Ishizaki
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdfKazuaki Ishizaki
 
Make AI ecosystem more interoperable
Make AI ecosystem more interoperableMake AI ecosystem more interoperable
Make AI ecosystem more interoperableKazuaki Ishizaki
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Kazuaki Ishizaki
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0Kazuaki Ishizaki
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkKazuaki Ishizaki
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_publicKazuaki Ishizaki
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_publicKazuaki Ishizaki
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki publicKazuaki Ishizaki
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_publicKazuaki Ishizaki
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_publicKazuaki Ishizaki
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラKazuaki Ishizaki
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化Kazuaki Ishizaki
 

More from Kazuaki Ishizaki (17)

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
 
Make AI ecosystem more interoperable
Make AI ecosystem more interoperableMake AI ecosystem more interoperable
Make AI ecosystem more interoperable
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
 
SparkTokyo2019
SparkTokyo2019SparkTokyo2019
SparkTokyo2019
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache Spark
 
icpe2019_ishizaki_public
icpe2019_ishizaki_publicicpe2019_ishizaki_public
icpe2019_ishizaki_public
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
 

Recently uploaded

AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 

Recently uploaded (20)

AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

Transparent GPU Exploitation for Java

  • 1. Keynote at The Fourth International Symposium on Computing and Networking (CANDAR’16) Kazuaki Ishizaki IBM Research – Tokyo Transparent GPU Exploitation for Java 1
  • 2. My Research History  1992-1995 Static compiler for High Performance Fortran  1996-now Just-in-time compiler for IBM Developers Kit for Java – 1996-2000 Benchmark and GUI applications – 2000-2010 Web and Enterprise applications – 2012- Analytics applications  2014- Java language with GPUs  2015- Apache Spark (in-memory data processing framework) with GPUs 2 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
  • 3. My Research History  1990-1992 My master thesis with FPGA – Used XC3000 series with schematic editor  Verilog and VHDL were just available  1992-1995 Static compiler for High Performance Fortran  1996-now Just-in-time compiler for IBM Developers Kit for Java 3 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
  • 4. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki What has Happened in HPC from 1995 to 2016  Program is becoming simpler  Hardware is becoming complicated 1995 2016 Hardware Fast scalar processors Commodity processors with hardware accelerators Applications Weather, wind, fluid, and physics simulations Machine learning and deep learning with big data Program Complicated and hardware-dependent code Simple and clean code (e.g. mapreduce by Hadoop) Users Limited to programmers who are well-educated for HPC Data scientists who are non-familiar with hardware Hardware Example4 GPUPowerPC
  • 5. Quiz: Can this program be executed in parallel? 5 Transparent GPU Exploitation for Java / Kazuaki Ishizaki void test(float a[], int idx[], int N) { for (int i = 0; i < N; i++) { a[idx[i]] = i; } }
  • 6. Answer: Depend on idx[]  Can this program be executed in parallel? 6 Transparent GPU Exploitation for Java / Kazuaki Ishizaki void test(float a[], int idx[], int N) { for (int i = 0; i < N; i++) { a[idx[i]] = i; } } idx = {0, 1, 2, 3, …} idx = {0, 1, 0, 3, …} Execute in parallel Execute sequentially
  • 7. How Can We Know idx[]?  (Word-based) Transactional memory  Parallelization analysis at – Compilation time: Not easy – Runtime: Require much time 7 Transparent GPU Exploitation for Java / Kazuaki Ishizaki void test(float a[], int idx[], int N) { for (int i = 0; i < N; i++) { a[idx[i]] = i; } }
  • 8. What We Want To Ask Programmer  Programmer usually knows everything 8 Transparent GPU Exploitation for Java / Kazuaki Ishizaki void test(float a[], int idx[], int N) { #pragma parallel for (int i = 0; i < N; i++) { a[idx[i]] = i; } } idx = {0, 1, 2, 3, …}
  • 9. What We Do Not Want To Ask Programmer  What Hardware Will This Program Use? –CPU? –GPU? –FPGA? –ASIC? 9 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
  • 10. My Recent Interest  How system generates hardware accelerator code from program with high-level abstraction – Expected (practical) result  People execute program without knowing usage of hardware accelerator – Challenge  How to optimize code for a certain hardware accelerator without specific information –On-going research  GPU exploitation from Java program  GPU exploitation in Apache Spark work with Akihiro Hayashi *, Alon Shalev Housfater -, Hiroshi Inoue +, Madhusudanan Kandasamy  , Gita Koblents -, Moriyoshi Ohara +, Vivek Sarkar *, and Jan Wroblewski (intern) + + IBM Research – Tokyo, - IBM Canada,  IBM India, * Rice University 10 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
  • 11. GPU Exploitation from Java Program
  • 12. Why Java for GPU Programming?  High productivity – Safety and flexibility – Good program portability among different machines  “write once, run anywhere” – One of the most popular programming languages  Hard to use CUDA and OpenCL for non-expert programmers  Many computation-intensive applications in non-HPC area – Data analytics and data science (Hadoop, Spark, etc.) – Security analysis (events in log files) – Natural language processing (messages in social network system) 12 Transparent GPU Exploitation for Java / Kazuaki Ishizaki From https://www.flickr.com/photos/dlato/5530553658 CUDA is a programming language for GPU offered by NVIDIA
  • 13. Transparent GPU Exploitation for Java / Kazuaki Ishizaki How We Write GPU Program  Five steps 1. Allocate GPU device memory 2. Copy data on CPU main memory to GPU device memory 3. Launch a GPU kernel to be executed in parallel on cores 4. Copy back data on GPU device memory to CPU main memory 5. Free GPU device memory device memory (up to 16GB) main memory (up to 1TB/socket) CPU GPU Data copy over PCIe or NVLink dozen cores/socket thousands cores 13
  • 14. Transparent GPU Exploitation for Java / Kazuaki Ishizaki How We Optimize GPU Program device memory (up to 16GB) main memory (up to 1TB/socket) CPU GPUdozen cores/socket thousands cores 14 Exploit faster memory • Read-only cache (Read only) • Shared memory (SMEM) Data copy over PCIe or NVLink From GTC presentation by NVIDIA Reduce data copy  Five steps 1. Allocate GPU device memory 2. Copy data on CPU main memory to GPU device memory 3. Launch a GPU kernel to be executed in parallel on cores 4. Copy back data on GPU device memory to CPU main memory 5. Free GPU device memory
  • 15. Fewer Code Makes GPU Programming Easy  Current programming model requires programmers to explicitly write operations for – managing device memories – copying data between CPU and GPU – expressing parallelism – exploiting faster memory  Java 8 enables programmers to just focus on – expressing parallelism 15 Transparent GPU Exploitation for Java / Kazuaki Ishizaki void fooCUDA(N, float *A, float *B, int N) { int sizeN = N * sizeof(float); cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN); cudaMemcpy(d_A, A, sizeN, Host2Device); GPU<<<N, 1>>>(d_A, d_B, N); cudaMemcpy(B, d_B, sizeN, Device2Host); cudaFree(d_B); cudaFree(d_A); } // code for GPU __global__ void GPU(float* d_A, float* d_B, int N) { int i = threadIdx.x; if (N <= i) return; d_B[i] = __ldg(&d_A[i]) * 2.0; //__ldg() for read-only cache } void fooJava(float A[], float B[], int N) { // similar to for (idx = 0; i < N; i++) IntStream.range(0, N).parallel().forEach(i -> { B[i] = A[i] * 2.0; }); }
  • 16. Goal  Build a Java just-in-time (JIT) compiler to generate high performance GPU code from a parallel loop construct  Implementing four performance optimizations  Offering performance evaluations on POWER8 with a GPU  Supporting Java language feature (See [PACT2015])  Predicting Performance on CPU and GPU [PPPJ2015]  Available in IBM Java 8 ppc64le and x86_64 – https://www.ibm.com/developerworks/java/jdk/java8/ 16 Transparent GPU Exploitation for Java / Kazuaki Ishizaki Accomplishments
  • 17. Parallel Programming in Java 8  Express parallelism by using parallel stream API among iterations of a lambda expression (index variable: i) 17 Transparent GPU Exploitation for Java / Kazuaki Ishizaki class Par { void foo(float[] a, float[] b, float[] c, int n) { java.util.Stream.IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); } } Reference implementation of Java 8 can execute this on multiple CPU threads i =0 on thread 0 i = 3 on thread 1 i = 4 on thread 2 i = 1 on thread 3 i = 2 on thread 0 time
  • 18. Portability among Different Hardware  A just-in-time compiler in IBM Java 8 runtime generates native instructions – for a target machine including GPUs from Java bytecode – for GPU which exploit device-specific capabilities more easily than OpenCL 18 Java program (.java) Java bytecode (.class, .jar) IBM Java 8 runtime Target machine Interpreter just-in-time compiler > javac Par.java > java Par for GPU IntStream.range(0, n) .parallel().forEach(i -> { ... }); Transparent GPU Exploitation for Java / Kazuaki Ishizaki
  • 19. Overview of Our JIT Compiler  Java bytecode sequence is divided into two intermediate presentation (IR) parts – Lambda expression: generate GPU code using NVIDIA tool chain (right hand side) – Others: generate CPU code using conventional JIT compiler (left hand side) 19 Transparent GPU Exploitation for Java / Kazuaki Ishizaki NVIDIA GPU binary for lambda expression CPU binary for - managing device memory - copying data - launching GPU binary Conventional Java JIT compiler Parallel stream APIs detection // Parallel stream code IntStream.range(0, n).parallel() .forEach(i -> { ...c[i] = a[i]...}); IR for GPUs ... c[i] = a[i]... IR for CPUs Java bytecode CPU native code generator GPU native code generator (by NVIDIA) Additional modules for GPU GPUs optimizations
  • 20. Optimizations for GPU in Our JIT Compiler  Optimizing alignment of Java arrays on GPUs – Reduce # of memory transactions to a GPU global memory  Using read-only cache – Reduce # of memory transactions to a GPU global memory  Optimizing data copy between CPU and GPU – Reduce amount of data copy  Eliminating redundant exception checks – Reduce # of instructions in GPU binary 20 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
  • 21. Reducing # of memory transactions to GPU global memory  Aligning the starting address of an array body in GPU global memory with memory transaction boundary 21 Transparent GPU Exploitation for Java / Kazuaki Ishizaki 0 128 a[0]-a[31] Object header Memory address a[32]-a[63] Naive alignment strategy a[0]-a[31] a[32]-a[63] 256 384 Our alignment strategy One memory transaction for a[0:31] Two memory transactions for a[0:31] IntStream.range(0,n).parallel(). forEach(i->{ ...= a[i]...; // a[] : float ...; }); a[64]-a[95] a[64]-a[95] A 128-byte memory transaction boundary
  • 22. Using Read-Only Cache  Prepare two versions of GPU code and execute 1. if a != b and a != c 1. Use read-only cache for a[i] 2. Use no read-only cache for a[i] 22 Easy and High Performance GPU Programming for Java Programmers Equivalent to CUDA code void foo(float[] a, float[] b, float[] c, int n) { if ((a[] != b[]) && (a[] != c[])) { // 1. IntStream.range(0, n).parallel().forEach( i -> { b[i] = ROa[i] * 2.0; c[i] = ROa[i] * 3.0; }); } else { // 2. execute code w/o a read-only cache } } // Equivalent to CUDA code __device__ foo(*a, *b, *c, N) b[i] = __ldg(&a[i]) * 2.0; c[i] = __ldg(&a[i]) * 3.0; } // original IntStream.range(0,n).parallel().forEach(i->{ b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; });
  • 23. Optimizing Data Copy between CPU and GPU  Eliminate data copy from GPU if an array (e.g. a[]) is not updated in GPU binary [Jablin11][Pai12]  Copy only a read or write set if an array index form is ‘i + constant’ (the set is contiguous) 23 Transparent GPU Exploitation for Java / Kazuaki Ishizaki sz = (n – 0) * sizeof(float) cudaMemCopy(&a[0], d_a, sz, H2D); // copy only a read set cudaMemCopy(&b[0], d_b, sz, H2D); cudaMemCopy(&c[0], d_c, sz, H2D); IntStream.range(0, n).parallel().forEach( i -> { b[i] = a[i]...; c[i] = a[i]...; }); cudaMemcpy(a, d_a, sz, D2H); cudaMemcpy(&b[0], d_b, sz, D2H); // copy only a write set cudaMemcpy(&c[0], c_b, sz, D2H); // copy only a write set
  • 24. Optimizing Data Copy between CPU and GPU  Eliminate data copy between CPU and GPU[Pai12] – if an array (e.g., a[] and b[]), which was accessed on GPU, is not accessed on CPU 24 // Data copy for a[] from CPU to GPU for (int t = 0; t < T; t++) { IntStream.range(0, N*N).parallel().forEach(idx -> { b[idx] = a[...]; }); // No data copy for b[] between GPU and CPU IntStream.range(0, N*N).parallel().forEach(idx -> { a[idx] = b[...]; } // No data copy for a[] between GPU and CPU } // Data copy for a[] and b[] from GPU to CPU Transparent GPU Exploitation for Java / Kazuaki Ishizaki
  • 25. Eliminating Redundant Exception Checks  Generate GPU code without exception checks by using – loop versioning [Artigas00] that guarantees safe region by using pre- condition checks on CPU 25 Transparent GPU Exploitation for Java / Kazuaki Ishizaki if ( // check cond. for NullPointerException a != null && b != null && c != null && // check cond. for ArrayIndexOutOfBoundsException a.length <l n && b.length <l n && c.length <l n) { ... <<<...>>> GPUbinary(...) ... } else { // execute this construct on CPU // to produce an exception // under the original exception semantics } IntStream.range(0,n).parallel(). forEach(i->{ b[i] = a[i]...; c[i] = a[i]...; }); GPU binary for { // safe region: // no exception // check is required i = ...; b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }
  • 26. Automatically Optimized for CPU and GPU  CPU code – handles GPU device memory management and data copying – checks whether optimized CPU and GPU code can be executed  GPU code is optimized – Using read-only cache – Eliminating exception checks 26 Transparent GPU Exploitation for Java / Kazuaki Ishizaki if (a != null && b != null && c != null && a.length < n && b.length < n && c.length < n && (a[] != b[]) && (a[] != c[])) { cudaMalloc(d_a, a.length*sizeof(float)+128); if (b!=a) cudaMalloc(d_b, b.length*sizeof(float)+128); if (c!=a && c!=b) cudaMalloc(d_c, c.length*sizeof(float)+128); int sz = (n – 0) * sizeof(float), szh = sz + Jhdrsz; cudaMemCopy(a, d_a + align - Jhdrsz, szh, H2D); <<...>> GPU(d_a, d_b, d_c, n) // launch GPU cudaMemcpy(b + Jhdrsz, d_b + align, sz, D2H); cudaMemcpy(c + Jhdrsz, d_c + align, sz, D2H); cudaFree(d_a); if (b!=a) cudaFree(d_b); if (c=!a && c!=b) cudaFree(d_c); } else { // execute CPU binary } CPU __global__ void GPU(float *a, float *b, float *c, int n) { // no exception checks i = ... b[i] = ROa[i] * 2.0; c[i] = ROa[i] * 3.0; } GPU
  • 27. Benchmark Programs  Prepare sequential and parallel stream API versions in Java 27 Transparent GPU Exploitation for Java / Kazuaki Ishizaki Name Summary Data size Type Blackscholes Financial application that calculates the price of put and call options 4,194,304 virtual options double MM A standard dense matrix multiplication: C = A.B 1,024 x 1,024 double Crypt Cryptographic application [Java Grande Benchmarks] N = 50,000,000 byte Series the first N fourier coefficients of the function [Java Grande Benchamark] N = 1,000,000 double SpMM Sparse matrix multiplication [Java Grande Benchmarks] N = 500,000 double MRIQ 3D image benchmark for MRI [Parboil benchmarks] 64x64x64 float Gemm Matrix multiplication: C = α.A.B + β.C [PolyBench] 1,024 x 1,024 int Gesummv Scalar, vector, and Matrix multiplication [PolyBench] 4,096 x 4,096 int
  • 28. Performance Improvements of GPU Version Over Sequential and Parallel CPU Versions  Achieve 127.9x on geomean and 2067.7x for Series over 1 CPU thread  Achieve 3.3x on geomean and 32.8x for Series over 160 CPU threads  Degrade performance for SpMM and Gesummv against 160 CPU threads Transparent GPU Exploitation for Java / Kazuaki Ishizaki28 Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memory with one NVIDIA Kepler K40m GPU at 876 MHz with 12-GB global memory (ECC off) Ubuntu 14.10, CUDA 5.5 Modified IBM Java 8 runtime for PowerPC
  • 29. Performance Impact of Each Optimization  MM: LV/DC/ALIGN/ROC are very effective  BlackScholes: DC is effective  MRIQ: LV/ALIGN/ROC is effective  SpMM and Gesummv: data transfer time for large arrays is dominant Apply optimizations cumulatively  BASE: Disabled our four optimizations  LV: loop versioning  DC: data copy  ALIGN: alignment optimization  ROC: read-only cache Breakdown of the execution time 29 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
  • 30. 0.85 0.45 1.51 0.92 0.74 0.11 1.19 3.47 0 0.5 1 1.5 2 2.5 3 3.5 4 BlackScholes MM Crypt Series SpMM MRIQ Gemm Gesummv SpeeduprelativetoCUDA Performance Comparison with Hand-Coded CUDA  Achieve 0.83x on geomean over CUDA  Crypt, Gemm, and Gesummv: usage of a read-only cache  BlackScholes: usage of larger CUDA threads per block (1024 vs. 128)  SpMM: overhead of exception checks  MRIQ: miss of ‘-use-fast-math’ compile option  MM: lack of usage of shared memory with loop tiling 30 Transparent GPU Exploitation for Java / Kazuaki Ishizaki Higher is better
  • 31. GPU Version is Slower than Parallel CPU Version Transparent GPU Exploitation for Java / Kazuaki Ishizaki31  Can we choose an appropriate device (CPU or GPU) to avoid performance degradation? – Want to make sure to achieve equal or better performance
  • 32. Machine-learning-based Performance Heuristics  Construct a binary prediction model offline by supervised machine learning with support vector machines (SVMs) – Features  Loop range  Dynamic number of instructions (memory access, arithmetic operation, …)  Dynamic number of array accesses (a[i], a[i + c], a[c * i], a[idx[i]])  Data transfer size (CPU to GPU, GPU to CPU) 32 Transparent GPU Exploitation for Java / Kazuaki Ishizaki data1 Bytecode App A feature 1 Features extraction LIBSVM Java Runtime Prediction Model data2 Bytecode App A feature 2 Features extraction data3 Bytecode App B feature 3 Features extraction CPU GPU
  • 33. Most Predictions are Correct Use 291 cases to build model  Succeeded in predicting cases of performance degradations on GPU  Failed to predict BlackScholes Transparent GPU Exploitation for Java / Kazuaki Ishizaki33 Prediction        1.8->1.0 0.8->1.0 0.4->1.0
  • 34. Related Work  Our research enables memory and communication optimizations with machine-learning-based device selection 34 Transparent GPU Exploitation for Java / Kazuaki Ishizaki Work Language Exception support JIT compiler How to write GPU kernel Data copy optimization GPU memory optimization Device selection JCUDA Java × × CUDA Manual Manual GPU only JaBEE Java × √ Override run method × × GPU only Aparapi Java × √ Override run method/Lambda × × Static Hadoop-CL Java × √ Override map/reduce method × × Static Rootbeer Java × √ Override run method Not described × Not described [PPPJ09] Java √ √ Java for-loop Not described × Dynamic with regression HJ-OpenCL Habanero- Java √ √ Forall constructs √ × Static Our work Java √ √ Standard parallel stream API √ ROCache / alignment Dynamic with machine learning
  • 35. Future Work  Exploiting shared memory – Not easy to predict performance – Require non-lightweight analysis for identifying reuse  Supporting additional Java operations 35 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
  • 36. GPU Exploitation in Apache Spark
  • 37. What is Apache Spark?  Framework that processes distributed computing by transforming distributed immutable memory structure using set of parallel operations  e.g. map(), filter(), reduce(), … – Distributed immutable in-memory structures  RDD (Resilient Distributed Dataset), DataFrame, Dataset – Scala is primary language for programming on Spark  Provide domain specific libraries Transparent GPU Exploitation for Java / Kazuaki Ishizaki Spark Runtime (written in Java and Scala) Spark Streaming (real-time) GraphX (graph) SparkSQL (SQL) MLlib (machine learning) Java Virtual Machine tasks Executor Driver Executor results Executor Data Data Data Open source: http://spark.apache.org/ Data Source (HDFS, DB, File, etc.) Latest version is 2.0.3 released in 2016/11 37
  • 38. How Program Works on Apache Spark  Parallel operations can be executed among partitions  In a partition, data can be processed sequentially Transparent GPU Exploitation for Java / Kazuaki Ishizaki case class Pt(x: Int, y: Int) val ds1: Dataset[Pt] = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDS val ds2: Dataset[Pt] = ds1.map(p => Pt(p.x+1, p.y*2)) val cnt: Int = ds2.reduce((p1, p2) => p1.x + p2.x) ds1 ds2 p.x+1 p.y*2 p1.x + p2.x 9 5 14 partition partition cnt 54 32 + = + =1 5 2 6 partition pt partition 38 2 10 3 12 3 7 4 8 4 14 5 16
  • 39. How We Can Run Program Faster on GPU  Assign many parallel computations into cores  Make memory accesses coalesce – Column-oriented layout results in better performance  [Che2011] reports on about 3x performance improvement of GPU kernel execution of kmeans with column-oriented layout over row-oriented layout 1 52 61 5 3 7 Assumption: 4 consecutive data elements can be coalesced using GPU hardware 2 v.s. 4 memory accesses to GPU device memory Row-oriented layoutColumn-oriented layout Pt(x: Int, y: Int) Pt(1,5), Pt(2,6), Pt(3,7), Pt(4,8) Load four Pt.x Load four Pt.y 2 6 4 843 87 cores x1 x2 x3 x4 cores Load Pt.x Load Pt.y Load Pt.x Load Pt.y 1 2 31 2 4 y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4 Transparent GPU Exploitation for Java / Kazuaki Ishizaki39
  • 40. Idea to Transparently Exploit GPUs on Apache Spark  Generate GPU code from a set of parallel operations – Made it in another research already  Physically put distributed immutable in-memory structures (e.g. Dataset) in column-oriented representation – Dataset is statically typed, but physical layout is not specified in program Transparent GPU Exploitation for Java / Kazuaki Ishizaki40
  • 41. Transparent GPU Exploitation for Java / Kazuaki Ishizaki Overview of GPU Exploitation on Apache Spark User’s Spark Program case class Pt(x: Int, y: Int) ds1 = sc.parallelize(Seq( Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2) .toDS ds2 = ds1.map(p => Pt(p.x+1, p.y*2)) cnt = ds2.reduce((p1, p2) => p1.x + p2.x) Nativecode GPU 10 12 14 +1= *2= ds1 Data transfer x y x y ds2 partition GPU kernel CPU 16 2 3 4 5 10 12 14 16 2 3 4 5 5 6 1 2 7 8 3 4 5 6 1 2 7 8 3 4 41
  • 42. Transparent GPU Exploitation for Java / Kazuaki Ishizaki Overview of GPU Exploitation on Apache Spark  Efficient – Reduce data copy overhead between CPU and GPU – Make memory accesses efficient on GPU  Transparent – Map parallelism in program into GPU native code User’s Spark Program case class Pt(x: Int, y: Int) ds1 = sc.parallelize(Seq( Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2) .toDS ds2 = ds1.map(p => Pt(p.x+1, p.y*2)) cnt = ds2.reduce((p1, p2) => p1.x + p2.x) Drive GPU native code Nativecode GPU +1= *2= ds1 Data transfer x y GPU manager Columnar storage x y GPU can exploit parallelism both among partitions in Dataset and within a partition of Dataset ds2 partition GPU kernel CPU Memoryaddress 42 10 12 14 16 2 3 4 5 10 12 14 16 2 3 4 5 5 6 1 2 7 8 3 4 5 6 1 2 7 8 3 4
  • 43. Exploit Parallelism Between GPU Kernels  Overlap data transfers and computations among different GPU kernels on a GPU Data transfer CPU to GPU GPU kernel GPU kernel GPU kernel GPU kernel GPU kernel GPU kernel GPU kernel GPU kernel GPU kernel GPU kernel GPU kernel GPU kernel GPU kernel GPU kernel GPU kernel Time Spark worker for GPU Spark worker for GPU Spark worker for GPU Data transfer GPU to CPU Transparent GPU Exploitation for Java / Kazuaki Ishizaki43
  • 44. Transparent GPU Exploitation for Java / Kazuaki Ishizaki How We Write Program And What is Executed  Write a program using a relational operation for DataFrame or a lambda expression for Dataset.  Catalyst performs optimization and code generation for the program.  The corresponding Java bytecode for the generated Java code is executed. ds1 = data.toDS() ds2 = ds2.map(p => p.x+1) ds2.reduce((a,b) => a+b) df1 = data.toDF(…) df2 = df2.selectExpr("x+1") df2.agg(sum()) Frontend API DataFrame (v1.3-) Dataset (v1.6-) Backend computation Catalyst-generated Java bytecode Java code Catalyst 2 61 5 Java heap Row-oriented Data data = Seq(Pt(1, 5),Pt(2, 6)) 44 “Catalyst” is a code-name for optimizer and code generator in Apache Spark
  • 45. Transparent GPU Exploitation for Java / Kazuaki Ishizaki How Program is Executed on GPU  For DataFrame and Dataset, enhanced Catalyst generates Java code optimized for GPU.  A just-in-time compiler in Java virtual machine can generate GPU code. ds1 = data.toDS() ds2 = ds2.map(p => p.x+1) ds2.reduce((a,b) => a+b) df1 = data.toDF(…) df2 = df2.selectExpr("x+1") df2.agg(sum()) Frontend API Backend computationAutomatically generated GPU code Optimized Java code Enhanced Catalyst Data2 61 5 GPU device memory Column-oriented 45 DataFrame (v1.3-) Dataset (v1.6-)data = Seq(Pt(1, 5),Pt(2, 6))
  • 46. Pseudo Java Code by Current Catalyst  Perform optimization that merges multiple parallel operations (selectExpr() and agg(sum()) into one loop int sum = 0 while (rowIterator.hasNext()) { // iterator-based access Row row = rowIterator.next(); // for df1 int x = row.getInteger(0); // selectExpr(x + 1) int x_new = x + 1; // for df2 sum += x_new; } val df1 = (-1 to 1).toDF("x") val df2 = df1.selectExpr("x + 1") df2.agg(sum()) Generated code corresponds to selectExpr() and local sum() 1 3 1 0 -1 -1 0 DataFrame program for Spark Transparent GPU Exploitation for Java / Kazuaki Ishizaki 20 1 Read sequentially 46 df1 x x_new sum Row-oriented Catalyst Generated pseudo Java code
  • 47. Pseudo Java Code by Enhanced Catalyst  Get column0 from column-oriented storage  For-loop can be executed in a parallel reduction manner Column column0 = df1.getColumn(0); // df1 int sum = 0; for (int i = 0; i < column0.numRows; i++) { int x = column0.getInteger(i); // selectExpr(x + 1) int x_new = x + 1; // for df2 sum += x_new; } 1 10-1 -1 0 Generated pseudo Java code Transparent GPU Exploitation for Java / Kazuaki Ishizaki 3 20 1 47 df1 x x_new sum Column-orientedEnhanced Catalyst
  • 48. Generate GPU Code Transparently from Spark Program  Copy column-oriented storage into GPU  Execute add and reduction in one GPU kernel Column column0 = df1.getColumn(0); int nRows = column0.numRows; cudaMalloc(&d_c0, nRows*4); cudaMemcpy(d_c0, column0, nRows, H2D); int sum = 0; cudaMalloc(&d_sum, 4); cudaMemcpy(d_c0, &sum, 4, H2D); <<...>> GPU(d_c0, d_sum, nRows) // launch GPU cudaMemcpy(d_c0, &sum, 4, D2H); cudaFree(d_sum); cudaFree(d_c0); Transparent GPU Exploitation for Java / Kazuaki Ishizaki val df1 = (-1 to 1).toDF("x") val df2 = df1.selectExpr("x + 1") df2.agg(sum()) // GPU code __global__ void GPU( int *d_c0, int *d_sum, long size) { long ix = … // 0, 1, 2 if (size <= ix) return; int x = d_c0[ix]; int x_new = x + 1; reduction(d_sum, x_new); } 48 1 10-1 -1 0 3 20 1 x x_new d_sum d_c0 Execute in parallel Generated CPU code
  • 49. Related Work  Spark With Accelerated Tasks [Grossman2016] – Generate GPU code from lambda function in map() in RDD – Very similar to enhanced Catalyst using columnar storage to transparently exploit GPUs. However, work for RDD with map()  GPU Columnar (proposed by Kiran Lonikar) – Generate GPU code from program using select() method in DataFrame – Very similar to enhanced Catalyst using columnar storage to transparently exploit GPUs Transparent GPU Exploitation for Java / Kazuaki Ishizaki val inputRDD = cl(sc.objectFile[Int]( hdfsPath )) val doubledRDD = inputRDD.map(i => 2 * i) 49
  • 50. Takeaway  How system generates hardware accelerator code from program with high-level abstraction –Most of programmers are not Ninja programmers –Compiler can transform program for hardware features, but does not want to do trial and error at runtime –How can compiler and hardware build good relationship?  (Not talked today) What can we do for deep learning? – Current deep learning frameworks use GPU by calling libraries (e.g. cnDNN/cuRNN by NVIDIA) – How will system support rapid evolution in deep learning?  New neural network structures are still proposed Transparent GPU Exploitation for Java / Kazuaki Ishizaki50