4. Explicit Parallelism with Java
High-level parallel programming with Java
offers opportunities for
preserving portability
Low-level parallel/accelerator programming is not required
enabling compiler to perform parallel-aware
optimizations and code generation
Java 8 Parallel Stream API
Multi-
core
CPUs
Many-
core
GPUs
FPGAs
Java 8 Programs
HW
SW
4
5. Challenges for
GPU Code Generation
5
IntStream.range(0, N).parallel().forEach(i -> <lambda>);
Challenge 1
Supporting Java
Features on GPUs
Challenge 3
Performance
Heuristics
Challenge 2
Accelerating Java
programs on GPUs
Exception Semantics
Virtual Method Calls
Kernel Optimizations
DT Optimizations
Selection of a faster device
from CPUs and GPUs
Standard Java API Call for Parallelism
Credits : Checklist by Phil Laver from the Noun Project, Rocket by Luis Prado from the Noun Project
Rocket by Maxwell Painter Karpylev from the Noun Project, Running by Dillon Arloff from the Noun Project
vs.
6. JIT Compilation for GPU
IBM Java 8 Compiler
Built on top of the production version of the
IBM Java 8 runtime environment
6
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
method A
method A
Interpretation on
JVM
1st invocation
2nd invocation
Nth invocation
(N+1)th invocation
Native Code
Generation for
Multi-
core
CPUs
7. The JIT Compilation Flow
7
Java
Bytecode
Translation to
Our IR
Parallel
Streams
Identification
Optimization
for GPUs
NVVM IR libNVVM
Existing
Optimizations
Target Machine
Code Generation
GPU
Binary
CPU
Binary
GPU Runtime
Our JIT Compiler
NVIDIA’s Tool Chain
8. Java Features on GPUs
Exceptions
Explicit exception checking on GPUs
ArrayIndexOutOfBoundsException
NullPointerException
ArithmeticException (Division by zero only)
Further Optimizations
Loop Versioning for redundant exception checking elimination
on GPUs based on [Artigas+, ICS’00]
Virtual Method Calls
Direct De-virtualization or Guarded De-virtualization
[Ishizaki+, OOPSLA’00]
8
Challenge 1 : Supporting Java Features on GPUs
9. Loop Versioning Example
9
IntStream.range(0, N)
.parallel()
.forEach(
i -> {
Array[i] = i / N;
}
);
// Speculative Checking on CPU
if (Array != null
&& 0 <= Array.length
&& N <= Array.length
&& N != 0) {
// Safe GPU Execution
} else {
May cause
OutOfBoundsException
May cause
ArithmeticException
May cause
NullPointerException
Challenge 1 : Supporting Java Features on GPUs
10. Optimizing GPU Execution
Kernel optimization
Optimizing alignment of Java arrays on GPUs
Utilizing “Read-Only Cache” in recent GPUs
Data transfer optimizations
Redundant data transfer elimination
Partial transfer if an array subscript is affine
10
Challenge 2 : Accelerating Java programs on
GPUs
11. The Alignment Optimization Example
11
0 128
a[0]-a[31]
Object header
Memory address
a[32]-a[63]
Naive
alignment
strategy
a[0]-a[31] a[32]-a[63]
256 384
Our
alignment
strategy
One memory transaction for a[0:31]
Two memory transactions for
a[0:31]
a[64]-a[95]
a[64]-a[95]
Challenge 2 : Accelerating Java programs on
GPUs
14. Related Work : Linear Regression
Regression based cost estimation is specific to an application
14
0e+00 4e+07 8e+07
0.00.20.40.60.81.0
The dynamic number of IR instructions
KernelExecutionTime(msec) NVIDIA Tesla K40 GPU
IBM POWER8 CPU
App 1) BlackScholes
ExecutionTime(msec)
0e+00 4e+07 8e+07
01234
The dynamic number of IR instructions
KernelExecutionTime(msec)
NVIDIA Tesla K40 GPU
IBM POWER8 CPU
ExecutionTime
(msec)
App 2) Vector Addition
CPU
GPU
CPU
GPU
[1] Leung et al. Automatic Parallelization for Graphics Processing Units (PPPJ’09)
[2] Kerr et al. Modeling GPU-CPU Workloads and Systems (GPGPU-3)
Challenge 3 : Performance Heuristics
15. Accurate Cost Model is required?
Accurate cost model construction would be
too much
Considerable effort will be needed to update
performance models for future generations of
hardware
15
Multi-
core
CPUs
Many-
core
GPUs
Which one is faster?
Machine-learning-based performance heuristics
Challenge 3 : Performance Heuristics
16. ML-based Performance Heuristics
A binary prediction model is constructed by supervised
machine learning with support vector machines
(SVMs)
16
bytecode
App A
Prediction
Model
JIT compiler
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
LIBSV
M
Training run with JIT Compiler Offline Model Construction
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
CPUGPU
Challenge 3 : Performance Heuristics
17. Features of program
Loop Range (Parallel Loop Size)
The dynamic number of Instructions in IR
Memory Access
Arithmetic Operations
Math Methods
Branch Instructions
Other Instructions
17
Challenge 3 : Performance Heuristics
18. Features of program (Cont’d)
The dynamic number of Array Accesses
Coalesced Access (a[i])
Offset Access (a[i+c])
Stride Access (a[c*i])
Other Access (a[b[i]])
Data Transfer Size
H2D Transfer Size
D2H Transfer Size
18
0 5 10 15 20 25 30
050100150200
c : offset or stride size
Bandwidth(GB/s)
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ●
●
●
Offset : array[i+c]
Stride : array[c*i]
Challenge 3 : Performance Heuristics
20. Prediction Model Construction
Obtained 291 samples by running 11 applications with
different data sets
Choice is either GPU or 160 worker threads on CPU
20
bytecode
App A
Prediction
Model
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
LIBSV
M
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
Training run with JIT Compiler Offline Model Construction
Challenge 3 : Performance Heuristics
21. Applications
21
Application Source Field Max Size Data Type
BlackScholes Finance 4,194,304 double
Crypt JGF Cryptography Size C (N=50M) byte
SpMM JGF Numerical Computing Size C (N=500K) double
MRIQ Parboil Medical Large (64^3) float
Gemm Polybench
Numerical
Computing
2K x 2K int
Gesummv Polybench 2K x 2K int
Doitgen Polybench 256x256x256 int
Jacobi-1D Polybench N=4M, T=1 int
Matrix Multiplication 2K x 2K double
Matrix Transpose 2K x 2K double
VecAdd 4M double
Challenge 3 : Performance Heuristics
22. Platform
CPU
IBM POWER8 @ 3.69GHz
20 cores
8 SMT threads per core = up to 160 threads
256 GB of RAM
GPU
NVIDIA Tesla K40m
12GB of Global Memory
22
Challenge 3 : Performance Heuristics
24. How to evaluate prediction models:
5-fold cross validation
Overfitting problem:
Prediction model may be tailored to the eleven applications
if training data = testing data
To avoid the overfitting problem:
Calculate the accuracy of the prediction model using 5-fold cross
validation
24
Subset 1 Subset 2 Subset 3 Subset 4 Subset 5
Subset 2Subset 1 Subset 3 Subset 4 Subset 5
Build a prediction model trained on Subset 2-5Used for TESTING
Accuracy : X%
Used for TESTING
Accuracy : Y%
TRAINING DATA (291 samples)
Build a prediction model trained
on Subset 1, 3-5
Challenge 3 : Performance Heuristics
25. 79.0%
97.6% 99.0% 99.0% 97.2%
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Range "+=nIRs" "+=dIRs" "+=Array" "ALL (+=DT)"
Accuracy (%), Total number of samples = 291
Accuracies with cross-validation:
25
Higher is better
Challenge 3 : Performance Heuristics
not distinguish
insn types
distinguish
insn types +=Data Transfer Size
26. Discussion
Based on results with 291 samples,
(range, # of detailed Insns, # of array accesses)
shows the best accuracy
DT does not contribute to improving the accuracy since there is a
few samples where the DT contributes the decision
Pros. and Cons.
(+) Future generations of hardware can be supported easily by
re-running applications
(+) Just add another training data to rebuild a prediction model
(-) Takes time to collect training data
26
Challenge 3 : Performance Heuristics
27. Related Work: Java + GPU
27
Lang JIT GPU Kernel Device Selection
JCUDA Java - CUDA GPU only
Lime Lime ✔ Override map/reduce Static
Firepile Scala ✔ reduce Static
JaBEE Java ✔ Override run GPU only
Aparapi Java ✔ map Static
Hadoop-CL Java ✔ Override map/reduce Static
RootBeer Java ✔ Override run Not Described
HJ-OpenCL HJ - forall / lambda Static
PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression
Our Work Java ✔ Parallel Stream Dynamic with Machine Learning
None of these approaches considers Java 8 Parallel Stream APIs
and a dynamic device selection with machine-learning
28. Summary
28
IntStream.range(0, N).parallel().forEach(i -> <lambda>);
Challenge 1
Supporting Java
Features on GPUs
Challenge 3
Performance
Heuristics
Challenge 2
Accelerating Java
programs on GPUs
Exception Semantics
Virtual Method Calls
Kernel Optimizations
DT Optimizations
Selection of a faster device
from CPUs and GPUs
Standard Java API Call for Parallelism
vs.
Over 1,000x Speedup Up to 99% accuracyExceptions & Virtual Calls
Credits : Checklist by Phil Laver from the Noun Project, Rocket by Luis Prado from the Noun Project
Rocket by Maxwell Painter Karpylev from the Noun Project, Running by Dillon Arloff from the Noun Project
29. Acknowledgements
Special thanks to
IBM CAS
Marcel Mitran (IBM Canada)
Jimmy Kwa (IBM Canada)
Habanero Group at Rice
Yoichi Matsuyama (CMU)
29
30. Further Readings
Code Generation and Optimizations
“Compiling and Optimizing Java 8 Programs for GPU
Execution.” Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents,
Vivek Sarkar. 24th International Conference on Parallel
Architectures and Compilation Techniques (PACT), October
2015.
Performance Heuristics
“Machine-Learning-based Performance Heuristics for
Runtime CPU/GPU Selection.” Akihiro Hayashi, Kazuaki
Ishizaki, Gita Koblents, Vivek Sarkar. 12th International
Conference on the Principles and Practice of Programming on
the Java Platform: virtual machines, languages, and tools
(PPPJ), September 2015.
30
33. How to evaluate prediction models:
TRUE/FALSE POSITIVE/NEGATIVE
4 types of binary prediction results
(Let POSITIVE is CPU, NEGATIVE is GPU)
TRUE POSITIVE
Correctly predicted that CPU is faster
TRUE NEGATIVE
Correctly predicted that GPU is faster
FALSE POSITIVE
Predicted CPU is faster, but GPU is actually faster
FALSE NEGATIVE
Predicted GPU is faster, but CPU is actually faster
33
34. How to evaluate prediction models:
Accuracy, Precision, and Recall metrics
Accuracy
the percentage of selections predicted correctly:
(TP + TN) / (TP + TN + FP + FN)
Precision X (X = CPU or GPU)
Precision CPU : TP / (TP + FP)
# of samples correctly predicted CPU is faster
/ total # of samples predicted that CPU is faster
Precision GPU : TN / (TN + FN)
Recall X (X = CPU or GPU)
Recall CPU : TP / (TP + FN)
# of samples correctly predicted CPU is faster
/ total # of samples labeled that CPU is actually faster
Recall GPU : TN / (TN + FP)
34
How precise is the
model when it predicts
X is faster.
How does the
prediction hit the nail
when X is actually
faster.
35. Precisions and Recalls with cross-validation
35
Precision
CPU
Recall
CPU
Precision
GPU
Recall
GPU
Range 79.0% 100% 0% 0%
+=nIRs 97.8% 99.1% 96.5% 91.8%
+=dIRs 98.7% 100% 100% 95.0%
+=Arrays 98.7% 100% 100% 95.0%
ALL 96.7% 100% 100% 86.9%
Higher is better
All prediction models except Range rarely make a bad decision