SlideShare a Scribd company logo
Exploration of Supervised Machine Learning
Techniques for Runtime Selection of CPU vs.GPU
Execution in Java Programs
Gloria Kim (Rice University)
Akihiro Hayashi (Rice University)
Vivek Sarkar (Georgia Tech)
1
Java For HPC?
Dual Socket Intel Xeon E5-2666 v3, 18cores, 2.9GHz, 60-GiB DRAM
2
4K x 4K Matrix Multiply GFLOPS Absolute speedup
Python 0.005 1
Java 0.058 11
C 0.253 47
Parallel loops 1.969 366
Parallel divide and conquer 36.168 6,724
+ vectorization 124.945 23,230
+ AVX intrinsics 335.217 62,323
Strassen 361.681 67,243
Credits : Charles Leiserson, Ken Kennedy Award Lecture @ Rice University, Karthik Murthy @ PACT2015
Any ways to
accelerate
Java
programs?
Java 8 Parallel Streams APIs
Explicit Parallelism with lambda
expressions
IntStream.range(0, N)
.parallel()
.forEach(i ->
<lambda>);
3
Explicit Parallelism with Java
 High-level parallel programming with Java
offers opportunities for
 preserving portability
Low-level parallel/accelerator programming is not required
 enabling compiler to perform parallel-aware
optimizations and code generation
Java 8 Parallel Stream API
Multi-
core
CPUs
Many-
core
GPUs
FPGAs
Java 8 Programs
HW
SW
4
Challenges for
GPU Code Generation
5
IntStream.range(0, N).parallel().forEach(i -> <lambda>);
Challenge 1
Supporting Java
Features on GPUs
Challenge 3
CPU / GPU
Selection
Challenge 2
Accelerating Java
programs on GPUs
 Exception Semantics
 Virtual Method Calls
 Kernel Optimizations
 DT Optimizations
 Selection of a faster device
from CPUs and GPUs
Standard Java API Call for Parallelism
Credits : Checklist by Phil Laver from the Noun Project, Rocket by Luis Prado from the Noun Project
Rocket by Maxwell Painter Karpylev from the Noun Project, Running by Dillon Arloff from the Noun Project
vs.
Related Work: Java + GPU
6
Lang JIT GPU Kernel Device Selection
JCUDA Java - CUDA GPU only
Lime Lime ✔ Override map/reduce Static
Firepile Scala ✔ reduce Static
JaBEE Java ✔ Override run GPU only
Aparapi Java ✔ map Static
Hadoop-CL Java ✔ Override map/reduce Static
RootBeer Java ✔ Override run Not Described
HJ-OpenCL HJ - forall / lambda Static
PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression
Our Work Java ✔ Parallel Stream Dynamic with Machine Learning
None of these approaches considers Java 8 Parallel Stream APIs
and a dynamic device selection with machine-learning
JIT Compilation for GPU
IBM Java 8 Compiler
 Built on top of the production version of the
IBM Java 8 runtime environment (J9 VM)
7
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
method A
method A
Interpretation on
JVM
1st invocation
2nd invocation
Nth invocation
(N+1)th invocation
Native Code
Generation for
Multi-
core
CPUs
The JIT Compilation Flow
8
Java
Bytecode
Translation to
Our IR
Parallel
Streams
Identification
Optimization
for GPUs
NVVM IR libNVVM
Existing
Optimizations
Target Machine
Code Generation
GPU
Binary
CPU
Binary
GPU Runtime
Our JIT Compiler
NVIDIA’s Tool Chain
Performance Evaluations
9
40.6 37.4
82.0
64.2
27.6
1.4 1.0
4.4
36.7
7.4 5.7
42.7 34.6
58.1
844.7 772.3
1.0
0.1
1.9
1164.8
9.0
1.2
0.0
0.1
1.0
10.0
100.0
1000.0
10000.0
SpeeduprelativetoSEQENTIAL
Java(logscale)
Higher is better160 worker threads (Fork/join) GPU
CPU: IBM POWER8 @ 3.69GHz , GPU: NVIDIA Tesla
K40m
Runtime CPU/GPU Selection
10
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
Nth invocation
(N+1)th invocation
Native Code
Generation for
(Open Question)
Which one is faster?
Our approach:
ML-based Performance Heuristics
 A binary prediction model is constructed by supervised
machine learning techniques
11
bytecode
App A
Prediction
Model
JIT compiler
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
ML
Training run with JIT Compiler Offline Model Construction
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
CPUGPU
Features of program
Loop Range (Parallel Loop Size)
The dynamic number of Instructions in IR
 Memory Access
 Arithmetic Operations
 Math Methods
 Branch Instructions
 Other Types of Instructions
12
Features of program (Cont’d)
The dynamic number of Array Accesses
 Coalesced Access (a[i])
 Offset Access (a[i+c])
 Stride Access (a[c*i])
 Other Access (a[b[i]])
13
0 5 10 15 20 25 30
050100150200
c : offset or stride size
Bandwidth(GB/s)
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ●
●
●
Offset : array[i+c]
Stride : array[c*i]
An example of feature vector
14
"features" : {
"range": 256,
"ILs" : {
"Memory": 9, "Arithmetic": 7, "Math": 0,
"Branch": 1, "Other": 1 },
"Array Accesses" : {
"Coalesced": 3, "Offset": 0, "Stride": 0, "Random": 0},
}
IntStream.range(0, N)
.parallel()
.forEach(i -> { a[i] = b[i] + c[i];});
Offline Prediction Model
Construction
 Obtained 291 samples by running 11 applications with different data
sets
 Additional 41 samples by running additional 3 applications
 Choice is either GPU or 160 worker threads on CPU
15
bytecode
App A
Prediction
Model
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
ML
Algorithms
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
Training run with JIT Compiler Offline Model Construction
Platform
CPU
 IBM POWER8 @ 3.69GHz
20 cores
8 SMT threads per core = up to 160 threads
256 GB of RAM
GPU
 NVIDIA Tesla K40m
12GB of Global Memory
16
Applications
17
Application Source Field Max Size Data Type
BlackScholes Finance 4,194,304 double
Crypt JGF Cryptography Size C (N=50M) byte
SpMM JGF Numerical Computing Size C (N=500K) double
MRIQ Parboil Medical Large (64^3) float
Gemm Polybench
Numerical
Computing
2K x 2K int
Gesummv Polybench 2K x 2K int
Doitgen Polybench 256x256x256 int
Jacobi-1D Polybench N=4M, T=1 int
Matrix Multiplication 2K x 2K double
Matrix Transpose 2K x 2K double
VecAdd 4M double
Explored ML Algorithms
Support Vector Machines (LIBSVM)
Decision Trees (Weka 3)
Logistic Regression (Weka 3)
Multilayer Perceptron (Weka 3)
k Nearest Neighbors (Weka 3)
Decision Stumps (Weka 3)
Naïve Bayes (Weka 3)
18
Top 3 ML Algorithms
Full Set of Features (10 Features)
19
98.28 98.28 97.6099.66 98.97 98.9798.28 98.68
92.68
0
20
40
60
80
100
LIBSVM J48 Tree Logistic Regression
Accuracy(%)
Higher is better
Accuracy from 5-fold CV Accuracy on original training data
Accuracy on unknown testing data
Further Explorations and Analyses
by Feature Subsetting
Research Questions
 Are the 10 features the best combination?
Accuracy
Runtime prediction overheads
 Which features are more important?
20
Feature Sub-Setting for further analyses:
10 Algorithms x 1024 Subsets
21
1. Loop Range (Parallel Loop Size)
2. # of Memory Accesses
3. # of Arithmetic Operations
4. # of Math Methods
5. # of Branch Instructions
6. # of Other Types of Instructions
7. # of Coalesced Memory Accesses
8. # of Offset Memory Accesses
9. # of Stride Memory Accesses
10. # of Other Types of Array
Accesses
10 Features
Subset 1
Subset 2
Subset 1024
SVM
Decision
Trees
Logistic
Regression
Multi Layer
Perceptron
kNN
Decision
Stumps
Naïve
Bayes
Accuracy X%
Accuracy Y%
Accuracy Z%
Accuracy A%
Accuracy B%
Accuracy C%
Accuracy D%
The impact of Subsetting
(Fullset vs. the best subset)
22
99.656 98.625 98.28298.282 97.595 98.282
0
20
40
60
80
100
LIBSVM Logistic Regression J48 Tree
Accuracy(%)
Higher is better
Subset of Features Fullset of features
Feature subsetting can yield better accuracies
The impact of Subsetting
(Fullset vs. the best subset)
23
99.66 98.63 98.2899.66 98.97 98.2899.66
82.93
92.68
0
10
20
30
40
50
60
70
80
90
100
LIBSVM (# of features = 3) Logistic Regression (# of features = 4) J48 Tree (# of features = 2)
Accuracy(%)
Higher is better
Accuracy from 5-fold CV Accuracy on original training data Accuracy on unknown testing data
Subsets with only a few features show great accuracies
An example of Subsetting analyses
24
 With LIBSVM, 99 subsets achieved the highest accuracy
Feature # of Models with this feature Percentage of Models with this feature
range 99 100.0%
stride 96 97.0%
arithmetic 65 65.7%
Other 2 56 56.6%
memory 56 56.6%
offset 55 55.6%
branch 54 54.5%
math 46 46.5%
Other 1 43 43.4%
Runtime Prediction Overheads
LIBSVM
Logistic
Regression
J48 Decision
Trees
Full Set (10
features)
2.278 usec 0.158 usec 0.020 usec
Subset
2.107 usec
(3 Features)
0.106 usec
(4 Features)
0.020 usec
(2 Features)
25
 Subsetting can reduce prediction overheads
 However, this part is very small compared to the kernel execution
part
 Based on our analysis, kernels usually take a few milliseconds
Lessons Learned
ML-based CPU/GPU Selection is a promising
way to choose faster devices
LIBSVM, Logistic Regression, and J48
Decision Tree are machine learning
techniques that produce models with best
accuracies
Range, coalesced, other types of array
accesses, and arithmetic instructions are
particularly important features
26
Lessons Learned (Cont’d)
While LIBSVM (Non-linear classification) shows
excellent accuracy in prediction, runtime prediction
overheads are relatively larger compared to other
algorithms
 However, those overheads are negligible in general
J48 Decision Tree shows comparable accuracy to
LIBSVM. Also, the output of the J48 Decision Tree
is more human-readable and fine-tunable
27
Conclusions
Conclusions
 ML-based CPU/GPU Selection is a promising
way to choose faster devices
Future Work
 Evaluate End-to-end performance improvements
 Explore the possibility of applying our technique
to OpenMP, OpenACC, and OpenCL
 Perform experiments on recent versions of GPUs
28
Further Reading
 Code Generation and Optimizations
 “Compiling and Optimizing Java 8 Programs for GPU
Execution.” Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents,
Vivek Sarkar. 24th International Conference on Parallel
Architectures and Compilation Techniques (PACT), October
2015.
 Performance Heuristics (SVM-based)
 “Machine-Learning-based Performance Heuristics for
Runtime CPU/GPU Selection.” Akihiro Hayashi, Kazuaki
Ishizaki, Gita Koblents, Vivek Sarkar. 12th International
Conference on the Principles and Practice of Programming on
the Java Platform: virtual machines, languages, and tools
(PPPJ), September 2015.
29

More Related Content

What's hot

"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
Edge AI and Vision Alliance
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA Taiwan
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
Kenta Oono
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
Sri Ambati
 
Accelerate Machine Learning Software on Intel Architecture
Accelerate Machine Learning Software on Intel Architecture Accelerate Machine Learning Software on Intel Architecture
Accelerate Machine Learning Software on Intel Architecture
Intel® Software
 
nn network
nn networknn network
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Ilham Amezzane
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
NVIDIA Taiwan
 
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningAsynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement Learning
Ssu-Rui Lee
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Intel® Software
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-Service
Hiroshi Doyu
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
Christian Peel
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
indico data
 
Squeezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile PhonesSqueezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile Phones
Anirudh Koul
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
Ehsan Totoni
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
Hiroki Nakahara
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hood
RichardWarburton
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
AMD Developer Central
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
geetachauhan
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
NVIDIA Taiwan
 

What's hot (20)

"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Accelerate Machine Learning Software on Intel Architecture
Accelerate Machine Learning Software on Intel Architecture Accelerate Machine Learning Software on Intel Architecture
Accelerate Machine Learning Software on Intel Architecture
 
nn network
nn networknn network
nn network
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
 
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningAsynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement Learning
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-Service
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 
Squeezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile PhonesSqueezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile Phones
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hood
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 

Similar to Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs. GPU Execution in Java Programs

Parallelism in a NumPy-based program
Parallelism in a NumPy-based programParallelism in a NumPy-based program
Parallelism in a NumPy-based program
Ralf Gommers
 
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NECST Lab @ Politecnico di Milano
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
NECST Lab @ Politecnico di Milano
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
NECST Lab @ Politecnico di Milano
 
Labview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLLabview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRL
Mohammad Sabouri
 
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...
ChangWoo Min
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
Databricks
 
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsx
BharathiLakshmiAAssi
 
matrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsxmatrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsx
Bharathi Lakshmi Pon
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Intel® Software
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
Seiya Tokui
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic Computing
DESMOND YUEN
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
Kazuaki Ishizaki
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
NECST Lab @ Politecnico di Milano
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
Akihiro Hayashi
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
fcassier
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Gaurav Raina
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
IndicThreads
 

Similar to Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs. GPU Execution in Java Programs (20)

Parallelism in a NumPy-based program
Parallelism in a NumPy-based programParallelism in a NumPy-based program
Parallelism in a NumPy-based program
 
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
 
Labview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLLabview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRL
 
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsx
 
matrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsxmatrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsx
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic Computing
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
 

More from Akihiro Hayashi

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Akihiro Hayashi
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
Akihiro Hayashi
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs
Akihiro Hayashi
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Akihiro Hayashi
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
Akihiro Hayashi
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Akihiro Hayashi
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Akihiro Hayashi
 

More from Akihiro Hayashi (8)

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
 

Recently uploaded

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 

Recently uploaded (20)

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 

Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs. GPU Execution in Java Programs

  • 1. Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice University) Akihiro Hayashi (Rice University) Vivek Sarkar (Georgia Tech) 1
  • 2. Java For HPC? Dual Socket Intel Xeon E5-2666 v3, 18cores, 2.9GHz, 60-GiB DRAM 2 4K x 4K Matrix Multiply GFLOPS Absolute speedup Python 0.005 1 Java 0.058 11 C 0.253 47 Parallel loops 1.969 366 Parallel divide and conquer 36.168 6,724 + vectorization 124.945 23,230 + AVX intrinsics 335.217 62,323 Strassen 361.681 67,243 Credits : Charles Leiserson, Ken Kennedy Award Lecture @ Rice University, Karthik Murthy @ PACT2015 Any ways to accelerate Java programs?
  • 3. Java 8 Parallel Streams APIs Explicit Parallelism with lambda expressions IntStream.range(0, N) .parallel() .forEach(i -> <lambda>); 3
  • 4. Explicit Parallelism with Java  High-level parallel programming with Java offers opportunities for  preserving portability Low-level parallel/accelerator programming is not required  enabling compiler to perform parallel-aware optimizations and code generation Java 8 Parallel Stream API Multi- core CPUs Many- core GPUs FPGAs Java 8 Programs HW SW 4
  • 5. Challenges for GPU Code Generation 5 IntStream.range(0, N).parallel().forEach(i -> <lambda>); Challenge 1 Supporting Java Features on GPUs Challenge 3 CPU / GPU Selection Challenge 2 Accelerating Java programs on GPUs  Exception Semantics  Virtual Method Calls  Kernel Optimizations  DT Optimizations  Selection of a faster device from CPUs and GPUs Standard Java API Call for Parallelism Credits : Checklist by Phil Laver from the Noun Project, Rocket by Luis Prado from the Noun Project Rocket by Maxwell Painter Karpylev from the Noun Project, Running by Dillon Arloff from the Noun Project vs.
  • 6. Related Work: Java + GPU 6 Lang JIT GPU Kernel Device Selection JCUDA Java - CUDA GPU only Lime Lime ✔ Override map/reduce Static Firepile Scala ✔ reduce Static JaBEE Java ✔ Override run GPU only Aparapi Java ✔ map Static Hadoop-CL Java ✔ Override map/reduce Static RootBeer Java ✔ Override run Not Described HJ-OpenCL HJ - forall / lambda Static PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression Our Work Java ✔ Parallel Stream Dynamic with Machine Learning None of these approaches considers Java 8 Parallel Stream APIs and a dynamic device selection with machine-learning
  • 7. JIT Compilation for GPU IBM Java 8 Compiler  Built on top of the production version of the IBM Java 8 runtime environment (J9 VM) 7 Multi- core CPUs Many- core GPUs method A method A method A method A Interpretation on JVM 1st invocation 2nd invocation Nth invocation (N+1)th invocation Native Code Generation for Multi- core CPUs
  • 8. The JIT Compilation Flow 8 Java Bytecode Translation to Our IR Parallel Streams Identification Optimization for GPUs NVVM IR libNVVM Existing Optimizations Target Machine Code Generation GPU Binary CPU Binary GPU Runtime Our JIT Compiler NVIDIA’s Tool Chain
  • 9. Performance Evaluations 9 40.6 37.4 82.0 64.2 27.6 1.4 1.0 4.4 36.7 7.4 5.7 42.7 34.6 58.1 844.7 772.3 1.0 0.1 1.9 1164.8 9.0 1.2 0.0 0.1 1.0 10.0 100.0 1000.0 10000.0 SpeeduprelativetoSEQENTIAL Java(logscale) Higher is better160 worker threads (Fork/join) GPU CPU: IBM POWER8 @ 3.69GHz , GPU: NVIDIA Tesla K40m
  • 10. Runtime CPU/GPU Selection 10 Multi- core CPUs Many- core GPUs method A method A Nth invocation (N+1)th invocation Native Code Generation for (Open Question) Which one is faster?
  • 11. Our approach: ML-based Performance Heuristics  A binary prediction model is constructed by supervised machine learning techniques 11 bytecode App A Prediction Model JIT compiler feature 1 data 1 bytecode App A data 2 bytecode App B data 3 feature 2 feature 3 ML Training run with JIT Compiler Offline Model Construction feature extraction feature extraction feature extraction Java Runtime CPUGPU
  • 12. Features of program Loop Range (Parallel Loop Size) The dynamic number of Instructions in IR  Memory Access  Arithmetic Operations  Math Methods  Branch Instructions  Other Types of Instructions 12
  • 13. Features of program (Cont’d) The dynamic number of Array Accesses  Coalesced Access (a[i])  Offset Access (a[i+c])  Stride Access (a[c*i])  Other Access (a[b[i]]) 13 0 5 10 15 20 25 30 050100150200 c : offset or stride size Bandwidth(GB/s) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Offset : array[i+c] Stride : array[c*i]
  • 14. An example of feature vector 14 "features" : { "range": 256, "ILs" : { "Memory": 9, "Arithmetic": 7, "Math": 0, "Branch": 1, "Other": 1 }, "Array Accesses" : { "Coalesced": 3, "Offset": 0, "Stride": 0, "Random": 0}, } IntStream.range(0, N) .parallel() .forEach(i -> { a[i] = b[i] + c[i];});
  • 15. Offline Prediction Model Construction  Obtained 291 samples by running 11 applications with different data sets  Additional 41 samples by running additional 3 applications  Choice is either GPU or 160 worker threads on CPU 15 bytecode App A Prediction Model feature 1 data 1 bytecode App A data 2 bytecode App B data 3 feature 2 feature 3 ML Algorithms feature extraction feature extraction feature extraction Java Runtime Training run with JIT Compiler Offline Model Construction
  • 16. Platform CPU  IBM POWER8 @ 3.69GHz 20 cores 8 SMT threads per core = up to 160 threads 256 GB of RAM GPU  NVIDIA Tesla K40m 12GB of Global Memory 16
  • 17. Applications 17 Application Source Field Max Size Data Type BlackScholes Finance 4,194,304 double Crypt JGF Cryptography Size C (N=50M) byte SpMM JGF Numerical Computing Size C (N=500K) double MRIQ Parboil Medical Large (64^3) float Gemm Polybench Numerical Computing 2K x 2K int Gesummv Polybench 2K x 2K int Doitgen Polybench 256x256x256 int Jacobi-1D Polybench N=4M, T=1 int Matrix Multiplication 2K x 2K double Matrix Transpose 2K x 2K double VecAdd 4M double
  • 18. Explored ML Algorithms Support Vector Machines (LIBSVM) Decision Trees (Weka 3) Logistic Regression (Weka 3) Multilayer Perceptron (Weka 3) k Nearest Neighbors (Weka 3) Decision Stumps (Weka 3) Naïve Bayes (Weka 3) 18
  • 19. Top 3 ML Algorithms Full Set of Features (10 Features) 19 98.28 98.28 97.6099.66 98.97 98.9798.28 98.68 92.68 0 20 40 60 80 100 LIBSVM J48 Tree Logistic Regression Accuracy(%) Higher is better Accuracy from 5-fold CV Accuracy on original training data Accuracy on unknown testing data
  • 20. Further Explorations and Analyses by Feature Subsetting Research Questions  Are the 10 features the best combination? Accuracy Runtime prediction overheads  Which features are more important? 20
  • 21. Feature Sub-Setting for further analyses: 10 Algorithms x 1024 Subsets 21 1. Loop Range (Parallel Loop Size) 2. # of Memory Accesses 3. # of Arithmetic Operations 4. # of Math Methods 5. # of Branch Instructions 6. # of Other Types of Instructions 7. # of Coalesced Memory Accesses 8. # of Offset Memory Accesses 9. # of Stride Memory Accesses 10. # of Other Types of Array Accesses 10 Features Subset 1 Subset 2 Subset 1024 SVM Decision Trees Logistic Regression Multi Layer Perceptron kNN Decision Stumps Naïve Bayes Accuracy X% Accuracy Y% Accuracy Z% Accuracy A% Accuracy B% Accuracy C% Accuracy D%
  • 22. The impact of Subsetting (Fullset vs. the best subset) 22 99.656 98.625 98.28298.282 97.595 98.282 0 20 40 60 80 100 LIBSVM Logistic Regression J48 Tree Accuracy(%) Higher is better Subset of Features Fullset of features Feature subsetting can yield better accuracies
  • 23. The impact of Subsetting (Fullset vs. the best subset) 23 99.66 98.63 98.2899.66 98.97 98.2899.66 82.93 92.68 0 10 20 30 40 50 60 70 80 90 100 LIBSVM (# of features = 3) Logistic Regression (# of features = 4) J48 Tree (# of features = 2) Accuracy(%) Higher is better Accuracy from 5-fold CV Accuracy on original training data Accuracy on unknown testing data Subsets with only a few features show great accuracies
  • 24. An example of Subsetting analyses 24  With LIBSVM, 99 subsets achieved the highest accuracy Feature # of Models with this feature Percentage of Models with this feature range 99 100.0% stride 96 97.0% arithmetic 65 65.7% Other 2 56 56.6% memory 56 56.6% offset 55 55.6% branch 54 54.5% math 46 46.5% Other 1 43 43.4%
  • 25. Runtime Prediction Overheads LIBSVM Logistic Regression J48 Decision Trees Full Set (10 features) 2.278 usec 0.158 usec 0.020 usec Subset 2.107 usec (3 Features) 0.106 usec (4 Features) 0.020 usec (2 Features) 25  Subsetting can reduce prediction overheads  However, this part is very small compared to the kernel execution part  Based on our analysis, kernels usually take a few milliseconds
  • 26. Lessons Learned ML-based CPU/GPU Selection is a promising way to choose faster devices LIBSVM, Logistic Regression, and J48 Decision Tree are machine learning techniques that produce models with best accuracies Range, coalesced, other types of array accesses, and arithmetic instructions are particularly important features 26
  • 27. Lessons Learned (Cont’d) While LIBSVM (Non-linear classification) shows excellent accuracy in prediction, runtime prediction overheads are relatively larger compared to other algorithms  However, those overheads are negligible in general J48 Decision Tree shows comparable accuracy to LIBSVM. Also, the output of the J48 Decision Tree is more human-readable and fine-tunable 27
  • 28. Conclusions Conclusions  ML-based CPU/GPU Selection is a promising way to choose faster devices Future Work  Evaluate End-to-end performance improvements  Explore the possibility of applying our technique to OpenMP, OpenACC, and OpenCL  Perform experiments on recent versions of GPUs 28
  • 29. Further Reading  Code Generation and Optimizations  “Compiling and Optimizing Java 8 Programs for GPU Execution.” Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents, Vivek Sarkar. 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), October 2015.  Performance Heuristics (SVM-based)  “Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection.” Akihiro Hayashi, Kazuaki Ishizaki, Gita Koblents, Vivek Sarkar. 12th International Conference on the Principles and Practice of Programming on the Java Platform: virtual machines, languages, and tools (PPPJ), September 2015. 29

Editor's Notes

  1. Thanks for your introduction. Hi, everyone, My name is Akihiro Hayashi. I’m a research scientist at Rice university. Today, I’ll be talking about GPU code generation from Java programs and machine-learning based performance heuristics for runtime cpu/gpu selection. This work is done in collaboration with Kazuaki Ishizaki at IBM Research at Tokyo, Gita Koblents at IBM Canada, and Vivek Sarkar at Rice.
  2. Okay, let me first talk about the motivation. Java language has been regarded as high-productive but not high-performance language for many years. What you see here is the performance of 4Kx4K matrix multiplication in different languages. Java is 11x faster than Python implementation, but Java is much much far from the state-of-the-art library implementation. One open question is that is there any ways to accelerate Java programs while maintaining high-productive features?
  3. One good news is that the latest version of Java has introduced parallel stream API that offers more opportunity for parallel programming than previous editions have supported. These APIs allow the programmers to write explicitly parallel programs with lambda expressions. Here is an example of Java 8 Parallel Streams APIs. You first generate the range of a parallel loop with the range method. In this case you generate a stream that ranges from 0 to N-1 and then you specify that this computation can be done in parallel and you also specify an actual computation with lambda expressions.
  4. Such high-level parallel programming with Java offers opportunities for preserving portability and enabling compiler to perform parallel-aware optimizations and code generation. Since now we have uniform and high-level expressions for parallel programming, compilers could generate parallel code for various kinds of hardware including multi-core CPUs, many-core GPUs, and FPGAs.
  5. Now we have standard Java API for explicit parallelism and we believe that generating GPU code is one of promising ways to accelerate Java programs. However, there are three challenges that we should cope with. Challenge 1 is how to support Java Features such as exception semantics and virttual method calls on GPUs, Challenge 2 is how to compile and optimize Java programs for GPU executions. Challenge 3 is performance heuristics. Since GPUs are not always faster than CPUs, we should properly select a faster device from CPUs and GPUs. Suppose that you go to a grocery store, but we don’t want to use a rocket right? We definitely need some criteria to choose a faster device.
  6. Okay, we have been working on IBM Java 8 compiler, which can generates GPU code from Java 8 programs. Like conventional JIT compilers, a method is firstly interpreted on JVM and then the method is going to get JIT compiled if the invocation count exceeds a certain threshold. We have already implemented a native code generator for many-core GPUs in addition to multi-core CPUs.
  7. Here is a detail of the compilation flow. Our JIT compiler first takes Java byte code, and then generates our Intermediate representations (IRs). The compiler applies several existing optimizations such as PRE. When the compiler identified the parallel streams in IRs, we apply several kinds of GPU-aware. After that, the compiler generates NVVM IR, which is an extended version of LLVM IR, the generated NVVM IR is finally translated into a GPU binary.
  8. Let’s talk about the motivation. Now we can generate parallel code for both CPUs and GPUs. However, one challenging problem is how to select a faster device for a given parallel loop since it depends on the characteristic of a program and data sets.
  9. Here is an our approach. We build a prediction model that choose a faster device from GPU and CPU using machine learning. To do so, we run various applications with different data sets for training. The JIT compiler extracts features of programs that may affect performance during JIT compilation and the generated training data is eventually used for constructing a binary prediction model.
  10. Here we have a list of features that we are using so far. As I mentioned earlier, parallel loop size and the number of instructions has a strong relationship with the execution time. That’s why we chose these features. For the number of instructions, we try to distinguish several types of instructions so that we can characterize the behavior of applications. For example, you may want to see if an application is either memory-bounded or computation-bounded.
  11. Additionally, the number of array accesses would be an important factor affecting GPU performance. It is well known that coalesced access, which is aligned access, achieves better memory bandwidth than misaligned access. Also, data transfer size would be important factor affecting GPU performance.
  12. Here is an example of feature vector. Expressed in JSON format