SlideShare a Scribd company logo
1 of 35
Machine-Learning-based Performance Heuristics for
Runtime CPU/GPU Selection in Java
Akihiro Hayashi (Rice University)
Kazuaki Ishizaki (IBM Research - Tokyo)
Gita Koblents (IBM Canada)
Vivek Sarkar (Rice University)
10th Workshop on Challenges for Parallel Computing @ IBM CASCON2015
1
Java For HPC?
Dual Socket Intel Xeon E5-2666 v3, 18cores, 2.9GHz, 60-GiB DRAM
2
4K x 4K Matrix Multiply GFLOPS Absolute speedup
Python 0.005 1
Java 0.058 11
C 0.253 47
Parallel loops 1.969 366
Parallel divide and conquer 36.168 6,724
+ vectorization 124.945 23,230
+ AVX intrinsics 335.217 62,323
Strassen 361.681 67,243
Credits : Charles Leiserson, Ken Kennedy Award Lecture @ Rice University, Karthik Murthy @ PACT2015
Any ways to
accelerate Java
programs?
Java 8 Parallel Streams APIs
Explicit Parallelism with lambda
expressions
IntStream.range(0, N)
.parallel()
.forEach(i ->
<lambda>);
3
Explicit Parallelism with Java
 High-level parallel programming with Java
offers opportunities for
 preserving portability
Low-level parallel/accelerator programming is not required
 enabling compiler to perform parallel-aware
optimizations and code generation
Java 8 Parallel Stream API
Multi-
core
CPUs
Many-
core
GPUs
FPGAs
Java 8 Programs
HW
SW
4
Challenges for
GPU Code Generation
5
IntStream.range(0, N).parallel().forEach(i -> <lambda>);
Challenge 1
Supporting Java
Features on GPUs
Challenge 3
Performance
Heuristics
Challenge 2
Accelerating Java
programs on GPUs
 Exception Semantics
 Virtual Method Calls
 Kernel Optimizations
 DT Optimizations
 Selection of a faster device
from CPUs and GPUs
Standard Java API Call for Parallelism
Credits : Checklist by Phil Laver from the Noun Project, Rocket by Luis Prado from the Noun Project
Rocket by Maxwell Painter Karpylev from the Noun Project, Running by Dillon Arloff from the Noun Project
vs.
JIT Compilation for GPU
IBM Java 8 Compiler
 Built on top of the production version of the
IBM Java 8 runtime environment
6
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
method A
method A
Interpretation on
JVM
1st invocation
2nd invocation
Nth invocation
(N+1)th invocation
Native Code
Generation for
Multi-
core
CPUs
The JIT Compilation Flow
7
Java
Bytecode
Translation to
Our IR
Parallel
Streams
Identification
Optimization
for GPUs
NVVM IR libNVVM
Existing
Optimizations
Target Machine
Code Generation
GPU
Binary
CPU
Binary
GPU Runtime
Our JIT Compiler
NVIDIA’s Tool Chain
Java Features on GPUs
 Exceptions
 Explicit exception checking on GPUs
 ArrayIndexOutOfBoundsException
 NullPointerException
 ArithmeticException (Division by zero only)
 Further Optimizations
 Loop Versioning for redundant exception checking elimination
on GPUs based on [Artigas+, ICS’00]
 Virtual Method Calls
 Direct De-virtualization or Guarded De-virtualization
[Ishizaki+, OOPSLA’00]
8
Challenge 1 : Supporting Java Features on GPUs
Loop Versioning Example
9
IntStream.range(0, N)
.parallel()
.forEach(
i -> {
Array[i] = i / N;
}
);
// Speculative Checking on CPU
if (Array != null
&& 0 <= Array.length
&& N <= Array.length
&& N != 0) {
// Safe GPU Execution
} else {
May cause
OutOfBoundsException
May cause
ArithmeticException
May cause
NullPointerException
Challenge 1 : Supporting Java Features on GPUs
Optimizing GPU Execution
Kernel optimization
 Optimizing alignment of Java arrays on GPUs
 Utilizing “Read-Only Cache” in recent GPUs
Data transfer optimizations
 Redundant data transfer elimination
 Partial transfer if an array subscript is affine
10
Challenge 2 : Accelerating Java programs on
GPUs
The Alignment Optimization Example
11
0 128
a[0]-a[31]
Object header
Memory address
a[32]-a[63]
Naive
alignment
strategy
a[0]-a[31] a[32]-a[63]
256 384
Our
alignment
strategy
One memory transaction for a[0:31]
Two memory transactions for
a[0:31]
a[64]-a[95]
a[64]-a[95]
Challenge 2 : Accelerating Java programs on
GPUs
Performance Evaluations
12
40.6 37.4
82.0
64.2
27.6
1.4 1.0
4.4
36.7
7.4 5.7
42.7 34.6
58.1
844.7 772.3
1.0
0.1
1.9
1164.8
9.0
1.2
0.0
0.1
1.0
10.0
100.0
1000.0
10000.0
SpeeduprelativetoSEQENTIAL
Java(logscale)
Higher is better160 worker threads (Fork/join) GPU
CPU: IBM POWER8 @ 3.69GHz , GPU: NVIDIA Tesla
K40m
Challenge 2 : Accelerating Java programs on
GPUs
Runtime CPU/GPU Selection
13
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
Nth invocation
(N+1)th invocation
Native Code
Generation for
(Open Question)
Which one is faster?
Challenge 3 : Performance Heuristics
Related Work : Linear Regression
 Regression based cost estimation is specific to an application
14
0e+00 4e+07 8e+07
0.00.20.40.60.81.0
The dynamic number of IR instructions
KernelExecutionTime(msec) NVIDIA Tesla K40 GPU
IBM POWER8 CPU
App 1) BlackScholes
ExecutionTime(msec)
0e+00 4e+07 8e+07
01234
The dynamic number of IR instructions
KernelExecutionTime(msec)
NVIDIA Tesla K40 GPU
IBM POWER8 CPU
ExecutionTime
(msec)
App 2) Vector Addition
CPU
GPU
CPU
GPU
[1] Leung et al. Automatic Parallelization for Graphics Processing Units (PPPJ’09)
[2] Kerr et al. Modeling GPU-CPU Workloads and Systems (GPGPU-3)
Challenge 3 : Performance Heuristics
Accurate Cost Model is required?
Accurate cost model construction would be
too much
Considerable effort will be needed to update
performance models for future generations of
hardware
15
Multi-
core
CPUs
Many-
core
GPUs
Which one is faster?
Machine-learning-based performance heuristics
Challenge 3 : Performance Heuristics
ML-based Performance Heuristics
 A binary prediction model is constructed by supervised
machine learning with support vector machines
(SVMs)
16
bytecode
App A
Prediction
Model
JIT compiler
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
LIBSV
M
Training run with JIT Compiler Offline Model Construction
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
CPUGPU
Challenge 3 : Performance Heuristics
Features of program
Loop Range (Parallel Loop Size)
The dynamic number of Instructions in IR
 Memory Access
 Arithmetic Operations
 Math Methods
 Branch Instructions
 Other Instructions
17
Challenge 3 : Performance Heuristics
Features of program (Cont’d)
The dynamic number of Array Accesses
 Coalesced Access (a[i])
 Offset Access (a[i+c])
 Stride Access (a[c*i])
 Other Access (a[b[i]])
Data Transfer Size
 H2D Transfer Size
 D2H Transfer Size
18
0 5 10 15 20 25 30
050100150200
c : offset or stride size
Bandwidth(GB/s)
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ●
●
●
Offset : array[i+c]
Stride : array[c*i]
Challenge 3 : Performance Heuristics
An example of feature vector
19
Challenge 3 : Performance Heuristics
"features" : {
"range": 256,
"ILs" : {
"Memory": 9, "Arithmetic": 7, "Math": 0,
"Branch": 1, "Other": 1 },
"Array Accesses" : {
"Coalesced": 3, "Offset": 0, "Stride": 0, "Random": 0},
"H2D Transfer" : [2064, 2064, 2064, …, 0],
"D2H Transfer" : [2048, 2048, 2048, …, 0]}
IntStream.range(0, N)
.parallel()
.forEach(i -> { a[i] = b[i] + c[i];});
Prediction Model Construction
 Obtained 291 samples by running 11 applications with
different data sets
 Choice is either GPU or 160 worker threads on CPU
20
bytecode
App A
Prediction
Model
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
LIBSV
M
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
Training run with JIT Compiler Offline Model Construction
Challenge 3 : Performance Heuristics
Applications
21
Application Source Field Max Size Data Type
BlackScholes Finance 4,194,304 double
Crypt JGF Cryptography Size C (N=50M) byte
SpMM JGF Numerical Computing Size C (N=500K) double
MRIQ Parboil Medical Large (64^3) float
Gemm Polybench
Numerical
Computing
2K x 2K int
Gesummv Polybench 2K x 2K int
Doitgen Polybench 256x256x256 int
Jacobi-1D Polybench N=4M, T=1 int
Matrix Multiplication 2K x 2K double
Matrix Transpose 2K x 2K double
VecAdd 4M double
Challenge 3 : Performance Heuristics
Platform
CPU
 IBM POWER8 @ 3.69GHz
20 cores
8 SMT threads per core = up to 160 threads
256 GB of RAM
GPU
 NVIDIA Tesla K40m
12GB of Global Memory
22
Challenge 3 : Performance Heuristics
Performance and Prediction Results
23
Prediction x ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
40.6 37.4
82.0
64.2
27.6
1.4 1.0
4.4
36.7
7.4 5.7
42.7 34.6
58.1
844.7 772.3
1.0
0.1
1.9
1164.8
9.0
1.2
0.0
0.1
1.0
10.0
100.0
1000.0
10000.0
SpeeduprelativetoSEQENTIAL
Java(logscale)
Higher is better160 worker threads (Fork/join) GPU
Challenge 3 : Performance Heuristics
How to evaluate prediction models:
5-fold cross validation
 Overfitting problem:
 Prediction model may be tailored to the eleven applications
if training data = testing data
 To avoid the overfitting problem:
 Calculate the accuracy of the prediction model using 5-fold cross
validation
24
Subset 1 Subset 2 Subset 3 Subset 4 Subset 5
Subset 2Subset 1 Subset 3 Subset 4 Subset 5
Build a prediction model trained on Subset 2-5Used for TESTING
Accuracy : X%
Used for TESTING
Accuracy : Y%
TRAINING DATA (291 samples)
Build a prediction model trained
on Subset 1, 3-5
Challenge 3 : Performance Heuristics
79.0%
97.6% 99.0% 99.0% 97.2%
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Range "+=nIRs" "+=dIRs" "+=Array" "ALL (+=DT)"
Accuracy (%), Total number of samples = 291
Accuracies with cross-validation:
25
Higher is better
Challenge 3 : Performance Heuristics
not distinguish
insn types
distinguish
insn types +=Data Transfer Size
Discussion
 Based on results with 291 samples,
(range, # of detailed Insns, # of array accesses)
shows the best accuracy
 DT does not contribute to improving the accuracy since there is a
few samples where the DT contributes the decision
 Pros. and Cons.
 (+) Future generations of hardware can be supported easily by
re-running applications
 (+) Just add another training data to rebuild a prediction model
 (-) Takes time to collect training data
26
Challenge 3 : Performance Heuristics
Related Work: Java + GPU
27
Lang JIT GPU Kernel Device Selection
JCUDA Java - CUDA GPU only
Lime Lime ✔ Override map/reduce Static
Firepile Scala ✔ reduce Static
JaBEE Java ✔ Override run GPU only
Aparapi Java ✔ map Static
Hadoop-CL Java ✔ Override map/reduce Static
RootBeer Java ✔ Override run Not Described
HJ-OpenCL HJ - forall / lambda Static
PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression
Our Work Java ✔ Parallel Stream Dynamic with Machine Learning
None of these approaches considers Java 8 Parallel Stream APIs
and a dynamic device selection with machine-learning
Summary
28
IntStream.range(0, N).parallel().forEach(i -> <lambda>);
Challenge 1
Supporting Java
Features on GPUs
Challenge 3
Performance
Heuristics
Challenge 2
Accelerating Java
programs on GPUs
 Exception Semantics
 Virtual Method Calls
 Kernel Optimizations
 DT Optimizations
 Selection of a faster device
from CPUs and GPUs
Standard Java API Call for Parallelism
vs.
Over 1,000x Speedup Up to 99% accuracyExceptions & Virtual Calls
Credits : Checklist by Phil Laver from the Noun Project, Rocket by Luis Prado from the Noun Project
Rocket by Maxwell Painter Karpylev from the Noun Project, Running by Dillon Arloff from the Noun Project
Acknowledgements
Special thanks to
 IBM CAS
 Marcel Mitran (IBM Canada)
 Jimmy Kwa (IBM Canada)
 Habanero Group at Rice
 Yoichi Matsuyama (CMU)
29
Further Readings
 Code Generation and Optimizations
 “Compiling and Optimizing Java 8 Programs for GPU
Execution.” Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents,
Vivek Sarkar. 24th International Conference on Parallel
Architectures and Compilation Techniques (PACT), October
2015.
 Performance Heuristics
 “Machine-Learning-based Performance Heuristics for
Runtime CPU/GPU Selection.” Akihiro Hayashi, Kazuaki
Ishizaki, Gita Koblents, Vivek Sarkar. 12th International
Conference on the Principles and Practice of Programming on
the Java Platform: virtual machines, languages, and tools
(PPPJ), September 2015.
30
Backup Slides
31
Performance Breakdown
32
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
BASE
+=LV
+=DT
+=ALIGN
ALL(+=ROC)
BASE
+=LV
+=DT
+=ALIGN
ALL(+=ROC)
BASE
+=LV
+=DT
+=ALIGN
ALL(+=ROC)
BASE
+=LV
+=DT
+=ALIGN
ALL(+=ROC)
BASE
+=LV
+=DT
+=ALIGN
ALL(+=ROC)
BASE
+=LV
+=DT
+=ALIGN
ALL(+=ROC)
BASE
+=LV
+=DT
+=ALIGN
ALL(+=ROC)
BASE
+=LV
+=DT
+=ALIGN
ALL(+=ROC)
BlackScholes Crypt SpMM Series MRIQ MM Gemm Gesummv
ExecutionTimenormalizedtoALL
Lower is betterH2D D2H Kernel
How to evaluate prediction models:
TRUE/FALSE POSITIVE/NEGATIVE
 4 types of binary prediction results
(Let POSITIVE is CPU, NEGATIVE is GPU)
 TRUE POSITIVE
 Correctly predicted that CPU is faster
 TRUE NEGATIVE
 Correctly predicted that GPU is faster
 FALSE POSITIVE
 Predicted CPU is faster, but GPU is actually faster
 FALSE NEGATIVE
 Predicted GPU is faster, but CPU is actually faster
33
How to evaluate prediction models:
Accuracy, Precision, and Recall metrics
 Accuracy
 the percentage of selections predicted correctly:
 (TP + TN) / (TP + TN + FP + FN)
 Precision X (X = CPU or GPU)
 Precision CPU : TP / (TP + FP)
 # of samples correctly predicted CPU is faster
/ total # of samples predicted that CPU is faster
 Precision GPU : TN / (TN + FN)
 Recall X (X = CPU or GPU)
 Recall CPU : TP / (TP + FN)
 # of samples correctly predicted CPU is faster
/ total # of samples labeled that CPU is actually faster
 Recall GPU : TN / (TN + FP)
34
How precise is the
model when it predicts
X is faster.
How does the
prediction hit the nail
when X is actually
faster.
Precisions and Recalls with cross-validation
35
Precision
CPU
Recall
CPU
Precision
GPU
Recall
GPU
Range 79.0% 100% 0% 0%
+=nIRs 97.8% 99.1% 96.5% 91.8%
+=dIRs 98.7% 100% 100% 95.0%
+=Arrays 98.7% 100% 100% 95.0%
ALL 96.7% 100% 100% 86.9%
Higher is better
All prediction models except Range rarely make a bad decision

More Related Content

What's hot

Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016Ehsan Totoni
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1Kenta Oono
 
PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019Masashi Shibata
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportPreferred Networks
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA Taiwan
 

What's hot (20)

Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereport
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 

Similar to Machine-learning based performance heuristics for Runtime CPU/GPU Selection in Java

Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxBharathiLakshmiAAssi
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsGanesan Narayanasamy
 
Java 5 6 Generics, Concurrency, Garbage Collection, Tuning
Java 5 6 Generics, Concurrency, Garbage Collection, TuningJava 5 6 Generics, Concurrency, Garbage Collection, Tuning
Java 5 6 Generics, Concurrency, Garbage Collection, TuningCarol McDonald
 
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...Juan Cruz Nores
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerSeiya Tokui
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...NECST Lab @ Politecnico di Milano
 
Hybrid CPU GPU MATLAB Image Processing Benchmarking
Hybrid CPU GPU MATLAB Image Processing BenchmarkingHybrid CPU GPU MATLAB Image Processing Benchmarking
Hybrid CPU GPU MATLAB Image Processing BenchmarkingDimitris Vayenas
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告Ryousei Takano
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Hajime Tazaki
 
Parallelism in a NumPy-based program
Parallelism in a NumPy-based programParallelism in a NumPy-based program
Parallelism in a NumPy-based programRalf Gommers
 
What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++Microsoft
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Matrix Multiplication with Ateji PX for Java
Matrix Multiplication with Ateji PX for JavaMatrix Multiplication with Ateji PX for Java
Matrix Multiplication with Ateji PX for JavaPatrick Viry
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...NECST Lab @ Politecnico di Milano
 

Similar to Machine-learning based performance heuristics for Runtime CPU/GPU Selection in Java (20)

Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsx
 
matrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsxmatrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsx
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
Java 5 6 Generics, Concurrency, Garbage Collection, Tuning
Java 5 6 Generics, Concurrency, Garbage Collection, TuningJava 5 6 Generics, Concurrency, Garbage Collection, Tuning
Java 5 6 Generics, Concurrency, Garbage Collection, Tuning
 
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
Hybrid CPU GPU MATLAB Image Processing Benchmarking
Hybrid CPU GPU MATLAB Image Processing BenchmarkingHybrid CPU GPU MATLAB Image Processing Benchmarking
Hybrid CPU GPU MATLAB Image Processing Benchmarking
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013
 
Parallelism in a NumPy-based program
Parallelism in a NumPy-based programParallelism in a NumPy-based program
Parallelism in a NumPy-based program
 
What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++
 
Do24738741
Do24738741Do24738741
Do24738741
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Matrix Multiplication with Ateji PX for Java
Matrix Multiplication with Ateji PX for JavaMatrix Multiplication with Ateji PX for Java
Matrix Multiplication with Ateji PX for Java
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
 

More from Akihiro Hayashi

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesAkihiro Hayashi
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral CompilationAkihiro Hayashi
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsAkihiro Hayashi
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Akihiro Hayashi
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...Akihiro Hayashi
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Akihiro Hayashi
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
 

More from Akihiro Hayashi (8)

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Machine-learning based performance heuristics for Runtime CPU/GPU Selection in Java

  • 1. Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection in Java Akihiro Hayashi (Rice University) Kazuaki Ishizaki (IBM Research - Tokyo) Gita Koblents (IBM Canada) Vivek Sarkar (Rice University) 10th Workshop on Challenges for Parallel Computing @ IBM CASCON2015 1
  • 2. Java For HPC? Dual Socket Intel Xeon E5-2666 v3, 18cores, 2.9GHz, 60-GiB DRAM 2 4K x 4K Matrix Multiply GFLOPS Absolute speedup Python 0.005 1 Java 0.058 11 C 0.253 47 Parallel loops 1.969 366 Parallel divide and conquer 36.168 6,724 + vectorization 124.945 23,230 + AVX intrinsics 335.217 62,323 Strassen 361.681 67,243 Credits : Charles Leiserson, Ken Kennedy Award Lecture @ Rice University, Karthik Murthy @ PACT2015 Any ways to accelerate Java programs?
  • 3. Java 8 Parallel Streams APIs Explicit Parallelism with lambda expressions IntStream.range(0, N) .parallel() .forEach(i -> <lambda>); 3
  • 4. Explicit Parallelism with Java  High-level parallel programming with Java offers opportunities for  preserving portability Low-level parallel/accelerator programming is not required  enabling compiler to perform parallel-aware optimizations and code generation Java 8 Parallel Stream API Multi- core CPUs Many- core GPUs FPGAs Java 8 Programs HW SW 4
  • 5. Challenges for GPU Code Generation 5 IntStream.range(0, N).parallel().forEach(i -> <lambda>); Challenge 1 Supporting Java Features on GPUs Challenge 3 Performance Heuristics Challenge 2 Accelerating Java programs on GPUs  Exception Semantics  Virtual Method Calls  Kernel Optimizations  DT Optimizations  Selection of a faster device from CPUs and GPUs Standard Java API Call for Parallelism Credits : Checklist by Phil Laver from the Noun Project, Rocket by Luis Prado from the Noun Project Rocket by Maxwell Painter Karpylev from the Noun Project, Running by Dillon Arloff from the Noun Project vs.
  • 6. JIT Compilation for GPU IBM Java 8 Compiler  Built on top of the production version of the IBM Java 8 runtime environment 6 Multi- core CPUs Many- core GPUs method A method A method A method A Interpretation on JVM 1st invocation 2nd invocation Nth invocation (N+1)th invocation Native Code Generation for Multi- core CPUs
  • 7. The JIT Compilation Flow 7 Java Bytecode Translation to Our IR Parallel Streams Identification Optimization for GPUs NVVM IR libNVVM Existing Optimizations Target Machine Code Generation GPU Binary CPU Binary GPU Runtime Our JIT Compiler NVIDIA’s Tool Chain
  • 8. Java Features on GPUs  Exceptions  Explicit exception checking on GPUs  ArrayIndexOutOfBoundsException  NullPointerException  ArithmeticException (Division by zero only)  Further Optimizations  Loop Versioning for redundant exception checking elimination on GPUs based on [Artigas+, ICS’00]  Virtual Method Calls  Direct De-virtualization or Guarded De-virtualization [Ishizaki+, OOPSLA’00] 8 Challenge 1 : Supporting Java Features on GPUs
  • 9. Loop Versioning Example 9 IntStream.range(0, N) .parallel() .forEach( i -> { Array[i] = i / N; } ); // Speculative Checking on CPU if (Array != null && 0 <= Array.length && N <= Array.length && N != 0) { // Safe GPU Execution } else { May cause OutOfBoundsException May cause ArithmeticException May cause NullPointerException Challenge 1 : Supporting Java Features on GPUs
  • 10. Optimizing GPU Execution Kernel optimization  Optimizing alignment of Java arrays on GPUs  Utilizing “Read-Only Cache” in recent GPUs Data transfer optimizations  Redundant data transfer elimination  Partial transfer if an array subscript is affine 10 Challenge 2 : Accelerating Java programs on GPUs
  • 11. The Alignment Optimization Example 11 0 128 a[0]-a[31] Object header Memory address a[32]-a[63] Naive alignment strategy a[0]-a[31] a[32]-a[63] 256 384 Our alignment strategy One memory transaction for a[0:31] Two memory transactions for a[0:31] a[64]-a[95] a[64]-a[95] Challenge 2 : Accelerating Java programs on GPUs
  • 12. Performance Evaluations 12 40.6 37.4 82.0 64.2 27.6 1.4 1.0 4.4 36.7 7.4 5.7 42.7 34.6 58.1 844.7 772.3 1.0 0.1 1.9 1164.8 9.0 1.2 0.0 0.1 1.0 10.0 100.0 1000.0 10000.0 SpeeduprelativetoSEQENTIAL Java(logscale) Higher is better160 worker threads (Fork/join) GPU CPU: IBM POWER8 @ 3.69GHz , GPU: NVIDIA Tesla K40m Challenge 2 : Accelerating Java programs on GPUs
  • 13. Runtime CPU/GPU Selection 13 Multi- core CPUs Many- core GPUs method A method A Nth invocation (N+1)th invocation Native Code Generation for (Open Question) Which one is faster? Challenge 3 : Performance Heuristics
  • 14. Related Work : Linear Regression  Regression based cost estimation is specific to an application 14 0e+00 4e+07 8e+07 0.00.20.40.60.81.0 The dynamic number of IR instructions KernelExecutionTime(msec) NVIDIA Tesla K40 GPU IBM POWER8 CPU App 1) BlackScholes ExecutionTime(msec) 0e+00 4e+07 8e+07 01234 The dynamic number of IR instructions KernelExecutionTime(msec) NVIDIA Tesla K40 GPU IBM POWER8 CPU ExecutionTime (msec) App 2) Vector Addition CPU GPU CPU GPU [1] Leung et al. Automatic Parallelization for Graphics Processing Units (PPPJ’09) [2] Kerr et al. Modeling GPU-CPU Workloads and Systems (GPGPU-3) Challenge 3 : Performance Heuristics
  • 15. Accurate Cost Model is required? Accurate cost model construction would be too much Considerable effort will be needed to update performance models for future generations of hardware 15 Multi- core CPUs Many- core GPUs Which one is faster? Machine-learning-based performance heuristics Challenge 3 : Performance Heuristics
  • 16. ML-based Performance Heuristics  A binary prediction model is constructed by supervised machine learning with support vector machines (SVMs) 16 bytecode App A Prediction Model JIT compiler feature 1 data 1 bytecode App A data 2 bytecode App B data 3 feature 2 feature 3 LIBSV M Training run with JIT Compiler Offline Model Construction feature extraction feature extraction feature extraction Java Runtime CPUGPU Challenge 3 : Performance Heuristics
  • 17. Features of program Loop Range (Parallel Loop Size) The dynamic number of Instructions in IR  Memory Access  Arithmetic Operations  Math Methods  Branch Instructions  Other Instructions 17 Challenge 3 : Performance Heuristics
  • 18. Features of program (Cont’d) The dynamic number of Array Accesses  Coalesced Access (a[i])  Offset Access (a[i+c])  Stride Access (a[c*i])  Other Access (a[b[i]]) Data Transfer Size  H2D Transfer Size  D2H Transfer Size 18 0 5 10 15 20 25 30 050100150200 c : offset or stride size Bandwidth(GB/s) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Offset : array[i+c] Stride : array[c*i] Challenge 3 : Performance Heuristics
  • 19. An example of feature vector 19 Challenge 3 : Performance Heuristics "features" : { "range": 256, "ILs" : { "Memory": 9, "Arithmetic": 7, "Math": 0, "Branch": 1, "Other": 1 }, "Array Accesses" : { "Coalesced": 3, "Offset": 0, "Stride": 0, "Random": 0}, "H2D Transfer" : [2064, 2064, 2064, …, 0], "D2H Transfer" : [2048, 2048, 2048, …, 0]} IntStream.range(0, N) .parallel() .forEach(i -> { a[i] = b[i] + c[i];});
  • 20. Prediction Model Construction  Obtained 291 samples by running 11 applications with different data sets  Choice is either GPU or 160 worker threads on CPU 20 bytecode App A Prediction Model feature 1 data 1 bytecode App A data 2 bytecode App B data 3 feature 2 feature 3 LIBSV M feature extraction feature extraction feature extraction Java Runtime Training run with JIT Compiler Offline Model Construction Challenge 3 : Performance Heuristics
  • 21. Applications 21 Application Source Field Max Size Data Type BlackScholes Finance 4,194,304 double Crypt JGF Cryptography Size C (N=50M) byte SpMM JGF Numerical Computing Size C (N=500K) double MRIQ Parboil Medical Large (64^3) float Gemm Polybench Numerical Computing 2K x 2K int Gesummv Polybench 2K x 2K int Doitgen Polybench 256x256x256 int Jacobi-1D Polybench N=4M, T=1 int Matrix Multiplication 2K x 2K double Matrix Transpose 2K x 2K double VecAdd 4M double Challenge 3 : Performance Heuristics
  • 22. Platform CPU  IBM POWER8 @ 3.69GHz 20 cores 8 SMT threads per core = up to 160 threads 256 GB of RAM GPU  NVIDIA Tesla K40m 12GB of Global Memory 22 Challenge 3 : Performance Heuristics
  • 23. Performance and Prediction Results 23 Prediction x ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 40.6 37.4 82.0 64.2 27.6 1.4 1.0 4.4 36.7 7.4 5.7 42.7 34.6 58.1 844.7 772.3 1.0 0.1 1.9 1164.8 9.0 1.2 0.0 0.1 1.0 10.0 100.0 1000.0 10000.0 SpeeduprelativetoSEQENTIAL Java(logscale) Higher is better160 worker threads (Fork/join) GPU Challenge 3 : Performance Heuristics
  • 24. How to evaluate prediction models: 5-fold cross validation  Overfitting problem:  Prediction model may be tailored to the eleven applications if training data = testing data  To avoid the overfitting problem:  Calculate the accuracy of the prediction model using 5-fold cross validation 24 Subset 1 Subset 2 Subset 3 Subset 4 Subset 5 Subset 2Subset 1 Subset 3 Subset 4 Subset 5 Build a prediction model trained on Subset 2-5Used for TESTING Accuracy : X% Used for TESTING Accuracy : Y% TRAINING DATA (291 samples) Build a prediction model trained on Subset 1, 3-5 Challenge 3 : Performance Heuristics
  • 25. 79.0% 97.6% 99.0% 99.0% 97.2% 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% Range "+=nIRs" "+=dIRs" "+=Array" "ALL (+=DT)" Accuracy (%), Total number of samples = 291 Accuracies with cross-validation: 25 Higher is better Challenge 3 : Performance Heuristics not distinguish insn types distinguish insn types +=Data Transfer Size
  • 26. Discussion  Based on results with 291 samples, (range, # of detailed Insns, # of array accesses) shows the best accuracy  DT does not contribute to improving the accuracy since there is a few samples where the DT contributes the decision  Pros. and Cons.  (+) Future generations of hardware can be supported easily by re-running applications  (+) Just add another training data to rebuild a prediction model  (-) Takes time to collect training data 26 Challenge 3 : Performance Heuristics
  • 27. Related Work: Java + GPU 27 Lang JIT GPU Kernel Device Selection JCUDA Java - CUDA GPU only Lime Lime ✔ Override map/reduce Static Firepile Scala ✔ reduce Static JaBEE Java ✔ Override run GPU only Aparapi Java ✔ map Static Hadoop-CL Java ✔ Override map/reduce Static RootBeer Java ✔ Override run Not Described HJ-OpenCL HJ - forall / lambda Static PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression Our Work Java ✔ Parallel Stream Dynamic with Machine Learning None of these approaches considers Java 8 Parallel Stream APIs and a dynamic device selection with machine-learning
  • 28. Summary 28 IntStream.range(0, N).parallel().forEach(i -> <lambda>); Challenge 1 Supporting Java Features on GPUs Challenge 3 Performance Heuristics Challenge 2 Accelerating Java programs on GPUs  Exception Semantics  Virtual Method Calls  Kernel Optimizations  DT Optimizations  Selection of a faster device from CPUs and GPUs Standard Java API Call for Parallelism vs. Over 1,000x Speedup Up to 99% accuracyExceptions & Virtual Calls Credits : Checklist by Phil Laver from the Noun Project, Rocket by Luis Prado from the Noun Project Rocket by Maxwell Painter Karpylev from the Noun Project, Running by Dillon Arloff from the Noun Project
  • 29. Acknowledgements Special thanks to  IBM CAS  Marcel Mitran (IBM Canada)  Jimmy Kwa (IBM Canada)  Habanero Group at Rice  Yoichi Matsuyama (CMU) 29
  • 30. Further Readings  Code Generation and Optimizations  “Compiling and Optimizing Java 8 Programs for GPU Execution.” Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents, Vivek Sarkar. 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), October 2015.  Performance Heuristics  “Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection.” Akihiro Hayashi, Kazuaki Ishizaki, Gita Koblents, Vivek Sarkar. 12th International Conference on the Principles and Practice of Programming on the Java Platform: virtual machines, languages, and tools (PPPJ), September 2015. 30
  • 33. How to evaluate prediction models: TRUE/FALSE POSITIVE/NEGATIVE  4 types of binary prediction results (Let POSITIVE is CPU, NEGATIVE is GPU)  TRUE POSITIVE  Correctly predicted that CPU is faster  TRUE NEGATIVE  Correctly predicted that GPU is faster  FALSE POSITIVE  Predicted CPU is faster, but GPU is actually faster  FALSE NEGATIVE  Predicted GPU is faster, but CPU is actually faster 33
  • 34. How to evaluate prediction models: Accuracy, Precision, and Recall metrics  Accuracy  the percentage of selections predicted correctly:  (TP + TN) / (TP + TN + FP + FN)  Precision X (X = CPU or GPU)  Precision CPU : TP / (TP + FP)  # of samples correctly predicted CPU is faster / total # of samples predicted that CPU is faster  Precision GPU : TN / (TN + FN)  Recall X (X = CPU or GPU)  Recall CPU : TP / (TP + FN)  # of samples correctly predicted CPU is faster / total # of samples labeled that CPU is actually faster  Recall GPU : TN / (TN + FP) 34 How precise is the model when it predicts X is faster. How does the prediction hit the nail when X is actually faster.
  • 35. Precisions and Recalls with cross-validation 35 Precision CPU Recall CPU Precision GPU Recall GPU Range 79.0% 100% 0% 0% +=nIRs 97.8% 99.1% 96.5% 91.8% +=dIRs 98.7% 100% 100% 95.0% +=Arrays 98.7% 100% 100% 95.0% ALL 96.7% 100% 100% 86.9% Higher is better All prediction models except Range rarely make a bad decision