SlideShare a Scribd company logo
1 of 27
GPUIterator:
Bridging the Gap between Chapel
and GPU platforms
Akihiro Hayashi (Rice),
Sri Raj Paul (Georgia Tech),
Vivek Sarkar (Georgia Tech)
1
GPUs are a common source of
performance improvement in HPC
2
0
20
40
60
80
100
120
140
160
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Count
Accelerator/Co-Processsor in Top500
Source: https://www.top500.org/statistics/list/
GPU Programming in Chapel
 Chapel’s multi-resolution concept
 Start with writing “forall” loops
(on CPU, proof-of-concept)
 Apply automatic GPU code generators [1][2] when/where
possible
 Consider writing GPU kernels using CUDA/OpenCL or
other accelerator language, and invoke them from
Chapel
(Focus of this paper) 3
forall i in1..n {
…
}
[1] Albert Sidelnik et al. Performance Portability with the Chapel Language (IPDPS ’12).
[2] Michael L. Chu et al. GPGPU support in Chapel with the Radeon Open Compute Platform (CHIUW’17).
High-level
Low-level
Motivation:
Vector Copy (Original)
4
var A: [1..n]real(32);
var B: [1..n]real(32);
//Vector Copy
forall i in1..n {
A(i) = B(i);
}
1
2
3
4
5
6
7
Motivation:
Vector Copy (GPU)
 Invoking CUDA/OpenCL code using the C
interoperability feature
5
extern proc GPUVC(A:[]real(32),
B: [] real(32),
lo: int, hi: int);
var A: [1..n]real(32);
var B: [1..n]real(32);
//InvokingCUDA/OpenCL program
GPUVC(A,B, 1,n);
//separate C file
void GPUVC(float *A,
float *B,
int start,
int end) {
//CUDA/OpenCL Code
}
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
Motivation:
The code is not very portable
6
//Original
forall i in1..n {
A(i) = B(i);
}
//GPU Version
GPUVC(A,B, 1,n);
 Potential “portability” problems
 How to switch back and forth between the
the original version and the GPU version?
 How to support hybrid execution?
 How to support distributed arrays?
Research Question:
What is an appropriate and portable programming interface
that bridges the ”forall” and GPU versions?
1
2
3
4
1
2
Our Solution: GPUIterator
 Contributions:
 Design and implementation of the GPUIterator
 Performance evaluation of different CPU+GPU execution
strategies
7
//Original Version
forall i in1..n {
A(i) = B(i);
}
//GPU Version
GPUVC(A,B, 1,n);
1
2
3
4
1
2
//GPU Iterator(in-between)
var G = lambda (lo: int, hi: int,
nElems: int) {
GPUVC(A,B, lo, hi);
};
var CPUPercent= 50;
forall i inGPU(1..n, G, CPUPercent) {
A(i) = B(i);
}
1
2
3
4
5
6
7
8
9
Chapel’s iterator
Chapel’s iterator allows us to control over the
scheduling of the loops in a productive
manner
8
https://chapel-lang.org/docs/master/primers/parIters.html
//Iterator over fibonacci numbers
forall i in fib(10) {
A(i) = B(i);
}
1
2
3
4
CPU1 CPU2
0 1 1 2 3 5 8 13 21 34
The GPUIterator automates work
distribution across CPUs+GPUs
9
forall i in GPU(1..n, GPUWrapper,
CPUPercent) {
A(i) = B(i);
}
1
2
3
4
CPU Portion GPU Portion1
forall i in 1..n {
A(i) = B(i);
}
1
2
3
CPU Portion1
CPU1 CPU2 CPUm CPU1 CPUm GPU1 GPUk
CPUPercent GPUPercent = 100 - CPUPercent
nn
How to use the GPUIterator?
10
var GPUCallBack = lambda (lo: int,
hi: int,
nElems: int){
assert(hi-lo+1 == nElems);
GPUVC(A,B, lo, hi);
};
forall i in GPU(1..n, GPUCallBack,
CPUPercent) {
A(i) = B(i);
}
1
2
3
4
5
6
7
8
9
10
This callback function is
called after the GPUIterator
has computed the subspace
(lo/hi: lower/upper bound,
n: # of elements )
GPU() internally divides the
original iteration space for
CPUs and GPUs
Tip: declaring CPUPercent as a command-line override
(“config const” ) helps us to explore different CPU+GPU executions
The GPUIterator supports
Distributed Arrays
11
var D: domain(1) dmapped Block(boundingBox={1..n}) = {1..n};
var A: [D]real(32);
var B: [D]real(32);
var GPUCallBack = lambda (lo: int, hi: int, nElems: int) {
GPUVC(A.localSlice(lo..hi),
B.localSlice(lo..hi),
0, hi-lo,nElems);
};
forall i in GPU(D,GPUCallBack,
CPUPercent) {
A(i) = B(i);
}
1
2
3
4
5
6
7
8
9
10
11
12
The GPUIterator supports
Zippered-forall
 Restriction
 The GPUIterator must be the leader iterator
12
forall (_, a, b) in zip(GPU(1..n, ...), A, B) {
a = b;
}
1
2
3
Bradford L. Chamberlain et al. “User-Defined Parallel
Zippered Iterators in Chapel.” (PGAS2011)
Implementation of the GPUIterator
Internal modules
 https://github.com/ahayashi/chapel
 Created the GPU Locale model
CHPL_LOCALE_MODEL=gpu
External modules
 https://github.com/ahayashi/chapel-gpu
 Fully implemented in Chapel
13
Locale0
sublocale0 sublocale1
CPU1
CPUm
GPU1
GPUk
Locale1
Implementation of the GPUIterator
14
coforall subloc in0..1{
if (subloc == 0) {
const numTasks= here.getChild(0).maxTaskPar;
coforall tid in 0..#numTasks{
const myIters= computeChunk(…);
for i in myItersdo
yield i;
}
}else if (subloc == 1) {
GPUCallBack(…);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
Writing CUDA/OpenCL Code for
the GPUIterator
GPU programs for the GPUIterator should
include typical host and device operations
15
CPU
0
CPU
1
DRAM
Host
DRAM
Device
PCI-Express
H2D transfer
D2H transfer
Chapel
Arrays reside
on host’s DRAM
Device array
allocations
Kernel
Execution
Performance Evaluations
 Platforms
 Intel Xeon CPU (12 cores) + NVIDIA Tesla M2050 GPU
 IBM POWER8 CPU (24 cores) + NVIDIA Tesla K80 GPU
 Intel Core i7 CPU (6 cores) + Intel UHD Graphics 630/AMD
Radeon Pro 560X
 Intel Core i5 CPU (4 cores) + NVIDIA TITAN Xp
 Chapel Compilers & Options
 Chapel Compiler 1.20.0-pre (as of March 27) with the --fast option
 GPU Compilers
 CUDA: NVCC 7.0.27(M2050), 8.0.61 (K80) with the -O3 option
 OpenCL: Apple LLVM 10.0.0 with the -O3 option
16
Performance Evaluations (Cont’d)
 Tasking
 CUDA: CHPL_TASK=qthreads
 OpenCL: CHPL_TASK=fifo
 Applications (https://github.com/ahayashi/chapel-gpu)
 Vector Copy
 Stream
 BlackScholes
 Logistic Regression
 Matrix Multiplicaiton
17
How many lines are added/modified?
Source code changes are minimal
18
LOC
added/modified
(Chapel)
CUDA LOC
(for NVIDIA
GPUs)
OpenCL LOC
(for Intel/AMD
GPUs)
Vector Copy 6 53 256
Stream 6 56 280
BlackScholes 6 131 352
Logistic Regression 11 97 472
Matrix Multiplication 6 213 290
LOC for CUDA/OpenCL are out of focus
How fast are GPUs?
(Single-node, POWER8 + K80)
 The iterator enables exploring different CPU+GPU strategies with very low overheads
 The GPU is up to 145x faster than the CPU, but is slower than the GPU due to data
transfer costs in some cases
19
0.01
0.1
1
10
100
1000
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only
vector copy stream blackscholes logistic regression matrix multiplication
Speedupoveroriginalforall
Higher is better
How fast are GPUs?
(Single-node, Xeon + M2050)
20
0.1
1
10
100
1000
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only
vector copy stream blackscholes logistic regression matrix multiplication
Speedupoveroriginalforall
 The iterator enables exploring different CPU+GPU strategies with very low overheads
 The GPU is up to 126x faster than the CPU, but is slower than the GPU due to data
transfer costs in some cases
Higher is better
How fast are GPUs compared to Chapel’s BLAS
module on CPUs?
(Single-node, Core i5 + Titan Xp)
 Motivation: to verify how fast the GPU variants are compared to a highly-tuned Chapel-CPU variant
 Result: the GPU variants are mostly faster than OpenBLAS’s gemm (4 core CPUs)
21
1
10
100
1000
10000
512x512 1024x1024 2048x2048 4096x4096 8192x8192
Speedupoveroriginalforall
(logscale)
Matrix Size
Matrix Multiplication (Higher is better)
CPU (BLAS.gemm) GPUIterator (Naïve CUDA) GPUIterator (Opt. CUDA) GPUIterator (cuBLAS)
When is hybrid execution beneficial? (Single
node, Core i7+UHD)
22
0
0.5
1
1.5
2
2.5
C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100%
Hybrid
blackscholes
Speedupoveroriginalforall
Blackscholes (Higher is better)
 With tightly-coupled GPUs, hybrid execution is more beneficial
Multi-node performance numbers
(Xeon + M2050)
 The original forall show good scalability
 The GPU variants give further performance improvements
23
0
2
4
6
8
10
12
14
16
forall
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
forall
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
forall
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CPU Hybrid CPU Hybrid CPU Hybrid
1 node 2 nodes 4 nodes
Speedupoveroriginalforall
BlackScholes (Higher is better)
Conclusions & Future Work
 Summary
 The GPUIterator provides an appropriate interface between
Chapel and accelerator programs
 Source code is available:
– https://github.com/ahayashi/chapel-gpu
 The use of GPUs can significantly improves the performance of
Chapel programs
 Future Work
 Support reduction
 Further performance evaluations on multi-node CPU+GPU
systems
 Automatic selection of the best “CPUPercent”
24
Backup Slides
25
GPU is not always faster
26
40.6 37.4
82.0
64.2
27.6
1.4 1.0
4.4
36.7
7.4 5.7
42.7 34.6
58.1
844.7 772.3
1.0
0.1
1.9
1164.8
9.0
1.2
0.0
0.1
1.0
10.0
100.0
1000.0
10000.0
SpeeduprelativetoSEQENTIAL
Java(logscale)
Higher is better160 worker threads (Fork/join) GPU
CPU: IBM POWER8 @ 3.69GHz , GPU: NVIDIA Tesla
K40m
The GPUIterator supports
Distributed Arrays (Cont’d)
27
CPU Portion GPU Portion
Locale 0
CPU Portion GPU Portion
Locale 1
A.localeSlice(lo..hi) A.localeSlice(lo..hi)A (Chapel) 1 n
A (CUDA/OpenCL) A[0] to A[hi-lo] A[0] to A[hi-lo]
No additional modifications for supporting multi-
locales executions
Note: localeSlice is Chapel’s array API
A(y)A(x) A(w)A(z)

More Related Content

What's hot

Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraJim Dowling
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015Kohei KaiGai
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseKazuaki Ishizaki
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaKazuaki Ishizaki
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance
Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance
Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance Ceph Community
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 

What's hot (20)

Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance
Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance
Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 

Similar to GPUIterator: Bridging the Gap between Chapel and GPU Platforms

Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievVolodymyr Saviak
 
Java gpu computing
Java gpu computingJava gpu computing
Java gpu computingArjan Lamers
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 
Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciênciaCampus Party Brasil
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkRed Hat Developers
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfMuhammadAbdullah311866
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020John Zedlewski
 
Nvidia in bioinformatics
Nvidia in bioinformaticsNvidia in bioinformatics
Nvidia in bioinformaticsShanker Trivedi
 

Similar to GPUIterator: Bridging the Gap between Chapel and GPU Platforms (20)

Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
Java gpu computing
Java gpu computingJava gpu computing
Java gpu computing
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciência
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Advances in GPU Computing
Advances in GPU ComputingAdvances in GPU Computing
Advances in GPU Computing
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech Talk
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
 
Nvidia in bioinformatics
Nvidia in bioinformaticsNvidia in bioinformatics
Nvidia in bioinformatics
 

More from Akihiro Hayashi

Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Akihiro Hayashi
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesAkihiro Hayashi
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral CompilationAkihiro Hayashi
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsAkihiro Hayashi
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Akihiro Hayashi
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...Akihiro Hayashi
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Akihiro Hayashi
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
 

More from Akihiro Hayashi (9)

Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
 

Recently uploaded

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

GPUIterator: Bridging the Gap between Chapel and GPU Platforms

  • 1. GPUIterator: Bridging the Gap between Chapel and GPU platforms Akihiro Hayashi (Rice), Sri Raj Paul (Georgia Tech), Vivek Sarkar (Georgia Tech) 1
  • 2. GPUs are a common source of performance improvement in HPC 2 0 20 40 60 80 100 120 140 160 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Count Accelerator/Co-Processsor in Top500 Source: https://www.top500.org/statistics/list/
  • 3. GPU Programming in Chapel  Chapel’s multi-resolution concept  Start with writing “forall” loops (on CPU, proof-of-concept)  Apply automatic GPU code generators [1][2] when/where possible  Consider writing GPU kernels using CUDA/OpenCL or other accelerator language, and invoke them from Chapel (Focus of this paper) 3 forall i in1..n { … } [1] Albert Sidelnik et al. Performance Portability with the Chapel Language (IPDPS ’12). [2] Michael L. Chu et al. GPGPU support in Chapel with the Radeon Open Compute Platform (CHIUW’17). High-level Low-level
  • 4. Motivation: Vector Copy (Original) 4 var A: [1..n]real(32); var B: [1..n]real(32); //Vector Copy forall i in1..n { A(i) = B(i); } 1 2 3 4 5 6 7
  • 5. Motivation: Vector Copy (GPU)  Invoking CUDA/OpenCL code using the C interoperability feature 5 extern proc GPUVC(A:[]real(32), B: [] real(32), lo: int, hi: int); var A: [1..n]real(32); var B: [1..n]real(32); //InvokingCUDA/OpenCL program GPUVC(A,B, 1,n); //separate C file void GPUVC(float *A, float *B, int start, int end) { //CUDA/OpenCL Code } 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8
  • 6. Motivation: The code is not very portable 6 //Original forall i in1..n { A(i) = B(i); } //GPU Version GPUVC(A,B, 1,n);  Potential “portability” problems  How to switch back and forth between the the original version and the GPU version?  How to support hybrid execution?  How to support distributed arrays? Research Question: What is an appropriate and portable programming interface that bridges the ”forall” and GPU versions? 1 2 3 4 1 2
  • 7. Our Solution: GPUIterator  Contributions:  Design and implementation of the GPUIterator  Performance evaluation of different CPU+GPU execution strategies 7 //Original Version forall i in1..n { A(i) = B(i); } //GPU Version GPUVC(A,B, 1,n); 1 2 3 4 1 2 //GPU Iterator(in-between) var G = lambda (lo: int, hi: int, nElems: int) { GPUVC(A,B, lo, hi); }; var CPUPercent= 50; forall i inGPU(1..n, G, CPUPercent) { A(i) = B(i); } 1 2 3 4 5 6 7 8 9
  • 8. Chapel’s iterator Chapel’s iterator allows us to control over the scheduling of the loops in a productive manner 8 https://chapel-lang.org/docs/master/primers/parIters.html //Iterator over fibonacci numbers forall i in fib(10) { A(i) = B(i); } 1 2 3 4 CPU1 CPU2 0 1 1 2 3 5 8 13 21 34
  • 9. The GPUIterator automates work distribution across CPUs+GPUs 9 forall i in GPU(1..n, GPUWrapper, CPUPercent) { A(i) = B(i); } 1 2 3 4 CPU Portion GPU Portion1 forall i in 1..n { A(i) = B(i); } 1 2 3 CPU Portion1 CPU1 CPU2 CPUm CPU1 CPUm GPU1 GPUk CPUPercent GPUPercent = 100 - CPUPercent nn
  • 10. How to use the GPUIterator? 10 var GPUCallBack = lambda (lo: int, hi: int, nElems: int){ assert(hi-lo+1 == nElems); GPUVC(A,B, lo, hi); }; forall i in GPU(1..n, GPUCallBack, CPUPercent) { A(i) = B(i); } 1 2 3 4 5 6 7 8 9 10 This callback function is called after the GPUIterator has computed the subspace (lo/hi: lower/upper bound, n: # of elements ) GPU() internally divides the original iteration space for CPUs and GPUs Tip: declaring CPUPercent as a command-line override (“config const” ) helps us to explore different CPU+GPU executions
  • 11. The GPUIterator supports Distributed Arrays 11 var D: domain(1) dmapped Block(boundingBox={1..n}) = {1..n}; var A: [D]real(32); var B: [D]real(32); var GPUCallBack = lambda (lo: int, hi: int, nElems: int) { GPUVC(A.localSlice(lo..hi), B.localSlice(lo..hi), 0, hi-lo,nElems); }; forall i in GPU(D,GPUCallBack, CPUPercent) { A(i) = B(i); } 1 2 3 4 5 6 7 8 9 10 11 12
  • 12. The GPUIterator supports Zippered-forall  Restriction  The GPUIterator must be the leader iterator 12 forall (_, a, b) in zip(GPU(1..n, ...), A, B) { a = b; } 1 2 3 Bradford L. Chamberlain et al. “User-Defined Parallel Zippered Iterators in Chapel.” (PGAS2011)
  • 13. Implementation of the GPUIterator Internal modules  https://github.com/ahayashi/chapel  Created the GPU Locale model CHPL_LOCALE_MODEL=gpu External modules  https://github.com/ahayashi/chapel-gpu  Fully implemented in Chapel 13 Locale0 sublocale0 sublocale1 CPU1 CPUm GPU1 GPUk Locale1
  • 14. Implementation of the GPUIterator 14 coforall subloc in0..1{ if (subloc == 0) { const numTasks= here.getChild(0).maxTaskPar; coforall tid in 0..#numTasks{ const myIters= computeChunk(…); for i in myItersdo yield i; } }else if (subloc == 1) { GPUCallBack(…); } } 1 2 3 4 5 6 7 8 9 10 11 12
  • 15. Writing CUDA/OpenCL Code for the GPUIterator GPU programs for the GPUIterator should include typical host and device operations 15 CPU 0 CPU 1 DRAM Host DRAM Device PCI-Express H2D transfer D2H transfer Chapel Arrays reside on host’s DRAM Device array allocations Kernel Execution
  • 16. Performance Evaluations  Platforms  Intel Xeon CPU (12 cores) + NVIDIA Tesla M2050 GPU  IBM POWER8 CPU (24 cores) + NVIDIA Tesla K80 GPU  Intel Core i7 CPU (6 cores) + Intel UHD Graphics 630/AMD Radeon Pro 560X  Intel Core i5 CPU (4 cores) + NVIDIA TITAN Xp  Chapel Compilers & Options  Chapel Compiler 1.20.0-pre (as of March 27) with the --fast option  GPU Compilers  CUDA: NVCC 7.0.27(M2050), 8.0.61 (K80) with the -O3 option  OpenCL: Apple LLVM 10.0.0 with the -O3 option 16
  • 17. Performance Evaluations (Cont’d)  Tasking  CUDA: CHPL_TASK=qthreads  OpenCL: CHPL_TASK=fifo  Applications (https://github.com/ahayashi/chapel-gpu)  Vector Copy  Stream  BlackScholes  Logistic Regression  Matrix Multiplicaiton 17
  • 18. How many lines are added/modified? Source code changes are minimal 18 LOC added/modified (Chapel) CUDA LOC (for NVIDIA GPUs) OpenCL LOC (for Intel/AMD GPUs) Vector Copy 6 53 256 Stream 6 56 280 BlackScholes 6 131 352 Logistic Regression 11 97 472 Matrix Multiplication 6 213 290 LOC for CUDA/OpenCL are out of focus
  • 19. How fast are GPUs? (Single-node, POWER8 + K80)  The iterator enables exploring different CPU+GPU strategies with very low overheads  The GPU is up to 145x faster than the CPU, but is slower than the GPU due to data transfer costs in some cases 19 0.01 0.1 1 10 100 1000 C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% CUDA C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% CUDA C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% CUDA C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% CUDA C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% CUDA Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only vector copy stream blackscholes logistic regression matrix multiplication Speedupoveroriginalforall Higher is better
  • 20. How fast are GPUs? (Single-node, Xeon + M2050) 20 0.1 1 10 100 1000 C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% CUDA C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% CUDA C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% CUDA C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% CUDA C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% CUDA Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only vector copy stream blackscholes logistic regression matrix multiplication Speedupoveroriginalforall  The iterator enables exploring different CPU+GPU strategies with very low overheads  The GPU is up to 126x faster than the CPU, but is slower than the GPU due to data transfer costs in some cases Higher is better
  • 21. How fast are GPUs compared to Chapel’s BLAS module on CPUs? (Single-node, Core i5 + Titan Xp)  Motivation: to verify how fast the GPU variants are compared to a highly-tuned Chapel-CPU variant  Result: the GPU variants are mostly faster than OpenBLAS’s gemm (4 core CPUs) 21 1 10 100 1000 10000 512x512 1024x1024 2048x2048 4096x4096 8192x8192 Speedupoveroriginalforall (logscale) Matrix Size Matrix Multiplication (Higher is better) CPU (BLAS.gemm) GPUIterator (Naïve CUDA) GPUIterator (Opt. CUDA) GPUIterator (cuBLAS)
  • 22. When is hybrid execution beneficial? (Single node, Core i7+UHD) 22 0 0.5 1 1.5 2 2.5 C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% Hybrid blackscholes Speedupoveroriginalforall Blackscholes (Higher is better)  With tightly-coupled GPUs, hybrid execution is more beneficial
  • 23. Multi-node performance numbers (Xeon + M2050)  The original forall show good scalability  The GPU variants give further performance improvements 23 0 2 4 6 8 10 12 14 16 forall C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% forall C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% forall C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100% CPU Hybrid CPU Hybrid CPU Hybrid 1 node 2 nodes 4 nodes Speedupoveroriginalforall BlackScholes (Higher is better)
  • 24. Conclusions & Future Work  Summary  The GPUIterator provides an appropriate interface between Chapel and accelerator programs  Source code is available: – https://github.com/ahayashi/chapel-gpu  The use of GPUs can significantly improves the performance of Chapel programs  Future Work  Support reduction  Further performance evaluations on multi-node CPU+GPU systems  Automatic selection of the best “CPUPercent” 24
  • 26. GPU is not always faster 26 40.6 37.4 82.0 64.2 27.6 1.4 1.0 4.4 36.7 7.4 5.7 42.7 34.6 58.1 844.7 772.3 1.0 0.1 1.9 1164.8 9.0 1.2 0.0 0.1 1.0 10.0 100.0 1000.0 10000.0 SpeeduprelativetoSEQENTIAL Java(logscale) Higher is better160 worker threads (Fork/join) GPU CPU: IBM POWER8 @ 3.69GHz , GPU: NVIDIA Tesla K40m
  • 27. The GPUIterator supports Distributed Arrays (Cont’d) 27 CPU Portion GPU Portion Locale 0 CPU Portion GPU Portion Locale 1 A.localeSlice(lo..hi) A.localeSlice(lo..hi)A (Chapel) 1 n A (CUDA/OpenCL) A[0] to A[hi-lo] A[0] to A[hi-lo] No additional modifications for supporting multi- locales executions Note: localeSlice is Chapel’s array API A(y)A(x) A(w)A(z)

Editor's Notes

  1. Okay let me first talk about the motivation. GPU are getting popular these days. What you see here is the number of accelerators in Top500 list and it obviously shows the number of gpus is increasing. Actually [CLEARSPEED], now [NVIDIA]