The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
1. GPUIterator:
Bridging the Gap between Chapel
and GPU platforms
Akihiro Hayashi (Rice),
Sri Raj Paul (Georgia Tech),
Vivek Sarkar (Georgia Tech)
1
2. GPUs are a common source of
performance improvement in HPC
2
0
20
40
60
80
100
120
140
160
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Count
Accelerator/Co-Processsor in Top500
Source: https://www.top500.org/statistics/list/
3. GPU Programming in Chapel
Chapel’s multi-resolution concept
Start with writing “forall” loops
(on CPU, proof-of-concept)
Apply automatic GPU code generators [1][2] when/where
possible
Consider writing GPU kernels using CUDA/OpenCL or
other accelerator language, and invoke them from
Chapel
(Focus of this paper) 3
forall i in1..n {
…
}
[1] Albert Sidelnik et al. Performance Portability with the Chapel Language (IPDPS ’12).
[2] Michael L. Chu et al. GPGPU support in Chapel with the Radeon Open Compute Platform (CHIUW’17).
High-level
Low-level
5. Motivation:
Vector Copy (GPU)
Invoking CUDA/OpenCL code using the C
interoperability feature
5
extern proc GPUVC(A:[]real(32),
B: [] real(32),
lo: int, hi: int);
var A: [1..n]real(32);
var B: [1..n]real(32);
//InvokingCUDA/OpenCL program
GPUVC(A,B, 1,n);
//separate C file
void GPUVC(float *A,
float *B,
int start,
int end) {
//CUDA/OpenCL Code
}
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
6. Motivation:
The code is not very portable
6
//Original
forall i in1..n {
A(i) = B(i);
}
//GPU Version
GPUVC(A,B, 1,n);
Potential “portability” problems
How to switch back and forth between the
the original version and the GPU version?
How to support hybrid execution?
How to support distributed arrays?
Research Question:
What is an appropriate and portable programming interface
that bridges the ”forall” and GPU versions?
1
2
3
4
1
2
7. Our Solution: GPUIterator
Contributions:
Design and implementation of the GPUIterator
Performance evaluation of different CPU+GPU execution
strategies
7
//Original Version
forall i in1..n {
A(i) = B(i);
}
//GPU Version
GPUVC(A,B, 1,n);
1
2
3
4
1
2
//GPU Iterator(in-between)
var G = lambda (lo: int, hi: int,
nElems: int) {
GPUVC(A,B, lo, hi);
};
var CPUPercent= 50;
forall i inGPU(1..n, G, CPUPercent) {
A(i) = B(i);
}
1
2
3
4
5
6
7
8
9
8. Chapel’s iterator
Chapel’s iterator allows us to control over the
scheduling of the loops in a productive
manner
8
https://chapel-lang.org/docs/master/primers/parIters.html
//Iterator over fibonacci numbers
forall i in fib(10) {
A(i) = B(i);
}
1
2
3
4
CPU1 CPU2
0 1 1 2 3 5 8 13 21 34
9. The GPUIterator automates work
distribution across CPUs+GPUs
9
forall i in GPU(1..n, GPUWrapper,
CPUPercent) {
A(i) = B(i);
}
1
2
3
4
CPU Portion GPU Portion1
forall i in 1..n {
A(i) = B(i);
}
1
2
3
CPU Portion1
CPU1 CPU2 CPUm CPU1 CPUm GPU1 GPUk
CPUPercent GPUPercent = 100 - CPUPercent
nn
10. How to use the GPUIterator?
10
var GPUCallBack = lambda (lo: int,
hi: int,
nElems: int){
assert(hi-lo+1 == nElems);
GPUVC(A,B, lo, hi);
};
forall i in GPU(1..n, GPUCallBack,
CPUPercent) {
A(i) = B(i);
}
1
2
3
4
5
6
7
8
9
10
This callback function is
called after the GPUIterator
has computed the subspace
(lo/hi: lower/upper bound,
n: # of elements )
GPU() internally divides the
original iteration space for
CPUs and GPUs
Tip: declaring CPUPercent as a command-line override
(“config const” ) helps us to explore different CPU+GPU executions
11. The GPUIterator supports
Distributed Arrays
11
var D: domain(1) dmapped Block(boundingBox={1..n}) = {1..n};
var A: [D]real(32);
var B: [D]real(32);
var GPUCallBack = lambda (lo: int, hi: int, nElems: int) {
GPUVC(A.localSlice(lo..hi),
B.localSlice(lo..hi),
0, hi-lo,nElems);
};
forall i in GPU(D,GPUCallBack,
CPUPercent) {
A(i) = B(i);
}
1
2
3
4
5
6
7
8
9
10
11
12
12. The GPUIterator supports
Zippered-forall
Restriction
The GPUIterator must be the leader iterator
12
forall (_, a, b) in zip(GPU(1..n, ...), A, B) {
a = b;
}
1
2
3
Bradford L. Chamberlain et al. “User-Defined Parallel
Zippered Iterators in Chapel.” (PGAS2011)
13. Implementation of the GPUIterator
Internal modules
https://github.com/ahayashi/chapel
Created the GPU Locale model
CHPL_LOCALE_MODEL=gpu
External modules
https://github.com/ahayashi/chapel-gpu
Fully implemented in Chapel
13
Locale0
sublocale0 sublocale1
CPU1
CPUm
GPU1
GPUk
Locale1
14. Implementation of the GPUIterator
14
coforall subloc in0..1{
if (subloc == 0) {
const numTasks= here.getChild(0).maxTaskPar;
coforall tid in 0..#numTasks{
const myIters= computeChunk(…);
for i in myItersdo
yield i;
}
}else if (subloc == 1) {
GPUCallBack(…);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
15. Writing CUDA/OpenCL Code for
the GPUIterator
GPU programs for the GPUIterator should
include typical host and device operations
15
CPU
0
CPU
1
DRAM
Host
DRAM
Device
PCI-Express
H2D transfer
D2H transfer
Chapel
Arrays reside
on host’s DRAM
Device array
allocations
Kernel
Execution
16. Performance Evaluations
Platforms
Intel Xeon CPU (12 cores) + NVIDIA Tesla M2050 GPU
IBM POWER8 CPU (24 cores) + NVIDIA Tesla K80 GPU
Intel Core i7 CPU (6 cores) + Intel UHD Graphics 630/AMD
Radeon Pro 560X
Intel Core i5 CPU (4 cores) + NVIDIA TITAN Xp
Chapel Compilers & Options
Chapel Compiler 1.20.0-pre (as of March 27) with the --fast option
GPU Compilers
CUDA: NVCC 7.0.27(M2050), 8.0.61 (K80) with the -O3 option
OpenCL: Apple LLVM 10.0.0 with the -O3 option
16
18. How many lines are added/modified?
Source code changes are minimal
18
LOC
added/modified
(Chapel)
CUDA LOC
(for NVIDIA
GPUs)
OpenCL LOC
(for Intel/AMD
GPUs)
Vector Copy 6 53 256
Stream 6 56 280
BlackScholes 6 131 352
Logistic Regression 11 97 472
Matrix Multiplication 6 213 290
LOC for CUDA/OpenCL are out of focus
19. How fast are GPUs?
(Single-node, POWER8 + K80)
The iterator enables exploring different CPU+GPU strategies with very low overheads
The GPU is up to 145x faster than the CPU, but is slower than the GPU due to data
transfer costs in some cases
19
0.01
0.1
1
10
100
1000
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only
vector copy stream blackscholes logistic regression matrix multiplication
Speedupoveroriginalforall
Higher is better
20. How fast are GPUs?
(Single-node, Xeon + M2050)
20
0.1
1
10
100
1000
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CUDA
Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only Hybrid GPU Only
vector copy stream blackscholes logistic regression matrix multiplication
Speedupoveroriginalforall
The iterator enables exploring different CPU+GPU strategies with very low overheads
The GPU is up to 126x faster than the CPU, but is slower than the GPU due to data
transfer costs in some cases
Higher is better
21. How fast are GPUs compared to Chapel’s BLAS
module on CPUs?
(Single-node, Core i5 + Titan Xp)
Motivation: to verify how fast the GPU variants are compared to a highly-tuned Chapel-CPU variant
Result: the GPU variants are mostly faster than OpenBLAS’s gemm (4 core CPUs)
21
1
10
100
1000
10000
512x512 1024x1024 2048x2048 4096x4096 8192x8192
Speedupoveroriginalforall
(logscale)
Matrix Size
Matrix Multiplication (Higher is better)
CPU (BLAS.gemm) GPUIterator (Naïve CUDA) GPUIterator (Opt. CUDA) GPUIterator (cuBLAS)
22. When is hybrid execution beneficial? (Single
node, Core i7+UHD)
22
0
0.5
1
1.5
2
2.5
C100%+G0% C75%+G25% C50%+G50% C25%+G75% C0%+G100%
Hybrid
blackscholes
Speedupoveroriginalforall
Blackscholes (Higher is better)
With tightly-coupled GPUs, hybrid execution is more beneficial
23. Multi-node performance numbers
(Xeon + M2050)
The original forall show good scalability
The GPU variants give further performance improvements
23
0
2
4
6
8
10
12
14
16
forall
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
forall
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
forall
C100%+G0%
C75%+G25%
C50%+G50%
C25%+G75%
C0%+G100%
CPU Hybrid CPU Hybrid CPU Hybrid
1 node 2 nodes 4 nodes
Speedupoveroriginalforall
BlackScholes (Higher is better)
24. Conclusions & Future Work
Summary
The GPUIterator provides an appropriate interface between
Chapel and accelerator programs
Source code is available:
– https://github.com/ahayashi/chapel-gpu
The use of GPUs can significantly improves the performance of
Chapel programs
Future Work
Support reduction
Further performance evaluations on multi-node CPU+GPU
systems
Automatic selection of the best “CPUPercent”
24
27. The GPUIterator supports
Distributed Arrays (Cont’d)
27
CPU Portion GPU Portion
Locale 0
CPU Portion GPU Portion
Locale 1
A.localeSlice(lo..hi) A.localeSlice(lo..hi)A (Chapel) 1 n
A (CUDA/OpenCL) A[0] to A[hi-lo] A[0] to A[hi-lo]
No additional modifications for supporting multi-
locales executions
Note: localeSlice is Chapel’s array API
A(y)A(x) A(w)A(z)
Editor's Notes
Okay let me first talk about the motivation. GPU are getting popular these days. What you see here is the number of accelerators in Top500 list and it obviously shows the number of gpus is increasing. Actually [CLEARSPEED], now [NVIDIA]