SlideShare a Scribd company logo
Tim Costa, NVIDIA HPC SW Product Manager
Jeff Larkin, Sr. DevTech Software Engineer
UK OpenMP Users Group, December 2020
BEST PRACTICES FOR
OPENMP ON GPUS
2
THE FUTURE OF PARALLEL PROGRAMMING
Standard Languages | Directives | Specialized Languages
Maximize Performance with
Specialized Languages & Intrinsics
Drive Base Languages to Better
Support Parallelism
Augment Base Languages with
Directives
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
!$omp target data map(x,y)
...
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
...
!$omp end target data
attribute(global)
subroutine saxpy(n, a, x, y) {
int i = blockIdx%x*blockDim%x +
threadIdx%x;
if (i < n) y(i) += a*x(i)
}
program main
real :: x(:), y(:)
real,device :: d_x(:), d_y(:)
d_x = x
d_y = y
call saxpy
<<<(N+255)/256,256>>>(...)
y = d_y
std::for_each_n(POL, idx(0), n,
[&](Index_t i){
y[i] += a*x[i];
});
3
THE ROLE OF DIRECTIVES FOR
PARALLEL PROGRAMMING
Serial Programming
Languages
Parallel
Programming
Languages
Directives convey additional information to the compiler.
4
AVAILABLE NOW: THE NVIDIA HPC SDK
Available at developer.nvidia.com/hpc-sdk, on NGC, and in the Cloud
Develop for the NVIDIA HPC Platform: GPU, CPU and Interconnect
HPC Libraries | GPU Accelerated C++ and Fortran | Directives | CUDA
7-8 Releases Per Year | Freely Available
Compilers
nvcc nvc
nvc++
nvfortran
Programming
Models
Standard C++ & Fortran
OpenACC & OpenMP
CUDA
Core
Libraries
libcu++
Thrust
CUB
Math
Libraries
cuBLAS cuTENSOR
cuSPARSE cuSOLVER
cuFFT cuRAND
Communication
Libraries
Open MPI
NVSHMEM
NCCL
DEVELOPMENT
Profilers
Nsight
Systems
Compute
Debugger
cuda-gdb
Host
Device
ANALYSIS
NVIDIA HPC SDK
5
HPC COMPILERS
NVC | NVC++ | NVFORTRAN
Programmable
Standard Languages
Directives
CUDA
Multicore
Directives
Vectorization
Multi-Platform
x86_64
Arm
OpenPOWER
Accelerated
Latest GPUs
Automatic Acceleration
*+=
6
BEST PRACTICES FOR OPENMP ON GPUS
Always use the teams and distribute directive to expose all available parallelism
Aggressively collapse loops to increase available parallelism
Use the target data directive and map clauses to reduce data movement between CPU and GPU
Use accelerated libraries whenever possible
Use OpenMP tasks to go asynchronous and better utilize the whole system
Use host fallback (if clause) to generate host and device code
7
Expose More
Parallelism
Many codes already have Parallel Worksharing loops
(Parallel For/Parallel Do); Isn’t that enough?
Parallel For creates a single contention group of
threads with a shared view of memory and the ability to
coordinate and synchronize.
This structure limits the degree of parallelism that a GPU
can exploit and doesn’t exploit many GPU advantages
Parallel For Isn’t Enough
#pragma omp parallel for reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
OMP PARALLEL
OMP FOR
ThreadTeam
8
Expose More
Parallelism
The Teams Distribute directives exposes coarse-
grained, scalable parallelism, generating more parallelism.
Parallel For is still used to engage the threads within
each team.
Coarse and Fine-grained parallelism may be combined (A)
or split (B).
How do I know which will be better for my code &
machine?
Use Teams Distribute
#pragma omp target teams distribute 
parallel for reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
A
#pragma omp target teams distribute 
reduction(max:error)
for( int j = 1; j < n-1; j++) {
#pragma parallel for reduction(max:error)
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
B
OMP TEAMS
OMP DISTRIBUTE
OMP FOR
9
Expose Even More
Parallelism!
Collapsing loops increases the parallelism available for the
compiler to exploit.
It’s likely possible for this code to perform equally well to
parallel for on the CPU and still exploit more parallelism
on a GPU.
Collapsing does erase possible gains from locality (maybe
tile can help?)
Not all loops can be collapsed
Use Collapse Clause
Aggressively
#pragma omp target teams distribute 
parallel for reduction(max:error) 
collapse(2)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] -
A[j][i]));
}
}
10
Optimize Data
Motion
The more loops you move to the device, the more data
movement you may introduce.
Compilers will always choose correctness over
performance, so the programmer can often do better.
Target Data directives and map clauses enable the
programmer to take control of data movement.
On Unified Memory machines, locality may be good
enough, but optimizing the data movement is more
portable.
Note: I’ve seen cases where registering arrays with CUDA
may further improve remaining data movement.
Use Target Data Mapping
#pragma omp target data map(to:Anew) map(tofrom:A)
while ( error > tol && iter < iter_max )
{
… // Use A and Anew in target directives
}
11
Accelerated
Libraries & OpenMP
Accelerated libraries are free performance
The use_device_ptr clause enables sharing data
from OpenMP to CUDA.
OpenMP 5.1 adds the interop construct to enable
sharing CUDA Streams
Eat your free lunch
#pragma omp target data map(alloc:x[0:n],y[0:n])
{
#pragma omp target teams distribute parallel for
for( i = 0; i < n; i++)
{
x[i] = 1.0f;
y[i] = 0.0f;
}
#pragma omp target data use_device_ptr(x,y)
{
cublasSaxpy(n, 2.0, x, 1, y, 1);
// Synchronize before using results
}
#pragma omp target update from(y[0:n])
}
12
Tasking & GPUs
Data copies, GPU Compute, and CPU compute
can all overlap.
OpenMP uses its existing tasking framework to
manage asynchronous execution and
dependencies.
Your mileage may vary regarding how well the
runtime maps OpenMP tasks to CUDA streams.
Get it working synchronously first!
Keep the whole system busy
#pragma omp target data map(alloc:a[0:N],b[0:N])
{
#pragma omp target update to(a[0:N]) nowait depend(inout:a)
#pragma omp target update to(b[0:N]) nowait depend(inout:b)
#pragma omp target teams distribute parallel for 
nowait depend(inout:a)
for(int i=0; i<N; i++ )
{
a[i] = 2.0f * a[i];
}
#pragma omp target teams distribute parallel for 
nowait depend(inout:a,b)
for(int i=0; i<N; i++ )
{
b[i] = 2.0f * a[i];
}
#pragma omp target update from(b[0:N]) nowait depend(inout:b)
#pragma omp taskwait depend(inout:b)
}
13
Host Fallback
The if clause can be used to selectively turn off
OpenMP offloading.
Unless the compiler knows at compile-time the
value of the if clause, both code paths are
generated
For simple loops, written to use the previous
guidelines, it may be possible to have unified
CPU & GPU code (no guarantee it will be
optimal)
The “if” clause
#pragma omp target teams distribute 
parallel for reduction(max:error) 
collapse(2) if(target:USE_GPU)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
14
The Loop Directive
The Loop directive was added in 5.0 as a
descriptive option for programming.
Loop asserts the ability of a loop to be run in any
order, including concurrently.
As more compilers & applications adopt it, we
hope it will enable more performance
portability.
Give descriptiveness a try
#pragma omp target teams loop 
reduction(max:error) 
collapse(2)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
15
BEST PRACTICES FOR OPENMP ON GPUS
Always use the teams and distribute directive to expose all available parallelism
Aggressively collapse loops to increase available parallelism
Use the target data directive and map clauses to reduce data movement between CPU and GPU
Use accelerated libraries whenever possible
Use OpenMP tasks to go asynchronous and better utilize the whole system
Use host fallback (if clause) to generate host and device code
Bonus: Give loop a try

More Related Content

What's hot

190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub
Jaewook. Kang
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Jyotirmoy Sundi
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
jbp4444
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
Rizwan Habib
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
Databricks
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
Preferred Networks
 
PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019
Masashi Shibata
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
Seiya Tokui
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
AI Frontiers
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
AMD Developer Central
 
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedOptimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedIntel IT Center
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowHorovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Databricks
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
geetachauhan
 
Intro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetIntro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNet
Amazon Web Services
 
Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to Polyaxon
Yu Ishikawa
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
Akhila Prabhakaran
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
Open mp
Open mpOpen mp
Open mp
Gopi Saiteja
 
Open mp directives
Open mp directivesOpen mp directives
Open mp directives
Prabhakaran V M
 

What's hot (20)

190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
OpenMP
OpenMPOpenMP
OpenMP
 
PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedOptimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowHorovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
 
Intro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetIntro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNet
 
Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to Polyaxon
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
Open mp
Open mpOpen mp
Open mp
 
Open mp directives
Open mp directivesOpen mp directives
Open mp directives
 

Similar to Best Practices for OpenMP on GPUs - OpenMP UK Users Group

Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
Sudhang Shankar
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
PET Computação
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
Jayanti Prasad Ph.D.
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
Fastly
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
Roman Okolovich
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
CSCJournals
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
Steve Severance
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
Tigabu Yaya
 
Parallel computation
Parallel computationParallel computation
Parallel computation
Jayanti Prasad Ph.D.
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
 
Program Assignment Process ManagementObjective This program a.docx
Program Assignment  Process ManagementObjective This program a.docxProgram Assignment  Process ManagementObjective This program a.docx
Program Assignment Process ManagementObjective This program a.docx
wkyra78
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
HSA Foundation
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPU
John Colvin
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
MLconf
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
GopalPatidar13
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
Intel® Software
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Valeriy Kravchuk
 

Similar to Best Practices for OpenMP on GPUs - OpenMP UK Users Group (20)

Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Program Assignment Process ManagementObjective This program a.docx
Program Assignment  Process ManagementObjective This program a.docxProgram Assignment  Process ManagementObjective This program a.docx
Program Assignment Process ManagementObjective This program a.docx
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPU
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
 

More from Jeff Larkin

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
Jeff Larkin
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
Jeff Larkin
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
Jeff Larkin
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
Jeff Larkin
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEJeff Larkin
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
Jeff Larkin
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Jeff Larkin
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
Jeff Larkin
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
Jeff Larkin
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Jeff Larkin
 

More from Jeff Larkin (13)

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
 

Recently uploaded

May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 

Recently uploaded (20)

May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 

Best Practices for OpenMP on GPUs - OpenMP UK Users Group

  • 1. Tim Costa, NVIDIA HPC SW Product Manager Jeff Larkin, Sr. DevTech Software Engineer UK OpenMP Users Group, December 2020 BEST PRACTICES FOR OPENMP ON GPUS
  • 2. 2 THE FUTURE OF PARALLEL PROGRAMMING Standard Languages | Directives | Specialized Languages Maximize Performance with Specialized Languages & Intrinsics Drive Base Languages to Better Support Parallelism Augment Base Languages with Directives do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo !$omp target data map(x,y) ... do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo ... !$omp end target data attribute(global) subroutine saxpy(n, a, x, y) { int i = blockIdx%x*blockDim%x + threadIdx%x; if (i < n) y(i) += a*x(i) } program main real :: x(:), y(:) real,device :: d_x(:), d_y(:) d_x = x d_y = y call saxpy <<<(N+255)/256,256>>>(...) y = d_y std::for_each_n(POL, idx(0), n, [&](Index_t i){ y[i] += a*x[i]; });
  • 3. 3 THE ROLE OF DIRECTIVES FOR PARALLEL PROGRAMMING Serial Programming Languages Parallel Programming Languages Directives convey additional information to the compiler.
  • 4. 4 AVAILABLE NOW: THE NVIDIA HPC SDK Available at developer.nvidia.com/hpc-sdk, on NGC, and in the Cloud Develop for the NVIDIA HPC Platform: GPU, CPU and Interconnect HPC Libraries | GPU Accelerated C++ and Fortran | Directives | CUDA 7-8 Releases Per Year | Freely Available Compilers nvcc nvc nvc++ nvfortran Programming Models Standard C++ & Fortran OpenACC & OpenMP CUDA Core Libraries libcu++ Thrust CUB Math Libraries cuBLAS cuTENSOR cuSPARSE cuSOLVER cuFFT cuRAND Communication Libraries Open MPI NVSHMEM NCCL DEVELOPMENT Profilers Nsight Systems Compute Debugger cuda-gdb Host Device ANALYSIS NVIDIA HPC SDK
  • 5. 5 HPC COMPILERS NVC | NVC++ | NVFORTRAN Programmable Standard Languages Directives CUDA Multicore Directives Vectorization Multi-Platform x86_64 Arm OpenPOWER Accelerated Latest GPUs Automatic Acceleration *+=
  • 6. 6 BEST PRACTICES FOR OPENMP ON GPUS Always use the teams and distribute directive to expose all available parallelism Aggressively collapse loops to increase available parallelism Use the target data directive and map clauses to reduce data movement between CPU and GPU Use accelerated libraries whenever possible Use OpenMP tasks to go asynchronous and better utilize the whole system Use host fallback (if clause) to generate host and device code
  • 7. 7 Expose More Parallelism Many codes already have Parallel Worksharing loops (Parallel For/Parallel Do); Isn’t that enough? Parallel For creates a single contention group of threads with a shared view of memory and the ability to coordinate and synchronize. This structure limits the degree of parallelism that a GPU can exploit and doesn’t exploit many GPU advantages Parallel For Isn’t Enough #pragma omp parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } OMP PARALLEL OMP FOR ThreadTeam
  • 8. 8 Expose More Parallelism The Teams Distribute directives exposes coarse- grained, scalable parallelism, generating more parallelism. Parallel For is still used to engage the threads within each team. Coarse and Fine-grained parallelism may be combined (A) or split (B). How do I know which will be better for my code & machine? Use Teams Distribute #pragma omp target teams distribute parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } A #pragma omp target teams distribute reduction(max:error) for( int j = 1; j < n-1; j++) { #pragma parallel for reduction(max:error) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } B OMP TEAMS OMP DISTRIBUTE OMP FOR
  • 9. 9 Expose Even More Parallelism! Collapsing loops increases the parallelism available for the compiler to exploit. It’s likely possible for this code to perform equally well to parallel for on the CPU and still exploit more parallelism on a GPU. Collapsing does erase possible gains from locality (maybe tile can help?) Not all loops can be collapsed Use Collapse Clause Aggressively #pragma omp target teams distribute parallel for reduction(max:error) collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } }
  • 10. 10 Optimize Data Motion The more loops you move to the device, the more data movement you may introduce. Compilers will always choose correctness over performance, so the programmer can often do better. Target Data directives and map clauses enable the programmer to take control of data movement. On Unified Memory machines, locality may be good enough, but optimizing the data movement is more portable. Note: I’ve seen cases where registering arrays with CUDA may further improve remaining data movement. Use Target Data Mapping #pragma omp target data map(to:Anew) map(tofrom:A) while ( error > tol && iter < iter_max ) { … // Use A and Anew in target directives }
  • 11. 11 Accelerated Libraries & OpenMP Accelerated libraries are free performance The use_device_ptr clause enables sharing data from OpenMP to CUDA. OpenMP 5.1 adds the interop construct to enable sharing CUDA Streams Eat your free lunch #pragma omp target data map(alloc:x[0:n],y[0:n]) { #pragma omp target teams distribute parallel for for( i = 0; i < n; i++) { x[i] = 1.0f; y[i] = 0.0f; } #pragma omp target data use_device_ptr(x,y) { cublasSaxpy(n, 2.0, x, 1, y, 1); // Synchronize before using results } #pragma omp target update from(y[0:n]) }
  • 12. 12 Tasking & GPUs Data copies, GPU Compute, and CPU compute can all overlap. OpenMP uses its existing tasking framework to manage asynchronous execution and dependencies. Your mileage may vary regarding how well the runtime maps OpenMP tasks to CUDA streams. Get it working synchronously first! Keep the whole system busy #pragma omp target data map(alloc:a[0:N],b[0:N]) { #pragma omp target update to(a[0:N]) nowait depend(inout:a) #pragma omp target update to(b[0:N]) nowait depend(inout:b) #pragma omp target teams distribute parallel for nowait depend(inout:a) for(int i=0; i<N; i++ ) { a[i] = 2.0f * a[i]; } #pragma omp target teams distribute parallel for nowait depend(inout:a,b) for(int i=0; i<N; i++ ) { b[i] = 2.0f * a[i]; } #pragma omp target update from(b[0:N]) nowait depend(inout:b) #pragma omp taskwait depend(inout:b) }
  • 13. 13 Host Fallback The if clause can be used to selectively turn off OpenMP offloading. Unless the compiler knows at compile-time the value of the if clause, both code paths are generated For simple loops, written to use the previous guidelines, it may be possible to have unified CPU & GPU code (no guarantee it will be optimal) The “if” clause #pragma omp target teams distribute parallel for reduction(max:error) collapse(2) if(target:USE_GPU) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } }
  • 14. 14 The Loop Directive The Loop directive was added in 5.0 as a descriptive option for programming. Loop asserts the ability of a loop to be run in any order, including concurrently. As more compilers & applications adopt it, we hope it will enable more performance portability. Give descriptiveness a try #pragma omp target teams loop reduction(max:error) collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } }
  • 15. 15 BEST PRACTICES FOR OPENMP ON GPUS Always use the teams and distribute directive to expose all available parallelism Aggressively collapse loops to increase available parallelism Use the target data directive and map clauses to reduce data movement between CPU and GPU Use accelerated libraries whenever possible Use OpenMP tasks to go asynchronous and better utilize the whole system Use host fallback (if clause) to generate host and device code Bonus: Give loop a try