SlideShare a Scribd company logo
1 of 21
Download to read offline
Lecture: Manycore GPU Architectures and
Programming, Part 4
-- Introducing OpenMP and HOMP for Accelerators
1
CSCE 569 Parallel Computing
Department of Computer Science and Engineering
Yonghong Yan
yanyh@cse.sc.edu
https://passlab.github.io/CSCE569/
Manycore GPU Architectures and
Programming: Outline
• Introduction
– GPU architectures, GPGPUs, and CUDA
• GPU Execution model
• CUDA Programming model
• Working with Memory in CUDA
– Global memory, shared and constant memory
• Streams and concurrency
• CUDA instruction intrinsic and library
• Performance, profiling, debugging, and error handling
• Directive-based high-level programming model
– OpenMP and OpenACC
2
HPC Systems with Accelerators
• Accelerator architectures
become popular
– GPUs and Xeon Phi
• Multiple accelerators are
common
– 2, 4, or 8
3
https://www.anandtech.com/show/12587/nvidias-dgx2-sixteen-v100-gpus-30-tb-of-nvme-only-400k
Programming on NVIDIA GPUs
1. CUDA and OpenCL
– Low-level
2. Library, e.g. cublas, cufft, cuDNN
3. OpenMP, OpenACC, and others
– Rely on compiler support
4. Application framework
– TensorFlow, etc
OpenMP 4.0 for Accelerators
4
AXPY Example with OpenMP: Multicore
• y = α·x + y
– x and y are vectors of size n
– α is scalar
• Data (x, y and a) are shared
– Parallelization is relatively easy
• Other examples
– sum: reduction
– Stencil: halo region exchange and synchronization 5
AXPY Offloading To a GPU using CUDA
6
Memory allocation on device
Memcpy from host to device
Launch parallel execution
Memcpy from device to host
Deallocation of dev memory
AXPY Example with OpenMP: single device
• y = α·x + y
– x and y are vectors of size n
– α is scalar
• target directive: annotate an offloading code region
• map clause: map data between host and device à moving data
– to|tofrom|from: mapping directions
– Use array region
7
OpenMP Computation and Data Offloading
• #pragma omp target device(id) map() if()
– target: create a data environment and offload
computation on the device
– device (int_exp): specify a target device
– map(to|from|tofrom|alloc:var_list) : data
mapping between the current data environment
and a device data environment
• #pragma target data device (id) map() if()
– Create a device data environment: to be
reused/inherited
omp target
CPU thread
omp parallel
Accelerator threads
CPU thread
8
Main
Memory
Application
data
target
Application
data
acc. cores
Copy in
remote
data
Copy out
remote data
Tasks
offloaded to
accelerator
target and map Examples
9
Accelerator: Explicit Data Mapping
• Relatively small number of
truly shared memory
accelerators so far
• Require the user to
explicitly map data to and
from the device memory
• Use array region
10
long a = 0x858;
long b = 0;
int anArray[100]
#pragma omp target data map(to:a) 
map(tofrom:b,anArray[0:64])
{
/* a, b and anArray are mapped
* to the device */
/* work on the device */
#pragma omp target …
{
…
}|
}
/* b and anArray are mapped
* back to the host */
target date Example
11
Accelerator: Hierarchical Parallelism
• Organize massive number of threads
– teams of threads, e.g. map to CUDA grid/block
• Distribute loops over teams
12
#pragma omp target
#pragma omp teams num_teams(2)
num_threads(8)
{
//-- creates a “league” of teams
//-- only local barriers permitted
#pragma omp distribute
for (int i=0; i<N; i++) {
}
}
teams and distribute Loop Example
Double-nested loops are mapped to the two levels of thread hierarchy (league
and team)
13
while ((k<=mits)&&(error>tol))
{
// a loop copying u[][] to uold[][] is omitted here
…
#pragma omp parallel for private(resid,j,i) reduction(+:error)
for (i=1;i<(n-1);i++)
for (j=1;j<(m-1);j++)
{
resid = (ax*(uold[i-1][j] + uold[i+1][j])
+ ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid*resid ;
} // the rest code omitted ...
}
#pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b, 
f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m])
#pragma omp target device(gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m], 
uold[0:n][0:m]) map(tofrom:u[0:n][0:m])
Jacobi Example: The Impact of Compiler
Transformation to Performance
14
Early Experiences With The OpenMP Accelerator Model; Chunhua Liao, Yonghong Yan, Bronis R. de Supinski, Daniel J. Quinlan
and Barbara Chapman; International Workshop on OpenMP (IWOMP) 2013, September 2013
0
10
20
30
40
50
60
70
80
90
100
128x128 256x256 512x512 1024x1024 2048x2048
Matrix size (float)
Jacobi Execution Time (s)
first version
target-data
Loop collapse using linearization with static-even scheduling
Loop collapse using 2-D mapping (16x16 block)
Loop collapse using 2-D mapping (8x32 block)
Loop collapse using linearization with round-robin scheduling
15
Loop Mapping Algorithms
Map-gv-gv in GPU Topology
#pragma acc loop gang (2) vector (2)
for ( i = x1; i < X1; i++ ) {
#pragma acc loop gang (3) vector (4)
for ( j = y1; j < Y1; j++ ) {...... }
}
X.Tian et al. LCPC Workshop 2013 17 / 26
• Need to achieve coalesced memory access on GPUs
Compiler Transformation of Nested Loops for GPGPUs, Xiaonan Tian, Rengan Xu, Yonghong Yan, Sunita Chandrasekaran, and Barbara
Chapman Journal of Concurrency and Computation: Practice and Experience, August 2015
Compiling a High-level Directive-Based Programming Model for GPGPUs 11
Fig. 9: Double nested loop mapping. Fig. 10: Triple nested loop mapping.
Table 2: Threads used in each loop with double loop mappings
Loop Mapping Algorithms
Map-gv-gv in GPU Topology
#pragma acc loop gang (2) vector (2)
for ( i = x1; i < X1; i++ ) {
#pragma acc loop gang (3) vector (4)
for ( j = y1; j < Y1; j++ ) {...... }
}
X.Tian et al. LCPC Workshop 2013
Mapping Nested Loops to GPUs
16
The certification suite has been evaluated on an NVIDIA Kepler GPU and an Intel Xeon CPU with 8 cores.
Table 2 shows a comparison of the speedup of OpenACC for a variety of applications to sequential version,
OpenMP Version 3.1 (8 cores) and CUDA (4.2 &5.0).
Table 2: Speedup of OpenACC for various applications.
Applications Domains
OpenACC Directive
Combinations
Lines of Code
Added vs Serial
Speedup Over
OpenMP OpenACC Seq OpenMP CUDA
Needleman-
Wunsch
Bioinformatics
data copy, copyin
kernels present
loop gang, vector, private
6 5 2.98 1.28 0.24
Stencil
Cellular
Automation
data copyin, copy, deviceptr
kernels present
loop collapse, independent
1 3 40.55 15.87 0.92
Computational
Fluid Dyanmics
(CFD)
Fluid Mechanics
data copyin, copy, deviceptr
data present, deviceptr
kernels deviceptr
kernels loop, gang, vector, private
loop gang, vector
acc_malloc(), acc_free()
8 46 35.86 4.59 0.38
2D Heat (grid
size 4096*4096)
Heat Conduction
data copyin, copy, deviceptr
kernels present
loop collapse, independent
1 3 99.52 28.63 0.90
Clever (10Ovals) Data Mining
data copyin
kernels present, create, copyin,
copy
loop independent
10 3 4.25 1.22 0.60
FeldKemp
(FDK)
Image
Processing
kernels copyin, copyout
loop, collapse, independent
1 2 48.30 6.51 0.75
We observed that the certification suite exhibits a set of behaviors. For example, OpenACC speedup ranges
from 2.98 to 99.52 over the sequential version and from 1.22 to 28.63 over 8-core OpenMP version. The
applications have been chosen based on their computations and communication patterns.
Compiler vs Hand-Written
NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model, Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan and Barbara Chapman 27th
International Workshop on Languages and Compilers for Parallel Computing (LCPC2014)
AXPY Example with OpenMP: Multiple device
17
• Parallel region
– One CPU thread per
device
• Manually partition array
x and y
• Each thread offload
subregion of x and y
• Chunk the loop in
alignment with the data
partition
Hybrid OpenMP (HOMP) for Multiple
Accelerators
18
HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems, Yonghong Yan, Jiawen Liu,
and Kirk W. Cameron, The IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2017
Three Challenges to Implement HOMP
1. Load balance when distributing loop iterations across
computational different devices (CPU, GPU, and MIC)
– We developed 7 algorithms of loop distribution and the runtime
select algorithms based on computation/data intensity
2. Only copy the associated data to the device that are needed
for the loop chunks assigned to that device
– Runtime support for ALIGN interface to move or share data between
memory spaces
1. Select devices for computations for the optimal performance
because more devices ≠ better performance
– CUTOFF ratio to select device
19
Offloading Execution Time (ms) on 2 CPUs + 4
GPUs + 2 MICs and using CUTOFF_RATIO
20
1423.10'
412.29'
653.72'
1017.77'
767.02'
514.63'
306.26'
3508.77'
885.80'
277.19'
274.32'
1695.45'
1723.84'
272.80'
20989.67'
3664.08'
21587.16'
3809.59'
3544.63'
10393.55'
6532.80'
1422.96'
1459.78'
400.75'
1060.35'
953.50'
793.04'
555.21'
709.63'
5054.33'
1482.90'
1504.45'
432.17'
752.22'
211.11'
779.83'
261.99'
545.66'
407.61'
291.86'
100.93'
0.00' 200.00' 400.00' 600.00' 800.00' 1000.00' 1200.00' 1400.00' 1600.00' 1800.00' 2000.00'
BLOCK'
SCHED_DYNAMIC'
MODEL_1_AUTO'
MODEL_2_AUTO'
SCHED_PROFILE_AUTO'
MODEL_PROFILE_AUTO'
MODEL_PROFILE_AUTO'(15%'CUTOFF)'
SCHED_DYNAMIC'
SCHED_GUIDED'
MODEL_1_AUTO'
MODEL_2_AUTO'
SCHED_PROFILE_AUTO'
MODEL_PROFILE_AUTO'
MODEL_1_AUTO(15%'CUTOFF)'
BLOCK'
SCHED_DYNAMIC'
SCHED_GUIDED'
MODEL_1_AUTO'
MODEL_2_AUTO'
SCHED_PROFILE_AUTO'
MODEL_PROFILE_AUTO'
MODEL_2_AUTO(15%'CUTOFF)'
BLOCK'
SCHED_DYNAMIC'
MODEL_1_AUTO'
MODEL_2_AUTO'
SCHED_PROFILE_AUTO'
MODEL_PROFILE_AUTO'
SCHED_PROFILE_AUTO(15%'CUTOFF)'
BLOCK'
MODEL_1_AUTO'
MODEL_2_AUTO'
MODEL_1_AUTO(15%'CUTOFF)'
BLOCK'
SCHED_DYNAMIC'
SCHED_GUIDED'
MODEL_1_AUTO'
MODEL_2_AUTO'
SCHED_PROFILE_AUTO'
MODEL_PROFILE_AUTO'
MODEL_PROFILE_AUTO(15%'CUTOFF)'
axpyI10B'
bm2dI256'
matulI6144'
matvecI48k'
stencil2dI25
6'
sumI300M'
TOTAL%OFF%TIME(ms)%
Execu2on%Time%(ms)%on%2%CPUs%+%4%GPUs%+%2%MICs%
The algorithm that delivers the best performance without CUTOFF
1.35
1.01
2.68
0.56
3.43
2.09
3.43
The algorithm with 15% CUTOFF_RATIO that delivers the best performance
and its speedup against the best algorithm that does not use CUTOFF
Speedup From CUTOFF
• Apply 15% CUTOFF ratio to modeling and profiling
– Only those devices who may compute more than 15% of total
iterations will be used
• Thinking of 8 devices (1/8 = 12.5%)
21
Benchmarks Devices used CUTOFF Speedup
axpy-10B 2 CPU + 4 GPUs 1.35
bm2d-256 2 CPU + 4 GPUs 1.01
matul-6144 4 GPUs 2.68
matvec-48k 4 GPUs 0.56
stencil2d-256 4 GPUs 3.43
sum-300M 2 CPUs + 4 GPUs 2.09

More Related Content

Similar to lecture_GPUArchCUDA04-OpenMPHOMP.pdf

Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsUnai Lopez-Novoa
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rFerdinand Jamitzky
 
BUD17-300: Journey of a packet
BUD17-300: Journey of a packetBUD17-300: Journey of a packet
BUD17-300: Journey of a packetLinaro
 

Similar to lecture_GPUArchCUDA04-OpenMPHOMP.pdf (20)

Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
BUD17-300: Journey of a packet
BUD17-300: Journey of a packetBUD17-300: Journey of a packet
BUD17-300: Journey of a packet
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 

More from Tigabu Yaya

ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfTigabu Yaya
 
03. Data Exploration in Data Science.pdf
03. Data Exploration in Data Science.pdf03. Data Exploration in Data Science.pdf
03. Data Exploration in Data Science.pdfTigabu Yaya
 
MOD_Architectural_Design_Chap6_Summary.pdf
MOD_Architectural_Design_Chap6_Summary.pdfMOD_Architectural_Design_Chap6_Summary.pdf
MOD_Architectural_Design_Chap6_Summary.pdfTigabu Yaya
 
MOD_Design_Implementation_Ch7_summary.pdf
MOD_Design_Implementation_Ch7_summary.pdfMOD_Design_Implementation_Ch7_summary.pdf
MOD_Design_Implementation_Ch7_summary.pdfTigabu Yaya
 
GER_Project_Management_Ch22_summary.pdf
GER_Project_Management_Ch22_summary.pdfGER_Project_Management_Ch22_summary.pdf
GER_Project_Management_Ch22_summary.pdfTigabu Yaya
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdf6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdfTigabu Yaya
 
200402_RoseRealTime.ppt
200402_RoseRealTime.ppt200402_RoseRealTime.ppt
200402_RoseRealTime.pptTigabu Yaya
 
matrixfactorization.ppt
matrixfactorization.pptmatrixfactorization.ppt
matrixfactorization.pptTigabu Yaya
 
The Jacobi and Gauss-Seidel Iterative Methods.pdf
The Jacobi and Gauss-Seidel Iterative Methods.pdfThe Jacobi and Gauss-Seidel Iterative Methods.pdf
The Jacobi and Gauss-Seidel Iterative Methods.pdfTigabu Yaya
 
C_and_C++_notes.pdf
C_and_C++_notes.pdfC_and_C++_notes.pdf
C_and_C++_notes.pdfTigabu Yaya
 

More from Tigabu Yaya (20)

ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdf
 
03. Data Exploration in Data Science.pdf
03. Data Exploration in Data Science.pdf03. Data Exploration in Data Science.pdf
03. Data Exploration in Data Science.pdf
 
MOD_Architectural_Design_Chap6_Summary.pdf
MOD_Architectural_Design_Chap6_Summary.pdfMOD_Architectural_Design_Chap6_Summary.pdf
MOD_Architectural_Design_Chap6_Summary.pdf
 
MOD_Design_Implementation_Ch7_summary.pdf
MOD_Design_Implementation_Ch7_summary.pdfMOD_Design_Implementation_Ch7_summary.pdf
MOD_Design_Implementation_Ch7_summary.pdf
 
GER_Project_Management_Ch22_summary.pdf
GER_Project_Management_Ch22_summary.pdfGER_Project_Management_Ch22_summary.pdf
GER_Project_Management_Ch22_summary.pdf
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdf6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdf
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
lecture6.pdf
lecture6.pdflecture6.pdf
lecture6.pdf
 
lecture5.pdf
lecture5.pdflecture5.pdf
lecture5.pdf
 
lecture4.pdf
lecture4.pdflecture4.pdf
lecture4.pdf
 
lecture3.pdf
lecture3.pdflecture3.pdf
lecture3.pdf
 
lecture2.pdf
lecture2.pdflecture2.pdf
lecture2.pdf
 
Chap 4.ppt
Chap 4.pptChap 4.ppt
Chap 4.ppt
 
200402_RoseRealTime.ppt
200402_RoseRealTime.ppt200402_RoseRealTime.ppt
200402_RoseRealTime.ppt
 
matrixfactorization.ppt
matrixfactorization.pptmatrixfactorization.ppt
matrixfactorization.ppt
 
nnfl.0620.pptx
nnfl.0620.pptxnnfl.0620.pptx
nnfl.0620.pptx
 
L20.ppt
L20.pptL20.ppt
L20.ppt
 
The Jacobi and Gauss-Seidel Iterative Methods.pdf
The Jacobi and Gauss-Seidel Iterative Methods.pdfThe Jacobi and Gauss-Seidel Iterative Methods.pdf
The Jacobi and Gauss-Seidel Iterative Methods.pdf
 
C_and_C++_notes.pdf
C_and_C++_notes.pdfC_and_C++_notes.pdf
C_and_C++_notes.pdf
 

Recently uploaded

NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsSandeep D Chaudhary
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Simple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdfSimple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdfstareducators107
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationNeilDeclaro1
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
dusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningdusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningMarc Dusseiller Dusjagr
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptxJoelynRubio1
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 

Recently uploaded (20)

NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Simple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdfSimple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdf
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health Education
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
dusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningdusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learning
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 

lecture_GPUArchCUDA04-OpenMPHOMP.pdf

  • 1. Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu https://passlab.github.io/CSCE569/
  • 2. Manycore GPU Architectures and Programming: Outline • Introduction – GPU architectures, GPGPUs, and CUDA • GPU Execution model • CUDA Programming model • Working with Memory in CUDA – Global memory, shared and constant memory • Streams and concurrency • CUDA instruction intrinsic and library • Performance, profiling, debugging, and error handling • Directive-based high-level programming model – OpenMP and OpenACC 2
  • 3. HPC Systems with Accelerators • Accelerator architectures become popular – GPUs and Xeon Phi • Multiple accelerators are common – 2, 4, or 8 3 https://www.anandtech.com/show/12587/nvidias-dgx2-sixteen-v100-gpus-30-tb-of-nvme-only-400k Programming on NVIDIA GPUs 1. CUDA and OpenCL – Low-level 2. Library, e.g. cublas, cufft, cuDNN 3. OpenMP, OpenACC, and others – Rely on compiler support 4. Application framework – TensorFlow, etc
  • 4. OpenMP 4.0 for Accelerators 4
  • 5. AXPY Example with OpenMP: Multicore • y = α·x + y – x and y are vectors of size n – α is scalar • Data (x, y and a) are shared – Parallelization is relatively easy • Other examples – sum: reduction – Stencil: halo region exchange and synchronization 5
  • 6. AXPY Offloading To a GPU using CUDA 6 Memory allocation on device Memcpy from host to device Launch parallel execution Memcpy from device to host Deallocation of dev memory
  • 7. AXPY Example with OpenMP: single device • y = α·x + y – x and y are vectors of size n – α is scalar • target directive: annotate an offloading code region • map clause: map data between host and device à moving data – to|tofrom|from: mapping directions – Use array region 7
  • 8. OpenMP Computation and Data Offloading • #pragma omp target device(id) map() if() – target: create a data environment and offload computation on the device – device (int_exp): specify a target device – map(to|from|tofrom|alloc:var_list) : data mapping between the current data environment and a device data environment • #pragma target data device (id) map() if() – Create a device data environment: to be reused/inherited omp target CPU thread omp parallel Accelerator threads CPU thread 8 Main Memory Application data target Application data acc. cores Copy in remote data Copy out remote data Tasks offloaded to accelerator
  • 9. target and map Examples 9
  • 10. Accelerator: Explicit Data Mapping • Relatively small number of truly shared memory accelerators so far • Require the user to explicitly map data to and from the device memory • Use array region 10 long a = 0x858; long b = 0; int anArray[100] #pragma omp target data map(to:a) map(tofrom:b,anArray[0:64]) { /* a, b and anArray are mapped * to the device */ /* work on the device */ #pragma omp target … { … }| } /* b and anArray are mapped * back to the host */
  • 12. Accelerator: Hierarchical Parallelism • Organize massive number of threads – teams of threads, e.g. map to CUDA grid/block • Distribute loops over teams 12 #pragma omp target #pragma omp teams num_teams(2) num_threads(8) { //-- creates a “league” of teams //-- only local barriers permitted #pragma omp distribute for (int i=0; i<N; i++) { } }
  • 13. teams and distribute Loop Example Double-nested loops are mapped to the two levels of thread hierarchy (league and team) 13
  • 14. while ((k<=mits)&&(error>tol)) { // a loop copying u[][] to uold[][] is omitted here … #pragma omp parallel for private(resid,j,i) reduction(+:error) for (i=1;i<(n-1);i++) for (j=1;j<(m-1);j++) { resid = (ax*(uold[i-1][j] + uold[i+1][j]) + ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b; u[i][j] = uold[i][j] - omega * resid; error = error + resid*resid ; } // the rest code omitted ... } #pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m]) #pragma omp target device(gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m], uold[0:n][0:m]) map(tofrom:u[0:n][0:m]) Jacobi Example: The Impact of Compiler Transformation to Performance 14 Early Experiences With The OpenMP Accelerator Model; Chunhua Liao, Yonghong Yan, Bronis R. de Supinski, Daniel J. Quinlan and Barbara Chapman; International Workshop on OpenMP (IWOMP) 2013, September 2013 0 10 20 30 40 50 60 70 80 90 100 128x128 256x256 512x512 1024x1024 2048x2048 Matrix size (float) Jacobi Execution Time (s) first version target-data Loop collapse using linearization with static-even scheduling Loop collapse using 2-D mapping (16x16 block) Loop collapse using 2-D mapping (8x32 block) Loop collapse using linearization with round-robin scheduling
  • 15. 15 Loop Mapping Algorithms Map-gv-gv in GPU Topology #pragma acc loop gang (2) vector (2) for ( i = x1; i < X1; i++ ) { #pragma acc loop gang (3) vector (4) for ( j = y1; j < Y1; j++ ) {...... } } X.Tian et al. LCPC Workshop 2013 17 / 26 • Need to achieve coalesced memory access on GPUs Compiler Transformation of Nested Loops for GPGPUs, Xiaonan Tian, Rengan Xu, Yonghong Yan, Sunita Chandrasekaran, and Barbara Chapman Journal of Concurrency and Computation: Practice and Experience, August 2015 Compiling a High-level Directive-Based Programming Model for GPGPUs 11 Fig. 9: Double nested loop mapping. Fig. 10: Triple nested loop mapping. Table 2: Threads used in each loop with double loop mappings Loop Mapping Algorithms Map-gv-gv in GPU Topology #pragma acc loop gang (2) vector (2) for ( i = x1; i < X1; i++ ) { #pragma acc loop gang (3) vector (4) for ( j = y1; j < Y1; j++ ) {...... } } X.Tian et al. LCPC Workshop 2013 Mapping Nested Loops to GPUs
  • 16. 16 The certification suite has been evaluated on an NVIDIA Kepler GPU and an Intel Xeon CPU with 8 cores. Table 2 shows a comparison of the speedup of OpenACC for a variety of applications to sequential version, OpenMP Version 3.1 (8 cores) and CUDA (4.2 &5.0). Table 2: Speedup of OpenACC for various applications. Applications Domains OpenACC Directive Combinations Lines of Code Added vs Serial Speedup Over OpenMP OpenACC Seq OpenMP CUDA Needleman- Wunsch Bioinformatics data copy, copyin kernels present loop gang, vector, private 6 5 2.98 1.28 0.24 Stencil Cellular Automation data copyin, copy, deviceptr kernels present loop collapse, independent 1 3 40.55 15.87 0.92 Computational Fluid Dyanmics (CFD) Fluid Mechanics data copyin, copy, deviceptr data present, deviceptr kernels deviceptr kernels loop, gang, vector, private loop gang, vector acc_malloc(), acc_free() 8 46 35.86 4.59 0.38 2D Heat (grid size 4096*4096) Heat Conduction data copyin, copy, deviceptr kernels present loop collapse, independent 1 3 99.52 28.63 0.90 Clever (10Ovals) Data Mining data copyin kernels present, create, copyin, copy loop independent 10 3 4.25 1.22 0.60 FeldKemp (FDK) Image Processing kernels copyin, copyout loop, collapse, independent 1 2 48.30 6.51 0.75 We observed that the certification suite exhibits a set of behaviors. For example, OpenACC speedup ranges from 2.98 to 99.52 over the sequential version and from 1.22 to 28.63 over 8-core OpenMP version. The applications have been chosen based on their computations and communication patterns. Compiler vs Hand-Written NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model, Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan and Barbara Chapman 27th International Workshop on Languages and Compilers for Parallel Computing (LCPC2014)
  • 17. AXPY Example with OpenMP: Multiple device 17 • Parallel region – One CPU thread per device • Manually partition array x and y • Each thread offload subregion of x and y • Chunk the loop in alignment with the data partition
  • 18. Hybrid OpenMP (HOMP) for Multiple Accelerators 18 HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems, Yonghong Yan, Jiawen Liu, and Kirk W. Cameron, The IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2017
  • 19. Three Challenges to Implement HOMP 1. Load balance when distributing loop iterations across computational different devices (CPU, GPU, and MIC) – We developed 7 algorithms of loop distribution and the runtime select algorithms based on computation/data intensity 2. Only copy the associated data to the device that are needed for the loop chunks assigned to that device – Runtime support for ALIGN interface to move or share data between memory spaces 1. Select devices for computations for the optimal performance because more devices ≠ better performance – CUTOFF ratio to select device 19
  • 20. Offloading Execution Time (ms) on 2 CPUs + 4 GPUs + 2 MICs and using CUTOFF_RATIO 20 1423.10' 412.29' 653.72' 1017.77' 767.02' 514.63' 306.26' 3508.77' 885.80' 277.19' 274.32' 1695.45' 1723.84' 272.80' 20989.67' 3664.08' 21587.16' 3809.59' 3544.63' 10393.55' 6532.80' 1422.96' 1459.78' 400.75' 1060.35' 953.50' 793.04' 555.21' 709.63' 5054.33' 1482.90' 1504.45' 432.17' 752.22' 211.11' 779.83' 261.99' 545.66' 407.61' 291.86' 100.93' 0.00' 200.00' 400.00' 600.00' 800.00' 1000.00' 1200.00' 1400.00' 1600.00' 1800.00' 2000.00' BLOCK' SCHED_DYNAMIC' MODEL_1_AUTO' MODEL_2_AUTO' SCHED_PROFILE_AUTO' MODEL_PROFILE_AUTO' MODEL_PROFILE_AUTO'(15%'CUTOFF)' SCHED_DYNAMIC' SCHED_GUIDED' MODEL_1_AUTO' MODEL_2_AUTO' SCHED_PROFILE_AUTO' MODEL_PROFILE_AUTO' MODEL_1_AUTO(15%'CUTOFF)' BLOCK' SCHED_DYNAMIC' SCHED_GUIDED' MODEL_1_AUTO' MODEL_2_AUTO' SCHED_PROFILE_AUTO' MODEL_PROFILE_AUTO' MODEL_2_AUTO(15%'CUTOFF)' BLOCK' SCHED_DYNAMIC' MODEL_1_AUTO' MODEL_2_AUTO' SCHED_PROFILE_AUTO' MODEL_PROFILE_AUTO' SCHED_PROFILE_AUTO(15%'CUTOFF)' BLOCK' MODEL_1_AUTO' MODEL_2_AUTO' MODEL_1_AUTO(15%'CUTOFF)' BLOCK' SCHED_DYNAMIC' SCHED_GUIDED' MODEL_1_AUTO' MODEL_2_AUTO' SCHED_PROFILE_AUTO' MODEL_PROFILE_AUTO' MODEL_PROFILE_AUTO(15%'CUTOFF)' axpyI10B' bm2dI256' matulI6144' matvecI48k' stencil2dI25 6' sumI300M' TOTAL%OFF%TIME(ms)% Execu2on%Time%(ms)%on%2%CPUs%+%4%GPUs%+%2%MICs% The algorithm that delivers the best performance without CUTOFF 1.35 1.01 2.68 0.56 3.43 2.09 3.43 The algorithm with 15% CUTOFF_RATIO that delivers the best performance and its speedup against the best algorithm that does not use CUTOFF
  • 21. Speedup From CUTOFF • Apply 15% CUTOFF ratio to modeling and profiling – Only those devices who may compute more than 15% of total iterations will be used • Thinking of 8 devices (1/8 = 12.5%) 21 Benchmarks Devices used CUTOFF Speedup axpy-10B 2 CPU + 4 GPUs 1.35 bm2d-256 2 CPU + 4 GPUs 1.01 matul-6144 4 GPUs 2.68 matvec-48k 4 GPUs 0.56 stencil2d-256 4 GPUs 3.43 sum-300M 2 CPUs + 4 GPUs 2.09