OpenMP-OpenACC-Offload-Cauldron2022-1.pdf

OpenACC, OpenMP,
Offloading and GCC
GNU Tools Cauldron 2022
Tobias Burnus, Thomas Schwinge, Andrew Stubbs
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
Directly after
this talk:
BoF on this
topic, second
room “S5”,
2nd floor

Agenda
Intro & History
GCC’s Offloading Implementation
OpenMP in GCC 13 Updates
OpenACC in GCC Updates
OpenMP Memory Management and Unified Shared Memory
AMD GCN Port Updates
nvptx Port Updates
Conclusion

OpenMP and OpenACC – Introductory Examples
OpenACC – Fortran example
!$acc parallel loop independent collapse(2) &
!$acc copyin(A,B) copyout(C)
do i = 1, N
do j = 1, N
block
real :: sum
sum = 0
!$acc loop reduction(+:sum)
do k = 1, N
sum = sum + A(k,i)*B(j,k)
end do
C(j,i) = sum
end block
end do
end do
!$acc end parallel loop
OpenMP – C/C++ example
#pragma omp target map(tofrom:C[N*N])
map(to:A[:N*N],B[:N*N])
private(i,j,k)
#pragma omp parallel for collapse(3)
private(i,j,k)
reduction(+:C[:N*N])
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
C[i*N+j] += A[i*N+k] * B[k*N+j];

History of OpenMP, OpenACC and Offloading in GCC
OpenMP History
1.0: 1997 for Fortran/1998 for C/C++
2.0: 2000 for Fortran/2002 for C/C++
2.5: 2005 – since GCC 4.2
3.0: 2008 – GCC 4.4
3.1: 2011 – GCC 4.7
4.0: 2013 (‘target’ support) – GCC 4.9.{0,1}
4.5: 2015 – GCC 6, Fortran: part. 7, full 11
5.0: 2018 – partially since GCC 9
5.1: 2020 – partially since GCC 12
5.2: 2021 – part.since GCC 13 [669 pages]
TR11: 2022 (6.0 preview) → SC22 (?)
https://openmp.org/specifications/
https://gcc.gnu.org/projects/gomp/
33 ARB members – for GCC:
Red Hat (now via IBM), SUSE, SIEMENS
OpenACC History
1.0: 2011
2.0: 2013 – partially since GCC 5 / 6
2.5: 2011 – mostly since GCC 9
2.6: 2017 – since GCC 10 (full)
2.7: 2018
3.0: 2019
3.1: 2020
3.2: 2021 [156 pages]
3.3: (?) 2022 → SC22 (?)
https://www.openacc.org/specification
https://gcc.gnu.org/wiki/OpenACC
33 members – for GCC:
SUSE, SIEMENS
GCC Offloading
2014: Add the nvptx port.
– 2016/GCC 6:
OpenACC offload
– 2017/GCC 7: OpenMP
2019: Added GCN port
– 2020/GCC 10:
offloading
– HSAIL: GCC 6–11
2014: Intel MIC (KNL)
– 2016/GCC 6:
simulator/offload
– 2021/GCC 12
deprecated

CPU Time-Share on ORNL’s Summit (2021)
200 PFlop/s (peak), TOP500.org #2 (Jun 2021), #4 (Jun 2022)
D.Berthold, W.R.Elwasif & T.Burnus,
https://openmpcon.org/conf2021/pro
gram-archive/

GCC’s Offloading
Implementation

GCC OpenMP/OpenACC Compilation for Offloading
Compilation
• C/C++/Fortran FE generate tree, mostly
shared between OpenMP/OpenACC but with
case separation
• Lowering: Lang hooks especially for implicit
data-sharing/mapping clauses
→ gimplify.cc, omp-low.cc
• parallel, offload regions split into separate
function with arg passing (→omp-low.cc, omp-
expand.cc)
Offloading
• Attribute on offload functions/global vars
→ Saved in lto format (own section)
• vect of global vars + entry functions
→ Saved in lto (own section)
• Normal processing (lto or not) for the rest
Fortran
original dump
gimple dump
omp-lower dump
optimized dump (-O0)

Device Compilation
Host Side
• Write entry function + global var into offload
.sections – once per TU or one time (LTO)
→ libgcc/offloadstuff.c + omp-offload.cc
Device Side
• Driver calls for every target mkoffload:
• calls device lto1 and linker
• generates host-side constructor to register target
and global variables/entry functions
• Offload code is in the data section of the resulting ELF
Optimization Issues
• Split of offload-func table and offload-device code
requires force_node → missed optimization
• Optimizations: Const prop into functions, inlining target-
side LTO
libgomp
• Loads libgomp-plugin* to check to check
for available device
• Plugin libraries hide details of target-specific
code
https://gcc.gnu.org/wiki/Offloading
(‘info gcc’/ GCC manual)

Parsing Once for All Devices
Single Parsing
• Single parsing and tree handling for host and
all devices → consistent state, late decision
for which devices to offload
• But: C++ w/ exceptions vs. w/o exception
in FE vs. target dependence
• Hard to insert special math functions for devices.
LLVM’s __clang_cuda_math.h
int abs(int __a) { return __nv_abs(__a); }
• More complex to implement metadirectives or
code gen for functions targeting only a specific device
• Handling feature differences is hard:
exception support, vectorization lengths, SIMD vs.
SIMT, ...

Levels of Parallelism for Offloading
• All three levels used: teams, threads and simd
• SIMD / vectorized loops map to thread/work item
• teams + parallel map to warps/wavefronts,
• OpenMP teams uses threadpool of size #teams
https://gcc.gnu.org/onlinedocs/libgomp/Offload-Target-
Specifics.html
Other compilers
• GCC: teams, parallel, simd
• LLVM/Clang: teams, parallel
under dev'ment: team, parallel, simd
• AMD: teams, parallel
• HPE/Cray: teams, parallel or simd
• Nvidia: teams, parallel
• Intel: teams, parallel, simd
https://www.openmp.org/events/2022-ecp-
community-bof-days/

OpenMP in GCC 13
Updates

OpenMP Progress
Implementation progress*
*by counting the implementation-status lines in ↑, i.e.
https://gcc.gnu.org/projects/gomp/ → Impl. Status
0
10
20
30
40
50
60
OMP 5 OMP 5.1 OMP 5.2
GCC 9 GCC 10 GCC 11 GCC 12 GCC 13 ALL
Pending work (incomplete)
OpenMP 5 (10 “no” items)
• Metadirectives (WIP/pending patch)
• Declare mapper (WIP/pending patch)
• Array shaping/noncont arrays (todo)
OpenMP 5.1 (20 “no” items)
• Interop/dispatch (todo)
• Assume (todo)
Offload Related
• Unified-shared memory (WIP/pending patches)
All
• OMPD (debugging) – WIP by Egyptian master students
• OMPT (tracing) – (todo)
Lots of smaller & not so small items + 5.2 + TR11 looming

OpenMP in GCC 13
• OpenMP 5.0: 'requires' + reverse offload (WIP for actually device support)
• Several patches in review/revise cycle: mapper, metadirectives, memory-handling (→ later), …
• OpenMP 5.1: more omp_target_… routines, by device-num env vars, nowait in taskwait
• OpenMP 5.2: clause renaming (+ ext. for doacross), firstprivate/allocate on scope,
omp_{initial,invalid}_device constants,
• Many smaller/minor items, bug fixes, …
(127 GCC-13 commits related to OpenMP/OpenACC/nvptx/gcn)
http://gcc.gnu.org/gcc-13/changes.html→ OpenMP
https://gcc.gnu.org/projects/gomp/

OpenACC in GCC
Update

GCC/OpenACC
• OpenACC 2.6 support
• Code offloading to AMD (GCN) and Nvidia (nvptx) GPUs
<https://www.openacc.org/>
OpenACC – Fortran example
!$acc parallel loop &
!$acc independent collapse(2) &
!$acc copyin(A,B) copyout(C)
do i = 1, N
do j = 1, N
block
real :: sum
sum = 0
!$acc loop reduction(+:sum)
do k = 1, N
sum = sum + A(k,i)*B(j,k)
end do
C(j,i) = sum
end block
end do
end do
!$acc end parallel loop

GCC/OpenACC
GCC 12+ changes
• OpenACC worker parallelism for AMD GPUs
• 'gcc/omp-oacc-neuter-broadcast.cc'
• Execution state changes (neutering/broadcasting) as a GCC middle end transformation
• Different approach from nvptx where it all happens in the back end
• Bug fixing (such as OpenACC specification adherence), for example:
• Data privatization/sharing at the OpenACC gang level: use GCN LDS, nvptx '.shared' memory
• OpenACC/Fortran: strided array sections and components of derived-type arrays
• OpenACC 'async' correctness
• The usual miscellanea
• Code generation optimizations: middle end as well as GCN, nvptx back ends
• Diagnostics: '-Wopenacc-parallelism' to diagnose potentially suboptimal choices of OpenACC parallelism

GCC/OpenACC
GCC 12+ changes
• OpenACC 'kernels' work, part I
• Decompose OpenACC 'kernels' constructs into parts, a sequence of compute constructs
• 'gcc/omp-oacc-kernels-decompose.cc'
• Bug fixing in master branch
• OpenACC 'kernels' work, part II
• Array access delinearization
• Scalar data privatization
• Analyze 'loop' constructs with 'auto' clause, decide 'seq' vs. 'independent'
• Graphite
• See talk at LPC¹, GNU Tools Track: OpenACC "kernels" improvements (Frederik Harwath)
• <https://linuxplumbersconf.org/event/11/contributions/998/>
• <https://youtu.be/zUw0ZVXCwoM?t=12304s>
• Developed on private branch; then integrated into public og11/og12 branches, TODO: master branch
• Revision and upstreaming of existing development branch work into GCC mainline
¹ Linux Plumbers Conference 2021, <https://linuxplumbersconf.org/>, virtual, week of 2021-09-20

GCC/OpenACC
Next steps
• More revision/upstreaming of existing development branch work into master branch
• Complete features of OpenACC 2.6 and earlier
• A few items listed here: <https://gcc.gnu.org/wiki/SummerOfCode#Selected_Project_Ideas>
• Also listed: a few OpenACC ideas for GCC '-fanalyzer'
• Implement features of OpenACC 2.7 and later
• (… waiting to be scheduled...)
OpenACC 2.7, 3.0, 3.1, 3.2 includes, for example:
• Lots of clarifications, specification bug fixes
• Shared-memory devices, multicore CPU as a device
• Arrays, subarrays and composite variables now allowed in 'reduction' clauses
• C++ lambdas
• Fortran 'do concurrent'
• Device to device memory copying
• Runtime error callback routines (based on OpenACC Profiling Interface callback routines)

OpenMP
Memory Management
Unified Memory

OpenMP 5 MemoryAllocators
Mainline support
• Basic support (API routines etc.) since GCC 12
• Libmemkind support in GCC 13
New features added to OG12* branch:
• Low-latency memory (nvptx only, for now)
• Up to 32K local on-chip memory per team.
• AMD GCN support is planned.
• Pinned memory
• Unified Shared Memory
• Both amdgcn and nvptx
• Allocator clauses and directives.
• The patches are posted for review, but most not yet
accepted.
*git branch: devel/omp/gcc-12
L. Li (BNL), Manage OpenMP GPUData EnvironmentUnder
UnifiedAddress Space (2018)
https://doi.org/10.1007/978-3-319-98521-3_5

Pinned Memory
The proposed implementation uses mlock on Linux.
• Works on all Linux systems.
• Avoids page miss penalties.
• But shows no performance boost on an unloaded system.
Planned: Cuda managed memory
• Use cudaMallocHost when a Cuda device is present.
• Same benefits for normal code.
• Uses a faster code path within Cuda to benefit all systems.

Unified Shared Memory
USM uses the same memory address on both host and
device
• No need to “map” data from host to device.
• All calls to malloc/calloc/new/free etc. are intercepted.
• So, all heap memory is shared.
• Libgfortran allocations are also captured.
• Stack and static data cannot be shared.
• “Shared” memory is actually automatically migrated by
the device driver on a page miss.
• NVPTX uses cudaMallocManaged.
• AMD GCN uses “coarse-grained” memory.
• AMD can also use “fine-grained” memory in which
the GPU accesses the main memory via the bus.

AMD GCN
Port Updates

AMD GCN Port Updates
GCC 12
• Improved debug information (for use with ROCGDB)
• 128-bit integer support (TImode)
• Improved GPU parallelism
GCC 13 development & OG12 branch
• MI200 (gfx90a) support
• Unified Shared Memory
• SIMD routines (OpenMP "declare SIMD")
• In-branch SIMD routine patch in review – target independent!
• SIMD math routines
• Aim to be able to vectorize calls to as much of libm as possible
• Some commits already, more to follow soon.
• Auto-SIMD for OpenMP parallel loops (soon)
• Try to match performance of other toolchains that do not require explicit "simd" directives
• Multiple vector sizes
• Currently 64-lanes only (fully maskable)
• Soon add 32, 16, 8, 4, and 2-lane vectors for those optimizers that can't (yet) use masks (e.g. SLP).
• Implemented by adding masking in the back-end.

nvptx Port Updates

nvptx Port Updates
• CUDA 11+ support
• Bug fixes/PTX conformance, especially for newer GPU hardware
• Also work around Nvidia PTX JIT bugs...
• Always use own 'cuda.h' and 'dlopen("libcuda.so")'
• Initial/experimental support for features of higher SM levels, PTX versions
• For example: symbol aliasing, 'HFmode', ...
• General PTX code generation improvements
… by Tom de Vries, Roger Sayle, and us

Conclusion

Conclusion
• Still lots of work to catch with OpenMP 5.x/OpenACC 2.7+
plus performance, diagnostic, documentation improvements
• But steady & large progress in the last year(s)
for OpenMP, GCN, nvptx and OpenACC
Q & A now to the talk
BoF to concurrency topics – and esp. OpenACC/OpenMP/Offloading
directly afterwards upstairs

Acknowledgement
This research used resources of the Oak Ridge Leadership Computing
Facility, which is a DOE Office of Science User Facility supported under
Contract DE-AC05-00OR22725
Disclaimer
© Siemens 2022
Subject to changes and errors. The information given in this document only
contains general descriptions and/or performance features which may not
always specifically reflect those described, or which may undergo modification
in the course of further development of the products. The requested
performance features are binding only when they are expressly agreed upon
in the concluded contract.
All product designations may be trademarks or other rights of
Siemens AG, its affiliated companies or other companies whose use by third
parties for their own purposes could violate the rights of the respective owner.

OpenMP-OpenACC-Offload-Cauldron2022-1.pdf

Recommended

Recommended

More Related Content

Similar to OpenMP-OpenACC-Offload-Cauldron2022-1.pdf

Similar to OpenMP-OpenACC-Offload-Cauldron2022-1.pdf (20)

More from ssuser866937

More from ssuser866937 (11)

Recently uploaded

Recently uploaded (20)

OpenMP-OpenACC-Offload-Cauldron2022-1.pdf