SlideShare a Scribd company logo
1 of 29
Download to read offline
OpenACC, OpenMP,
Offloading and GCC
GNU Tools Cauldron 2022
Tobias Burnus, Thomas Schwinge, Andrew Stubbs
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
Directly after
this talk:
BoF on this
topic, second
room “S5”,
2nd floor
Agenda
Intro & History
GCC’s Offloading Implementation
OpenMP in GCC 13 Updates
OpenACC in GCC Updates
OpenMP Memory Management and Unified Shared Memory
AMD GCN Port Updates
nvptx Port Updates
Conclusion
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
OpenMP and OpenACC – Introductory Examples
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
OpenACC – Fortran example
!$acc parallel loop independent collapse(2) &
!$acc copyin(A,B) copyout(C)
do i = 1, N
do j = 1, N
block
real :: sum
sum = 0
!$acc loop reduction(+:sum)
do k = 1, N
sum = sum + A(k,i)*B(j,k)
end do
C(j,i) = sum
end block
end do
end do
!$acc end parallel loop
OpenMP – C/C++ example
#pragma omp target map(tofrom:C[N*N]) 
map(to:A[:N*N],B[:N*N]) 
private(i,j,k)
#pragma omp parallel for collapse(3) 
private(i,j,k) 
reduction(+:C[:N*N])
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
C[i*N+j] += A[i*N+k] * B[k*N+j];
History of OpenMP, OpenACC and Offloading in GCC
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
OpenMP History
1.0: 1997 for Fortran/1998 for C/C++
2.0: 2000 for Fortran/2002 for C/C++
2.5: 2005 – since GCC 4.2
3.0: 2008 – GCC 4.4
3.1: 2011 – GCC 4.7
4.0: 2013 (‘target’ support) – GCC 4.9.{0,1}
4.5: 2015 – GCC 6, Fortran: part. 7, full 11
5.0: 2018 – partially since GCC 9
5.1: 2020 – partially since GCC 12
5.2: 2021 – part.since GCC 13 [669 pages]
TR11: 2022 (6.0 preview) → SC22 (?)
https://openmp.org/specifications/
https://gcc.gnu.org/projects/gomp/
33 ARB members – for GCC:
Red Hat (now via IBM), SUSE, SIEMENS
OpenACC History
1.0: 2011
2.0: 2013 – partially since GCC 5 / 6
2.5: 2011 – mostly since GCC 9
2.6: 2017 – since GCC 10 (full)
2.7: 2018
3.0: 2019
3.1: 2020
3.2: 2021 [156 pages]
3.3: (?) 2022 → SC22 (?)
https://www.openacc.org/specification
https://gcc.gnu.org/wiki/OpenACC
33 members – for GCC:
SUSE, SIEMENS
GCC Offloading
2014: Add the nvptx port.
– 2016/GCC 6:
OpenACC offload
– 2017/GCC 7: OpenMP
2019: Added GCN port
– 2020/GCC 10:
offloading
– HSAIL: GCC 6–11
2014: Intel MIC (KNL)
– 2016/GCC 6:
simulator/offload
– 2021/GCC 12
deprecated
CPU Time-Share on ORNL’s Summit (2021)
200 PFlop/s (peak), TOP500.org #2 (Jun 2021), #4 (Jun 2022)
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
D.Berthold, W.R.Elwasif & T.Burnus,
https://openmpcon.org/conf2021/pro
gram-archive/
GCC’s Offloading
Implementation
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
GCC OpenMP/OpenACC Compilation for Offloading
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
Compilation
• C/C++/Fortran FE generate tree, mostly
shared between OpenMP/OpenACC but with
case separation
• Lowering: Lang hooks especially for implicit
data-sharing/mapping clauses
→ gimplify.cc, omp-low.cc
• parallel, offload regions split into separate
function with arg passing (→omp-low.cc, omp-
expand.cc)
Offloading
• Attribute on offload functions/global vars
→ Saved in lto format (own section)
• vect of global vars + entry functions
→ Saved in lto (own section)
• Normal processing (lto or not) for the rest
Fortran
original dump
gimple dump
omp-lower dump
optimized dump (-O0)
Device Compilation
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
Host Side
• Write entry function + global var into offload
.sections – once per TU or one time (LTO)
→ libgcc/offloadstuff.c + omp-offload.cc
Device Side
• Driver calls for every target mkoffload:
• calls device lto1 and linker
• generates host-side constructor to register target
and global variables/entry functions
• Offload code is in the data section of the resulting ELF
Optimization Issues
• Split of offload-func table and offload-device code
requires force_node → missed optimization
• Optimizations: Const prop into functions, inlining target-
side LTO
libgomp
• Loads libgomp-plugin* to check to check
for available device
• Plugin libraries hide details of target-specific
code
https://gcc.gnu.org/wiki/Offloading
(‘info gcc’/ GCC manual)
Parsing Once for All Devices
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
Single Parsing
• Single parsing and tree handling for host and
all devices → consistent state, late decision
for which devices to offload
• But: C++ w/ exceptions vs. w/o exception
in FE vs. target dependence
• Hard to insert special math functions for devices.
LLVM’s __clang_cuda_math.h
int abs(int __a) { return __nv_abs(__a); }
• More complex to implement metadirectives or
code gen for functions targeting only a specific device
• Handling feature differences is hard:
exception support, vectorization lengths, SIMD vs.
SIMT, ...
Levels of Parallelism for Offloading
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
• All three levels used: teams, threads and simd
• SIMD / vectorized loops map to thread/work item
• teams + parallel map to warps/wavefronts,
• OpenMP teams uses threadpool of size #teams
https://gcc.gnu.org/onlinedocs/libgomp/Offload-Target-
Specifics.html
Other compilers
• GCC: teams, parallel, simd
• LLVM/Clang: teams, parallel
under dev'ment: team, parallel, simd
• AMD: teams, parallel
• HPE/Cray: teams, parallel or simd
• Nvidia: teams, parallel
• Intel: teams, parallel, simd
https://www.openmp.org/events/2022-ecp-
community-bof-days/
OpenMP in GCC 13
Updates
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
OpenMP Progress
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
Implementation progress*
*by counting the implementation-status lines in ↑, i.e.
https://gcc.gnu.org/projects/gomp/ → Impl. Status
0
10
20
30
40
50
60
OMP 5 OMP 5.1 OMP 5.2
GCC 9 GCC 10 GCC 11 GCC 12 GCC 13 ALL
Pending work (incomplete)
OpenMP 5 (10 “no” items)
• Metadirectives (WIP/pending patch)
• Declare mapper (WIP/pending patch)
• Array shaping/noncont arrays (todo)
OpenMP 5.1 (20 “no” items)
• Interop/dispatch (todo)
• Assume (todo)
Offload Related
• Unified-shared memory (WIP/pending patches)
All
• OMPD (debugging) – WIP by Egyptian master students
• OMPT (tracing) – (todo)
Lots of smaller & not so small items + 5.2 + TR11 looming
OpenMP in GCC 13
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
• OpenMP 5.0: 'requires' + reverse offload (WIP for actually device support)
• Several patches in review/revise cycle: mapper, metadirectives, memory-handling (→ later), …
• OpenMP 5.1: more omp_target_… routines, by device-num env vars, nowait in taskwait
• OpenMP 5.2: clause renaming (+ ext. for doacross), firstprivate/allocate on scope,
omp_{initial,invalid}_device constants,
• Many smaller/minor items, bug fixes, …
(127 GCC-13 commits related to OpenMP/OpenACC/nvptx/gcn)
http://gcc.gnu.org/gcc-13/changes.html→ OpenMP
https://gcc.gnu.org/projects/gomp/
OpenACC in GCC
Update
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
GCC/OpenACC
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
• OpenACC 2.6 support
• Code offloading to AMD (GCN) and Nvidia (nvptx) GPUs
<https://www.openacc.org/>
OpenACC – Fortran example
!$acc parallel loop &
!$acc independent collapse(2) &
!$acc copyin(A,B) copyout(C)
do i = 1, N
do j = 1, N
block
real :: sum
sum = 0
!$acc loop reduction(+:sum)
do k = 1, N
sum = sum + A(k,i)*B(j,k)
end do
C(j,i) = sum
end block
end do
end do
!$acc end parallel loop
GCC/OpenACC
GCC 12+ changes
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
• OpenACC worker parallelism for AMD GPUs
• 'gcc/omp-oacc-neuter-broadcast.cc'
• Execution state changes (neutering/broadcasting) as a GCC middle end transformation
• Different approach from nvptx where it all happens in the back end
• Bug fixing (such as OpenACC specification adherence), for example:
• Data privatization/sharing at the OpenACC gang level: use GCN LDS, nvptx '.shared' memory
• OpenACC/Fortran: strided array sections and components of derived-type arrays
• OpenACC 'async' correctness
• The usual miscellanea
• Code generation optimizations: middle end as well as GCN, nvptx back ends
• Diagnostics: '-Wopenacc-parallelism' to diagnose potentially suboptimal choices of OpenACC parallelism
GCC/OpenACC
GCC 12+ changes
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
• OpenACC 'kernels' work, part I
• Decompose OpenACC 'kernels' constructs into parts, a sequence of compute constructs
• 'gcc/omp-oacc-kernels-decompose.cc'
• Bug fixing in master branch
• OpenACC 'kernels' work, part II
• Array access delinearization
• Scalar data privatization
• Analyze 'loop' constructs with 'auto' clause, decide 'seq' vs. 'independent'
• Graphite
• See talk at LPC¹, GNU Tools Track: OpenACC "kernels" improvements (Frederik Harwath)
• <https://linuxplumbersconf.org/event/11/contributions/998/>
• <https://youtu.be/zUw0ZVXCwoM?t=12304s>
• Developed on private branch; then integrated into public og11/og12 branches, TODO: master branch
• Revision and upstreaming of existing development branch work into GCC mainline
¹ Linux Plumbers Conference 2021, <https://linuxplumbersconf.org/>, virtual, week of 2021-09-20
GCC/OpenACC
Next steps
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
• More revision/upstreaming of existing development branch work into master branch
• Complete features of OpenACC 2.6 and earlier
• A few items listed here: <https://gcc.gnu.org/wiki/SummerOfCode#Selected_Project_Ideas>
• Also listed: a few OpenACC ideas for GCC '-fanalyzer'
• Implement features of OpenACC 2.7 and later
• (… waiting to be scheduled...)
OpenACC 2.7, 3.0, 3.1, 3.2 includes, for example:
• Lots of clarifications, specification bug fixes
• Shared-memory devices, multicore CPU as a device
• Arrays, subarrays and composite variables now allowed in 'reduction' clauses
• C++ lambdas
• Fortran 'do concurrent'
• Device to device memory copying
• Runtime error callback routines (based on OpenACC Profiling Interface callback routines)
OpenMP
Memory Management
Unified Memory
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
OpenMP 5 MemoryAllocators
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
Mainline support
• Basic support (API routines etc.) since GCC 12
• Libmemkind support in GCC 13
New features added to OG12* branch:
• Low-latency memory (nvptx only, for now)
• Up to 32K local on-chip memory per team.
• AMD GCN support is planned.
• Pinned memory
• Unified Shared Memory
• Both amdgcn and nvptx
• Allocator clauses and directives.
• The patches are posted for review, but most not yet
accepted.
*git branch: devel/omp/gcc-12
L. Li (BNL), Manage OpenMP GPUData EnvironmentUnder
UnifiedAddress Space (2018)
https://doi.org/10.1007/978-3-319-98521-3_5
Pinned Memory
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
The proposed implementation uses mlock on Linux.
• Works on all Linux systems.
• Avoids page miss penalties.
• But shows no performance boost on an unloaded system.
Planned: Cuda managed memory
• Use cudaMallocHost when a Cuda device is present.
• Same benefits for normal code.
• Uses a faster code path within Cuda to benefit all systems.
Unified Shared Memory
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
USM uses the same memory address on both host and
device
• No need to “map” data from host to device.
• All calls to malloc/calloc/new/free etc. are intercepted.
• So, all heap memory is shared.
• Libgfortran allocations are also captured.
• Stack and static data cannot be shared.
• “Shared” memory is actually automatically migrated by
the device driver on a page miss.
• NVPTX uses cudaMallocManaged.
• AMD GCN uses “coarse-grained” memory.
• AMD can also use “fine-grained” memory in which
the GPU accesses the main memory via the bus.
AMD GCN
Port Updates
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
AMD GCN Port Updates
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
GCC 12
• Improved debug information (for use with ROCGDB)
• 128-bit integer support (TImode)
• Improved GPU parallelism
GCC 13 development & OG12 branch
• MI200 (gfx90a) support
• Unified Shared Memory
• SIMD routines (OpenMP "declare SIMD")
• In-branch SIMD routine patch in review – target independent!
• SIMD math routines
• Aim to be able to vectorize calls to as much of libm as possible
• Some commits already, more to follow soon.
• Auto-SIMD for OpenMP parallel loops (soon)
• Try to match performance of other toolchains that do not require explicit "simd" directives
• Multiple vector sizes
• Currently 64-lanes only (fully maskable)
• Soon add 32, 16, 8, 4, and 2-lane vectors for those optimizers that can't (yet) use masks (e.g. SLP).
• Implemented by adding masking in the back-end.
nvptx Port Updates
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
nvptx Port Updates
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
• CUDA 11+ support
• Bug fixes/PTX conformance, especially for newer GPU hardware
• Also work around Nvidia PTX JIT bugs...
• Always use own 'cuda.h' and 'dlopen("libcuda.so")'
• Initial/experimental support for features of higher SM levels, PTX versions
• For example: symbol aliasing, 'HFmode', ...
• General PTX code generation improvements
… by Tom de Vries, Roger Sayle, and us
Conclusion
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
Conclusion
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
• Still lots of work to catch with OpenMP 5.x/OpenACC 2.7+
plus performance, diagnostic, documentation improvements
• But steady & large progress in the last year(s)
for OpenMP, GCN, nvptx and OpenACC
Q & A now to the talk
BoF to concurrency topics – and esp. OpenACC/OpenMP/Offloading
directly afterwards upstairs
Acknowledgement
This research used resources of the Oak Ridge Leadership Computing
Facility, which is a DOE Office of Science User Facility supported under
Contract DE-AC05-00OR22725
Disclaimer
© Siemens 2022
Subject to changes and errors. The information given in this document only
contains general descriptions and/or performance features which may not
always specifically reflect those described, or which may undergo modification
in the course of further development of the products. The requested
performance features are binding only when they are expressly agreed upon
in the concluded contract.
All product designations may be trademarks or other rights of
Siemens AG, its affiliated companies or other companies whose use by third
parties for their own purposes could violate the rights of the respective owner.
© Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software

More Related Content

Similar to OpenMP-OpenACC-Offload-Cauldron2022-1.pdf

BCON22: oneAPI backend - Blender Cycles on Intel GPUs
BCON22: oneAPI backend - Blender Cycles on Intel GPUsBCON22: oneAPI backend - Blender Cycles on Intel GPUs
BCON22: oneAPI backend - Blender Cycles on Intel GPUsXavier Hallade
 
An Update on the European Processor Initiative
An Update on the European Processor InitiativeAn Update on the European Processor Initiative
An Update on the European Processor Initiativeinside-BigData.com
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Embarcados
 
The JVM in the Cloud: OpenJ9 and the traditional HotSpot JVM
The JVM in the Cloud: OpenJ9 and the traditional HotSpot JVMThe JVM in the Cloud: OpenJ9 and the traditional HotSpot JVM
The JVM in the Cloud: OpenJ9 and the traditional HotSpot JVMAndy Moncsek
 
Serving QML applications over the network
Serving QML applications over the networkServing QML applications over the network
Serving QML applications over the networkJeremy Lainé
 
Developing MIPS Exploits to Hack Routers
Developing MIPS Exploits to Hack RoutersDeveloping MIPS Exploits to Hack Routers
Developing MIPS Exploits to Hack RoutersOnur Alanbel
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?ArangoDB Database
 
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...Christopher Diamantopoulos
 
Gitlab ci e kubernetes, build test and deploy your projects like a pro
Gitlab ci e kubernetes, build test and deploy your projects like a proGitlab ci e kubernetes, build test and deploy your projects like a pro
Gitlab ci e kubernetes, build test and deploy your projects like a prosparkfabrik
 
Webinar - Unbox GitLab CI/CD
Webinar - Unbox GitLab CI/CD Webinar - Unbox GitLab CI/CD
Webinar - Unbox GitLab CI/CD Annie Huang
 
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...ryancox
 
Wonho Park_20151209
Wonho Park_20151209Wonho Park_20151209
Wonho Park_20151209Wonho Park
 
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfBuilding PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfPaul Yang
 
Flutter Vikings 2022 - Full Stack Dart
Flutter Vikings 2022  - Full Stack DartFlutter Vikings 2022  - Full Stack Dart
Flutter Vikings 2022 - Full Stack DartChris Swan
 
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...Edge AI and Vision Alliance
 
Docker and IBM Integration Bus
Docker and IBM Integration BusDocker and IBM Integration Bus
Docker and IBM Integration BusGeza Geleji
 

Similar to OpenMP-OpenACC-Offload-Cauldron2022-1.pdf (20)

BCON22: oneAPI backend - Blender Cycles on Intel GPUs
BCON22: oneAPI backend - Blender Cycles on Intel GPUsBCON22: oneAPI backend - Blender Cycles on Intel GPUs
BCON22: oneAPI backend - Blender Cycles on Intel GPUs
 
An Update on the European Processor Initiative
An Update on the European Processor InitiativeAn Update on the European Processor Initiative
An Update on the European Processor Initiative
 
LEGaTO Integration
LEGaTO IntegrationLEGaTO Integration
LEGaTO Integration
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
 
The JVM in the Cloud: OpenJ9 and the traditional HotSpot JVM
The JVM in the Cloud: OpenJ9 and the traditional HotSpot JVMThe JVM in the Cloud: OpenJ9 and the traditional HotSpot JVM
The JVM in the Cloud: OpenJ9 and the traditional HotSpot JVM
 
Serving QML applications over the network
Serving QML applications over the networkServing QML applications over the network
Serving QML applications over the network
 
Developing MIPS Exploits to Hack Routers
Developing MIPS Exploits to Hack RoutersDeveloping MIPS Exploits to Hack Routers
Developing MIPS Exploits to Hack Routers
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
 
Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
 
Gitlab ci e kubernetes, build test and deploy your projects like a pro
Gitlab ci e kubernetes, build test and deploy your projects like a proGitlab ci e kubernetes, build test and deploy your projects like a pro
Gitlab ci e kubernetes, build test and deploy your projects like a pro
 
Webinar - Unbox GitLab CI/CD
Webinar - Unbox GitLab CI/CD Webinar - Unbox GitLab CI/CD
Webinar - Unbox GitLab CI/CD
 
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
 
Wonho Park_20151209
Wonho Park_20151209Wonho Park_20151209
Wonho Park_20151209
 
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfBuilding PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
 
Flutter Vikings 2022 - Full Stack Dart
Flutter Vikings 2022  - Full Stack DartFlutter Vikings 2022  - Full Stack Dart
Flutter Vikings 2022 - Full Stack Dart
 
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
 
Docker and IBM Integration Bus
Docker and IBM Integration BusDocker and IBM Integration Bus
Docker and IBM Integration Bus
 

More from ssuser866937

GNU Toolchain Infrastructure at gcc cauldron
GNU Toolchain Infrastructure at gcc cauldronGNU Toolchain Infrastructure at gcc cauldron
GNU Toolchain Infrastructure at gcc cauldronssuser866937
 
Ctrl-C redesign for gcc cauldron in 2022 in prague
Ctrl-C redesign for gcc cauldron in 2022 in pragueCtrl-C redesign for gcc cauldron in 2022 in prague
Ctrl-C redesign for gcc cauldron in 2022 in praguessuser866937
 
cauldron-2022-docs-bof at gcc cauldron in 2022
cauldron-2022-docs-bof at gcc cauldron in 2022cauldron-2022-docs-bof at gcc cauldron in 2022
cauldron-2022-docs-bof at gcc cauldron in 2022ssuser866937
 
Cauldron_2022_ctf_frame at gcc cauldron 2022 in prague
Cauldron_2022_ctf_frame at gcc cauldron 2022 in pragueCauldron_2022_ctf_frame at gcc cauldron 2022 in prague
Cauldron_2022_ctf_frame at gcc cauldron 2022 in praguessuser866937
 
BoF-OpenMP-OpenACC-Offloading-Cauldron2022.pdf
BoF-OpenMP-OpenACC-Offloading-Cauldron2022.pdfBoF-OpenMP-OpenACC-Offloading-Cauldron2022.pdf
BoF-OpenMP-OpenACC-Offloading-Cauldron2022.pdfssuser866937
 
Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022ssuser866937
 
2022-ranger-update-Cauldron for gcc versions
2022-ranger-update-Cauldron for gcc versions2022-ranger-update-Cauldron for gcc versions
2022-ranger-update-Cauldron for gcc versionsssuser866937
 
2022 Cauldron Value Numbering for gcc versions
2022 Cauldron Value Numbering for gcc versions2022 Cauldron Value Numbering for gcc versions
2022 Cauldron Value Numbering for gcc versionsssuser866937
 
2022-Cauldron-If-Conversion-for-a-Partially-Predicated-VLIW-Architecture.pdf
2022-Cauldron-If-Conversion-for-a-Partially-Predicated-VLIW-Architecture.pdf2022-Cauldron-If-Conversion-for-a-Partially-Predicated-VLIW-Architecture.pdf
2022-Cauldron-If-Conversion-for-a-Partially-Predicated-VLIW-Architecture.pdfssuser866937
 
2022 Cauldron analyzer talk from david malcolm
2022 Cauldron analyzer talk from david malcolm2022 Cauldron analyzer talk from david malcolm
2022 Cauldron analyzer talk from david malcolmssuser866937
 
cs.ds-2211.13454.pdf
cs.ds-2211.13454.pdfcs.ds-2211.13454.pdf
cs.ds-2211.13454.pdfssuser866937
 

More from ssuser866937 (11)

GNU Toolchain Infrastructure at gcc cauldron
GNU Toolchain Infrastructure at gcc cauldronGNU Toolchain Infrastructure at gcc cauldron
GNU Toolchain Infrastructure at gcc cauldron
 
Ctrl-C redesign for gcc cauldron in 2022 in prague
Ctrl-C redesign for gcc cauldron in 2022 in pragueCtrl-C redesign for gcc cauldron in 2022 in prague
Ctrl-C redesign for gcc cauldron in 2022 in prague
 
cauldron-2022-docs-bof at gcc cauldron in 2022
cauldron-2022-docs-bof at gcc cauldron in 2022cauldron-2022-docs-bof at gcc cauldron in 2022
cauldron-2022-docs-bof at gcc cauldron in 2022
 
Cauldron_2022_ctf_frame at gcc cauldron 2022 in prague
Cauldron_2022_ctf_frame at gcc cauldron 2022 in pragueCauldron_2022_ctf_frame at gcc cauldron 2022 in prague
Cauldron_2022_ctf_frame at gcc cauldron 2022 in prague
 
BoF-OpenMP-OpenACC-Offloading-Cauldron2022.pdf
BoF-OpenMP-OpenACC-Offloading-Cauldron2022.pdfBoF-OpenMP-OpenACC-Offloading-Cauldron2022.pdf
BoF-OpenMP-OpenACC-Offloading-Cauldron2022.pdf
 
Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022
 
2022-ranger-update-Cauldron for gcc versions
2022-ranger-update-Cauldron for gcc versions2022-ranger-update-Cauldron for gcc versions
2022-ranger-update-Cauldron for gcc versions
 
2022 Cauldron Value Numbering for gcc versions
2022 Cauldron Value Numbering for gcc versions2022 Cauldron Value Numbering for gcc versions
2022 Cauldron Value Numbering for gcc versions
 
2022-Cauldron-If-Conversion-for-a-Partially-Predicated-VLIW-Architecture.pdf
2022-Cauldron-If-Conversion-for-a-Partially-Predicated-VLIW-Architecture.pdf2022-Cauldron-If-Conversion-for-a-Partially-Predicated-VLIW-Architecture.pdf
2022-Cauldron-If-Conversion-for-a-Partially-Predicated-VLIW-Architecture.pdf
 
2022 Cauldron analyzer talk from david malcolm
2022 Cauldron analyzer talk from david malcolm2022 Cauldron analyzer talk from david malcolm
2022 Cauldron analyzer talk from david malcolm
 
cs.ds-2211.13454.pdf
cs.ds-2211.13454.pdfcs.ds-2211.13454.pdf
cs.ds-2211.13454.pdf
 

Recently uploaded

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 

Recently uploaded (20)

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 

OpenMP-OpenACC-Offload-Cauldron2022-1.pdf

  • 1. OpenACC, OpenMP, Offloading and GCC GNU Tools Cauldron 2022 Tobias Burnus, Thomas Schwinge, Andrew Stubbs © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software Directly after this talk: BoF on this topic, second room “S5”, 2nd floor
  • 2. Agenda Intro & History GCC’s Offloading Implementation OpenMP in GCC 13 Updates OpenACC in GCC Updates OpenMP Memory Management and Unified Shared Memory AMD GCN Port Updates nvptx Port Updates Conclusion © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
  • 3. OpenMP and OpenACC – Introductory Examples © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software OpenACC – Fortran example !$acc parallel loop independent collapse(2) & !$acc copyin(A,B) copyout(C) do i = 1, N do j = 1, N block real :: sum sum = 0 !$acc loop reduction(+:sum) do k = 1, N sum = sum + A(k,i)*B(j,k) end do C(j,i) = sum end block end do end do !$acc end parallel loop OpenMP – C/C++ example #pragma omp target map(tofrom:C[N*N]) map(to:A[:N*N],B[:N*N]) private(i,j,k) #pragma omp parallel for collapse(3) private(i,j,k) reduction(+:C[:N*N]) for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) for (k = 0; k < N; ++k) C[i*N+j] += A[i*N+k] * B[k*N+j];
  • 4. History of OpenMP, OpenACC and Offloading in GCC © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software OpenMP History 1.0: 1997 for Fortran/1998 for C/C++ 2.0: 2000 for Fortran/2002 for C/C++ 2.5: 2005 – since GCC 4.2 3.0: 2008 – GCC 4.4 3.1: 2011 – GCC 4.7 4.0: 2013 (‘target’ support) – GCC 4.9.{0,1} 4.5: 2015 – GCC 6, Fortran: part. 7, full 11 5.0: 2018 – partially since GCC 9 5.1: 2020 – partially since GCC 12 5.2: 2021 – part.since GCC 13 [669 pages] TR11: 2022 (6.0 preview) → SC22 (?) https://openmp.org/specifications/ https://gcc.gnu.org/projects/gomp/ 33 ARB members – for GCC: Red Hat (now via IBM), SUSE, SIEMENS OpenACC History 1.0: 2011 2.0: 2013 – partially since GCC 5 / 6 2.5: 2011 – mostly since GCC 9 2.6: 2017 – since GCC 10 (full) 2.7: 2018 3.0: 2019 3.1: 2020 3.2: 2021 [156 pages] 3.3: (?) 2022 → SC22 (?) https://www.openacc.org/specification https://gcc.gnu.org/wiki/OpenACC 33 members – for GCC: SUSE, SIEMENS GCC Offloading 2014: Add the nvptx port. – 2016/GCC 6: OpenACC offload – 2017/GCC 7: OpenMP 2019: Added GCN port – 2020/GCC 10: offloading – HSAIL: GCC 6–11 2014: Intel MIC (KNL) – 2016/GCC 6: simulator/offload – 2021/GCC 12 deprecated
  • 5. CPU Time-Share on ORNL’s Summit (2021) 200 PFlop/s (peak), TOP500.org #2 (Jun 2021), #4 (Jun 2022) © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software D.Berthold, W.R.Elwasif & T.Burnus, https://openmpcon.org/conf2021/pro gram-archive/
  • 6. GCC’s Offloading Implementation © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
  • 7. GCC OpenMP/OpenACC Compilation for Offloading © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software Compilation • C/C++/Fortran FE generate tree, mostly shared between OpenMP/OpenACC but with case separation • Lowering: Lang hooks especially for implicit data-sharing/mapping clauses → gimplify.cc, omp-low.cc • parallel, offload regions split into separate function with arg passing (→omp-low.cc, omp- expand.cc) Offloading • Attribute on offload functions/global vars → Saved in lto format (own section) • vect of global vars + entry functions → Saved in lto (own section) • Normal processing (lto or not) for the rest Fortran original dump gimple dump omp-lower dump optimized dump (-O0)
  • 8. Device Compilation © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software Host Side • Write entry function + global var into offload .sections – once per TU or one time (LTO) → libgcc/offloadstuff.c + omp-offload.cc Device Side • Driver calls for every target mkoffload: • calls device lto1 and linker • generates host-side constructor to register target and global variables/entry functions • Offload code is in the data section of the resulting ELF Optimization Issues • Split of offload-func table and offload-device code requires force_node → missed optimization • Optimizations: Const prop into functions, inlining target- side LTO libgomp • Loads libgomp-plugin* to check to check for available device • Plugin libraries hide details of target-specific code https://gcc.gnu.org/wiki/Offloading (‘info gcc’/ GCC manual)
  • 9. Parsing Once for All Devices © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software Single Parsing • Single parsing and tree handling for host and all devices → consistent state, late decision for which devices to offload • But: C++ w/ exceptions vs. w/o exception in FE vs. target dependence • Hard to insert special math functions for devices. LLVM’s __clang_cuda_math.h int abs(int __a) { return __nv_abs(__a); } • More complex to implement metadirectives or code gen for functions targeting only a specific device • Handling feature differences is hard: exception support, vectorization lengths, SIMD vs. SIMT, ...
  • 10. Levels of Parallelism for Offloading © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software • All three levels used: teams, threads and simd • SIMD / vectorized loops map to thread/work item • teams + parallel map to warps/wavefronts, • OpenMP teams uses threadpool of size #teams https://gcc.gnu.org/onlinedocs/libgomp/Offload-Target- Specifics.html Other compilers • GCC: teams, parallel, simd • LLVM/Clang: teams, parallel under dev'ment: team, parallel, simd • AMD: teams, parallel • HPE/Cray: teams, parallel or simd • Nvidia: teams, parallel • Intel: teams, parallel, simd https://www.openmp.org/events/2022-ecp- community-bof-days/
  • 11. OpenMP in GCC 13 Updates © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
  • 12. OpenMP Progress © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software Implementation progress* *by counting the implementation-status lines in ↑, i.e. https://gcc.gnu.org/projects/gomp/ → Impl. Status 0 10 20 30 40 50 60 OMP 5 OMP 5.1 OMP 5.2 GCC 9 GCC 10 GCC 11 GCC 12 GCC 13 ALL Pending work (incomplete) OpenMP 5 (10 “no” items) • Metadirectives (WIP/pending patch) • Declare mapper (WIP/pending patch) • Array shaping/noncont arrays (todo) OpenMP 5.1 (20 “no” items) • Interop/dispatch (todo) • Assume (todo) Offload Related • Unified-shared memory (WIP/pending patches) All • OMPD (debugging) – WIP by Egyptian master students • OMPT (tracing) – (todo) Lots of smaller & not so small items + 5.2 + TR11 looming
  • 13. OpenMP in GCC 13 © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software • OpenMP 5.0: 'requires' + reverse offload (WIP for actually device support) • Several patches in review/revise cycle: mapper, metadirectives, memory-handling (→ later), … • OpenMP 5.1: more omp_target_… routines, by device-num env vars, nowait in taskwait • OpenMP 5.2: clause renaming (+ ext. for doacross), firstprivate/allocate on scope, omp_{initial,invalid}_device constants, • Many smaller/minor items, bug fixes, … (127 GCC-13 commits related to OpenMP/OpenACC/nvptx/gcn) http://gcc.gnu.org/gcc-13/changes.html→ OpenMP https://gcc.gnu.org/projects/gomp/
  • 14. OpenACC in GCC Update © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
  • 15. GCC/OpenACC © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software • OpenACC 2.6 support • Code offloading to AMD (GCN) and Nvidia (nvptx) GPUs <https://www.openacc.org/> OpenACC – Fortran example !$acc parallel loop & !$acc independent collapse(2) & !$acc copyin(A,B) copyout(C) do i = 1, N do j = 1, N block real :: sum sum = 0 !$acc loop reduction(+:sum) do k = 1, N sum = sum + A(k,i)*B(j,k) end do C(j,i) = sum end block end do end do !$acc end parallel loop
  • 16. GCC/OpenACC GCC 12+ changes © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software • OpenACC worker parallelism for AMD GPUs • 'gcc/omp-oacc-neuter-broadcast.cc' • Execution state changes (neutering/broadcasting) as a GCC middle end transformation • Different approach from nvptx where it all happens in the back end • Bug fixing (such as OpenACC specification adherence), for example: • Data privatization/sharing at the OpenACC gang level: use GCN LDS, nvptx '.shared' memory • OpenACC/Fortran: strided array sections and components of derived-type arrays • OpenACC 'async' correctness • The usual miscellanea • Code generation optimizations: middle end as well as GCN, nvptx back ends • Diagnostics: '-Wopenacc-parallelism' to diagnose potentially suboptimal choices of OpenACC parallelism
  • 17. GCC/OpenACC GCC 12+ changes © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software • OpenACC 'kernels' work, part I • Decompose OpenACC 'kernels' constructs into parts, a sequence of compute constructs • 'gcc/omp-oacc-kernels-decompose.cc' • Bug fixing in master branch • OpenACC 'kernels' work, part II • Array access delinearization • Scalar data privatization • Analyze 'loop' constructs with 'auto' clause, decide 'seq' vs. 'independent' • Graphite • See talk at LPC¹, GNU Tools Track: OpenACC "kernels" improvements (Frederik Harwath) • <https://linuxplumbersconf.org/event/11/contributions/998/> • <https://youtu.be/zUw0ZVXCwoM?t=12304s> • Developed on private branch; then integrated into public og11/og12 branches, TODO: master branch • Revision and upstreaming of existing development branch work into GCC mainline ¹ Linux Plumbers Conference 2021, <https://linuxplumbersconf.org/>, virtual, week of 2021-09-20
  • 18. GCC/OpenACC Next steps © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software • More revision/upstreaming of existing development branch work into master branch • Complete features of OpenACC 2.6 and earlier • A few items listed here: <https://gcc.gnu.org/wiki/SummerOfCode#Selected_Project_Ideas> • Also listed: a few OpenACC ideas for GCC '-fanalyzer' • Implement features of OpenACC 2.7 and later • (… waiting to be scheduled...) OpenACC 2.7, 3.0, 3.1, 3.2 includes, for example: • Lots of clarifications, specification bug fixes • Shared-memory devices, multicore CPU as a device • Arrays, subarrays and composite variables now allowed in 'reduction' clauses • C++ lambdas • Fortran 'do concurrent' • Device to device memory copying • Runtime error callback routines (based on OpenACC Profiling Interface callback routines)
  • 19. OpenMP Memory Management Unified Memory © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
  • 20. OpenMP 5 MemoryAllocators © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software Mainline support • Basic support (API routines etc.) since GCC 12 • Libmemkind support in GCC 13 New features added to OG12* branch: • Low-latency memory (nvptx only, for now) • Up to 32K local on-chip memory per team. • AMD GCN support is planned. • Pinned memory • Unified Shared Memory • Both amdgcn and nvptx • Allocator clauses and directives. • The patches are posted for review, but most not yet accepted. *git branch: devel/omp/gcc-12 L. Li (BNL), Manage OpenMP GPUData EnvironmentUnder UnifiedAddress Space (2018) https://doi.org/10.1007/978-3-319-98521-3_5
  • 21. Pinned Memory © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software The proposed implementation uses mlock on Linux. • Works on all Linux systems. • Avoids page miss penalties. • But shows no performance boost on an unloaded system. Planned: Cuda managed memory • Use cudaMallocHost when a Cuda device is present. • Same benefits for normal code. • Uses a faster code path within Cuda to benefit all systems.
  • 22. Unified Shared Memory © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software USM uses the same memory address on both host and device • No need to “map” data from host to device. • All calls to malloc/calloc/new/free etc. are intercepted. • So, all heap memory is shared. • Libgfortran allocations are also captured. • Stack and static data cannot be shared. • “Shared” memory is actually automatically migrated by the device driver on a page miss. • NVPTX uses cudaMallocManaged. • AMD GCN uses “coarse-grained” memory. • AMD can also use “fine-grained” memory in which the GPU accesses the main memory via the bus.
  • 23. AMD GCN Port Updates © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
  • 24. AMD GCN Port Updates © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software GCC 12 • Improved debug information (for use with ROCGDB) • 128-bit integer support (TImode) • Improved GPU parallelism GCC 13 development & OG12 branch • MI200 (gfx90a) support • Unified Shared Memory • SIMD routines (OpenMP "declare SIMD") • In-branch SIMD routine patch in review – target independent! • SIMD math routines • Aim to be able to vectorize calls to as much of libm as possible • Some commits already, more to follow soon. • Auto-SIMD for OpenMP parallel loops (soon) • Try to match performance of other toolchains that do not require explicit "simd" directives • Multiple vector sizes • Currently 64-lanes only (fully maskable) • Soon add 32, 16, 8, 4, and 2-lane vectors for those optimizers that can't (yet) use masks (e.g. SLP). • Implemented by adding masking in the back-end.
  • 25. nvptx Port Updates © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
  • 26. nvptx Port Updates © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software • CUDA 11+ support • Bug fixes/PTX conformance, especially for newer GPU hardware • Also work around Nvidia PTX JIT bugs... • Always use own 'cuda.h' and 'dlopen("libcuda.so")' • Initial/experimental support for features of higher SM levels, PTX versions • For example: symbol aliasing, 'HFmode', ... • General PTX code generation improvements … by Tom de Vries, Roger Sayle, and us
  • 27. Conclusion © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software
  • 28. Conclusion © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software • Still lots of work to catch with OpenMP 5.x/OpenACC 2.7+ plus performance, diagnostic, documentation improvements • But steady & large progress in the last year(s) for OpenMP, GCN, nvptx and OpenACC Q & A now to the talk BoF to concurrency topics – and esp. OpenACC/OpenMP/Offloading directly afterwards upstairs
  • 29. Acknowledgement This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725 Disclaimer © Siemens 2022 Subject to changes and errors. The information given in this document only contains general descriptions and/or performance features which may not always specifically reflect those described, or which may undergo modification in the course of further development of the products. The requested performance features are binding only when they are expressly agreed upon in the concluded contract. All product designations may be trademarks or other rights of Siemens AG, its affiliated companies or other companies whose use by third parties for their own purposes could violate the rights of the respective owner. © Siemens 2022 | 2022-09-18 | Tobias Burnus,Thomas Schwinge,Andrew Stubbs | OpenACC, OpenMP, Offloading and GCC | Siemens Digital Industries Software