SlideShare a Scribd company logo
Evaluating GPU Programming Models for the LUMI Supercomputer
Supercomputing Frontiers Asia
March 1st, 2022
George S. Markomanolis, Aksel Alpay, Jeffrey Young, Michael Klemm, Nicholas Malaya, Aniello
Esposito, Jussi Heikonen, Sergei Bastrakov, Alexander Debus, Thomas Kluge, Klaus Steiniger, Jan
Stephan, Rene Widera and Michael Bussmann
Lead HPC Scientist, CSC – IT Center For Science Ltd.
Outline
• LUMI
• Programming Models
• PortingWorkflow
• Benchmarks and applications
• Hardware
• Results
2
3
AMD GPUs (MI100 example)
4
AMD MI100
Programming models - HIP
• HIP: Heterogeneous Interface for Portability is developed by AMD to program on AMD GPUs
• It is a C++ runtime API and it supports both AMD and NVIDIA platforms
• HIP is similar to CUDA and there is no performance overhead on NVIDIA GPUs
• Many well-known libraries have been ported on HIP
• New projects or porting from CUDA, could be developed directly in HIP
• The supported CUDAAPI is called with HIP prefix (cudamalloc -> hipmalloc)
• Hipify tools convert CUDA codes to HIP
5
Programming models - OpenMP Offloading
• Well known programming model, long history on CPUs
• GPU support since v4.0
• For AMD GPUs we use AMD OpenMP compiler (AOMP)
• With 120 compute units for MI100, the num_teams should be multiple of 120 and the thread_limits multiple
of 64 (wavefront in AMD)
6
Programming models - SYCL
• SYCL is an open standard for heterogeneous programming. It is developed and maintained by the Khronos
Group.
• It expresses heterogeneous data parallelism with pure C++. The latest SYCL version is SYCL 2020, which relies
on C++17.
• Starting with SYCL 2020, a generalized backend architecture was introduced that allows for other backends apart
from OpenCL. Backends used by current SYCL implementations include OpenCL, Level Zero, CUDA, HIP and
others.
• SYCL is using data parallel kernels, the execution is organized by a task graph and there are two memory
management models, the buffer-accessor and unified shared memory (USM) models.
• For this work we use the hipSYCL implementation, it supports NVIDIA/AMD/Intel GPUs among CPUs.
7
Programming models - Kokkos
• Kokkos Core implements a programming model in C++ for writing performance portable applications
targeting all major HPC platforms. It provides abstractions for both parallel execution of code and data
management. (ECP/NNSA)
• The Kokkos abstraction layer maps C++ source code to the specific instructions required for the backend
during build time. When compiling the source code, the binary will be built for the declared backends:
oSerial backend, for serial code on a host device.
o Host-parallel backend, which executes in parallel on the host device (OpenMP API, etc.).
o Device-parallel backend, which offloads on a device, such as a GPU.
8
Programming models - OpenACC
• GCC will provide OpenACC (Mentor Graphics contract, now called Siemens EDA). Checking functionality
• HPE is supporting OpenACC v2.7 for Fortran. This is quite old OpenACC version. HPE announced that they will
not support OpenACC for C/C++
• Clacc from ORNL: https://github.com/llvm-doe-org/llvm-project/tree/clacc/master OpenACC from LLVM only for
C (Fortran and C++ in the future)
oTranslate OpenACC to OpenMP Offloading
9
Programming models - Alpaka
• Abstraction Library for Parallel Kernel Acceleration (Alpaka) library is a header-only C++14 abstraction library for
accelerator development.
• Alpaka enables an accelerator-agnostic implementation of hierarchical redundant parallelism, a user can specify data
and task parallelism at multiple levels of compute and memory for a particular platform.
• In addition to grids, blocks, and threads, alpaka also provides an element construct that represents an n-dimensional
set of inputs that is amenable to vectorization by the compiler.
• Platform decided at the compile time, single source interface
• C++ User interface for the Platform independent Library Alpaka (CUPLA) makes it easy to port CUDA codes
• Supports: HIP, CUDA, TBB, OpenMP (CPU and GPU), SYCL etc.
10
Porting non-GPU codes to LUMI
11
Porting GPU codes to LUMI
12
Benchmarks and Applications (I)
• BabelStream
A memory bound benchmark with many programming models implemented and five kernels
oadd (a[i]=b[i]+c[i])
omult (a[i]=b * c[i])
ocopy (a[i]=b[i])
otriad (a[i]=b[i]+d*c[i])
odot (sum=sum+a[i]*b[i])
oThe default problem size is $2^{25}$ FP64 operations and 100 iterations. We are evaluating
BabelStream v3.4 (6fe81e1). We present also the first results of BabelStream with the Alpaka
backend
13
Benchmarks and Applications (II)
• miniBUDE
We use also the mini-app called miniBUDE for the Bristol University Docking Engine (BUDE), a kernel of a
drug discovery application that is compute bound.
BUDE predicts the binding energy of the ligand with the target, however, there are many ways this bonding could
happen, and a variety of positions and rotations of the ligand relative to protein, known as poses, are explored.
For each pose, a number of properties are evaluated.
oIt supports various programming models but we do evaluate mainly CUDA and HIP.
oWe use the version with commit 1af5b39
• Both BabelStream and miniBUDE support many programming models such as HIP, CUDA, OpenMP
Offloading, SYCL, OpenACC, Kokkos.
14
Methodology and Hardware
• Compiling based on the available instructions from benchmark/application
• Execute 10 times and plot boxplots
• Tune where necessary based on the number of compute units on the GPU, wavefront/warp.
15
Compilers/Software
16
Results - BabelStream
17
Copy kernel
• AOMP lower performance compared to the
other programming models
• The default Kokkos implementation in
BabelStream does not provide tuning
options.
• Finally, hipSYCL has a variation on A100
which is less than 2%, and Alpaka achieves
a performance similar to HIP for the
MI100
• MI100 is 26-34% slower than A100 and
15-37% faster than V100
• While MI100 achieves 97.8-99.99% of the
peak performance except the OpenMP
results, the NVIDIAA100 achieves 96.3-
98.5% and V100 89-98.2%
Results - BabelStream
18
Multiply kernel
• For MI100, most of the programming
models achieve 96.63% and above, except
OpenMP, while for A100 all the
programming models perform close to the
peak, however, for V100, the not tuned
Kokkos underperforms.
• For MI100, most programming models
perform 17.8-37.8% faster than V100, and
similarly, MI100 is 25-31% slower than
A100.
Results - BabelStream
19
Add kernel
• The performance of Alpaka programming
model is quite close to HIP/CUDA for all
the devices with hipSYCL following
• The OpenMP is less efficient on the MI100
compared to the rest GPUs. Finally,
Kokkos’ performance, seems to be between
Alpaka and hipSYCL.
• MI100 is 14.74-21.60% faster than V100
and 28.55-32.86 slower than A100
• MI100 achieves 94.52-99.99% of its peak
while the rest GPUs 97.6-99.99%.
Results - BabelStream
20
Triad kernel
• For this kernel Alpaka performs equally to
HIP/CUDA, with following Kokkos, and
then hipSYCL and OpenMP.
• MI100 is 28.93-32.81% slower than A100
and 18.63-21.61% faster than V100
• MI100 achieves 94.62-99.94% of the peak
while the NVIDIA GPUs achieve at least
98%.
Results - BabelStream
21
Dot kernel
• In the first segment of V100, we have also a plot of
MI100 for hipSYCL as they were too close these
values.
• MI100 GPU is around 2.69-14.62% faster than NVIDIA
V100 except for OpenMP, and 28.00-69.69% slower
than A100.
• For HIP/CUDA; We define 216 blocks of threads for
the A100, as it has 108 streaming multiprocessors, and
improved the dot kernel performance by 8-10%.
• For hipSYCL; we utilize 960 blocks of threads and 256
threads per block for MI100 to achieve around 8%
faster than the default values.
• For Alpaka; we are able to tune also the dot kernel with
720 blocks of threads for MI100 and improve its
performance by 28% comparing to the default settings.
• OpenMP underperforms for AOMP, almost 50% slower
than V100
BabelStream Summary
• The peak bandwidth percentage for the OpenMP programming model is 42.16% - 94.68% for MI100
while it is at least 96% for the NVIDIA GPUs which demonstrates that AOMP needs further
development.
• For Kokkos, the range is 74-99% for MI100, 72.87-99% for the NVIDIA GPUs, where the non-
optimized version has lower performance mainly on MI100 and V100.
• hipSYCL achieves at least 96% of the HIP/CUDA performance except for MI100 and dot kernel that
achieves 89.53%.
• Finally, Alpaka achieves at least 97.2% for all the cases and demonstrates its performance.
• The OpenMP compiler performs better for NVIDIA GPUs regarding dot kernel, however, the version
that we used for AMD GPUs, is not the final product yet.
• Overall, the Alpaka performance is quite close to HIP, followed by hipSYCL and Kokkos, while
OpenMP can perform slower depending on the kernel.
22
Results - MiniBUDE
23
• Large problem sizes for miniBUDE are required to be
able to saturate the GPUs.
• For every experiment, we execute 8 iterations with
983040 poses, we calculate this number by tuning for
AMD MI100
• The miniBUDE provides in the output the single
precision GFLOP/s, and we observe that the AMD
MI100 GPU achieves a performance close to A100 by
2% and around 26% over V100.
• Regarding single precision, we tested also the mixbench
benchmark and the MI100 was 1.16 times faster than
A100, achieving both close to their peak performance.
Conclusion
• We presented a workflow for porting applications to LUMI supercomputer, an AMD GPU-based system
• We utilize a benchmark and a mini-app, which are memory and compute-bound respectively. We illustrate how
various programming models perform on these GPUs and what techniques can improve the performance for specific
cases.
• We presented the first results of BabelStream with Alpaka backend
• We discuss the lack of performance on some aspects of OpenMP
• HIP/CUDA perform quite good and most of the programming models are quite close, depending on the kernel
pattern. However, the users should be aware about other programming models either ones that underperform in
some cases like OpenMP for some patterns or the learning curve of Kokkos tuning
• Finally, programming models such as Alpaka and SYCL could be utilized as they support many backends, are
portable and for many kernels they provide similar performance to HIP.
• Support reproducibility, find how to reproduce experiments on NVIDIA V100/100, and AMD MI100:
https://zenodo.org/record/6307447
24
Future work
• We investigate to identify what the issue with some programming models and miniBUDE is.
• We want to analyze the OpenACC performance from the GCC and Cray Fortran compiler
• Test the new functionalities from the ROCm platform such as Heterogeneous Memory Management (HMM)
• We envision evaluating multi-GPU benchmarks and their scalability across multiple nodes
• By using LUMI we will be able to use the MI250X GPU and compare it with the current GPU generation, among
testing the packed FP32
• Finally, we are interested in benchmarking I/O from the GPUs memory as we have already hipified Elbencho
benchmark
25
facebook.com/CSCfi
twitter.com/CSCfi
youtube.com/CSCfi
linkedin.com/company/csc---it-center-for-science
Kuvat CSC:n arkisto, Adobe Stock ja Thinkstock
github.com/CSCfi
George S. Markomanolis
CSC – IT Center for Science Ltd.
Lead HPC Scientist
georgios.markomanolis@csc.fi
26

More Related Content

What's hot

If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
Allan Cantle
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor Architecture
AMD
 
Assembly Language Programming
Assembly Language ProgrammingAssembly Language Programming
Assembly Language Programming
Niropam Das
 
Introduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technologyIntroduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technology
義洋 顏
 
DDR4 SDRAM : Notes
DDR4 SDRAM : NotesDDR4 SDRAM : Notes
DDR4 SDRAM : Notes
Subhajit Sahu
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD
 
Pci express technology 3.0
Pci express technology 3.0Pci express technology 3.0
Pci express technology 3.0
Biddika Manjusree
 
Intel core i5
Intel core i5Intel core i5
Intel core i5
Abdul-Fattah Mahran
 
01 nand flash_reliability_notes
01 nand flash_reliability_notes01 nand flash_reliability_notes
01 nand flash_reliability_notes
swethamg18
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUDelivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
AMD
 
Ch13.pdf
Ch13.pdfCh13.pdf
Ch13.pdf
Sourav Roy
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
inside-BigData.com
 
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Kuniyasu Suzaki
 
System On Chip
System On ChipSystem On Chip
System On Chip
A B Shinde
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
jtsagata
 
DFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core MicroprocessorsDFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core Microprocessors
Ishwar Parulkar
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
George Markomanolis
 
Qemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System Emulation
National Cheng Kung University
 
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010Altera Corporation
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architectures
A B Shinde
 

What's hot (20)

If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor Architecture
 
Assembly Language Programming
Assembly Language ProgrammingAssembly Language Programming
Assembly Language Programming
 
Introduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technologyIntroduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technology
 
DDR4 SDRAM : Notes
DDR4 SDRAM : NotesDDR4 SDRAM : Notes
DDR4 SDRAM : Notes
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
 
Pci express technology 3.0
Pci express technology 3.0Pci express technology 3.0
Pci express technology 3.0
 
Intel core i5
Intel core i5Intel core i5
Intel core i5
 
01 nand flash_reliability_notes
01 nand flash_reliability_notes01 nand flash_reliability_notes
01 nand flash_reliability_notes
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUDelivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
 
Ch13.pdf
Ch13.pdfCh13.pdf
Ch13.pdf
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
 
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
 
System On Chip
System On ChipSystem On Chip
System On Chip
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
DFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core MicroprocessorsDFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core Microprocessors
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Qemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System Emulation
 
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architectures
 

Similar to Evaluating GPU programming Models for the LUMI Supercomputer

Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
George Markomanolis
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
George Markomanolis
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Edge AI and Vision Alliance
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
Linaro
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
inside-BigData.com
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
AMD Developer Central
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
Unai Lopez-Novoa
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de Informática UCM
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
Ganesan Narayanasamy
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Embarcados
 
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdfOpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
ssuser866937
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
AdamRobertsIBM
 
Efabless Marketplace webinar slides 2024
Efabless Marketplace webinar slides 2024Efabless Marketplace webinar slides 2024
Efabless Marketplace webinar slides 2024
Nobin Mathew
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
Vladimir Starostenkov
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
AMD Developer Central
 
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Cesar Maciel
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
 
OpenCAPI Technology Ecosystem
OpenCAPI Technology EcosystemOpenCAPI Technology Ecosystem
OpenCAPI Technology Ecosystem
Ganesan Narayanasamy
 
Codasip application class RISC-V processor solutions
Codasip application class RISC-V processor solutionsCodasip application class RISC-V processor solutions
Codasip application class RISC-V processor solutions
RISC-V International
 

Similar to Evaluating GPU programming Models for the LUMI Supercomputer (20)

Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
 
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdfOpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
 
Efabless Marketplace webinar slides 2024
Efabless Marketplace webinar slides 2024Efabless Marketplace webinar slides 2024
Efabless Marketplace webinar slides 2024
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
 
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
OpenCAPI Technology Ecosystem
OpenCAPI Technology EcosystemOpenCAPI Technology Ecosystem
OpenCAPI Technology Ecosystem
 
Codasip application class RISC-V processor solutions
Codasip application class RISC-V processor solutionsCodasip application class RISC-V processor solutions
Codasip application class RISC-V processor solutions
 
CUDA
CUDACUDA
CUDA
 

More from George Markomanolis

Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
George Markomanolis
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part I
George Markomanolis
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
George Markomanolis
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
George Markomanolis
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
George Markomanolis
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit Supercomputer
George Markomanolis
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
George Markomanolis
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
George Markomanolis
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
George Markomanolis
 
Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
George Markomanolis
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
George Markomanolis
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
George Markomanolis
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
George Markomanolis
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
George Markomanolis
 

More from George Markomanolis (15)

Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part I
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit Supercomputer
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 

Evaluating GPU programming Models for the LUMI Supercomputer

  • 1. Evaluating GPU Programming Models for the LUMI Supercomputer Supercomputing Frontiers Asia March 1st, 2022 George S. Markomanolis, Aksel Alpay, Jeffrey Young, Michael Klemm, Nicholas Malaya, Aniello Esposito, Jussi Heikonen, Sergei Bastrakov, Alexander Debus, Thomas Kluge, Klaus Steiniger, Jan Stephan, Rene Widera and Michael Bussmann Lead HPC Scientist, CSC – IT Center For Science Ltd.
  • 2. Outline • LUMI • Programming Models • PortingWorkflow • Benchmarks and applications • Hardware • Results 2
  • 3. 3
  • 4. AMD GPUs (MI100 example) 4 AMD MI100
  • 5. Programming models - HIP • HIP: Heterogeneous Interface for Portability is developed by AMD to program on AMD GPUs • It is a C++ runtime API and it supports both AMD and NVIDIA platforms • HIP is similar to CUDA and there is no performance overhead on NVIDIA GPUs • Many well-known libraries have been ported on HIP • New projects or porting from CUDA, could be developed directly in HIP • The supported CUDAAPI is called with HIP prefix (cudamalloc -> hipmalloc) • Hipify tools convert CUDA codes to HIP 5
  • 6. Programming models - OpenMP Offloading • Well known programming model, long history on CPUs • GPU support since v4.0 • For AMD GPUs we use AMD OpenMP compiler (AOMP) • With 120 compute units for MI100, the num_teams should be multiple of 120 and the thread_limits multiple of 64 (wavefront in AMD) 6
  • 7. Programming models - SYCL • SYCL is an open standard for heterogeneous programming. It is developed and maintained by the Khronos Group. • It expresses heterogeneous data parallelism with pure C++. The latest SYCL version is SYCL 2020, which relies on C++17. • Starting with SYCL 2020, a generalized backend architecture was introduced that allows for other backends apart from OpenCL. Backends used by current SYCL implementations include OpenCL, Level Zero, CUDA, HIP and others. • SYCL is using data parallel kernels, the execution is organized by a task graph and there are two memory management models, the buffer-accessor and unified shared memory (USM) models. • For this work we use the hipSYCL implementation, it supports NVIDIA/AMD/Intel GPUs among CPUs. 7
  • 8. Programming models - Kokkos • Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platforms. It provides abstractions for both parallel execution of code and data management. (ECP/NNSA) • The Kokkos abstraction layer maps C++ source code to the specific instructions required for the backend during build time. When compiling the source code, the binary will be built for the declared backends: oSerial backend, for serial code on a host device. o Host-parallel backend, which executes in parallel on the host device (OpenMP API, etc.). o Device-parallel backend, which offloads on a device, such as a GPU. 8
  • 9. Programming models - OpenACC • GCC will provide OpenACC (Mentor Graphics contract, now called Siemens EDA). Checking functionality • HPE is supporting OpenACC v2.7 for Fortran. This is quite old OpenACC version. HPE announced that they will not support OpenACC for C/C++ • Clacc from ORNL: https://github.com/llvm-doe-org/llvm-project/tree/clacc/master OpenACC from LLVM only for C (Fortran and C++ in the future) oTranslate OpenACC to OpenMP Offloading 9
  • 10. Programming models - Alpaka • Abstraction Library for Parallel Kernel Acceleration (Alpaka) library is a header-only C++14 abstraction library for accelerator development. • Alpaka enables an accelerator-agnostic implementation of hierarchical redundant parallelism, a user can specify data and task parallelism at multiple levels of compute and memory for a particular platform. • In addition to grids, blocks, and threads, alpaka also provides an element construct that represents an n-dimensional set of inputs that is amenable to vectorization by the compiler. • Platform decided at the compile time, single source interface • C++ User interface for the Platform independent Library Alpaka (CUPLA) makes it easy to port CUDA codes • Supports: HIP, CUDA, TBB, OpenMP (CPU and GPU), SYCL etc. 10
  • 11. Porting non-GPU codes to LUMI 11
  • 12. Porting GPU codes to LUMI 12
  • 13. Benchmarks and Applications (I) • BabelStream A memory bound benchmark with many programming models implemented and five kernels oadd (a[i]=b[i]+c[i]) omult (a[i]=b * c[i]) ocopy (a[i]=b[i]) otriad (a[i]=b[i]+d*c[i]) odot (sum=sum+a[i]*b[i]) oThe default problem size is $2^{25}$ FP64 operations and 100 iterations. We are evaluating BabelStream v3.4 (6fe81e1). We present also the first results of BabelStream with the Alpaka backend 13
  • 14. Benchmarks and Applications (II) • miniBUDE We use also the mini-app called miniBUDE for the Bristol University Docking Engine (BUDE), a kernel of a drug discovery application that is compute bound. BUDE predicts the binding energy of the ligand with the target, however, there are many ways this bonding could happen, and a variety of positions and rotations of the ligand relative to protein, known as poses, are explored. For each pose, a number of properties are evaluated. oIt supports various programming models but we do evaluate mainly CUDA and HIP. oWe use the version with commit 1af5b39 • Both BabelStream and miniBUDE support many programming models such as HIP, CUDA, OpenMP Offloading, SYCL, OpenACC, Kokkos. 14
  • 15. Methodology and Hardware • Compiling based on the available instructions from benchmark/application • Execute 10 times and plot boxplots • Tune where necessary based on the number of compute units on the GPU, wavefront/warp. 15
  • 17. Results - BabelStream 17 Copy kernel • AOMP lower performance compared to the other programming models • The default Kokkos implementation in BabelStream does not provide tuning options. • Finally, hipSYCL has a variation on A100 which is less than 2%, and Alpaka achieves a performance similar to HIP for the MI100 • MI100 is 26-34% slower than A100 and 15-37% faster than V100 • While MI100 achieves 97.8-99.99% of the peak performance except the OpenMP results, the NVIDIAA100 achieves 96.3- 98.5% and V100 89-98.2%
  • 18. Results - BabelStream 18 Multiply kernel • For MI100, most of the programming models achieve 96.63% and above, except OpenMP, while for A100 all the programming models perform close to the peak, however, for V100, the not tuned Kokkos underperforms. • For MI100, most programming models perform 17.8-37.8% faster than V100, and similarly, MI100 is 25-31% slower than A100.
  • 19. Results - BabelStream 19 Add kernel • The performance of Alpaka programming model is quite close to HIP/CUDA for all the devices with hipSYCL following • The OpenMP is less efficient on the MI100 compared to the rest GPUs. Finally, Kokkos’ performance, seems to be between Alpaka and hipSYCL. • MI100 is 14.74-21.60% faster than V100 and 28.55-32.86 slower than A100 • MI100 achieves 94.52-99.99% of its peak while the rest GPUs 97.6-99.99%.
  • 20. Results - BabelStream 20 Triad kernel • For this kernel Alpaka performs equally to HIP/CUDA, with following Kokkos, and then hipSYCL and OpenMP. • MI100 is 28.93-32.81% slower than A100 and 18.63-21.61% faster than V100 • MI100 achieves 94.62-99.94% of the peak while the NVIDIA GPUs achieve at least 98%.
  • 21. Results - BabelStream 21 Dot kernel • In the first segment of V100, we have also a plot of MI100 for hipSYCL as they were too close these values. • MI100 GPU is around 2.69-14.62% faster than NVIDIA V100 except for OpenMP, and 28.00-69.69% slower than A100. • For HIP/CUDA; We define 216 blocks of threads for the A100, as it has 108 streaming multiprocessors, and improved the dot kernel performance by 8-10%. • For hipSYCL; we utilize 960 blocks of threads and 256 threads per block for MI100 to achieve around 8% faster than the default values. • For Alpaka; we are able to tune also the dot kernel with 720 blocks of threads for MI100 and improve its performance by 28% comparing to the default settings. • OpenMP underperforms for AOMP, almost 50% slower than V100
  • 22. BabelStream Summary • The peak bandwidth percentage for the OpenMP programming model is 42.16% - 94.68% for MI100 while it is at least 96% for the NVIDIA GPUs which demonstrates that AOMP needs further development. • For Kokkos, the range is 74-99% for MI100, 72.87-99% for the NVIDIA GPUs, where the non- optimized version has lower performance mainly on MI100 and V100. • hipSYCL achieves at least 96% of the HIP/CUDA performance except for MI100 and dot kernel that achieves 89.53%. • Finally, Alpaka achieves at least 97.2% for all the cases and demonstrates its performance. • The OpenMP compiler performs better for NVIDIA GPUs regarding dot kernel, however, the version that we used for AMD GPUs, is not the final product yet. • Overall, the Alpaka performance is quite close to HIP, followed by hipSYCL and Kokkos, while OpenMP can perform slower depending on the kernel. 22
  • 23. Results - MiniBUDE 23 • Large problem sizes for miniBUDE are required to be able to saturate the GPUs. • For every experiment, we execute 8 iterations with 983040 poses, we calculate this number by tuning for AMD MI100 • The miniBUDE provides in the output the single precision GFLOP/s, and we observe that the AMD MI100 GPU achieves a performance close to A100 by 2% and around 26% over V100. • Regarding single precision, we tested also the mixbench benchmark and the MI100 was 1.16 times faster than A100, achieving both close to their peak performance.
  • 24. Conclusion • We presented a workflow for porting applications to LUMI supercomputer, an AMD GPU-based system • We utilize a benchmark and a mini-app, which are memory and compute-bound respectively. We illustrate how various programming models perform on these GPUs and what techniques can improve the performance for specific cases. • We presented the first results of BabelStream with Alpaka backend • We discuss the lack of performance on some aspects of OpenMP • HIP/CUDA perform quite good and most of the programming models are quite close, depending on the kernel pattern. However, the users should be aware about other programming models either ones that underperform in some cases like OpenMP for some patterns or the learning curve of Kokkos tuning • Finally, programming models such as Alpaka and SYCL could be utilized as they support many backends, are portable and for many kernels they provide similar performance to HIP. • Support reproducibility, find how to reproduce experiments on NVIDIA V100/100, and AMD MI100: https://zenodo.org/record/6307447 24
  • 25. Future work • We investigate to identify what the issue with some programming models and miniBUDE is. • We want to analyze the OpenACC performance from the GCC and Cray Fortran compiler • Test the new functionalities from the ROCm platform such as Heterogeneous Memory Management (HMM) • We envision evaluating multi-GPU benchmarks and their scalability across multiple nodes • By using LUMI we will be able to use the MI250X GPU and compare it with the current GPU generation, among testing the packed FP32 • Finally, we are interested in benchmarking I/O from the GPUs memory as we have already hipified Elbencho benchmark 25
  • 26. facebook.com/CSCfi twitter.com/CSCfi youtube.com/CSCfi linkedin.com/company/csc---it-center-for-science Kuvat CSC:n arkisto, Adobe Stock ja Thinkstock github.com/CSCfi George S. Markomanolis CSC – IT Center for Science Ltd. Lead HPC Scientist georgios.markomanolis@csc.fi 26