SlideShare a Scribd company logo
1 of 24
Download to read offline
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Exploring Programming Models for LUMI Supercomputer
ECMWF HPC Workshop on Meteorology, September 2021
George S. Markomanolis
Lead HPC Scientist, CSC – IT Center for Science Ltd. 1
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
AMD GPUs (MI100 example)
2
LUMI will have the next
generation of AMD
Instinct GPU
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Introduction to HIP
• Radeon Open Compute Platform (ROCm)
• HIP: Heterogeneous Interface for Portability is developed by AMD to
program on AMD GPUs
• It is a C++ runtime API and it supports both AMD and NVIDIA platforms
• HIP is similar to CUDA and there is no performance overhead on NVIDIA
GPUs
• Many well-known libraries have been ported on HIP
• New projects or porting from CUDA, could be developed directly in HIP
• The supported CUDAAPI is called with HIP prefix (cudamalloc ->
hipmalloc)
https://github.com/ROCm-Developer-Tools/HIP
3
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
ECMWF Atlas
• Atlas is an ECMWF library for parallel data-structures supporting unstructured grids and function spaces
cd src
hipconvertinplace-perl.sh .
…
info: TOTAL-converted 92 CUDA->HIP refs ( error:14 init:0 version:0 device:12 context:0 module:0
memory:17 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0
execution:0 graph:0
…
kernels (5 total) : unpack_kernel(1) kernel_multiblock(1) kernel_ex(1) loop_kernel_ex(1) kernel_exe(1)
hipSuccess 15
hipDeviceSynchronize 12
hipLaunchKernelGGL 10
hipGetLastError 8
…
hipMemcpyDeviceToHost 1
• The code is hipified and now it is only C++ with HIP and Fortran with OpenACC.
4
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Benchmark MatMul cuBLAS, hipBLAS
• Use the benchmark https://github.com/pc2/OMP-Offloading
• Matrix multiplication of 2048 x 2048, single precision
• All the CUDA calls were converted and it was linked with hipBlas
5
0
5000
10000
15000
20000
25000
V100 MI100
GFLOP/s
GPU
Matrix Multiplication (SP)
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
BabelStream
6
• A memory bound benchmark from the university of Bristol
• Five kernels
o add (a[i]=b[i]+c[i])
o multiply (a[i]=b*c[i])
o copy (a[i]=b[i])
o triad (a[i]=b[i]+d*c[i])
o dot (sum = sum+d*c[i])
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Improving OpenMP performance on BabelStream for MI100
• Original call:
#pragma omp target teams distribute parallel for simd
• Optimized call
#pragma omp target teams distribute parallel for simd thread_limit(256) num_teams(240)
• For the dot kernel we used 720 teams
7
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Fortran
• First Scenario: Fortran + CUDA C/C++
oAssuming there is no CUDA code in the Fortran files.
oHipify CUDA
oCompile and link with hipcc
• Second Scenario: CUDA Fortran
oThere is no hipify equivalent but there is another approach…
oHIP functions are callable from C, using `extern C`
oSee hipfort
8
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Hipfort
• We need to use hipfort, a Fortran interface library for GPU kernel *
• Steps:
1) We write the kernels in a new C++ file
2) Wrap the kernel launch in a C function
3) Use Fortran 2003 C binding to call the function
4) Things could change in the future (see GPUFORT)
• Example of Fortran with HIP:
https://github.com/cschpc/lumi/tree/main/hipfort
• Use OpenMP offload to GPUs
* https://github.com/ROCmSoftwarePlatform/hipfort
9
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc 10
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
GPUFort – Fortran with OpenACC (1/2)
11
Ifdef original file
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
GPUFort – Fortran with OpenACC (2/2)
12
Extern C routine Kernel
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
OpenACC
• GCC will provide OpenACC (Mentor Graphics contract, now called
Siemens EDA). Checking functionality
• HPE is supporting for OpenACC v2.0 for Fortran. This is quite old
OpenACC version. HPE announced support for OpenACC, newer
versions for all the main programming languages (Fortran/C/C++)
• Clacc from ORNL: https://github.com/llvm-doe-org/llvm-
project/tree/clacc/master OpenACC from LLVM only for C (Fortran and
C++ in the future)
oTranslate OpenACC to OpenMP Offloading
• If the code is in Fortran, we could use GPUFort
13
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Programming Models
• We have utilized with success at least the following programming
models/interfaces on AMD MI100 GPU:
oHIP
oOpenMP Offloading
ohipSYCL
oKokkos
oAlpaka
14
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
SYCL (hipSYCL)
• C++ Single-source Heterogeneous Programming for Acceleration Offload
• Generic programming with templates and lambda functions
• Big momentum currently, NERSC, ALCF, Codeplay partnership
• SYCL 2020 specification was announced early 2021
• Terminology: Unified Shared Memory (USM), buffer, accessor, data
movement, queue
• hipSYCL supports CPU, AMD/NVIDIA GPUs, Intel GPU (experimental)
15
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Kokkos
• Kokkos Core implements a programming model in C++ for writing
performance portable applications targeting all major HPC platforms.
It provides abstractions for both parallel execution of code and data
management. (ECP/NNSA)
• Terminology: view, execution space (serial, threads, OpenMP,
GPU,…), memory space (DRAM, NVRAM, …), pattern, policy
• Supports: CPU, AMD/NVIDIA GPUs, Intel KNL etc.
16
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Alpaka
• Abstraction Library for Parallel Kernel Acceleration (Alpaka) library is
a header-only C++14 abstraction library for accelerator development.
Developed by HZDR.
• Similar to CUDA terminology, grid/block/thread plus element
• Platform decided at the compile time, single source interface
• Easy to port CUDA codes through CUPLA
• Terminology: queue (non/blocking), buffers, work division
• Supports: HIP, CUDA, TBB, OpenMP (CPU and GPU) etc.
17
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
BabelStream Results
18
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Tuning
• Multiple wavefronts per compute unit (CU) is important to hide
latency and instruction throughput
• Tune number of threads per block, number of teams for OpenMP
offloading etc.
• Memory coalescing increases bandwidth
• Unrolling loops allow compiler to prefetch data
• Small kernels can cause latency overhead, adjust the workload
• Use of Local Data Share (LDS) memory
19
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Conclusion/Future work
• A code written in C/C++ and MPI+OpenMP is a bit easier to be ported to OpenMP
offloading compared to other approaches.
• The hipSYCL, Kokos, and Alpaka could be a good option considering that the code
is in C++.
• There can be challenges, depending on the code and what GPU functionalities are
integrated to an application
• It will be required to tune the code for high occupancy
• Track historical performance among new compilers
• GCC for OpenACC and OpenMP Offloading for AMD GPUs (issues will be solved
with GCC 12.x and LLVM 13.x)
• Tracking how profiling tools work on AMD GPUs (rocprof, TAU, HPCToolkit)
• We have trained more than 80 people on HIP porting: http://github.com/csc-
training/hip
20
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
www.lumi-supercomputer.eu
contact@lumi-supercomputer.eu
Follow us
Twitter: @LUMIhpc
LinkedIn: LUMI supercomputer
YouTube: LUMI supercomputer
georgios.markomanolis@csc.fi
CSC – IT Center for Science Ltd.
Lead HPC Scientist
George Markomanolis
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Porting Codes to LUMI (I)
22
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Porting Codes to LUMI (II)
23
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
N-BODY SIMULATION
• N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA) AllPairs_N2
• 171 CUDA calls converted to HIP without issues, close to 1000 lines of code
• 32768 number of small particles, 2000 time steps
• Tune the number of threads equal to 256 than 1024 default at ROCm 4.1
24
0
20
40
60
80
100
120
V100 MI100 MI100*
Seconds
GPU
N-Body Simulation

More Related Content

What's hot

P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
Open-NFP
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
inside-BigData.com
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
oholiab
 

What's hot (20)

Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
 
Involvement in OpenHPC
Involvement in OpenHPC	Involvement in OpenHPC
Involvement in OpenHPC
 
Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem	Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
 
ARM and Machine Learning
ARM and Machine LearningARM and Machine Learning
ARM and Machine Learning
 

Similar to Exploring the Programming Models for the LUMI Supercomputer

OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdfOpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
ssuser866937
 
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
CEE-SEC(R)
 
Efabless Marketplace webinar slides 2024
Efabless Marketplace webinar slides 2024Efabless Marketplace webinar slides 2024
Efabless Marketplace webinar slides 2024
Nobin Mathew
 
IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18
Yaoqing Gao
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 

Similar to Exploring the Programming Models for the LUMI Supercomputer (20)

AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
Feedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureFeedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows Azure
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdfOpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
 
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
 
Efabless Marketplace webinar slides 2024
Efabless Marketplace webinar slides 2024Efabless Marketplace webinar slides 2024
Efabless Marketplace webinar slides 2024
 
IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
Porting To Symbian
Porting To SymbianPorting To Symbian
Porting To Symbian
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC session
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 

More from George Markomanolis

More from George Markomanolis (13)

Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part I
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit Supercomputer
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Recently uploaded

Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Recently uploaded (20)

Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 

Exploring the Programming Models for the LUMI Supercomputer

  • 1. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Exploring Programming Models for LUMI Supercomputer ECMWF HPC Workshop on Meteorology, September 2021 George S. Markomanolis Lead HPC Scientist, CSC – IT Center for Science Ltd. 1
  • 2. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc AMD GPUs (MI100 example) 2 LUMI will have the next generation of AMD Instinct GPU
  • 3. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Introduction to HIP • Radeon Open Compute Platform (ROCm) • HIP: Heterogeneous Interface for Portability is developed by AMD to program on AMD GPUs • It is a C++ runtime API and it supports both AMD and NVIDIA platforms • HIP is similar to CUDA and there is no performance overhead on NVIDIA GPUs • Many well-known libraries have been ported on HIP • New projects or porting from CUDA, could be developed directly in HIP • The supported CUDAAPI is called with HIP prefix (cudamalloc -> hipmalloc) https://github.com/ROCm-Developer-Tools/HIP 3
  • 4. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc ECMWF Atlas • Atlas is an ECMWF library for parallel data-structures supporting unstructured grids and function spaces cd src hipconvertinplace-perl.sh . … info: TOTAL-converted 92 CUDA->HIP refs ( error:14 init:0 version:0 device:12 context:0 module:0 memory:17 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 … kernels (5 total) : unpack_kernel(1) kernel_multiblock(1) kernel_ex(1) loop_kernel_ex(1) kernel_exe(1) hipSuccess 15 hipDeviceSynchronize 12 hipLaunchKernelGGL 10 hipGetLastError 8 … hipMemcpyDeviceToHost 1 • The code is hipified and now it is only C++ with HIP and Fortran with OpenACC. 4
  • 5. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Benchmark MatMul cuBLAS, hipBLAS • Use the benchmark https://github.com/pc2/OMP-Offloading • Matrix multiplication of 2048 x 2048, single precision • All the CUDA calls were converted and it was linked with hipBlas 5 0 5000 10000 15000 20000 25000 V100 MI100 GFLOP/s GPU Matrix Multiplication (SP)
  • 6. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc BabelStream 6 • A memory bound benchmark from the university of Bristol • Five kernels o add (a[i]=b[i]+c[i]) o multiply (a[i]=b*c[i]) o copy (a[i]=b[i]) o triad (a[i]=b[i]+d*c[i]) o dot (sum = sum+d*c[i])
  • 7. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Improving OpenMP performance on BabelStream for MI100 • Original call: #pragma omp target teams distribute parallel for simd • Optimized call #pragma omp target teams distribute parallel for simd thread_limit(256) num_teams(240) • For the dot kernel we used 720 teams 7
  • 8. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Fortran • First Scenario: Fortran + CUDA C/C++ oAssuming there is no CUDA code in the Fortran files. oHipify CUDA oCompile and link with hipcc • Second Scenario: CUDA Fortran oThere is no hipify equivalent but there is another approach… oHIP functions are callable from C, using `extern C` oSee hipfort 8
  • 9. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Hipfort • We need to use hipfort, a Fortran interface library for GPU kernel * • Steps: 1) We write the kernels in a new C++ file 2) Wrap the kernel launch in a C function 3) Use Fortran 2003 C binding to call the function 4) Things could change in the future (see GPUFORT) • Example of Fortran with HIP: https://github.com/cschpc/lumi/tree/main/hipfort • Use OpenMP offload to GPUs * https://github.com/ROCmSoftwarePlatform/hipfort 9
  • 11. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc GPUFort – Fortran with OpenACC (1/2) 11 Ifdef original file
  • 12. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc GPUFort – Fortran with OpenACC (2/2) 12 Extern C routine Kernel
  • 13. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc OpenACC • GCC will provide OpenACC (Mentor Graphics contract, now called Siemens EDA). Checking functionality • HPE is supporting for OpenACC v2.0 for Fortran. This is quite old OpenACC version. HPE announced support for OpenACC, newer versions for all the main programming languages (Fortran/C/C++) • Clacc from ORNL: https://github.com/llvm-doe-org/llvm- project/tree/clacc/master OpenACC from LLVM only for C (Fortran and C++ in the future) oTranslate OpenACC to OpenMP Offloading • If the code is in Fortran, we could use GPUFort 13
  • 14. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Programming Models • We have utilized with success at least the following programming models/interfaces on AMD MI100 GPU: oHIP oOpenMP Offloading ohipSYCL oKokkos oAlpaka 14
  • 15. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc SYCL (hipSYCL) • C++ Single-source Heterogeneous Programming for Acceleration Offload • Generic programming with templates and lambda functions • Big momentum currently, NERSC, ALCF, Codeplay partnership • SYCL 2020 specification was announced early 2021 • Terminology: Unified Shared Memory (USM), buffer, accessor, data movement, queue • hipSYCL supports CPU, AMD/NVIDIA GPUs, Intel GPU (experimental) 15
  • 16. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Kokkos • Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platforms. It provides abstractions for both parallel execution of code and data management. (ECP/NNSA) • Terminology: view, execution space (serial, threads, OpenMP, GPU,…), memory space (DRAM, NVRAM, …), pattern, policy • Supports: CPU, AMD/NVIDIA GPUs, Intel KNL etc. 16
  • 17. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Alpaka • Abstraction Library for Parallel Kernel Acceleration (Alpaka) library is a header-only C++14 abstraction library for accelerator development. Developed by HZDR. • Similar to CUDA terminology, grid/block/thread plus element • Platform decided at the compile time, single source interface • Easy to port CUDA codes through CUPLA • Terminology: queue (non/blocking), buffers, work division • Supports: HIP, CUDA, TBB, OpenMP (CPU and GPU) etc. 17
  • 19. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Tuning • Multiple wavefronts per compute unit (CU) is important to hide latency and instruction throughput • Tune number of threads per block, number of teams for OpenMP offloading etc. • Memory coalescing increases bandwidth • Unrolling loops allow compiler to prefetch data • Small kernels can cause latency overhead, adjust the workload • Use of Local Data Share (LDS) memory 19
  • 20. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Conclusion/Future work • A code written in C/C++ and MPI+OpenMP is a bit easier to be ported to OpenMP offloading compared to other approaches. • The hipSYCL, Kokos, and Alpaka could be a good option considering that the code is in C++. • There can be challenges, depending on the code and what GPU functionalities are integrated to an application • It will be required to tune the code for high occupancy • Track historical performance among new compilers • GCC for OpenACC and OpenMP Offloading for AMD GPUs (issues will be solved with GCC 12.x and LLVM 13.x) • Tracking how profiling tools work on AMD GPUs (rocprof, TAU, HPCToolkit) • We have trained more than 80 people on HIP porting: http://github.com/csc- training/hip 20
  • 21. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc www.lumi-supercomputer.eu contact@lumi-supercomputer.eu Follow us Twitter: @LUMIhpc LinkedIn: LUMI supercomputer YouTube: LUMI supercomputer georgios.markomanolis@csc.fi CSC – IT Center for Science Ltd. Lead HPC Scientist George Markomanolis
  • 24. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc N-BODY SIMULATION • N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA) AllPairs_N2 • 171 CUDA calls converted to HIP without issues, close to 1000 lines of code • 32768 number of small particles, 2000 time steps • Tune the number of threads equal to 256 than 1024 default at ROCm 4.1 24 0 20 40 60 80 100 120 V100 MI100 MI100* Seconds GPU N-Body Simulation