SlideShare a Scribd company logo
1 of 34
Download to read offline
Getting started with AMD GPUs
FOSDEM’21 HPC, Big Data and Data Science devroom
February 7th, 2021
George S. Markomanolis
Lead HPC Scientist, CSC – IT For Science Ltd.
Outline
• Motivation
• LUMI
• ROCm
• Introduction and porting codes to HIP
• Benchmarking
• Fortran and HIP
• Tuning
2
Disclaimer
• AMD ecosystem is under heavy development, many things can change without notice
• All the experiments took place on NVIDIA V100 GPU (Puhti cluster at CSC)
• Trying to use the latest versions of ROCm
• Some results are really fresh and investigating the outcome
3
LUMI
4
Motivation/Challenges
• LUMI will have AMD GPUs
• Need to learn how to program and port codes on AMD ecosystem
• Provide training to LUMI users
• Investigate in the future about possible problems
• Not yet access to AMD GPUs
5
AMD GPUs (MI100 example)
6
AMD MI100
LUMI will have a
different GPU
Differences between HIP and CUDA
• AMD GCN hardware wavefronts size is 64 (like warp for CUDA)
• Some CUDA library functions do not have AMD equivalents
• Shared memory and registers per thrad can differ between AMD and
NVIDIA hardware
7
ROCm
8
• Open Software Platform for GPU-
accelerated Computing by AMD
ROCm installation
• Many components need to be installed
• Rocm-cmake
• ROCT Thunk Interface
• HSA Runtime API and runtime for ROCm
• ROCM LLVM / Clang
• Rocminfo (only for AMD HW)
• ROCm-Device-Libs
• ROCm-CompilerSupport
• ROCclr - Radeon Open Compute Common Language Runtime
• HIP
Instructions: https://github.com/cschpc/lumi/blob/main/rocm/rocm_install.md
Script: https://github.com/cschpc/lumi/blob/main/rocm/rocm_installation.sh
Repo: https://github.com/cschpc/lumi
9
Introduction to HIP
• HIP: Heterogeneous Interface for Portability is developed by
AMD to program on AMD GPUs
• It is a C++ runtime API and it supports both AMD and NVIDIA
platforms
• HIP is similar to CUDA and there is no performance overhead
on NVIDIA GPUs
• Many well-known libraries have been ported on HIP
• New projects or porting from CUDA, could be developed
directly in HIP
https://github.com/ROCm-Developer-Tools/HIP
10
Differences between CUDA and HIPAPI
#include “cuda.h”
cudaMalloc(&d_x, N*sizeof(double));
cudaMemcpy(d_x,x,N*sizeof(double),
cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
11
#include “hip/hip_runtime.h”
hipMalloc(&d_x, N*sizeof(double));
hipMemcpy(d_x,x,N*sizeof(double),
hipMemcpyHostToDevice);
hipDeviceSynchronize();
CUDA HIP
Launching kernel with CUDA and HIP
kernel_name <<<gridsize, blocksize,
shared_mem_size,
stream>>>
(arg0, arg1, ...);
12
hipLaunchKernelGGL(kernel_name,
gridsize,
blocksize,
shared_mem_size,
stream,
arg0, arg1, ... );
CUDA HIP
HIP API
• Device Management:
o hipSetDevice(), hipGetDevice(), hipGetDeviceProperties(), hipDeviceSynchronize()
• Memory Management
o hipMalloc(), hipMemcpy(), hipMemcpyAsync(), hipFree(), hipHostMalloc
• Streams
o hipStreamCreate(), hipStreamSynchronize(), hipStreamDestroy()
• Events
o hipEventCreate(), hipEventRecord(), hipStreamWaitEvent(), hipEventElapsedTime()
• Device Kernels
o __global__, __device__, hipLaunchKernelGGL()
• Device code
o threadIdx, blockIdx, blockDim, __shared__
o Hundreds math functions covering entire CUDA math library
• Error handling
o hipGetLastError(), hipGetErrorString()
https://rocmdocs.amd.com/en/latest/ROCm_API_References/HIP-API.html
13
Hipify
• Hipify tools convert automatically CUDA codes
• It is possible that not all the code is converted, the remaining
needs the implementation of the developer
• Hipify-perl: text-based search and replace
• Hipify-clang: source-to-source translator that uses clang
compiler
• Porting guide: https://github.com/ROCm-Developer-
Tools/HIP/blob/main/docs/markdown/hip_porting_guide.md
14
Hipify-perl
• It can scan directories and converts CUDA codes with replacement of the cuda to hip
(sed –e ’s/cuda/hip/g’)
$ hipify-perl --inplace filename
It modifies the filename input inplace, replacing input with hipified output, save backup
in .prehip file.
$ hipconvertinplace-perl.sh directory
It converts all the related files that are located inside the directory
15
Hipify-perl (cont).
1) $ ls src/
Makefile.am matMulAB.c matMulAB.h matMul.c
2) $ hipconvertinplace-perl.sh src
3) $ ls src/
Makefile.am matMulAB.c matMulAB.c.prehip matMulAB.h matMulAB.h.prehip
matMul.c matMul.c.prehip
No compilation took place, just convertion.
16
Hipify-perl (cont).
• The hipify-perl will return a report for each file, and it looks like this:
info: TOTAL-converted 53 CUDA->HIP refs ( error:0 init:0 version:0 device:1 ... library:16
... numeric_literal:12 define:0 extern_shared:0 kernel_launch:0 )
warn:0 LOC:888
kernels (0 total) :
hipFree 18
HIPBLAS_STATUS_SUCCESS 6
hipSuccess 4
hipMalloc 3
HIPBLAS_OP_N 2
hipDeviceSynchronize 1
hip_runtime 1
17
Hipify-perl (cont).
#include <cuda_runtime.h>
#include "cublas_v2.h"
if (cudaSuccess != cudaMalloc((void **) &a_dev,
sizeof(*a) * n * n) ||
cudaSuccess != cudaMalloc((void **) &b_dev,
sizeof(*b) * n * n) ||
cudaSuccess != cudaMalloc((void **) &c_dev,
sizeof(*c) * n * n)) {
printf("error: memory allocation (CUDA)n");
cudaFree(a_dev); cudaFree(b_dev);
cudaFree(c_dev);
cudaDestroy(handle);
exit(EXIT_FAILURE);
}
18
#include <hip/hip_runtime.h>
#include "hipblas.h”
if (hipSuccess != hipMalloc((void **) &a_dev,
sizeof(*a) * n * n) ||
hipSuccess != hipMalloc((void **) &b_dev,
sizeof(*b) * n * n) ||
hipSuccess != hipMalloc((void **) &c_dev,
sizeof(*c) * n * n)) {
printf("error: memory allocation (CUDA)n");
hipFree(a_dev); hipFree(b_dev); hipFree(c_dev);
hipblasDestroy(handle);
exit(EXIT_FAILURE);
}
CUDA HIP
Compilation
1) Compilation with CC=hipcc
matMulAB.c:21:10: fatal error: hipblas.h: No such file or directory 21 | #include
"hipblas.h"
2) Install HipBLAS library *
3) Compile again and the binary is ready. When the HIP is on NVIDIA hardware, the .cpp
file should be compiled with the option “hipcc -x cu …”.
• The hipcc is using nvcc on NVIDIA GPUs and hcc for AMD GPUs
* https://github.com/cschpc/lumi/blob/main/hip/hipblas.md
19
Hipify-clang
• Build from source
• Some times needs to include manually the headers -I/...
$ hipify-clang --print-stats -o matMul.o matMul.c
[HIPIFY] info: file 'matMul.c' statistics:
CONVERTED refs count: 0
UNCONVERTED refs count: 0
CONVERSION %: 0
REPLACED bytes: 0
TOTAL bytes: 4662
CHANGED lines of code: 1
TOTAL lines of code: 155
CODE CHANGED (in bytes) %: 0
CODE CHANGED (in lines) %: 1
TIME ELAPSED s: 22.94
20
Benchmark MatMul OpenMP oflload
• Use the benchmark https://github.com/pc2/OMP-Offloading for testing purposes,
matrix multiplication of 2048 x 2048
• All the CUDA calls were converted and it was linked with hipBlas among also
OpenMP offload
• CUDA
matMulAB (11) : 1001.2 GFLOPS 11990.1 GFLOPS maxabserr = 0.0
• HIP
matMulAB (11) : 978.8 GFLOPS 12302.4 GFLOPS maxabserr = 0.0
• For the most executions, HIP version was equal or a bit better than CUDA version, for total
execution, there is ~2.23% overhead for HIP using NVIDIA GPUs
21
N-BODY SIMULATION
• N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA)
AllPairs_N2
• 171 CUDA calls converted to HIP without issues, close to 1000 lines of code
• HIP calls: hipMemcpy, hipMalloc, hipMemcpyHostToDevice,
hipMemcpyDeviceToHost, hipLaunchKernelGGL, hipDeviceSynchronize, hip_runtime,
hipSuccess, hipGetErrorString, hipGetLastError, hipError_t, HIP_DYNAMIC_SHARED
• 32768 number of small particles, 2000 time steps
• CUDA execution time: 68.5 seconds
• HIP execution time: 70.1 seconds, ~2.33% overhead
22
Fortran
• First Scenario: Fortran + CUDA C/C++
oAssuming there is no CUDA code in the Fortran files.
oHipify CUDA
oCompile and link with hipcc
• Second Scenario: CUDA Fortran
oThere is no HIP equivalent
oHIP functions are callable from C, using `extern C`
oSee hipfort
23
Hipfort
• The approach to port Fortran codes on AMD GPUs is different, the hipify tool does
not support it.
• We need to use hipfort, a Fortran interface library for GPU kernel *
• Steps:
1) We write the kernels in a new C++ file
2) Wrap the kernel launch in a C function
3) Use Fortran 2003 C binding to call the function
4) Things could change in the future
• Use OpenMP offload to GPUs
* https://github.com/ROCmSoftwarePlatform/hipfort
24
Fortran CUDA example
• Saxpy example
• Fortran CUDA, 29 lines of code
• Ported to HIP manually, two files of 52 lines, with more than 20 new lines.
• Quite a lot of changes for such a small code.
• Should we try to use OpenMP offload before we try to HIP the code?
• Need to adjust Makefile to compile the multiple files
• The HIP version is up to 30% faster, seems to be a comparison between nvcc and
pgf90, still checking to verify some results
• Example of Fortran with HIP: https://github.com/cschpc/lumi/tree/main/hipfort
25
Fortran CUDA example (cont.)
26
Original Fortran CUDA New Fortran 2003 with HIP C++ with HIP and extern C
AMD OpenMP (AOMP)
• We have tested the LLVM provided OpenMP offload and gets
improved by the time
• AOMP is under heavy development and we started testing it.
• AOMP has still some performance issues according to some
public results but we expect to be also improved significanlty
by the time LUMI is delivered
• https://github.com/ROCm-Developer-Tools/aomp
27
OpenMP or HIP?
• Some users will be questioning about the approach
• OpenMP can provide a quick porting but it is expected with HIP
to have better performance as we avoid some layers like that.
• For complicated codes and programming languages as Fortran,
probably OpenMP could provide a benefit. Always profile your
code to investigate the performance.
28
Porting code to LUMI (not official)
29
Profiling/Debugging
• AMD will provide APIs for profiling and debugging
• Cray will support the profiling API through CrayPat
• Some well known tools are collaborating with AMD and preparing their tools for
profiling and debugging
• Some simple environment variables such as
AMD_LOG_LEVEL=4 will provide some information.
• More information about a hipMemcpy error:
hipError_t err = hipMemcpy(c,c_d,nBytes,hipMemcpyDeviceToHost);
printf("%s ",hipGetErrorString(err));
30
Tuning
• Multiple wavefronts per compute unit (CU) is important to hide
latency and instruction throughput
• Memory coalescing increases bandwidth
• Unrolling loops allow compiler to prefetch data
• Small kernels can cause latency overhead, adjust the workload
• Use of Local Data Share (LDS)
31
Programming models
• OpenACC will be available through the GCC as Mentor
Graphics (now called Siemens EDA) is developing the
OpenACC integration
• Kokkos, Raja, Alpaka, and SYCL should be able to be used on
LUMI but they do not support all the programming languages
32
Conclusion
• Depending on the code the porting to HIP can be more straight forward
• There can be challenges, depending on the code and what GPU
functionalities are integrated to an application
• There are many approaches to port a code and you should select the one
that you are more familiar and provides as possible as good performance
• It will be required to tune the code for high occupancy
• Profiling can help to investigate data transfer issues
• Probably is more productive to try OpenMP with offload to GPUs
initially with Fortran codes
33
facebook.com/CSCfi
twitter.com/CSCfi
youtube.com/CSCfi
linkedin.com/company/csc---it-center-for-science
Kuvat CSC:n arkisto, Adobe Stock ja Thinkstock
github.com/CSCfi
Thank you!
georgios.markomanolis@csc.fi

More Related Content

What's hot

CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
Vishal Singh
 

What's hot (20)

Devops Porto - CI/CD at Gitlab
Devops Porto - CI/CD at GitlabDevops Porto - CI/CD at Gitlab
Devops Porto - CI/CD at Gitlab
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
 
Building Embedded Linux Full Tutorial for ARM
Building Embedded Linux Full Tutorial for ARMBuilding Embedded Linux Full Tutorial for ARM
Building Embedded Linux Full Tutorial for ARM
 
Cuda
CudaCuda
Cuda
 
Embedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernelEmbedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernel
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
Gitlab ci, cncf.sk
Gitlab ci, cncf.skGitlab ci, cncf.sk
Gitlab ci, cncf.sk
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101
 
Seminar presentation on OpenGL
Seminar presentation on OpenGLSeminar presentation on OpenGL
Seminar presentation on OpenGL
 
Embedded C
Embedded CEmbedded C
Embedded C
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Systemd: the modern Linux init system you will learn to love
Systemd: the modern Linux init system you will learn to loveSystemd: the modern Linux init system you will learn to love
Systemd: the modern Linux init system you will learn to love
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
5 p9 pnor and open bmc overview - final
5 p9 pnor and open bmc overview - final5 p9 pnor and open bmc overview - final
5 p9 pnor and open bmc overview - final
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Continuous Integration/Deployment with Gitlab CI
Continuous Integration/Deployment with Gitlab CIContinuous Integration/Deployment with Gitlab CI
Continuous Integration/Deployment with Gitlab CI
 
Embedded Android : System Development - Part II (Linux device drivers)
Embedded Android : System Development - Part II (Linux device drivers)Embedded Android : System Development - Part II (Linux device drivers)
Embedded Android : System Development - Part II (Linux device drivers)
 
Gitlab - Creating C++ applications with Gitlab CI
Gitlab - Creating C++ applications with Gitlab CIGitlab - Creating C++ applications with Gitlab CI
Gitlab - Creating C++ applications with Gitlab CI
 
Red Bend Software: Separation Using Type-1 Virtualization in Vehicles and Aut...
Red Bend Software: Separation Using Type-1 Virtualization in Vehicles and Aut...Red Bend Software: Separation Using Type-1 Virtualization in Vehicles and Aut...
Red Bend Software: Separation Using Type-1 Virtualization in Vehicles and Aut...
 

Similar to Getting started with AMD GPUs

Add sale davinci
Add sale davinciAdd sale davinci
Add sale davinci
Akash Sahoo
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
mohamedragabslideshare
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Docker, Inc.
 

Similar to Getting started with AMD GPUs (20)

Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Mesa and Its Debugging
Mesa and Its DebuggingMesa and Its Debugging
Mesa and Its Debugging
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
State of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache BigtopState of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache Bigtop
 
Add sale davinci
Add sale davinciAdd sale davinci
Add sale davinci
 
CUDA
CUDACUDA
CUDA
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
 
Freedreno on Android – XDC 2023
Freedreno on Android          – XDC 2023Freedreno on Android          – XDC 2023
Freedreno on Android – XDC 2023
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on Hops
 
Mesa and Its Debugging, Вадим Шовкопляс
Mesa and Its Debugging, Вадим ШовкоплясMesa and Its Debugging, Вадим Шовкопляс
Mesa and Its Debugging, Вадим Шовкопляс
 
Bsdtw17: johannes m dieterich: high performance computing and gpu acceleratio...
Bsdtw17: johannes m dieterich: high performance computing and gpu acceleratio...Bsdtw17: johannes m dieterich: high performance computing and gpu acceleratio...
Bsdtw17: johannes m dieterich: high performance computing and gpu acceleratio...
 

More from George Markomanolis

More from George Markomanolis (15)

Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part I
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit Supercomputer
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Recently uploaded

Electricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentsElectricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 students
levieagacer
 
HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT
HIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPTHIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPT
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
GlendelCaroz
 
Heat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree daysHeat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree days
Brahmesh Reddy B R
 

Recently uploaded (20)

Electricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentsElectricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 students
 
MODERN PHYSICS_REPORTING_QUANTA_.....pdf
MODERN PHYSICS_REPORTING_QUANTA_.....pdfMODERN PHYSICS_REPORTING_QUANTA_.....pdf
MODERN PHYSICS_REPORTING_QUANTA_.....pdf
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
 
HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT
HIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPTHIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPT
HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT
 
Costs to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of UgandaCosts to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of Uganda
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
 
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpWASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
 
Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdf
 
MSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdfMSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdf
 
GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of Asepsis
 
PARENTAL CARE IN FISHES.pptx for 5th sem
PARENTAL CARE IN FISHES.pptx for 5th semPARENTAL CARE IN FISHES.pptx for 5th sem
PARENTAL CARE IN FISHES.pptx for 5th sem
 
TEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfTEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdf
 
Heat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree daysHeat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree days
 
Lubrication System in forced feed system
Lubrication System in forced feed systemLubrication System in forced feed system
Lubrication System in forced feed system
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
 
Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...
Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...
Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...
 
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdfFORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
 
Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...
 
Fun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdfFun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdf
 

Getting started with AMD GPUs

  • 1. Getting started with AMD GPUs FOSDEM’21 HPC, Big Data and Data Science devroom February 7th, 2021 George S. Markomanolis Lead HPC Scientist, CSC – IT For Science Ltd.
  • 2. Outline • Motivation • LUMI • ROCm • Introduction and porting codes to HIP • Benchmarking • Fortran and HIP • Tuning 2
  • 3. Disclaimer • AMD ecosystem is under heavy development, many things can change without notice • All the experiments took place on NVIDIA V100 GPU (Puhti cluster at CSC) • Trying to use the latest versions of ROCm • Some results are really fresh and investigating the outcome 3
  • 5. Motivation/Challenges • LUMI will have AMD GPUs • Need to learn how to program and port codes on AMD ecosystem • Provide training to LUMI users • Investigate in the future about possible problems • Not yet access to AMD GPUs 5
  • 6. AMD GPUs (MI100 example) 6 AMD MI100 LUMI will have a different GPU
  • 7. Differences between HIP and CUDA • AMD GCN hardware wavefronts size is 64 (like warp for CUDA) • Some CUDA library functions do not have AMD equivalents • Shared memory and registers per thrad can differ between AMD and NVIDIA hardware 7
  • 8. ROCm 8 • Open Software Platform for GPU- accelerated Computing by AMD
  • 9. ROCm installation • Many components need to be installed • Rocm-cmake • ROCT Thunk Interface • HSA Runtime API and runtime for ROCm • ROCM LLVM / Clang • Rocminfo (only for AMD HW) • ROCm-Device-Libs • ROCm-CompilerSupport • ROCclr - Radeon Open Compute Common Language Runtime • HIP Instructions: https://github.com/cschpc/lumi/blob/main/rocm/rocm_install.md Script: https://github.com/cschpc/lumi/blob/main/rocm/rocm_installation.sh Repo: https://github.com/cschpc/lumi 9
  • 10. Introduction to HIP • HIP: Heterogeneous Interface for Portability is developed by AMD to program on AMD GPUs • It is a C++ runtime API and it supports both AMD and NVIDIA platforms • HIP is similar to CUDA and there is no performance overhead on NVIDIA GPUs • Many well-known libraries have been ported on HIP • New projects or porting from CUDA, could be developed directly in HIP https://github.com/ROCm-Developer-Tools/HIP 10
  • 11. Differences between CUDA and HIPAPI #include “cuda.h” cudaMalloc(&d_x, N*sizeof(double)); cudaMemcpy(d_x,x,N*sizeof(double), cudaMemcpyHostToDevice); cudaDeviceSynchronize(); 11 #include “hip/hip_runtime.h” hipMalloc(&d_x, N*sizeof(double)); hipMemcpy(d_x,x,N*sizeof(double), hipMemcpyHostToDevice); hipDeviceSynchronize(); CUDA HIP
  • 12. Launching kernel with CUDA and HIP kernel_name <<<gridsize, blocksize, shared_mem_size, stream>>> (arg0, arg1, ...); 12 hipLaunchKernelGGL(kernel_name, gridsize, blocksize, shared_mem_size, stream, arg0, arg1, ... ); CUDA HIP
  • 13. HIP API • Device Management: o hipSetDevice(), hipGetDevice(), hipGetDeviceProperties(), hipDeviceSynchronize() • Memory Management o hipMalloc(), hipMemcpy(), hipMemcpyAsync(), hipFree(), hipHostMalloc • Streams o hipStreamCreate(), hipStreamSynchronize(), hipStreamDestroy() • Events o hipEventCreate(), hipEventRecord(), hipStreamWaitEvent(), hipEventElapsedTime() • Device Kernels o __global__, __device__, hipLaunchKernelGGL() • Device code o threadIdx, blockIdx, blockDim, __shared__ o Hundreds math functions covering entire CUDA math library • Error handling o hipGetLastError(), hipGetErrorString() https://rocmdocs.amd.com/en/latest/ROCm_API_References/HIP-API.html 13
  • 14. Hipify • Hipify tools convert automatically CUDA codes • It is possible that not all the code is converted, the remaining needs the implementation of the developer • Hipify-perl: text-based search and replace • Hipify-clang: source-to-source translator that uses clang compiler • Porting guide: https://github.com/ROCm-Developer- Tools/HIP/blob/main/docs/markdown/hip_porting_guide.md 14
  • 15. Hipify-perl • It can scan directories and converts CUDA codes with replacement of the cuda to hip (sed –e ’s/cuda/hip/g’) $ hipify-perl --inplace filename It modifies the filename input inplace, replacing input with hipified output, save backup in .prehip file. $ hipconvertinplace-perl.sh directory It converts all the related files that are located inside the directory 15
  • 16. Hipify-perl (cont). 1) $ ls src/ Makefile.am matMulAB.c matMulAB.h matMul.c 2) $ hipconvertinplace-perl.sh src 3) $ ls src/ Makefile.am matMulAB.c matMulAB.c.prehip matMulAB.h matMulAB.h.prehip matMul.c matMul.c.prehip No compilation took place, just convertion. 16
  • 17. Hipify-perl (cont). • The hipify-perl will return a report for each file, and it looks like this: info: TOTAL-converted 53 CUDA->HIP refs ( error:0 init:0 version:0 device:1 ... library:16 ... numeric_literal:12 define:0 extern_shared:0 kernel_launch:0 ) warn:0 LOC:888 kernels (0 total) : hipFree 18 HIPBLAS_STATUS_SUCCESS 6 hipSuccess 4 hipMalloc 3 HIPBLAS_OP_N 2 hipDeviceSynchronize 1 hip_runtime 1 17
  • 18. Hipify-perl (cont). #include <cuda_runtime.h> #include "cublas_v2.h" if (cudaSuccess != cudaMalloc((void **) &a_dev, sizeof(*a) * n * n) || cudaSuccess != cudaMalloc((void **) &b_dev, sizeof(*b) * n * n) || cudaSuccess != cudaMalloc((void **) &c_dev, sizeof(*c) * n * n)) { printf("error: memory allocation (CUDA)n"); cudaFree(a_dev); cudaFree(b_dev); cudaFree(c_dev); cudaDestroy(handle); exit(EXIT_FAILURE); } 18 #include <hip/hip_runtime.h> #include "hipblas.h” if (hipSuccess != hipMalloc((void **) &a_dev, sizeof(*a) * n * n) || hipSuccess != hipMalloc((void **) &b_dev, sizeof(*b) * n * n) || hipSuccess != hipMalloc((void **) &c_dev, sizeof(*c) * n * n)) { printf("error: memory allocation (CUDA)n"); hipFree(a_dev); hipFree(b_dev); hipFree(c_dev); hipblasDestroy(handle); exit(EXIT_FAILURE); } CUDA HIP
  • 19. Compilation 1) Compilation with CC=hipcc matMulAB.c:21:10: fatal error: hipblas.h: No such file or directory 21 | #include "hipblas.h" 2) Install HipBLAS library * 3) Compile again and the binary is ready. When the HIP is on NVIDIA hardware, the .cpp file should be compiled with the option “hipcc -x cu …”. • The hipcc is using nvcc on NVIDIA GPUs and hcc for AMD GPUs * https://github.com/cschpc/lumi/blob/main/hip/hipblas.md 19
  • 20. Hipify-clang • Build from source • Some times needs to include manually the headers -I/... $ hipify-clang --print-stats -o matMul.o matMul.c [HIPIFY] info: file 'matMul.c' statistics: CONVERTED refs count: 0 UNCONVERTED refs count: 0 CONVERSION %: 0 REPLACED bytes: 0 TOTAL bytes: 4662 CHANGED lines of code: 1 TOTAL lines of code: 155 CODE CHANGED (in bytes) %: 0 CODE CHANGED (in lines) %: 1 TIME ELAPSED s: 22.94 20
  • 21. Benchmark MatMul OpenMP oflload • Use the benchmark https://github.com/pc2/OMP-Offloading for testing purposes, matrix multiplication of 2048 x 2048 • All the CUDA calls were converted and it was linked with hipBlas among also OpenMP offload • CUDA matMulAB (11) : 1001.2 GFLOPS 11990.1 GFLOPS maxabserr = 0.0 • HIP matMulAB (11) : 978.8 GFLOPS 12302.4 GFLOPS maxabserr = 0.0 • For the most executions, HIP version was equal or a bit better than CUDA version, for total execution, there is ~2.23% overhead for HIP using NVIDIA GPUs 21
  • 22. N-BODY SIMULATION • N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA) AllPairs_N2 • 171 CUDA calls converted to HIP without issues, close to 1000 lines of code • HIP calls: hipMemcpy, hipMalloc, hipMemcpyHostToDevice, hipMemcpyDeviceToHost, hipLaunchKernelGGL, hipDeviceSynchronize, hip_runtime, hipSuccess, hipGetErrorString, hipGetLastError, hipError_t, HIP_DYNAMIC_SHARED • 32768 number of small particles, 2000 time steps • CUDA execution time: 68.5 seconds • HIP execution time: 70.1 seconds, ~2.33% overhead 22
  • 23. Fortran • First Scenario: Fortran + CUDA C/C++ oAssuming there is no CUDA code in the Fortran files. oHipify CUDA oCompile and link with hipcc • Second Scenario: CUDA Fortran oThere is no HIP equivalent oHIP functions are callable from C, using `extern C` oSee hipfort 23
  • 24. Hipfort • The approach to port Fortran codes on AMD GPUs is different, the hipify tool does not support it. • We need to use hipfort, a Fortran interface library for GPU kernel * • Steps: 1) We write the kernels in a new C++ file 2) Wrap the kernel launch in a C function 3) Use Fortran 2003 C binding to call the function 4) Things could change in the future • Use OpenMP offload to GPUs * https://github.com/ROCmSoftwarePlatform/hipfort 24
  • 25. Fortran CUDA example • Saxpy example • Fortran CUDA, 29 lines of code • Ported to HIP manually, two files of 52 lines, with more than 20 new lines. • Quite a lot of changes for such a small code. • Should we try to use OpenMP offload before we try to HIP the code? • Need to adjust Makefile to compile the multiple files • The HIP version is up to 30% faster, seems to be a comparison between nvcc and pgf90, still checking to verify some results • Example of Fortran with HIP: https://github.com/cschpc/lumi/tree/main/hipfort 25
  • 26. Fortran CUDA example (cont.) 26 Original Fortran CUDA New Fortran 2003 with HIP C++ with HIP and extern C
  • 27. AMD OpenMP (AOMP) • We have tested the LLVM provided OpenMP offload and gets improved by the time • AOMP is under heavy development and we started testing it. • AOMP has still some performance issues according to some public results but we expect to be also improved significanlty by the time LUMI is delivered • https://github.com/ROCm-Developer-Tools/aomp 27
  • 28. OpenMP or HIP? • Some users will be questioning about the approach • OpenMP can provide a quick porting but it is expected with HIP to have better performance as we avoid some layers like that. • For complicated codes and programming languages as Fortran, probably OpenMP could provide a benefit. Always profile your code to investigate the performance. 28
  • 29. Porting code to LUMI (not official) 29
  • 30. Profiling/Debugging • AMD will provide APIs for profiling and debugging • Cray will support the profiling API through CrayPat • Some well known tools are collaborating with AMD and preparing their tools for profiling and debugging • Some simple environment variables such as AMD_LOG_LEVEL=4 will provide some information. • More information about a hipMemcpy error: hipError_t err = hipMemcpy(c,c_d,nBytes,hipMemcpyDeviceToHost); printf("%s ",hipGetErrorString(err)); 30
  • 31. Tuning • Multiple wavefronts per compute unit (CU) is important to hide latency and instruction throughput • Memory coalescing increases bandwidth • Unrolling loops allow compiler to prefetch data • Small kernels can cause latency overhead, adjust the workload • Use of Local Data Share (LDS) 31
  • 32. Programming models • OpenACC will be available through the GCC as Mentor Graphics (now called Siemens EDA) is developing the OpenACC integration • Kokkos, Raja, Alpaka, and SYCL should be able to be used on LUMI but they do not support all the programming languages 32
  • 33. Conclusion • Depending on the code the porting to HIP can be more straight forward • There can be challenges, depending on the code and what GPU functionalities are integrated to an application • There are many approaches to port a code and you should select the one that you are more familiar and provides as possible as good performance • It will be required to tune the code for high occupancy • Profiling can help to investigate data transfer issues • Probably is more productive to try OpenMP with offload to GPUs initially with Fortran codes 33
  • 34. facebook.com/CSCfi twitter.com/CSCfi youtube.com/CSCfi linkedin.com/company/csc---it-center-for-science Kuvat CSC:n arkisto, Adobe Stock ja Thinkstock github.com/CSCfi Thank you! georgios.markomanolis@csc.fi