SlideShare a Scribd company logo
Introduction to GPUs in HPC
Ben Cumming, CSCS
February 28, 2017
2D and 3D Launch Configurations
Launch Configuration
So far we have used one-dimensional launch configurations
threads in blocks indexed using threadIdx.x
blocks in a grid indexed using blockIdx.x
Many kernels map naturally onto 2D and 3D indexing
e.g. matrix-matrix operations
e.g. stencils
Introduction to GPUs in HPC | 3
Full Launch Configuration
Kernel launch dimensions can be specified with dim3 structs
kernel<<<dim3 grid_dim, dim3 block_dim>>>(...);
dim3.x , dim3.y and dim3.z specify the launch dimensions
can be constructed with 1, 2 or 3 dimensions
unspecified dim3 dimensions are set to 1
launch configuration examples
// 1D: 128 x1x1 for 128 threads
dim3 a(128);
// 2D: 16 x8x1 for 128 threads
dim3 b(16, 8);
// 3D: 16 x8x4 for 512 threads
dim3 c(16, 8, 4);
Introduction to GPUs in HPC | 4
The threadIdx , blockDim , blockIdx and gridDim can be treated
like 3D vectors via the .x , .y and .z members.
matrix addition example
__global__
void MatAdd(float *A, float *B, float *C, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if(i<n && j<n) {
auto pos = i + j*n;
C[pos] = A[pos] + B[pos ];
}
}
int main () {
// ...
dim3 threadsPerBlock (16, 16);
dim3 numBlocks(n / threadsPerBlock .x, n / threadsPerBlock .y);
MatAdd <<<numBlocks , threadsPerBlock >>>(A, B, C);
// ...
}
Introduction to GPUs in HPC | 5
Exercise: Launch Configuration
Write the 2D diffusion stencil in diffusion/diffusion2d.cu
Set up 2D launch configuration in the main loop
Draw a picture of the solution to validate it
a plotting script is provided for visualizing the results
use a small domain for visualization
# build and run after writing code
cd diffusion
srun diffusion2d 8 1000000
# do the plotting
module load daint/gpu
module load Python /2.7.12 - CrayGNU -2016.11
python plotting.py
Introduction to GPUs in HPC | 6
Using MPI with GPUs
What is MPI
MPI (Message Passing Interface) is a standardised library
for message passing
Highly portable: it is implemented on every HPC system
available today.
Has C, C++ and Fortran bindings.
Supports point to point communication
MPI_Send, MPI_Recv, MPI_Sendrecv, etc.
Supports global collectives
MPI_Barrier, MPI_Gather, MPI_Reduce, etc.
When you start an MPI application
N copies of the application are launched.
Each copy is given a rank∈ {0, 1, . . . , N − 1}.
Introduction to GPUs in HPC | 8
A basic MPI application
Example MPI application myapp.cpp
#include <mpi.h>
#include <unistd.h>
#include <cstdio >
int main(int argc , char ** argv) {
// initialize MPI on this rank
MPI_Init (&argc , &argv);
// get information about our place in the world
int rank , size;
MPI_Comm_rank (MPI_COMM_WORLD , &rank);
MPI_Comm_size (MPI_COMM_WORLD , &size);
// print a message
char name [128]; gethostname (name , sizeof(name));
printf("hello world from %d of %d on %sn", rank , size , name);
// close down MPI
MPI_Finalize ();
return 0;
}
MPI applications are compiled with a compiler wrapper:
> CC myapp.cpp -o myapp # the Cray C++ wrapper is CC
Introduction to GPUs in HPC | 9
Running our basic MPI application
# run myapp 4 ranks (-n) on 4 nodes (-N)
> srun -n4 -N4 ./ myapp
hello world from 0 of 4 on nid02117
hello world from 1 of 4 on nid02118
hello world from 2 of 4 on nid02119
hello world from 3 of 4 on nid02120
c NVIDIA Corporation
Introduction to GPUs in HPC | 10
MPI with data in device memory
We use GPUs to parallelize on-node computation
and probably MPI for communication between nodes
To use with data that is in buffers in GPU memory:
1. allocate buffers in host memory
2. manually copy from device→host memory
3. perform MPI communication with host buffers
4. copy received data from host→device memory
This approach can be very fast:
have a CPU thread dedicated to asynchronous host↔device
and MPI communication
Introduction to GPUs in HPC | 11
GPU-aware MPI
GPU-aware MPI implementations can automatically handle
MPI transactions with pointers to GPU memory
MVAPICH 2.0
OpenMPI since version 1.7.0
Cray MPI
How it works
Each pointer passed to MPI is checked to see if it is in
host or device memory. If not set, MPI assumes that all
pointers are to host memory, and your application will
probably crash with segmentation faults
Small messages between GPUs (up to ≈8 k) are copied
directly with RDMA
Larger messages are pipelined via host memory
Introduction to GPUs in HPC | 12
How to use G2G communication
Set the environment variable export MPICH_RDMA_ENABLED_CUDA=1
If not set, MPI assumes that all pointers are to host
memory, and your application will probably crash with
segmentation faults
Experiment with the environment variable
MPICH_G2G_PIPELINE
Sets the maximum number of 512 kB message chunks that
can be in flight (default 16)
MPI with G2G example
MPI_Request srequest , rrequest;
auto send_data = malloc_device <double >(100);
auto recv_data = malloc_device <double >(100);
// call MPI with GPU pointers
MPI_Irecv(recv_data , 100, MPI_DOUBLE , source , tag , MPI_COMM_WORLD ,
&rrequest);
MPI_Isend(send_data , 100, MPI_DOUBLE , target , tag , MPI_COMM_WORLD ,
&srequest);
Introduction to GPUs in HPC | 13
Capabilities and Limitations
Support for most MPI API calls (point-to-point, collectives,
etc)
Robust support for common MPI API calls
i.e. point-to-point operations
No support for user-defined MPI data types
Introduction to GPUs in HPC | 14
Exercise: MPI with G2G
2D stencil with MPI in diffusion/diffusion2d_mpi.cu
Implement the G2G version
1. can you observe any performance differences between the
two?
2. why are we restricted to just 1 MPI rank per node?
Implement a version that uses managed memory
what happens if you don’t set MPICH_RDMA_ENABLED_CUDA?
# load modules require to compile MPI
module swap PrgEnv -cray PrgEnv -gnu
module load cudatoolkit
# launch with 2 MPI ranks
MPICH_RDMA_ENABLED_CUDA =1 srun --reservation =Pascal2day -p jalet
-C gpu -n2 -N2 diffusion2d_mpi .cuda 8
# plot the solution
module load python /2.7.6
python plotting.py
# once it gets the correct results:
sbatch job.batch
Introduction to GPUs in HPC | 15
Exercises: 2D Diffusion with MPI Results
Time for 1000 time steps 128×16,382 on K20X GPUs
nodes G2G off G2G on
1 0.479 0.479
2 0.277 0.274
4 0.183 0.180
8 0.152 0.151
16 0.167 0.117
Introduction to GPUs in HPC | 16
Using Unified Memory with MPI
To pass a managed pointer to MPI you must use a
GPU-aware MPI distribution.
Even if the managed memory is on the host at time of
calling.
The MPI implementation
uses page-locked (pinned)
memory for RDMA.
If not aware of unified
memory you get
if lucky: crashes.
if unlucky: infuriating
bugs.
c NVIDIA Corporation
Introduction to GPUs in HPC | 17

More Related Content

What's hot

"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ..."Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
Edge AI and Vision Alliance
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
inside-BigData.com
 
Performance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SPerformance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4S
Ganesan Narayanasamy
 

What's hot (20)

"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ..."Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
 
Introduction to OpenVX
Introduction to OpenVXIntroduction to OpenVX
Introduction to OpenVX
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Involvement in OpenHPC
Involvement in OpenHPC	Involvement in OpenHPC
Involvement in OpenHPC
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & Productivity
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyond
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
GPU Performance Prediction Using High-level Application Models
GPU Performance Prediction Using High-level Application ModelsGPU Performance Prediction Using High-level Application Models
GPU Performance Prediction Using High-level Application Models
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
Performance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SPerformance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4S
 
CuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPUCuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPU
 

Viewers also liked

Multi-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & AnalysisMulti-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & Analysis
inside-BigData.com
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Spark Summit
 
Cluster Tutorial
Cluster TutorialCluster Tutorial
Cluster Tutorial
cybercbm
 

Viewers also liked (20)

Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
Multi-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & AnalysisMulti-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & Analysis
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
 
Video: The State of the Solid State Drive SSD
Video: The State of the Solid State Drive SSDVideo: The State of the Solid State Drive SSD
Video: The State of the Solid State Drive SSD
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
 
SC17 Exhibitor Prospectus
SC17 Exhibitor ProspectusSC17 Exhibitor Prospectus
SC17 Exhibitor Prospectus
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
 
Storrs HPC Overview - Feb. 2017
Storrs HPC Overview - Feb. 2017Storrs HPC Overview - Feb. 2017
Storrs HPC Overview - Feb. 2017
 
Intersect360 Top of All Things in HPC Snapshot Analysis
Intersect360 Top of All Things in HPC Snapshot AnalysisIntersect360 Top of All Things in HPC Snapshot Analysis
Intersect360 Top of All Things in HPC Snapshot Analysis
 
Towards Exascale Computing with Fortran 2015
Towards Exascale Computing with Fortran 2015Towards Exascale Computing with Fortran 2015
Towards Exascale Computing with Fortran 2015
 
State of Linux Containers for HPC
State of Linux Containers for HPCState of Linux Containers for HPC
State of Linux Containers for HPC
 
SC16 Student Cluster Competition Configurations & Results
SC16 Student Cluster Competition Configurations & ResultsSC16 Student Cluster Competition Configurations & Results
SC16 Student Cluster Competition Configurations & Results
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
HPC Computing Trends
HPC Computing TrendsHPC Computing Trends
HPC Computing Trends
 
Maximizing HPC Compute Resources with Minimal Cost
Maximizing HPC Compute Resources with Minimal CostMaximizing HPC Compute Resources with Minimal Cost
Maximizing HPC Compute Resources with Minimal Cost
 
EMC in HPC – The Journey so far and the Road Ahead
EMC in HPC – The Journey so far and the Road AheadEMC in HPC – The Journey so far and the Road Ahead
EMC in HPC – The Journey so far and the Road Ahead
 
Business Wizard Of The Year : Mr. AMAR BABU,M.D.-LENOVO INDIA
Business Wizard Of The Year : Mr. AMAR BABU,M.D.-LENOVO INDIABusiness Wizard Of The Year : Mr. AMAR BABU,M.D.-LENOVO INDIA
Business Wizard Of The Year : Mr. AMAR BABU,M.D.-LENOVO INDIA
 
Cluster Tutorial
Cluster TutorialCluster Tutorial
Cluster Tutorial
 

Similar to Introduction to GPUs in HPC

Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
Ruymán Reyes
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
mohamedragabslideshare
 
Griffon Topic2 Presentation (Tia)
Griffon Topic2 Presentation (Tia)Griffon Topic2 Presentation (Tia)
Griffon Topic2 Presentation (Tia)
Nat Weerawan
 

Similar to Introduction to GPUs in HPC (20)

Presentation1
Presentation1Presentation1
Presentation1
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
 
Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPI
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPI
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
hybrid-programming.pptx
hybrid-programming.pptxhybrid-programming.pptx
hybrid-programming.pptx
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
 
IBM AI at Scale
IBM AI at ScaleIBM AI at Scale
IBM AI at Scale
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Griffon Topic2 Presentation (Tia)
Griffon Topic2 Presentation (Tia)Griffon Topic2 Presentation (Tia)
Griffon Topic2 Presentation (Tia)
 
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
C/R Support for Heterogeneous HPC Applications
C/R Support for Heterogeneous HPC Applications C/R Support for Heterogeneous HPC Applications
C/R Support for Heterogeneous HPC Applications
 
Senior Design: Raspberry Pi Cluster Computing
Senior Design: Raspberry Pi Cluster ComputingSenior Design: Raspberry Pi Cluster Computing
Senior Design: Raspberry Pi Cluster Computing
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 

Introduction to GPUs in HPC

  • 1. Introduction to GPUs in HPC Ben Cumming, CSCS February 28, 2017
  • 2. 2D and 3D Launch Configurations
  • 3. Launch Configuration So far we have used one-dimensional launch configurations threads in blocks indexed using threadIdx.x blocks in a grid indexed using blockIdx.x Many kernels map naturally onto 2D and 3D indexing e.g. matrix-matrix operations e.g. stencils Introduction to GPUs in HPC | 3
  • 4. Full Launch Configuration Kernel launch dimensions can be specified with dim3 structs kernel<<<dim3 grid_dim, dim3 block_dim>>>(...); dim3.x , dim3.y and dim3.z specify the launch dimensions can be constructed with 1, 2 or 3 dimensions unspecified dim3 dimensions are set to 1 launch configuration examples // 1D: 128 x1x1 for 128 threads dim3 a(128); // 2D: 16 x8x1 for 128 threads dim3 b(16, 8); // 3D: 16 x8x4 for 512 threads dim3 c(16, 8, 4); Introduction to GPUs in HPC | 4
  • 5. The threadIdx , blockDim , blockIdx and gridDim can be treated like 3D vectors via the .x , .y and .z members. matrix addition example __global__ void MatAdd(float *A, float *B, float *C, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if(i<n && j<n) { auto pos = i + j*n; C[pos] = A[pos] + B[pos ]; } } int main () { // ... dim3 threadsPerBlock (16, 16); dim3 numBlocks(n / threadsPerBlock .x, n / threadsPerBlock .y); MatAdd <<<numBlocks , threadsPerBlock >>>(A, B, C); // ... } Introduction to GPUs in HPC | 5
  • 6. Exercise: Launch Configuration Write the 2D diffusion stencil in diffusion/diffusion2d.cu Set up 2D launch configuration in the main loop Draw a picture of the solution to validate it a plotting script is provided for visualizing the results use a small domain for visualization # build and run after writing code cd diffusion srun diffusion2d 8 1000000 # do the plotting module load daint/gpu module load Python /2.7.12 - CrayGNU -2016.11 python plotting.py Introduction to GPUs in HPC | 6
  • 8. What is MPI MPI (Message Passing Interface) is a standardised library for message passing Highly portable: it is implemented on every HPC system available today. Has C, C++ and Fortran bindings. Supports point to point communication MPI_Send, MPI_Recv, MPI_Sendrecv, etc. Supports global collectives MPI_Barrier, MPI_Gather, MPI_Reduce, etc. When you start an MPI application N copies of the application are launched. Each copy is given a rank∈ {0, 1, . . . , N − 1}. Introduction to GPUs in HPC | 8
  • 9. A basic MPI application Example MPI application myapp.cpp #include <mpi.h> #include <unistd.h> #include <cstdio > int main(int argc , char ** argv) { // initialize MPI on this rank MPI_Init (&argc , &argv); // get information about our place in the world int rank , size; MPI_Comm_rank (MPI_COMM_WORLD , &rank); MPI_Comm_size (MPI_COMM_WORLD , &size); // print a message char name [128]; gethostname (name , sizeof(name)); printf("hello world from %d of %d on %sn", rank , size , name); // close down MPI MPI_Finalize (); return 0; } MPI applications are compiled with a compiler wrapper: > CC myapp.cpp -o myapp # the Cray C++ wrapper is CC Introduction to GPUs in HPC | 9
  • 10. Running our basic MPI application # run myapp 4 ranks (-n) on 4 nodes (-N) > srun -n4 -N4 ./ myapp hello world from 0 of 4 on nid02117 hello world from 1 of 4 on nid02118 hello world from 2 of 4 on nid02119 hello world from 3 of 4 on nid02120 c NVIDIA Corporation Introduction to GPUs in HPC | 10
  • 11. MPI with data in device memory We use GPUs to parallelize on-node computation and probably MPI for communication between nodes To use with data that is in buffers in GPU memory: 1. allocate buffers in host memory 2. manually copy from device→host memory 3. perform MPI communication with host buffers 4. copy received data from host→device memory This approach can be very fast: have a CPU thread dedicated to asynchronous host↔device and MPI communication Introduction to GPUs in HPC | 11
  • 12. GPU-aware MPI GPU-aware MPI implementations can automatically handle MPI transactions with pointers to GPU memory MVAPICH 2.0 OpenMPI since version 1.7.0 Cray MPI How it works Each pointer passed to MPI is checked to see if it is in host or device memory. If not set, MPI assumes that all pointers are to host memory, and your application will probably crash with segmentation faults Small messages between GPUs (up to ≈8 k) are copied directly with RDMA Larger messages are pipelined via host memory Introduction to GPUs in HPC | 12
  • 13. How to use G2G communication Set the environment variable export MPICH_RDMA_ENABLED_CUDA=1 If not set, MPI assumes that all pointers are to host memory, and your application will probably crash with segmentation faults Experiment with the environment variable MPICH_G2G_PIPELINE Sets the maximum number of 512 kB message chunks that can be in flight (default 16) MPI with G2G example MPI_Request srequest , rrequest; auto send_data = malloc_device <double >(100); auto recv_data = malloc_device <double >(100); // call MPI with GPU pointers MPI_Irecv(recv_data , 100, MPI_DOUBLE , source , tag , MPI_COMM_WORLD , &rrequest); MPI_Isend(send_data , 100, MPI_DOUBLE , target , tag , MPI_COMM_WORLD , &srequest); Introduction to GPUs in HPC | 13
  • 14. Capabilities and Limitations Support for most MPI API calls (point-to-point, collectives, etc) Robust support for common MPI API calls i.e. point-to-point operations No support for user-defined MPI data types Introduction to GPUs in HPC | 14
  • 15. Exercise: MPI with G2G 2D stencil with MPI in diffusion/diffusion2d_mpi.cu Implement the G2G version 1. can you observe any performance differences between the two? 2. why are we restricted to just 1 MPI rank per node? Implement a version that uses managed memory what happens if you don’t set MPICH_RDMA_ENABLED_CUDA? # load modules require to compile MPI module swap PrgEnv -cray PrgEnv -gnu module load cudatoolkit # launch with 2 MPI ranks MPICH_RDMA_ENABLED_CUDA =1 srun --reservation =Pascal2day -p jalet -C gpu -n2 -N2 diffusion2d_mpi .cuda 8 # plot the solution module load python /2.7.6 python plotting.py # once it gets the correct results: sbatch job.batch Introduction to GPUs in HPC | 15
  • 16. Exercises: 2D Diffusion with MPI Results Time for 1000 time steps 128×16,382 on K20X GPUs nodes G2G off G2G on 1 0.479 0.479 2 0.277 0.274 4 0.183 0.180 8 0.152 0.151 16 0.167 0.117 Introduction to GPUs in HPC | 16
  • 17. Using Unified Memory with MPI To pass a managed pointer to MPI you must use a GPU-aware MPI distribution. Even if the managed memory is on the host at time of calling. The MPI implementation uses page-locked (pinned) memory for RDMA. If not aware of unified memory you get if lucky: crashes. if unlucky: infuriating bugs. c NVIDIA Corporation Introduction to GPUs in HPC | 17