SlideShare a Scribd company logo
1 of 18
1Optimizing MPI-Communication On GPU
 Introduction To GPU
 GPU vs CPU
 More About GPU
 Introduction to CUDA
 Introduction to MPI
 Integration of MPI & CUDA
 Examples(Matrix Multiplication)
 Performance Results
 Conclusion and Future Enhancements
 Bibliography
2Optimizing MPI-On GPU Communicaion
• What is GPU?
• Uses SIMD architecture
• Earlier used exclusively for graphics processing
• Extremely efficient for many types of parallel tasks
• Single chip now capable of peak single
• Precision floating point performance of more than 4Tflops
3Optimizing MPI-On GPU Communicaion
• The CPU –BRAIN of Computer ,While
GPU is SOUL.
• CPU-Small no of Cores,
• CPU can do any task, But not vice versa
4Optimizing MPI-On GPU Communicaion
•GPU consists of – SM, SP(ALU), Global
memory(DRAM),local memory , shared memory,
5Optimizing MPI-On GPU Communicaion
• CUDA-(Compute Unified Device
Architecture)NVIDIA introduced CUDA in june 2007.
• OpenCL-(Open Computing Language),by Apple
Inc. introduced in august 2009.
•OpenACC-(for Open Accelerators),by CRAY,CAPS
and NVIDIA
 CUDA PROGRAMMING MODEL-
• CUDA Kernel- (__global__)
• Methods to handle memory
• CUDA IPC
6Optimizing MPI-On GPU Communicaion
• CUDA IPC can Work efficiently for multi-GPU
communication on same system.
•What For multiple node GPU communication?
Answer is MPI
7Optimizing MPI-On GPU Communicaion
• Standard interface/specification for parallel programming
– Language independent
– Platform independent
• MPI FUNCTIONS-
#include<mpi.h>
int main(int argc, char*argv[]) {
int myrank;
MPI_Init(&argc,&argv); /* InitializetheMPI library */
/* Determinethecalling processrank */
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);
MPI_Recv(s_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
Running- mpirun –n 2 ./apps<args>
8Optimizing MPI-On GPU Communicaion
9Optimizing MPI-On GPU Communicaion
double *dev_buf, *host_buf;
cudaMalloc(&dev_buf, size);
cudaMallocHost(&host_buf, size);
if (my_rank == sender) { /* sender*/
computation_on_GPU(dev_buf);
cudaMemcpy(host_buf, dev_buf,
size,...);
MPI_Send(host_buf, size, ...);
}
else { /* receiver*/
MPI_Recv(host_buf, size, ...);
cudaMemcpy(dev_buf, host_buf,
size, ...);
computation_on_GPU(dev_buf);}
10Optimizing MPI-On GPU Communicaion
11Optimizing MPI-On GPU Communicaion
 How to Optimize this…???
Optimizing MPI-On GPU Communicaion 12
•One address space for all CPU and GPU memory
-Determine physical memory location from a pointer
value
-Enable libraries to simplify their interfaces (e.g. MPI and
cudaMemcpy)
//MPI rank 0
MPI_Send(s_buf_d,size,…);
//MPI rank n-1
MPI_Recv(r_buf_d,size,…);
With UVA and CUDA-aware
MPI
//MPI rank 0
cudaMemcpy(s_buf_h,s_buf_
d,size,…);
MPI_Send(s_buf_h,size,…);
//MPI rank n-1
MPI_Recv(r_buf_h,size,…);
cudaMemcpy(r_buf_d,r_buf_
h,size,…);
No UVA and regular MPI
 By using UVA we need not Have to copy the data from
Device to host(CPU), and we can directly pass a buffer to
the another Device(GPU)
 But it can optimize the communication on a same none
with multiple GPU
Optimizing MPI-On GPU Communicaion 13
__global__ void MatMulKernel(Matrix A, Matrix B,
Matrix C)
{
// Each thread computesoneelement of C
// by accumulating results into Cvalue
float Cvalue= 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row > A.height || col > B.width) return;
for (int e= 0; e< A.width; ++e)
Cvalue+= A.elements[row * A.width + e] *
B.elements[e* B.width + col];
C.elements[row * C.width + col] = Cvalue;
}
Optimizing MPI-On GPU Communicaion 14
void assignDeviceToProcess()
{
MPI_Get_processor_name(host_name,&namelen);
MPI_Bcast(&(host_names[n]),MPI_MAX_PROCESS
OR_NAME, MPI_CHAR, n,
MPI_COMM_WORLD);
bytes= nprocs*
sizeof(char[MPI_MAX_PROCESSOR_NAME]);
mat_mul(host_names, nprocs,
sizeof(char[MPI_MAX_PROCESSOR_NAME]));
MPI_Comm_rank(nodeComm, &myrank);
cudaSetDevice(myrank);
}
Optimizing MPI-On GPU Communicaion 15
PERFORMANCE OF MPI FOR HIGHER AND SMALL ORDER
MITRICES
Fig. Performance of mpi for higher order matrices
Fig. Perforance of mpi for smaller order matrices.
Optimizing MPI-On GPU Communicaion 16
• GPU are become the core of HPC
•NVIDIA GPUS are most popular with their CUDA
programming model
• MPI can be used to have more acceleration and
parallelization
•MPICH2 is recent MPI implementation which we used
•NVIDIA is developing MVAPICH2
•GPUDirect can be used
-RDMA
Optimizing MPI-On GPU Communicaion 17
[1]. S. Potluri, H. Wang’s "Optimizing MPI Communication on Multi-GPU
Systems using CUDA Inter-Process Communication",201 2 IEEE 26 th Internatio nal Parallel and
Distributed Pro cessing Sympo sium Wo rksho ps & PhD Fo rum
[2]. Khaled Hamidouche, Sreeram Potluri “Efficient Inter-node MPI Communication using
GpuDirect RDMA
for InfiniBand Clusters with NVIDIA GPUs”, 201 3 42nd Internatio nal Co nference o n Parallel
Pro cessing
[3]. Sampath S’s“PerformanceEvaluation and Comparison of MPI and PVM using aCluster
Based Parallel Computing Architecture”, 201 3 Internatio nal Co nference o n Circuits, Po wer and
Co mputing Techno lo gies [ICCPCT-201 3]
4. Websitereferred “https: //develo per.nvidia.co m/mpi-so lutio ns-gpus”
Optimizing MPI-On GPU Communicaion 18

More Related Content

What's hot

Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityLinaro
 
New Process/Thread Runtime
New Process/Thread Runtime	New Process/Thread Runtime
New Process/Thread Runtime Linaro
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVELinaro
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
How to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memoHow to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memoNaoto MATSUMOTO
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportLinaro
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC EcosystemPost-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC EcosystemLinaro
 
eBPF in the view of a storage developer
eBPF in the view of a storage developereBPF in the view of a storage developer
eBPF in the view of a storage developerRichárd Kovács
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveNetronome
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_mapslcplcp1
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsLinaro
 

What's hot (20)

Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & Productivity
 
Day4
Day4Day4
Day4
 
New Process/Thread Runtime
New Process/Thread Runtime	New Process/Thread Runtime
New Process/Thread Runtime
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Chainer v4 and v5
Chainer v4 and v5Chainer v4 and v5
Chainer v4 and v5
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
Cheap HPC
Cheap HPCCheap HPC
Cheap HPC
 
How to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memoHow to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memo
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC EcosystemPost-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem
 
eBPF in the view of a storage developer
eBPF in the view of a storage developereBPF in the view of a storage developer
eBPF in the view of a storage developer
 
Cuda
CudaCuda
Cuda
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
 

Similar to Presentation1

Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPIyaman dua
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPIAnkit Mahato
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaTim Ellison
 

Similar to Presentation1 (20)

Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPI
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based Modeling
 
hybrid-programming.pptx
hybrid-programming.pptxhybrid-programming.pptx
hybrid-programming.pptx
 
Parallel computing(2)
Parallel computing(2)Parallel computing(2)
Parallel computing(2)
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPI
 
Lecture9
Lecture9Lecture9
Lecture9
 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
 
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPI
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
Using MPI
Using MPIUsing MPI
Using MPI
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Mpi
Mpi Mpi
Mpi
 
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 

Presentation1

  • 2.  Introduction To GPU  GPU vs CPU  More About GPU  Introduction to CUDA  Introduction to MPI  Integration of MPI & CUDA  Examples(Matrix Multiplication)  Performance Results  Conclusion and Future Enhancements  Bibliography 2Optimizing MPI-On GPU Communicaion
  • 3. • What is GPU? • Uses SIMD architecture • Earlier used exclusively for graphics processing • Extremely efficient for many types of parallel tasks • Single chip now capable of peak single • Precision floating point performance of more than 4Tflops 3Optimizing MPI-On GPU Communicaion
  • 4. • The CPU –BRAIN of Computer ,While GPU is SOUL. • CPU-Small no of Cores, • CPU can do any task, But not vice versa 4Optimizing MPI-On GPU Communicaion
  • 5. •GPU consists of – SM, SP(ALU), Global memory(DRAM),local memory , shared memory, 5Optimizing MPI-On GPU Communicaion
  • 6. • CUDA-(Compute Unified Device Architecture)NVIDIA introduced CUDA in june 2007. • OpenCL-(Open Computing Language),by Apple Inc. introduced in august 2009. •OpenACC-(for Open Accelerators),by CRAY,CAPS and NVIDIA  CUDA PROGRAMMING MODEL- • CUDA Kernel- (__global__) • Methods to handle memory • CUDA IPC 6Optimizing MPI-On GPU Communicaion
  • 7. • CUDA IPC can Work efficiently for multi-GPU communication on same system. •What For multiple node GPU communication? Answer is MPI 7Optimizing MPI-On GPU Communicaion
  • 8. • Standard interface/specification for parallel programming – Language independent – Platform independent • MPI FUNCTIONS- #include<mpi.h> int main(int argc, char*argv[]) { int myrank; MPI_Init(&argc,&argv); /* InitializetheMPI library */ /* Determinethecalling processrank */ MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(s_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD); MPI_Finalize(); return 0; } Running- mpirun –n 2 ./apps<args> 8Optimizing MPI-On GPU Communicaion
  • 9. 9Optimizing MPI-On GPU Communicaion
  • 10. double *dev_buf, *host_buf; cudaMalloc(&dev_buf, size); cudaMallocHost(&host_buf, size); if (my_rank == sender) { /* sender*/ computation_on_GPU(dev_buf); cudaMemcpy(host_buf, dev_buf, size,...); MPI_Send(host_buf, size, ...); } else { /* receiver*/ MPI_Recv(host_buf, size, ...); cudaMemcpy(dev_buf, host_buf, size, ...); computation_on_GPU(dev_buf);} 10Optimizing MPI-On GPU Communicaion
  • 11. 11Optimizing MPI-On GPU Communicaion  How to Optimize this…???
  • 12. Optimizing MPI-On GPU Communicaion 12 •One address space for all CPU and GPU memory -Determine physical memory location from a pointer value -Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)
  • 13. //MPI rank 0 MPI_Send(s_buf_d,size,…); //MPI rank n-1 MPI_Recv(r_buf_d,size,…); With UVA and CUDA-aware MPI //MPI rank 0 cudaMemcpy(s_buf_h,s_buf_ d,size,…); MPI_Send(s_buf_h,size,…); //MPI rank n-1 MPI_Recv(r_buf_h,size,…); cudaMemcpy(r_buf_d,r_buf_ h,size,…); No UVA and regular MPI  By using UVA we need not Have to copy the data from Device to host(CPU), and we can directly pass a buffer to the another Device(GPU)  But it can optimize the communication on a same none with multiple GPU Optimizing MPI-On GPU Communicaion 13
  • 14. __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { // Each thread computesoneelement of C // by accumulating results into Cvalue float Cvalue= 0; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if(row > A.height || col > B.width) return; for (int e= 0; e< A.width; ++e) Cvalue+= A.elements[row * A.width + e] * B.elements[e* B.width + col]; C.elements[row * C.width + col] = Cvalue; } Optimizing MPI-On GPU Communicaion 14 void assignDeviceToProcess() { MPI_Get_processor_name(host_name,&namelen); MPI_Bcast(&(host_names[n]),MPI_MAX_PROCESS OR_NAME, MPI_CHAR, n, MPI_COMM_WORLD); bytes= nprocs* sizeof(char[MPI_MAX_PROCESSOR_NAME]); mat_mul(host_names, nprocs, sizeof(char[MPI_MAX_PROCESSOR_NAME])); MPI_Comm_rank(nodeComm, &myrank); cudaSetDevice(myrank); }
  • 15. Optimizing MPI-On GPU Communicaion 15 PERFORMANCE OF MPI FOR HIGHER AND SMALL ORDER MITRICES Fig. Performance of mpi for higher order matrices Fig. Perforance of mpi for smaller order matrices.
  • 16. Optimizing MPI-On GPU Communicaion 16 • GPU are become the core of HPC •NVIDIA GPUS are most popular with their CUDA programming model • MPI can be used to have more acceleration and parallelization •MPICH2 is recent MPI implementation which we used •NVIDIA is developing MVAPICH2 •GPUDirect can be used -RDMA
  • 17. Optimizing MPI-On GPU Communicaion 17 [1]. S. Potluri, H. Wang’s "Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication",201 2 IEEE 26 th Internatio nal Parallel and Distributed Pro cessing Sympo sium Wo rksho ps & PhD Fo rum [2]. Khaled Hamidouche, Sreeram Potluri “Efficient Inter-node MPI Communication using GpuDirect RDMA for InfiniBand Clusters with NVIDIA GPUs”, 201 3 42nd Internatio nal Co nference o n Parallel Pro cessing [3]. Sampath S’s“PerformanceEvaluation and Comparison of MPI and PVM using aCluster Based Parallel Computing Architecture”, 201 3 Internatio nal Co nference o n Circuits, Po wer and Co mputing Techno lo gies [ICCPCT-201 3] 4. Websitereferred “https: //develo per.nvidia.co m/mpi-so lutio ns-gpus”
  • 18. Optimizing MPI-On GPU Communicaion 18