1Optimizing MPI-Communication On GPU
 Introduction To GPU
 GPU vs CPU
 More About GPU
 Introduction to CUDA
 Introduction to MPI
 Integration of MPI & CUDA
 Examples(Matrix Multiplication)
 Performance Results
 Conclusion and Future Enhancements
 Bibliography
2Optimizing MPI-On GPU Communicaion
• What is GPU?
• Uses SIMD architecture
• Earlier used exclusively for graphics processing
• Extremely efficient for many types of parallel tasks
• Single chip now capable of peak single
• Precision floating point performance of more than 4Tflops
3Optimizing MPI-On GPU Communicaion
• The CPU –BRAIN of Computer ,While
GPU is SOUL.
• CPU-Small no of Cores,
• CPU can do any task, But not vice versa
4Optimizing MPI-On GPU Communicaion
•GPU consists of – SM, SP(ALU), Global
memory(DRAM),local memory , shared memory,
5Optimizing MPI-On GPU Communicaion
• CUDA-(Compute Unified Device
Architecture)NVIDIA introduced CUDA in june 2007.
• OpenCL-(Open Computing Language),by Apple
Inc. introduced in august 2009.
•OpenACC-(for Open Accelerators),by CRAY,CAPS
and NVIDIA
 CUDA PROGRAMMING MODEL-
• CUDA Kernel- (__global__)
• Methods to handle memory
• CUDA IPC
6Optimizing MPI-On GPU Communicaion
• CUDA IPC can Work efficiently for multi-GPU
communication on same system.
•What For multiple node GPU communication?
Answer is MPI
7Optimizing MPI-On GPU Communicaion
• Standard interface/specification for parallel programming
– Language independent
– Platform independent
• MPI FUNCTIONS-
#include<mpi.h>
int main(int argc, char*argv[]) {
int myrank;
MPI_Init(&argc,&argv); /* InitializetheMPI library */
/* Determinethecalling processrank */
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);
MPI_Recv(s_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
Running- mpirun –n 2 ./apps<args>
8Optimizing MPI-On GPU Communicaion
9Optimizing MPI-On GPU Communicaion
double *dev_buf, *host_buf;
cudaMalloc(&dev_buf, size);
cudaMallocHost(&host_buf, size);
if (my_rank == sender) { /* sender*/
computation_on_GPU(dev_buf);
cudaMemcpy(host_buf, dev_buf,
size,...);
MPI_Send(host_buf, size, ...);
}
else { /* receiver*/
MPI_Recv(host_buf, size, ...);
cudaMemcpy(dev_buf, host_buf,
size, ...);
computation_on_GPU(dev_buf);}
10Optimizing MPI-On GPU Communicaion
11Optimizing MPI-On GPU Communicaion
 How to Optimize this…???
Optimizing MPI-On GPU Communicaion 12
•One address space for all CPU and GPU memory
-Determine physical memory location from a pointer
value
-Enable libraries to simplify their interfaces (e.g. MPI and
cudaMemcpy)
//MPI rank 0
MPI_Send(s_buf_d,size,…);
//MPI rank n-1
MPI_Recv(r_buf_d,size,…);
With UVA and CUDA-aware
MPI
//MPI rank 0
cudaMemcpy(s_buf_h,s_buf_
d,size,…);
MPI_Send(s_buf_h,size,…);
//MPI rank n-1
MPI_Recv(r_buf_h,size,…);
cudaMemcpy(r_buf_d,r_buf_
h,size,…);
No UVA and regular MPI
 By using UVA we need not Have to copy the data from
Device to host(CPU), and we can directly pass a buffer to
the another Device(GPU)
 But it can optimize the communication on a same none
with multiple GPU
Optimizing MPI-On GPU Communicaion 13
__global__ void MatMulKernel(Matrix A, Matrix B,
Matrix C)
{
// Each thread computesoneelement of C
// by accumulating results into Cvalue
float Cvalue= 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row > A.height || col > B.width) return;
for (int e= 0; e< A.width; ++e)
Cvalue+= A.elements[row * A.width + e] *
B.elements[e* B.width + col];
C.elements[row * C.width + col] = Cvalue;
}
Optimizing MPI-On GPU Communicaion 14
void assignDeviceToProcess()
{
MPI_Get_processor_name(host_name,&namelen);
MPI_Bcast(&(host_names[n]),MPI_MAX_PROCESS
OR_NAME, MPI_CHAR, n,
MPI_COMM_WORLD);
bytes= nprocs*
sizeof(char[MPI_MAX_PROCESSOR_NAME]);
mat_mul(host_names, nprocs,
sizeof(char[MPI_MAX_PROCESSOR_NAME]));
MPI_Comm_rank(nodeComm, &myrank);
cudaSetDevice(myrank);
}
Optimizing MPI-On GPU Communicaion 15
PERFORMANCE OF MPI FOR HIGHER AND SMALL ORDER
MITRICES
Fig. Performance of mpi for higher order matrices
Fig. Perforance of mpi for smaller order matrices.
Optimizing MPI-On GPU Communicaion 16
• GPU are become the core of HPC
•NVIDIA GPUS are most popular with their CUDA
programming model
• MPI can be used to have more acceleration and
parallelization
•MPICH2 is recent MPI implementation which we used
•NVIDIA is developing MVAPICH2
•GPUDirect can be used
-RDMA
Optimizing MPI-On GPU Communicaion 17
[1]. S. Potluri, H. Wang’s "Optimizing MPI Communication on Multi-GPU
Systems using CUDA Inter-Process Communication",201 2 IEEE 26 th Internatio nal Parallel and
Distributed Pro cessing Sympo sium Wo rksho ps & PhD Fo rum
[2]. Khaled Hamidouche, Sreeram Potluri “Efficient Inter-node MPI Communication using
GpuDirect RDMA
for InfiniBand Clusters with NVIDIA GPUs”, 201 3 42nd Internatio nal Co nference o n Parallel
Pro cessing
[3]. Sampath S’s“PerformanceEvaluation and Comparison of MPI and PVM using aCluster
Based Parallel Computing Architecture”, 201 3 Internatio nal Co nference o n Circuits, Po wer and
Co mputing Techno lo gies [ICCPCT-201 3]
4. Websitereferred “https: //develo per.nvidia.co m/mpi-so lutio ns-gpus”
Optimizing MPI-On GPU Communicaion 18

Presentation1

  • 1.
  • 2.
     Introduction ToGPU  GPU vs CPU  More About GPU  Introduction to CUDA  Introduction to MPI  Integration of MPI & CUDA  Examples(Matrix Multiplication)  Performance Results  Conclusion and Future Enhancements  Bibliography 2Optimizing MPI-On GPU Communicaion
  • 3.
    • What isGPU? • Uses SIMD architecture • Earlier used exclusively for graphics processing • Extremely efficient for many types of parallel tasks • Single chip now capable of peak single • Precision floating point performance of more than 4Tflops 3Optimizing MPI-On GPU Communicaion
  • 4.
    • The CPU–BRAIN of Computer ,While GPU is SOUL. • CPU-Small no of Cores, • CPU can do any task, But not vice versa 4Optimizing MPI-On GPU Communicaion
  • 5.
    •GPU consists of– SM, SP(ALU), Global memory(DRAM),local memory , shared memory, 5Optimizing MPI-On GPU Communicaion
  • 6.
    • CUDA-(Compute UnifiedDevice Architecture)NVIDIA introduced CUDA in june 2007. • OpenCL-(Open Computing Language),by Apple Inc. introduced in august 2009. •OpenACC-(for Open Accelerators),by CRAY,CAPS and NVIDIA  CUDA PROGRAMMING MODEL- • CUDA Kernel- (__global__) • Methods to handle memory • CUDA IPC 6Optimizing MPI-On GPU Communicaion
  • 7.
    • CUDA IPCcan Work efficiently for multi-GPU communication on same system. •What For multiple node GPU communication? Answer is MPI 7Optimizing MPI-On GPU Communicaion
  • 8.
    • Standard interface/specificationfor parallel programming – Language independent – Platform independent • MPI FUNCTIONS- #include<mpi.h> int main(int argc, char*argv[]) { int myrank; MPI_Init(&argc,&argv); /* InitializetheMPI library */ /* Determinethecalling processrank */ MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(s_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD); MPI_Finalize(); return 0; } Running- mpirun –n 2 ./apps<args> 8Optimizing MPI-On GPU Communicaion
  • 9.
  • 10.
    double *dev_buf, *host_buf; cudaMalloc(&dev_buf,size); cudaMallocHost(&host_buf, size); if (my_rank == sender) { /* sender*/ computation_on_GPU(dev_buf); cudaMemcpy(host_buf, dev_buf, size,...); MPI_Send(host_buf, size, ...); } else { /* receiver*/ MPI_Recv(host_buf, size, ...); cudaMemcpy(dev_buf, host_buf, size, ...); computation_on_GPU(dev_buf);} 10Optimizing MPI-On GPU Communicaion
  • 11.
    11Optimizing MPI-On GPUCommunicaion  How to Optimize this…???
  • 12.
    Optimizing MPI-On GPUCommunicaion 12 •One address space for all CPU and GPU memory -Determine physical memory location from a pointer value -Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)
  • 13.
    //MPI rank 0 MPI_Send(s_buf_d,size,…); //MPIrank n-1 MPI_Recv(r_buf_d,size,…); With UVA and CUDA-aware MPI //MPI rank 0 cudaMemcpy(s_buf_h,s_buf_ d,size,…); MPI_Send(s_buf_h,size,…); //MPI rank n-1 MPI_Recv(r_buf_h,size,…); cudaMemcpy(r_buf_d,r_buf_ h,size,…); No UVA and regular MPI  By using UVA we need not Have to copy the data from Device to host(CPU), and we can directly pass a buffer to the another Device(GPU)  But it can optimize the communication on a same none with multiple GPU Optimizing MPI-On GPU Communicaion 13
  • 14.
    __global__ void MatMulKernel(MatrixA, Matrix B, Matrix C) { // Each thread computesoneelement of C // by accumulating results into Cvalue float Cvalue= 0; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if(row > A.height || col > B.width) return; for (int e= 0; e< A.width; ++e) Cvalue+= A.elements[row * A.width + e] * B.elements[e* B.width + col]; C.elements[row * C.width + col] = Cvalue; } Optimizing MPI-On GPU Communicaion 14 void assignDeviceToProcess() { MPI_Get_processor_name(host_name,&namelen); MPI_Bcast(&(host_names[n]),MPI_MAX_PROCESS OR_NAME, MPI_CHAR, n, MPI_COMM_WORLD); bytes= nprocs* sizeof(char[MPI_MAX_PROCESSOR_NAME]); mat_mul(host_names, nprocs, sizeof(char[MPI_MAX_PROCESSOR_NAME])); MPI_Comm_rank(nodeComm, &myrank); cudaSetDevice(myrank); }
  • 15.
    Optimizing MPI-On GPUCommunicaion 15 PERFORMANCE OF MPI FOR HIGHER AND SMALL ORDER MITRICES Fig. Performance of mpi for higher order matrices Fig. Perforance of mpi for smaller order matrices.
  • 16.
    Optimizing MPI-On GPUCommunicaion 16 • GPU are become the core of HPC •NVIDIA GPUS are most popular with their CUDA programming model • MPI can be used to have more acceleration and parallelization •MPICH2 is recent MPI implementation which we used •NVIDIA is developing MVAPICH2 •GPUDirect can be used -RDMA
  • 17.
    Optimizing MPI-On GPUCommunicaion 17 [1]. S. Potluri, H. Wang’s "Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication",201 2 IEEE 26 th Internatio nal Parallel and Distributed Pro cessing Sympo sium Wo rksho ps & PhD Fo rum [2]. Khaled Hamidouche, Sreeram Potluri “Efficient Inter-node MPI Communication using GpuDirect RDMA for InfiniBand Clusters with NVIDIA GPUs”, 201 3 42nd Internatio nal Co nference o n Parallel Pro cessing [3]. Sampath S’s“PerformanceEvaluation and Comparison of MPI and PVM using aCluster Based Parallel Computing Architecture”, 201 3 Internatio nal Co nference o n Circuits, Po wer and Co mputing Techno lo gies [ICCPCT-201 3] 4. Websitereferred “https: //develo per.nvidia.co m/mpi-so lutio ns-gpus”
  • 18.
    Optimizing MPI-On GPUCommunicaion 18