Presentation1

1Optimizing MPI-Communication On GPU

 Introduction To GPU
 GPU vs CPU
 More About GPU
 Introduction to CUDA
 Introduction to MPI
 Integration of MPI & CUDA
 Examples(Matrix Multiplication)
 Performance Results
 Conclusion and Future Enhancements
 Bibliography
2Optimizing MPI-On GPU Communicaion

• What is GPU?
• Uses SIMD architecture
• Earlier used exclusively for graphics processing
• Extremely efficient for many types of parallel tasks
• Single chip now capable of peak single
• Precision floating point performance of more than 4Tflops

• The CPU –BRAIN of Computer ,While
GPU is SOUL.
• CPU-Small no of Cores,
• CPU can do any task, But not vice versa

•GPU consists of – SM, SP(ALU), Global
memory(DRAM),local memory , shared memory,

• CUDA-(Compute Unified Device
Architecture)NVIDIA introduced CUDA in june 2007.
• OpenCL-(Open Computing Language),by Apple
Inc. introduced in august 2009.
•OpenACC-(for Open Accelerators),by CRAY,CAPS
and NVIDIA
 CUDA PROGRAMMING MODEL-
• CUDA Kernel- (__global__)
• Methods to handle memory
• CUDA IPC

• CUDA IPC can Work efficiently for multi-GPU
communication on same system.
•What For multiple node GPU communication?
Answer is MPI

• Standard interface/specification for parallel programming
– Language independent
– Platform independent
• MPI FUNCTIONS-
#include<mpi.h>
int main(int argc, char*argv[]) {
int myrank;
MPI_Init(&argc,&argv); /* InitializetheMPI library */
/* Determinethecalling processrank */
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);
MPI_Recv(s_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
Running- mpirun –n 2 ./apps<args>

double *dev_buf, *host_buf;
cudaMalloc(&dev_buf, size);
cudaMallocHost(&host_buf, size);
if (my_rank == sender) { /* sender*/
computation_on_GPU(dev_buf);
cudaMemcpy(host_buf, dev_buf,
size,...);
MPI_Send(host_buf, size, ...);
}
else { /* receiver*/
MPI_Recv(host_buf, size, ...);
cudaMemcpy(dev_buf, host_buf,
size, ...);
computation_on_GPU(dev_buf);}

 How to Optimize this…???

Optimizing MPI-On GPU Communicaion 12
•One address space for all CPU and GPU memory
-Determine physical memory location from a pointer
value
-Enable libraries to simplify their interfaces (e.g. MPI and
cudaMemcpy)

//MPI rank 0
MPI_Send(s_buf_d,size,…);
//MPI rank n-1
MPI_Recv(r_buf_d,size,…);
With UVA and CUDA-aware
MPI
//MPI rank 0
cudaMemcpy(s_buf_h,s_buf_
d,size,…);
MPI_Send(s_buf_h,size,…);
//MPI rank n-1
MPI_Recv(r_buf_h,size,…);
cudaMemcpy(r_buf_d,r_buf_
h,size,…);
No UVA and regular MPI
 By using UVA we need not Have to copy the data from
Device to host(CPU), and we can directly pass a buffer to
the another Device(GPU)
 But it can optimize the communication on a same none
with multiple GPU

__global__ void MatMulKernel(Matrix A, Matrix B,
Matrix C)
{
// Each thread computesoneelement of C
// by accumulating results into Cvalue
float Cvalue= 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row > A.height || col > B.width) return;
for (int e= 0; e< A.width; ++e)
Cvalue+= A.elements[row * A.width + e] *
B.elements[e* B.width + col];
C.elements[row * C.width + col] = Cvalue;
}
void assignDeviceToProcess()
{
MPI_Get_processor_name(host_name,&namelen);
MPI_Bcast(&(host_names[n]),MPI_MAX_PROCESS
OR_NAME, MPI_CHAR, n,
MPI_COMM_WORLD);
bytes= nprocs*
sizeof(char[MPI_MAX_PROCESSOR_NAME]);
mat_mul(host_names, nprocs,
sizeof(char[MPI_MAX_PROCESSOR_NAME]));
MPI_Comm_rank(nodeComm, &myrank);
cudaSetDevice(myrank);
}

PERFORMANCE OF MPI FOR HIGHER AND SMALL ORDER
MITRICES
Fig. Performance of mpi for higher order matrices
Fig. Perforance of mpi for smaller order matrices.

• GPU are become the core of HPC
•NVIDIA GPUS are most popular with their CUDA
programming model
• MPI can be used to have more acceleration and
parallelization
•MPICH2 is recent MPI implementation which we used
•NVIDIA is developing MVAPICH2
•GPUDirect can be used
-RDMA

[1]. S. Potluri, H. Wang’s "Optimizing MPI Communication on Multi-GPU
Systems using CUDA Inter-Process Communication",201 2 IEEE 26 th Internatio nal Parallel and
Distributed Pro cessing Sympo sium Wo rksho ps & PhD Fo rum
[2]. Khaled Hamidouche, Sreeram Potluri “Efficient Inter-node MPI Communication using
GpuDirect RDMA
for InfiniBand Clusters with NVIDIA GPUs”, 201 3 42nd Internatio nal Co nference o n Parallel
Pro cessing
[3]. Sampath S’s“PerformanceEvaluation and Comparison of MPI and PVM using aCluster
Based Parallel Computing Architecture”, 201 3 Internatio nal Co nference o n Circuits, Po wer and
Co mputing Techno lo gies [ICCPCT-201 3]
4. Websitereferred “https: //develo per.nvidia.co m/mpi-so lutio ns-gpus”

Presentation1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Presentation1

Similar to Presentation1 (20)

Presentation1