2. Introduction To GPU
GPU vs CPU
More About GPU
Introduction to CUDA
Introduction to MPI
Integration of MPI & CUDA
Examples(Matrix Multiplication)
Performance Results
Conclusion and Future Enhancements
Bibliography
2Optimizing MPI-On GPU Communicaion
3. • What is GPU?
• Uses SIMD architecture
• Earlier used exclusively for graphics processing
• Extremely efficient for many types of parallel tasks
• Single chip now capable of peak single
• Precision floating point performance of more than 4Tflops
3Optimizing MPI-On GPU Communicaion
4. • The CPU –BRAIN of Computer ,While
GPU is SOUL.
• CPU-Small no of Cores,
• CPU can do any task, But not vice versa
4Optimizing MPI-On GPU Communicaion
5. •GPU consists of – SM, SP(ALU), Global
memory(DRAM),local memory , shared memory,
5Optimizing MPI-On GPU Communicaion
6. • CUDA-(Compute Unified Device
Architecture)NVIDIA introduced CUDA in june 2007.
• OpenCL-(Open Computing Language),by Apple
Inc. introduced in august 2009.
•OpenACC-(for Open Accelerators),by CRAY,CAPS
and NVIDIA
CUDA PROGRAMMING MODEL-
• CUDA Kernel- (__global__)
• Methods to handle memory
• CUDA IPC
6Optimizing MPI-On GPU Communicaion
7. • CUDA IPC can Work efficiently for multi-GPU
communication on same system.
•What For multiple node GPU communication?
Answer is MPI
7Optimizing MPI-On GPU Communicaion
8. • Standard interface/specification for parallel programming
– Language independent
– Platform independent
• MPI FUNCTIONS-
#include<mpi.h>
int main(int argc, char*argv[]) {
int myrank;
MPI_Init(&argc,&argv); /* InitializetheMPI library */
/* Determinethecalling processrank */
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);
MPI_Recv(s_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
Running- mpirun –n 2 ./apps<args>
8Optimizing MPI-On GPU Communicaion
12. Optimizing MPI-On GPU Communicaion 12
•One address space for all CPU and GPU memory
-Determine physical memory location from a pointer
value
-Enable libraries to simplify their interfaces (e.g. MPI and
cudaMemcpy)
13. //MPI rank 0
MPI_Send(s_buf_d,size,…);
//MPI rank n-1
MPI_Recv(r_buf_d,size,…);
With UVA and CUDA-aware
MPI
//MPI rank 0
cudaMemcpy(s_buf_h,s_buf_
d,size,…);
MPI_Send(s_buf_h,size,…);
//MPI rank n-1
MPI_Recv(r_buf_h,size,…);
cudaMemcpy(r_buf_d,r_buf_
h,size,…);
No UVA and regular MPI
By using UVA we need not Have to copy the data from
Device to host(CPU), and we can directly pass a buffer to
the another Device(GPU)
But it can optimize the communication on a same none
with multiple GPU
Optimizing MPI-On GPU Communicaion 13
14. __global__ void MatMulKernel(Matrix A, Matrix B,
Matrix C)
{
// Each thread computesoneelement of C
// by accumulating results into Cvalue
float Cvalue= 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row > A.height || col > B.width) return;
for (int e= 0; e< A.width; ++e)
Cvalue+= A.elements[row * A.width + e] *
B.elements[e* B.width + col];
C.elements[row * C.width + col] = Cvalue;
}
Optimizing MPI-On GPU Communicaion 14
void assignDeviceToProcess()
{
MPI_Get_processor_name(host_name,&namelen);
MPI_Bcast(&(host_names[n]),MPI_MAX_PROCESS
OR_NAME, MPI_CHAR, n,
MPI_COMM_WORLD);
bytes= nprocs*
sizeof(char[MPI_MAX_PROCESSOR_NAME]);
mat_mul(host_names, nprocs,
sizeof(char[MPI_MAX_PROCESSOR_NAME]));
MPI_Comm_rank(nodeComm, &myrank);
cudaSetDevice(myrank);
}
15. Optimizing MPI-On GPU Communicaion 15
PERFORMANCE OF MPI FOR HIGHER AND SMALL ORDER
MITRICES
Fig. Performance of mpi for higher order matrices
Fig. Perforance of mpi for smaller order matrices.
16. Optimizing MPI-On GPU Communicaion 16
• GPU are become the core of HPC
•NVIDIA GPUS are most popular with their CUDA
programming model
• MPI can be used to have more acceleration and
parallelization
•MPICH2 is recent MPI implementation which we used
•NVIDIA is developing MVAPICH2
•GPUDirect can be used
-RDMA
17. Optimizing MPI-On GPU Communicaion 17
[1]. S. Potluri, H. Wang’s "Optimizing MPI Communication on Multi-GPU
Systems using CUDA Inter-Process Communication",201 2 IEEE 26 th Internatio nal Parallel and
Distributed Pro cessing Sympo sium Wo rksho ps & PhD Fo rum
[2]. Khaled Hamidouche, Sreeram Potluri “Efficient Inter-node MPI Communication using
GpuDirect RDMA
for InfiniBand Clusters with NVIDIA GPUs”, 201 3 42nd Internatio nal Co nference o n Parallel
Pro cessing
[3]. Sampath S’s“PerformanceEvaluation and Comparison of MPI and PVM using aCluster
Based Parallel Computing Architecture”, 201 3 Internatio nal Co nference o n Circuits, Po wer and
Co mputing Techno lo gies [ICCPCT-201 3]
4. Websitereferred “https: //develo per.nvidia.co m/mpi-so lutio ns-gpus”