This document discusses optimizing MPI communication on GPUs. It introduces GPUs and their advantages over CPUs for parallel tasks. MPI is commonly used for communication between GPUs on multiple nodes. The document examines integrating MPI and CUDA for GPU communication and provides an example of matrix multiplication. It evaluates the performance of MPI on higher and lower order matrices and concludes GPUs are becoming core to high performance computing while MPI can provide more acceleration when combined with technologies like GPUDirect RDMA.