This document discusses optimizing MPI communication on GPUs. It introduces GPUs and their advantages over CPUs for parallel tasks. MPI is commonly used for communication between GPUs on multiple nodes. The document examines integrating MPI and CUDA for GPU communication and provides an example of matrix multiplication. It evaluates the performance of MPI on higher and lower order matrices and concludes GPUs are becoming core to high performance computing while MPI can provide more acceleration when combined with technologies like GPUDirect RDMA.
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
When developing an application for Cray XK7 systems, optimization of compute kernels is only a small part of maximizing scaling and performance. Programmers must consider the effect of the GPU’s distinct address space and the PCIe bus on application scalability. Without such considerations applications rapidly become limited by transfers to and from the GPU and fail to scale to large numbers of nodes. This paper will demonstrate methods for optimizing GPU to GPU communication and present XK7 results for these methods.
This presentation was originally given at CUG 2013.
In this deck, Jem Davies (VP Engineering and ARM Fellow) gives a brief introduction to Machine Learning and explains how it is used in devices such as smartphones, autos, and drones. "I do think that machine learning altogether is probably going to be one of the biggest shifts in computing that we'll see in quite a few years. I'm reluctant to put a number on it like -- the biggest thing in 25 years or whatever," said Jem Davies in a recent investor call. "But this is going to be big. It is going to affect all of us. It affects quite a lot of ARM, in fact."
Watch the video presentation: http://insidehpc.com/2017/03/slidecast-arm-steps-machine-learning/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this presentation, you will learn how to use Lustre filesystem more efficient and the best practices. Moreover, Darshan tool will be used to analyze the IO.
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
When developing an application for Cray XK7 systems, optimization of compute kernels is only a small part of maximizing scaling and performance. Programmers must consider the effect of the GPU’s distinct address space and the PCIe bus on application scalability. Without such considerations applications rapidly become limited by transfers to and from the GPU and fail to scale to large numbers of nodes. This paper will demonstrate methods for optimizing GPU to GPU communication and present XK7 results for these methods.
This presentation was originally given at CUG 2013.
In this deck, Jem Davies (VP Engineering and ARM Fellow) gives a brief introduction to Machine Learning and explains how it is used in devices such as smartphones, autos, and drones. "I do think that machine learning altogether is probably going to be one of the biggest shifts in computing that we'll see in quite a few years. I'm reluctant to put a number on it like -- the biggest thing in 25 years or whatever," said Jem Davies in a recent investor call. "But this is going to be big. It is going to affect all of us. It affects quite a lot of ARM, in fact."
Watch the video presentation: http://insidehpc.com/2017/03/slidecast-arm-steps-machine-learning/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this presentation, you will learn how to use Lustre filesystem more efficient and the best practices. Moreover, Darshan tool will be used to analyze the IO.
"This deck is from the opening session of the "Introduction to Programming Pascal (P100) with CUDA 8" workshop at CSCS in Lugano, Switzerland. The three-day course is intended to offer an introduction to Pascal computing using CUDA 8."
Watch the video: http://wp.me/p3RLHQ-gsQ
Learn more: http://www.cscs.ch/events/event_detail/index.html?tx_seminars_pi1%5BshowUid%5D=155
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Programming Languages & Tools for Higher Performance & ProductivityLinaro
By Hitoshi Murai, RIKEN AICS
For higher performance and productivity of HPC systems, it is important to provide users with good programming environment including languages, compilers, and tools. In this talk, the programming model of the post-K supercomputer will be shown.
Hitoshi Murai Bio
Hitoshi Murai received a master's degree in information science from Kyoto University in 1996. He worked as a software developer in NEC from 1996 to 2010. He received a Ph.D degree in computer science from University of Tsukuba in 2010. He is currently a research scientist of the programming environment research team and the Flagship 2020 project in Advanced Institute for Computational Science, RIKEN. His research interests include compilers and parallel programming languages.
Email
h-murai@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Atsushi Hori
RIKEN
New portable and practical parallel execution model, Process in Process (PiP in short) will be presented. PiP tasks share the same virtual address space like the multi-thread model and privatized variables like the multi-process model. Because of this, PiP provides the best of two worlds, multi-process (MPI) and multi-thread (OpenMP). Researcher, System Software Development Team, RIKEN
Learn more about DirectGMA in this blog post: bit.ly/AMDDirectGMA
AMD has introduced Direct Graphics Memory Access in order to:
‒ Makes a portion of the GPU memory accessible to other devices
‒ Allows devices on the bus to write directly into this area of GPU memory
‒ Allows GPUs to write directly into the memory of remote devices on the bus supporting DirectGMA
‒ Provides a driver interface to allow 3rd party hardware vendors to support data exchange with an AMD GPU using DirectGMA
‒ and more
View the accompanying blog post here: bit.ly/AMDDirectGMA
Performance evaluation with Arm HPC tools for SVELinaro
by: Performance evaluation with Arm HPC tools for SVE Miwako Tsuji (RIKEN), Yuetsu Kodama (RIKEN)
The "co-design" is a bi-directional approach where a system would be designed on demand from applications and the applications must be optimized to the system. The performance estimation and evaluation of applications are important for the co-design. In this talk, we focus on the performance evaluation with Arm HPC tools for SVE.
Miwako Tsuji received master and PhD degrees from Information Science and Technology, Hokkaido University. From 2007 to 2013, she was working in University of Hokkaido, University of Tokyo, University of Tsukuba and Universite de Versailles Saint-Quentin-en-Yvelines. She is a research scientist at RIKEN Advanced Institute for Computational Science since 2013. She is a member of the architecture development team of the flagship 2020 project, i.e. post-K computer project, since the project was started in 2014. She is a coauthor of ACM Gordon Bell Prize in 2011.
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
A presentation at FOSDEM 2022 about AMD GPUs, tuning, programming models and software roadmap. This is continuation from the previous talk (FOSDEM 2021)
In this deck from the GPU Technology Conference, John Romein and Bram Veenboer from ASTRON in the Netherlands present: Can FPGAs compete with GPUs?
"We'll discuss how FPGAs are changing as a result of new technology such as the Open CL high-level programming language, hard floating-point units, and tight integration with CPU cores. Traditionally energy-efficient FPGAs were considered notoriously difficult to program and unsuitable for complex HPC applications. We'll compare the latest FPGAs to GPUs, examining the architecture, programming models, programming effort, performance, and energy efficiency by considering some real applications."
Watch the video: https://wp.me/p3RLHQ-kfK
Learn more: https://www.astron.nl/
and
https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
How to Burn Multi-GPUs using CUDA stress test memoNaoto MATSUMOTO
How to Burn Multi-GPUs using CUDA stress test memo (2017/05/20)
SAKURA Internet, Inc. / SAKURA Internet Research Center.
Senior Researcher / Naoto MATSUMOTO
Arm tools and roadmap for SVE compiler supportLinaro
By Richard Sandiford, Florian Hahn (Arm), ARM
This presentation will give an overview of what Arm is doing to develop the HPC ecosystem, with a particular focus on SVE. It will include a brief synopsis of both the commercial and open-source tools and libraries that Arm is developing and a description of the various community initiatives that Arm is involved in. The bulk of the talk will describe the roadmap for SVE compiler support in both GCC and LLVM. It will cover the work that has already been done to support both hand-optimised and automatically-vectorised code, and the plans for future improvements.
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
By Koichi Hirai, Fujitsu
Post-K use Arm based super computer. But there are not too many Arm based servers for HPC. Therefore we think to need to build Arm HPC Ecosystem until Post-K release. In this presentation, we describe our collaboration efforts to build the Arm HPC Ecosystem.
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
eBPF is one of the key technologies nowadays. There are several existing technologies in network or observability fields but not much in storage space. This presentation tells my research story and tries to define some of the possibilities of the technology.
The first version of eBPF hardware offload was merged into the Linux kernel in October 2016 and became part of Linux v4.9. For the last two years the project has been growing and evolving to integrate more closely with the core kernel infrastructure and enable more advanced use cases. This talk will explain the internals of the kernel architecture of the offload and how it allows seamless execution of unmodified eBPF datapaths in HW.
An evaluation of LLVM compiler for SVE with fairly complicated loopsLinaro
By Hiroshi Nakashima, Kyoto University / RIKEN AICS
As a part of the evaluation of Post-K’s compilers, we have been investigating compiled codes of vectorizable kernel loops in a particle-in-cell simulation program. This talk will reveal how the latest version of LLVM compiler (v1.4) works on the loops together with the qualitative and quantitative comparison with the code generated by Intel’s compiler for KNL.
Hiroshi Nakashima Bio
Currently working as a professor of Kyoto University’s supercomputer center (ACCMS) for R&D on HPC programming and supercomputer system architecture, as well as a visiting senior researcher of RIKEN AICS for the evaluation of Post-K computer and its compilers.
Email
h.nakashima@media.kyoto-u.ac.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
Parallel Application Performance Prediction Using Analysis Based Models and HPC Simulations, Mohammad Abu Obaida, Jason Liu, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2018 SIGSIM Principles of Advanced Discrete Simulation (SIGSIM-PADS’18), May 2018.
"This deck is from the opening session of the "Introduction to Programming Pascal (P100) with CUDA 8" workshop at CSCS in Lugano, Switzerland. The three-day course is intended to offer an introduction to Pascal computing using CUDA 8."
Watch the video: http://wp.me/p3RLHQ-gsQ
Learn more: http://www.cscs.ch/events/event_detail/index.html?tx_seminars_pi1%5BshowUid%5D=155
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Programming Languages & Tools for Higher Performance & ProductivityLinaro
By Hitoshi Murai, RIKEN AICS
For higher performance and productivity of HPC systems, it is important to provide users with good programming environment including languages, compilers, and tools. In this talk, the programming model of the post-K supercomputer will be shown.
Hitoshi Murai Bio
Hitoshi Murai received a master's degree in information science from Kyoto University in 1996. He worked as a software developer in NEC from 1996 to 2010. He received a Ph.D degree in computer science from University of Tsukuba in 2010. He is currently a research scientist of the programming environment research team and the Flagship 2020 project in Advanced Institute for Computational Science, RIKEN. His research interests include compilers and parallel programming languages.
Email
h-murai@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Atsushi Hori
RIKEN
New portable and practical parallel execution model, Process in Process (PiP in short) will be presented. PiP tasks share the same virtual address space like the multi-thread model and privatized variables like the multi-process model. Because of this, PiP provides the best of two worlds, multi-process (MPI) and multi-thread (OpenMP). Researcher, System Software Development Team, RIKEN
Learn more about DirectGMA in this blog post: bit.ly/AMDDirectGMA
AMD has introduced Direct Graphics Memory Access in order to:
‒ Makes a portion of the GPU memory accessible to other devices
‒ Allows devices on the bus to write directly into this area of GPU memory
‒ Allows GPUs to write directly into the memory of remote devices on the bus supporting DirectGMA
‒ Provides a driver interface to allow 3rd party hardware vendors to support data exchange with an AMD GPU using DirectGMA
‒ and more
View the accompanying blog post here: bit.ly/AMDDirectGMA
Performance evaluation with Arm HPC tools for SVELinaro
by: Performance evaluation with Arm HPC tools for SVE Miwako Tsuji (RIKEN), Yuetsu Kodama (RIKEN)
The "co-design" is a bi-directional approach where a system would be designed on demand from applications and the applications must be optimized to the system. The performance estimation and evaluation of applications are important for the co-design. In this talk, we focus on the performance evaluation with Arm HPC tools for SVE.
Miwako Tsuji received master and PhD degrees from Information Science and Technology, Hokkaido University. From 2007 to 2013, she was working in University of Hokkaido, University of Tokyo, University of Tsukuba and Universite de Versailles Saint-Quentin-en-Yvelines. She is a research scientist at RIKEN Advanced Institute for Computational Science since 2013. She is a member of the architecture development team of the flagship 2020 project, i.e. post-K computer project, since the project was started in 2014. She is a coauthor of ACM Gordon Bell Prize in 2011.
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
A presentation at FOSDEM 2022 about AMD GPUs, tuning, programming models and software roadmap. This is continuation from the previous talk (FOSDEM 2021)
In this deck from the GPU Technology Conference, John Romein and Bram Veenboer from ASTRON in the Netherlands present: Can FPGAs compete with GPUs?
"We'll discuss how FPGAs are changing as a result of new technology such as the Open CL high-level programming language, hard floating-point units, and tight integration with CPU cores. Traditionally energy-efficient FPGAs were considered notoriously difficult to program and unsuitable for complex HPC applications. We'll compare the latest FPGAs to GPUs, examining the architecture, programming models, programming effort, performance, and energy efficiency by considering some real applications."
Watch the video: https://wp.me/p3RLHQ-kfK
Learn more: https://www.astron.nl/
and
https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
How to Burn Multi-GPUs using CUDA stress test memoNaoto MATSUMOTO
How to Burn Multi-GPUs using CUDA stress test memo (2017/05/20)
SAKURA Internet, Inc. / SAKURA Internet Research Center.
Senior Researcher / Naoto MATSUMOTO
Arm tools and roadmap for SVE compiler supportLinaro
By Richard Sandiford, Florian Hahn (Arm), ARM
This presentation will give an overview of what Arm is doing to develop the HPC ecosystem, with a particular focus on SVE. It will include a brief synopsis of both the commercial and open-source tools and libraries that Arm is developing and a description of the various community initiatives that Arm is involved in. The bulk of the talk will describe the roadmap for SVE compiler support in both GCC and LLVM. It will cover the work that has already been done to support both hand-optimised and automatically-vectorised code, and the plans for future improvements.
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
By Koichi Hirai, Fujitsu
Post-K use Arm based super computer. But there are not too many Arm based servers for HPC. Therefore we think to need to build Arm HPC Ecosystem until Post-K release. In this presentation, we describe our collaboration efforts to build the Arm HPC Ecosystem.
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
eBPF is one of the key technologies nowadays. There are several existing technologies in network or observability fields but not much in storage space. This presentation tells my research story and tries to define some of the possibilities of the technology.
The first version of eBPF hardware offload was merged into the Linux kernel in October 2016 and became part of Linux v4.9. For the last two years the project has been growing and evolving to integrate more closely with the core kernel infrastructure and enable more advanced use cases. This talk will explain the internals of the kernel architecture of the offload and how it allows seamless execution of unmodified eBPF datapaths in HW.
An evaluation of LLVM compiler for SVE with fairly complicated loopsLinaro
By Hiroshi Nakashima, Kyoto University / RIKEN AICS
As a part of the evaluation of Post-K’s compilers, we have been investigating compiled codes of vectorizable kernel loops in a particle-in-cell simulation program. This talk will reveal how the latest version of LLVM compiler (v1.4) works on the loops together with the qualitative and quantitative comparison with the code generated by Intel’s compiler for KNL.
Hiroshi Nakashima Bio
Currently working as a professor of Kyoto University’s supercomputer center (ACCMS) for R&D on HPC programming and supercomputer system architecture, as well as a visiting senior researcher of RIKEN AICS for the evaluation of Post-K computer and its compilers.
Email
h.nakashima@media.kyoto-u.ac.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
Parallel Application Performance Prediction Using Analysis Based Models and HPC Simulations, Mohammad Abu Obaida, Jason Liu, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2018 SIGSIM Principles of Advanced Discrete Simulation (SIGSIM-PADS’18), May 2018.
High Performance Computing Workshop for IHPC, Techkriti'13
Supercomputing Blog contains the codes -
http://ankitmahato.blogspot.in/search/label/Supercomputing
Credits:
https://computing.llnl.gov/
http://www.mcs.anl.gov/research/projects/mpi/
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
One of the biggest issues for a developer – whether they are an engineer at an OEM or working for a mobile AI application startup – is that their apps are at the mercy of pre-set power and performance settings as defined by OEMs or Silicon vendors. So how can a developer break through that barrier when it seems their hands are tied behind their backs? The Snapdragon Power Optimization SDK allows developers to control the CPU and GPU frequency much more finely from their own application logic. This provides developers with more control within the bounds of the power/thermal framework.
Using GPUs to Handle Big Data with JavaTim Ellison
A copy of the slides presented at JavaOne conference 2014.
Learn how Java can exploit the power of graphics processing units (GPUs) to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
2. Introduction To GPU
GPU vs CPU
More About GPU
Introduction to CUDA
Introduction to MPI
Integration of MPI & CUDA
Examples(Matrix Multiplication)
Performance Results
Conclusion and Future Enhancements
Bibliography
2Optimizing MPI-On GPU Communicaion
3. • What is GPU?
• Uses SIMD architecture
• Earlier used exclusively for graphics processing
• Extremely efficient for many types of parallel tasks
• Single chip now capable of peak single
• Precision floating point performance of more than 4Tflops
3Optimizing MPI-On GPU Communicaion
4. • The CPU –BRAIN of Computer ,While
GPU is SOUL.
• CPU-Small no of Cores,
• CPU can do any task, But not vice versa
4Optimizing MPI-On GPU Communicaion
5. •GPU consists of – SM, SP(ALU), Global
memory(DRAM),local memory , shared memory,
5Optimizing MPI-On GPU Communicaion
6. • CUDA-(Compute Unified Device
Architecture)NVIDIA introduced CUDA in june 2007.
• OpenCL-(Open Computing Language),by Apple
Inc. introduced in august 2009.
•OpenACC-(for Open Accelerators),by CRAY,CAPS
and NVIDIA
CUDA PROGRAMMING MODEL-
• CUDA Kernel- (__global__)
• Methods to handle memory
• CUDA IPC
6Optimizing MPI-On GPU Communicaion
7. • CUDA IPC can Work efficiently for multi-GPU
communication on same system.
•What For multiple node GPU communication?
Answer is MPI
7Optimizing MPI-On GPU Communicaion
8. • Standard interface/specification for parallel programming
– Language independent
– Platform independent
• MPI FUNCTIONS-
#include<mpi.h>
int main(int argc, char*argv[]) {
int myrank;
MPI_Init(&argc,&argv); /* InitializetheMPI library */
/* Determinethecalling processrank */
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);
MPI_Recv(s_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
Running- mpirun –n 2 ./apps<args>
8Optimizing MPI-On GPU Communicaion
12. Optimizing MPI-On GPU Communicaion 12
•One address space for all CPU and GPU memory
-Determine physical memory location from a pointer
value
-Enable libraries to simplify their interfaces (e.g. MPI and
cudaMemcpy)
13. //MPI rank 0
MPI_Send(s_buf_d,size,…);
//MPI rank n-1
MPI_Recv(r_buf_d,size,…);
With UVA and CUDA-aware
MPI
//MPI rank 0
cudaMemcpy(s_buf_h,s_buf_
d,size,…);
MPI_Send(s_buf_h,size,…);
//MPI rank n-1
MPI_Recv(r_buf_h,size,…);
cudaMemcpy(r_buf_d,r_buf_
h,size,…);
No UVA and regular MPI
By using UVA we need not Have to copy the data from
Device to host(CPU), and we can directly pass a buffer to
the another Device(GPU)
But it can optimize the communication on a same none
with multiple GPU
Optimizing MPI-On GPU Communicaion 13
14. __global__ void MatMulKernel(Matrix A, Matrix B,
Matrix C)
{
// Each thread computesoneelement of C
// by accumulating results into Cvalue
float Cvalue= 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row > A.height || col > B.width) return;
for (int e= 0; e< A.width; ++e)
Cvalue+= A.elements[row * A.width + e] *
B.elements[e* B.width + col];
C.elements[row * C.width + col] = Cvalue;
}
Optimizing MPI-On GPU Communicaion 14
void assignDeviceToProcess()
{
MPI_Get_processor_name(host_name,&namelen);
MPI_Bcast(&(host_names[n]),MPI_MAX_PROCESS
OR_NAME, MPI_CHAR, n,
MPI_COMM_WORLD);
bytes= nprocs*
sizeof(char[MPI_MAX_PROCESSOR_NAME]);
mat_mul(host_names, nprocs,
sizeof(char[MPI_MAX_PROCESSOR_NAME]));
MPI_Comm_rank(nodeComm, &myrank);
cudaSetDevice(myrank);
}
15. Optimizing MPI-On GPU Communicaion 15
PERFORMANCE OF MPI FOR HIGHER AND SMALL ORDER
MITRICES
Fig. Performance of mpi for higher order matrices
Fig. Perforance of mpi for smaller order matrices.
16. Optimizing MPI-On GPU Communicaion 16
• GPU are become the core of HPC
•NVIDIA GPUS are most popular with their CUDA
programming model
• MPI can be used to have more acceleration and
parallelization
•MPICH2 is recent MPI implementation which we used
•NVIDIA is developing MVAPICH2
•GPUDirect can be used
-RDMA
17. Optimizing MPI-On GPU Communicaion 17
[1]. S. Potluri, H. Wang’s "Optimizing MPI Communication on Multi-GPU
Systems using CUDA Inter-Process Communication",201 2 IEEE 26 th Internatio nal Parallel and
Distributed Pro cessing Sympo sium Wo rksho ps & PhD Fo rum
[2]. Khaled Hamidouche, Sreeram Potluri “Efficient Inter-node MPI Communication using
GpuDirect RDMA
for InfiniBand Clusters with NVIDIA GPUs”, 201 3 42nd Internatio nal Co nference o n Parallel
Pro cessing
[3]. Sampath S’s“PerformanceEvaluation and Comparison of MPI and PVM using aCluster
Based Parallel Computing Architecture”, 201 3 Internatio nal Co nference o n Circuits, Po wer and
Co mputing Techno lo gies [ICCPCT-201 3]
4. Websitereferred “https: //develo per.nvidia.co m/mpi-so lutio ns-gpus”