Optimizing GPU to GPUCommunication on Cray XK7Jeff Larkin
What Amdahl says about GPU communicationIf you make your GPU computation infinitely fast, performancewill be bound by your...
CPU0MPI_Send ()GPU0CPU1MPI_Recv()GPU1How do GPUs communicate?MPIBut whatreallyhappenshere?
cudaMemcpy()CPU0MPI_Send()GPU0cudaMemcpy()CPU1MPI_Recv()GPU1Until recently…
Unified Virtual AddressingNo UVA: Multiple Memory Spaces UVA : Single Address SpaceSystemMemoryCPU GPUGPUMemoryPCI-e0x0000...
Unified Virtual AddressingOne address space for all CPU and GPU memoryDetermine physical memory location from a pointer va...
MPI+CUDA//MPI rank 0MPI_Send(s_buf_d,size,…);//MPI rank n-1MPI_Recv(r_buf_d,size,…);With UVA and CUDA-aware MPI//MPI rank ...
CUDA-Aware MPI Libraries MayUse RDMA to completely remove CPU from the picture.Stage GPU buffers through CPU memory automa...
GPU-Awareness in Cray MPICray began supporting GPU-awareness in 5.6.3Functions on XK7, but not optimally performingExpecte...
OMB LatencyHost-to-Host will alwayshave the lowest latency(fewest hops)Staging through hostmemory explicitly addssignifica...
OMB BandwidthOnce again, H2H wins out(probably by a difference oflatency)Direct RDMA suffers badlywith this benchmark.Sett...
OMB Bandwidth, Varying PipelinesOMB sends messages in awindow of 64, so that isnaturally optimal.Counter-intuitive that no...
Optimizing Performance of a MessageMPI vendors know how to optimize performance for aninterconnectDifferent approaches for...
MPI Lacks the Ability to ExpressDependenciesOne way to negate the cost ofG2G communication is tooverlap with somethingcomp...
Exploiting Communication ConcurrencyThis cannot be expressed in GPU-aware MPI today.PackNorthD2HNorthTransferNorthH2DNorth...
Concurrent Messaging Pseudocodedo i=N,WMPI_Irecv(i)enddodo i=N,WpackBufferAsync(i)copyBufferD2HAsync(i)enddowhile(more to ...
Talks Related to this OptimizationHOMME - Matt Norman – CUG2012 S3D - John Levesque – CUG2013
SummaryOptimizing kernels will only take you so far as you scale,communication cannot be an afterthought.GPU-aware MPI lib...
Optimizing GPU to GPU Communication on Cray XK7
Upcoming SlideShare
Loading in …5
×

Optimizing GPU to GPU Communication on Cray XK7

661 views

Published on

When developing an application for Cray XK7 systems, optimization of compute kernels is only a small part of maximizing scaling and performance. Programmers must consider the effect of the GPU’s distinct address space and the PCIe bus on application scalability. Without such considerations applications rapidly become limited by transfers to and from the GPU and fail to scale to large numbers of nodes. This paper will demonstrate methods for optimizing GPU to GPU communication and present XK7 results for these methods.

This presentation was originally given at CUG 2013.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
661
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Optimizing GPU to GPU Communication on Cray XK7

  1. 1. Optimizing GPU to GPUCommunication on Cray XK7Jeff Larkin
  2. 2. What Amdahl says about GPU communicationIf you make your GPU computation infinitely fast, performancewill be bound by your communication.GPU-2-GPU communication hasHigher latency (additional hop over PCIe)Lower bandwidth (limited by lowest bandwidth link)G2G communication cannot be an afterthought when running atscale.
  3. 3. CPU0MPI_Send ()GPU0CPU1MPI_Recv()GPU1How do GPUs communicate?MPIBut whatreallyhappenshere?
  4. 4. cudaMemcpy()CPU0MPI_Send()GPU0cudaMemcpy()CPU1MPI_Recv()GPU1Until recently…
  5. 5. Unified Virtual AddressingNo UVA: Multiple Memory Spaces UVA : Single Address SpaceSystemMemoryCPU GPUGPUMemoryPCI-e0x00000xFFFF0x00000xFFFFSystemMemoryCPU GPUGPUMemoryPCI-e0x00000xFFFF
  6. 6. Unified Virtual AddressingOne address space for all CPU and GPU memoryDetermine physical memory location from a pointer valueEnable libraries to simplify their interfaces (e.g. MPI andcudaMemcpy)Supported on Tesla starting with Fermi 64-bit applications onLinux and Windows TCCPinned fabric BufferHost Buffer memcpyGPU BuffercudaMemcpyPinned fabric BufferPinned fabric Buffer
  7. 7. MPI+CUDA//MPI rank 0MPI_Send(s_buf_d,size,…);//MPI rank n-1MPI_Recv(r_buf_d,size,…);With UVA and CUDA-aware MPI//MPI rank 0cudaMemcpy(s_buf_h,s_buf_d,size,…);MPI_Send(s_buf_h,size,…);//MPI rank n-1MPI_Recv(r_buf_h,size,…);cudaMemcpy(r_buf_d,r_buf_h,size,…);No UVA and regular MPICUDA-aware MPI makes MPI+CUDA easier.
  8. 8. CUDA-Aware MPI Libraries MayUse RDMA to completely remove CPU from the picture.Stage GPU buffers through CPU memory automaticallyPipeline messagesAll the programmer needs to know is they’ve passed a GPUpointer to MPI, the library developer can optimize the restGPU0GPU1GPU0GPU1CPU0CPU1GPU0GPU1
  9. 9. GPU-Awareness in Cray MPICray began supporting GPU-awareness in 5.6.3Functions on XK7, but not optimally performingExpected to work very well on XC30Must be explicitly enabled via run-time environment variableMPICH_RDMA_ENABLED_CUDAWorks with both CUDA and OpenACCVersion 5.6.4 adds a pipelining feature that should help largemessagesEnabled with MPICH_G2G_PIPELINE
  10. 10. OMB LatencyHost-to-Host will alwayshave the lowest latency(fewest hops)Staging through hostmemory explicitly addssignificant latencyGPU-aware library is ableto fall in the middle.Note: 2 nodes on separateblades.
  11. 11. OMB BandwidthOnce again, H2H wins out(probably by a difference oflatency)Direct RDMA suffers badlywith this benchmark.SettingMPICH_G2G_PIPELINE=64pipelines messages andopens up more concurrency.Hand-PipeliningCame ~Here
  12. 12. OMB Bandwidth, Varying PipelinesOMB sends messages in awindow of 64, so that isnaturally optimal.Counter-intuitive that nointermediate values seemedto help.Additionally tried varyingchunk sizes with no benefit.
  13. 13. Optimizing Performance of a MessageMPI vendors know how to optimize performance for aninterconnectDifferent approaches for different message sizesMultiple algorithmsUnfortunately, on the XK7, this may not be the optimal approach.
  14. 14. MPI Lacks the Ability to ExpressDependenciesOne way to negate the cost ofG2G communication is tooverlap with somethingcomputation.Restructuring order ofcomputation may allow suchoverlappingSome communication patternshave a natural concurrencythat isn’t easily exploited.NW EExchangeCells WithN,S,E,WNeighbors.S
  15. 15. Exploiting Communication ConcurrencyThis cannot be expressed in GPU-aware MPI today.PackNorthD2HNorthTransferNorthH2DNorthUnpackNorthPackEastD2HEastTransferEastH2DEastUnpackEastPackSouthD2HSouthTransferSouthH2DSouthUnpackSouthPackWestD2HWestTransferWestH2DWestUnpackWest
  16. 16. Concurrent Messaging Pseudocodedo i=N,WMPI_Irecv(i)enddodo i=N,WpackBufferAsync(i)copyBufferD2HAsync(i)enddowhile(more to send/recv)if Irecv completedcopyBufferH2DAsyncunpackBufferAsync()endifif D2H completedMPI_Isend()endifdone
  17. 17. Talks Related to this OptimizationHOMME - Matt Norman – CUG2012 S3D - John Levesque – CUG2013
  18. 18. SummaryOptimizing kernels will only take you so far as you scale,communication cannot be an afterthought.GPU-aware MPI libraries are becoming availableEasier to programCan optimize performance of individual message transfersSome communication patterns have a natural concurrency thatcan be exploited to make communication “free”, but this takesadditional effort.

×