Your SlideShare is downloading. ×
  • Like
Optimizing GPU to GPU Communication on Cray XK7
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Optimizing GPU to GPU Communication on Cray XK7


When developing an application for Cray XK7 systems, optimization of compute kernels is only a small part of maximizing scaling and performance. Programmers must consider the effect of the GPU’s …

When developing an application for Cray XK7 systems, optimization of compute kernels is only a small part of maximizing scaling and performance. Programmers must consider the effect of the GPU’s distinct address space and the PCIe bus on application scalability. Without such considerations applications rapidly become limited by transfers to and from the GPU and fail to scale to large numbers of nodes. This paper will demonstrate methods for optimizing GPU to GPU communication and present XK7 results for these methods.

This presentation was originally given at CUG 2013.

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Optimizing GPU to GPUCommunication on Cray XK7Jeff Larkin
  • 2. What Amdahl says about GPU communicationIf you make your GPU computation infinitely fast, performancewill be bound by your communication.GPU-2-GPU communication hasHigher latency (additional hop over PCIe)Lower bandwidth (limited by lowest bandwidth link)G2G communication cannot be an afterthought when running atscale.
  • 3. CPU0MPI_Send ()GPU0CPU1MPI_Recv()GPU1How do GPUs communicate?MPIBut whatreallyhappenshere?
  • 4. cudaMemcpy()CPU0MPI_Send()GPU0cudaMemcpy()CPU1MPI_Recv()GPU1Until recently…
  • 5. Unified Virtual AddressingNo UVA: Multiple Memory Spaces UVA : Single Address SpaceSystemMemoryCPU GPUGPUMemoryPCI-e0x00000xFFFF0x00000xFFFFSystemMemoryCPU GPUGPUMemoryPCI-e0x00000xFFFF
  • 6. Unified Virtual AddressingOne address space for all CPU and GPU memoryDetermine physical memory location from a pointer valueEnable libraries to simplify their interfaces (e.g. MPI andcudaMemcpy)Supported on Tesla starting with Fermi 64-bit applications onLinux and Windows TCCPinned fabric BufferHost Buffer memcpyGPU BuffercudaMemcpyPinned fabric BufferPinned fabric Buffer
  • 7. MPI+CUDA//MPI rank 0MPI_Send(s_buf_d,size,…);//MPI rank n-1MPI_Recv(r_buf_d,size,…);With UVA and CUDA-aware MPI//MPI rank 0cudaMemcpy(s_buf_h,s_buf_d,size,…);MPI_Send(s_buf_h,size,…);//MPI rank n-1MPI_Recv(r_buf_h,size,…);cudaMemcpy(r_buf_d,r_buf_h,size,…);No UVA and regular MPICUDA-aware MPI makes MPI+CUDA easier.
  • 8. CUDA-Aware MPI Libraries MayUse RDMA to completely remove CPU from the picture.Stage GPU buffers through CPU memory automaticallyPipeline messagesAll the programmer needs to know is they’ve passed a GPUpointer to MPI, the library developer can optimize the restGPU0GPU1GPU0GPU1CPU0CPU1GPU0GPU1
  • 9. GPU-Awareness in Cray MPICray began supporting GPU-awareness in 5.6.3Functions on XK7, but not optimally performingExpected to work very well on XC30Must be explicitly enabled via run-time environment variableMPICH_RDMA_ENABLED_CUDAWorks with both CUDA and OpenACCVersion 5.6.4 adds a pipelining feature that should help largemessagesEnabled with MPICH_G2G_PIPELINE
  • 10. OMB LatencyHost-to-Host will alwayshave the lowest latency(fewest hops)Staging through hostmemory explicitly addssignificant latencyGPU-aware library is ableto fall in the middle.Note: 2 nodes on separateblades.
  • 11. OMB BandwidthOnce again, H2H wins out(probably by a difference oflatency)Direct RDMA suffers badlywith this benchmark.SettingMPICH_G2G_PIPELINE=64pipelines messages andopens up more concurrency.Hand-PipeliningCame ~Here
  • 12. OMB Bandwidth, Varying PipelinesOMB sends messages in awindow of 64, so that isnaturally optimal.Counter-intuitive that nointermediate values seemedto help.Additionally tried varyingchunk sizes with no benefit.
  • 13. Optimizing Performance of a MessageMPI vendors know how to optimize performance for aninterconnectDifferent approaches for different message sizesMultiple algorithmsUnfortunately, on the XK7, this may not be the optimal approach.
  • 14. MPI Lacks the Ability to ExpressDependenciesOne way to negate the cost ofG2G communication is tooverlap with somethingcomputation.Restructuring order ofcomputation may allow suchoverlappingSome communication patternshave a natural concurrencythat isn’t easily exploited.NW EExchangeCells WithN,S,E,WNeighbors.S
  • 15. Exploiting Communication ConcurrencyThis cannot be expressed in GPU-aware MPI today.PackNorthD2HNorthTransferNorthH2DNorthUnpackNorthPackEastD2HEastTransferEastH2DEastUnpackEastPackSouthD2HSouthTransferSouthH2DSouthUnpackSouthPackWestD2HWestTransferWestH2DWestUnpackWest
  • 16. Concurrent Messaging Pseudocodedo i=N,WMPI_Irecv(i)enddodo i=N,WpackBufferAsync(i)copyBufferD2HAsync(i)enddowhile(more to send/recv)if Irecv completedcopyBufferH2DAsyncunpackBufferAsync()endifif D2H completedMPI_Isend()endifdone
  • 17. Talks Related to this OptimizationHOMME - Matt Norman – CUG2012 S3D - John Levesque – CUG2013
  • 18. SummaryOptimizing kernels will only take you so far as you scale,communication cannot be an afterthought.GPU-aware MPI libraries are becoming availableEasier to programCan optimize performance of individual message transfersSome communication patterns have a natural concurrency thatcan be exploited to make communication “free”, but this takesadditional effort.