GPUrdma - Presentation

High performance low latency
network over GPU
Feras Daoud
Technion – Israel Institute of Technology
Advisor:
Prof. Mark Silberstein

Agenda
2
Introduction1
InfiniBand Background2
GPUrdma3
GPUrdma Evaluation4
GPI25

Agenda
3
Introduction1
GPUrdma3
GPUrdma Evaluation4
GPI25

4
What
• A GPU-side library for performing RDMA directly from
GPU kernels
Why
• To improve communication performance between distributed
GPUs
Results
• 5 µsec GPU-to-GPU communication latency and up to 50 Gbps
transfer bandwidth

Graphics Process Unit - GPU
5
• Accelerators are added alongside a CPU to accelerate
specific compute functions
• Graphics  General purpose computing
• Thousands of cores
• High memory throughput
• Fully programmable

GPU Hardware
6
Multicore CPU
Core Core Core
Host
Memory
Execution Manager
Device Memory
Level 2 Cache
DMA
Control
Multiprocessor
MultiprocessorMultiprocessor
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
MIMD
SIMD

Cuda compute platform
7
Cuda
Program
CPU GPU
Invoke

GPU execution time
8
• Execution Time:
1. Computation time
2. Communication time
• Improve execution time by:
• Overlap communication with computation
• Add more GPUs
Computation
Communication
Decrease
computation
time
Constant
communication
time

Evolution of GPU-HCA interaction:
GPU
GPU
RAM
CPU
CPU
RAM
NIC
Naive Version
Data Path
Control Path
9

GPU
GPU
RAM
CPU
CPU
RAM
NIC
GPU
GPU
RAM
CPU
CPU
RAM
NIC
Naive Version GPUDirect RDMA
10
Data Path
Control Path
Direct
Data Path

GPU
GPU
RAM
CPU
CPU
RAM
NIC
GPU
GPU
RAM
CPU
CPU
RAM
NIC
GPU
GPU
RAM
CPU
CPU
RAM
NIC
Naive Version GPUDirect RDMA GPUrdma
11
Data Path
Control Path
Direct
Data Path Direct
Control Path

Motivations
GPUDirect RDMA Node
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
12

Motivations
GPUDirect RDMA Node
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
13
CPU Overhead

Motivations
GPUDirect RDMA Node
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
14
Bulk-synchronous design
and explicit pipelining
Communication
Communication
Computation

Motivations
GPUDirect RDMA Node
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
15
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
Multiple GPU kernel invocations
1. kernel calls overhead
2. Inefficient shared memory usage

Motivations
GPUDirect RDMA Node
CPU_rdma_read()
GPU_kernel<<<>>> {
Compute_Even()
}
CPU_rdma_write()
16
Sparse data
1 0 0 1 0 0 1
0 3 5 2 5 1 8
Offsets
0x0
0x3
0x6

GPUrdma library
GPU_kernel<<<>>> {
GPU_rdma_read()
GPU_Compute()
GPU_rdma_write()
}
GPUrdma Node
17
• No CPU intervention
• Overlapping communication and computation
• One kernel call
• Efficient shared memory usage
• Send spare data directly from the kernel

Agenda
18
GPUrdma3
GPUrdma Evaluation4
GPI25
Introduction1

InfiniBand Background
1. Queue pair buffer (QP)
• Send queue
• Receive queue
2. Work Queue Element
• Contains communication instructions
3. Completion queue buffer (CQ)
• Contains completion elements
4. Completion queue element
• Contains information about completed jobs
19
Task
Verbs
Completion
QP Buffer
CQ Buffer
NIC
WQE WQE
CQE CQE

InfiniBand Background
1. Write work queue element to QP buffer
2. Ring the Door-Bell
3. Check completion queue element status
20
Ring the Door-Bell to execute jobs
• MMIO address
• Informs the NIC about new jobs
CPU Memory
QP CQ
CPU
Data
Control Path
Data
Path
01100110
01101001
10011001

Agenda
21
GPUrdma Evaluation4
GPI25
Introduction1
GPUrdma3

GPUrdma Node
• Direct path for data exchange
• Direct HCA control from GPU kernels
• No CPU intervention
Native GPU Node
GPU
Control Path
22
GPU Memory
QP CQ Data
01100110
01101001
10011001
Data
Path

GPUrdma Implementation
23
CPU Memory
QP CQ Data
0110011001
1010011001
1001010101
CPU
GPU
Data Path
GPU Memory

Data Path - GPUDirect RDMA
24
CPU Memory
CPU
GPU
Data Path
GPU Memory
Data
0110011001
1010011001
1001010101
QP CQ

25
CPU Memory
QP CQ
CPU
GPU
GPU Memory
1. Move QP, CQ to GPU memory
Data
0110011001
1010011001
1001010101
Data Path

1. Move QP, CQ to GPU memory
Modify InfiniBand Verbs
• ibv_create_qp()
• ibv_create_cq()
26
CPU Memory
CPU
GPU
Data Path
GPU Memory
QP CQ Data
0110011001
1010011001
1001010101

2. Map the HCA doorbell address
into GPU address space
27
CPU Memory
CPU
GPU
Data Path
GPU Memory
QP CQ Data
0110011001
1010011001
1001010101

28
CPU Memory
CPU
GPU
Data Path
GPU Memory
QP CQ Data
0110011001
1010011001
1001010101
2. Map the HCA doorbell address
into GPU address space
Modify NVIDIA driver

CPU ring doorbell
29
CPU
CPU_rdma_write {
doorbell_Vadd = CPU_get_doorbell_addr()
CPU_ring_doorbell(doorbell_Vadd)
}
PCIe doorbell_Padd

GPU ring doorbell
30
GPUdoorbell_Vadd = CPU_get_doorbell_addr()
GPU_doorbell_Vadd =
cudaHostRegister(doorbell_Vadd)
GPU_kernel<<<>>> {
GPU_ring_doorbell(GPU_doorbell_Vadd)
}
PCIe doorbell_Padd

GPU ring doorbell
31
GPUdoorbell_Vadd = CPU_get_doorbell_addr()
GPU_doorbell_Vadd =
cudaHostRegister(doorbell_Vadd)
GPU_kernel<<<>>> {
GPU_ring_doorbell(GPU_doorbell_Vadd)
}
PCIe doorbell_Padd

cudaHostRegister
32
cudaHostRegister(CPU_Vaddr)
get_user_pages(CPU_Vaddr)
Output : Struct Page
Physical Page Struct Page

Driver Hack
33
• Allocate fake_buffer in host memory
• cudaHostRegister( fake_buffer )
• Change the physical address with doorbell
address

Agenda
34
GPI25
Introduction1
GPUrdma Evaluation4
GPUrdma3

GPUrdma Evaluation
• Single QP
• Multiple QP
• Scalability - Optimal QP/CQ location
NVIDIA Tesla K40c GPU Mellanox Connect-IB HCA
35

GPUrdma – 1 thread , 1 QP
36
CPU
GPU
• CPU controller VS GPU controller

37
• GPU controller – Optimize doorbell rings
Doorbell Optimized
CPU
GPU

38
• GPU controller – Optimize CQ poll
CQ Optimized
Doorbell Optimized
CPU

GPUrdma – 32 threads , 1 QP
39
• GPU controller – Write parallel jobs
Parallel writes
CQ Optimized
CPU

GPUDirect RDMA
40
• CPU controller
CPU 30 QPs
CPU 1 QP

GPUrdma – 30 QPs
41
• 1 QP per Block vs 30 QPs per Block
30 QPs per Block
CPU 30 QPs
1 QP per Block

Scalability – Optimal QP/CQ location:
1. QP and CQ in GPU memory
2. QP in GPU and CQ in system memory
3. CQ in GPU and QP in system memory
42
CPU Memory
CPU
GPU
Data Path
GPU Memory
QP CQ Data
0110011001
1010011001
1001010101

43
CPU Memory
CPU
GPU
Data Path
GPU Memory
QP
CQ
Data
0110011001
1010011001
1001010101

44
CPU Memory
CPU
GPU
Data Path
GPU Memory
QP
Data
0110011001
1010011001
1001010101
CQ

45
CPU Memory
CPU
GPU
Data Path
GPU Memory
Data
0110011001
1010011001
1001010101
4. QP and CQ in system memory
QP CQ

Optimal QP/CQ location:
Transfer latency [µsec]
QP in CPU QP in GPU
CQ in CPU 8.6 6.2
CQ in GPU 6.8 4.8
46
o Throughput: No difference
o Latency:

Limitations
47
Symptom:
GPUrdma may observe stale data or data that arrives out of order
Scenario:
Intensive RDMA writes to GPU memory
Good news:
NVIDIA announced a CUDA 8 feature that enables consistent update
Suggested fix:
CRC32 integrity check API for error detection

Interim results summary
48
5 µsec GPU to GPU communication latency
GPUrdma outperforms the CPU RDMA for small packets
by factor of 4.5X
50 Gbps transfer bandwidth for messages from 16KB and
larger

Agenda
49
GPI2
Introduction1
GPUrdma3
GPUrdma Evaluation4
5

GPI2 for GPUs:
GPI - A framework to implement Partitioned Global Address Space (PGAS)
GPI2 - Extends this global address space to GPU memory
Threads
Global
memory Host
Local memory
Host
Global
memory GPU
Local memory
GPU
50
Global
memory Host
Local memory
Host
Threads Threads
Global
memory GPU
Local memory
GPU
Threads

GPI2 code example
CPU Node
gaspi_segment_create (CPU_MEM)
Initialize data
gaspi_write_notify
gaspi_notify_waitsome
gaspi_proc_term
GPU Node
gaspi_segment_create (GPU_MEM)
GPU_Compute _data<<<>>>
gaspi_write_notify
gaspi_proc_term
51

GPI2 using GPUrdma
CPU Node
gaspi_segment_create (CPU_MEM)
Initialize data
gaspi_write_notify
gaspi_proc_term
GPU Node
gaspi_segment_create (GPU_MEM)
GPU_start_kernel <<<>>> {
gpu_gaspi_notify_waitsome
Compute_data()
gpu_gaspi_write_notify }
gaspi_proc_term
52

GPUrdma Ping-Pong
53
• 10 µsec roundtrip latency
• 40.5 Gb/sec throughput
• 52 Gb/sec throughput with
overlapping
• 38 Gb/sec for GPI2
CPU Node
Start timer
gaspi_write_notify
Stop timer
GPU Node
gpu_gaspi_notify_waitsome
gpu_gaspi_write_notify

GPUrdma Multi-Matrix vector product
CPU Node
Start timer
gaspi_write_notify
Stop timer
GPU Node
gpu_notify_waitsome
Matrix_compute()
gpu_write_notify
Batch size
[Vectors]
GPI2 GPUrdma
480 2.6 11.7
960 4.8 18.8
1920 8.4 25.2
3840 13.9 29.1
7680 19.9 30.3
15360 24.3 31.5
• System throughput in millions of 32x1 vector
multiplications per second as a function of the batch size
54

Lena Oden, Fraunhofer Institute for Industrial Mathematics:
• Infiniband-Verbs on GPU: A case study of controlling an Infiniband network device from
the GPU
• Analyzing Put/Get APIs for Thread-collaborative Processors
Mark Silberstein, Technion – Israel Institute of Technology:
• GPUnet: networking abstractions for GPU programs
• GPUfs: Integrating a file system with GPUs
55
Related works

56
Future works
Thanks
• GPUrdma for new accelerators (Tesla k80)
• Integrating GPUrdma with GPUnet
• Integrating GPUrdma with NCCL (library of standard collective communication
routines)
New research works using GPUrdma:
• Exploiting advances in GPU IO abilities for building a scaling deep learning
training systems

GPUrdma - Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to GPUrdma - Presentation

Similar to GPUrdma - Presentation (20)

Recently uploaded

Recently uploaded (20)

GPUrdma - Presentation

Editor's Notes