SlideShare a Scribd company logo
High performance low latency
network over GPU
Feras Daoud
Technion – Israel Institute of Technology
Advisor:
Prof. Mark Silberstein
Agenda
2
Introduction1
InfiniBand Background2
GPUrdma3
GPUrdma Evaluation4
GPI25
Agenda
3
Introduction1
InfiniBand Background2
GPUrdma3
GPUrdma Evaluation4
GPI25
4
What
• A GPU-side library for performing RDMA directly from
GPU kernels
Why
• To improve communication performance between distributed
GPUs
Results
• 5 µsec GPU-to-GPU communication latency and up to 50 Gbps
transfer bandwidth
Graphics Process Unit - GPU
5
• Accelerators are added alongside a CPU to accelerate
specific compute functions
• Graphics  General purpose computing
• Thousands of cores
• High memory throughput
• Fully programmable
GPU Hardware
6
Multicore CPU
Core Core Core
Host
Memory
Execution Manager
Device Memory
Level 2 Cache
DMA
Control
Multiprocessor
MultiprocessorMultiprocessor
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
MIMD
SIMD
Cuda compute platform
7
Cuda
Program
CPU GPU
Invoke
GPU execution time
8
• Execution Time:
1. Computation time
2. Communication time
• Improve execution time by:
• Overlap communication with computation
• Add more GPUs
Computation
Communication
Decrease
computation
time
Constant
communication
time
Evolution of GPU-HCA interaction:
GPU
GPU
RAM
CPU
CPU
RAM
NIC
Naive Version
Data Path
Control Path
9
Evolution of GPU-HCA interaction:
GPU
GPU
RAM
CPU
CPU
RAM
NIC
GPU
GPU
RAM
CPU
CPU
RAM
NIC
Naive Version GPUDirect RDMA
10
Data Path
Control Path
Direct
Data Path
Evolution of GPU-HCA interaction:
GPU
GPU
RAM
CPU
CPU
RAM
NIC
GPU
GPU
RAM
CPU
CPU
RAM
NIC
GPU
GPU
RAM
CPU
CPU
RAM
NIC
Naive Version GPUDirect RDMA GPUrdma
11
Data Path
Control Path
Direct
Data Path Direct
Control Path
Motivations
GPUDirect RDMA Node
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
12
Motivations
GPUDirect RDMA Node
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
13
CPU Overhead
Motivations
GPUDirect RDMA Node
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
14
Bulk-synchronous design
and explicit pipelining
Communication
Communication
Computation
Motivations
GPUDirect RDMA Node
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
15
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
CPU_rdma_read()
GPU_kernel<<<>>> {
GPU_Compute()
}
CPU_rdma_write()
Multiple GPU kernel invocations
1. kernel calls overhead
2. Inefficient shared memory usage
Motivations
GPUDirect RDMA Node
CPU_rdma_read()
GPU_kernel<<<>>> {
Compute_Even()
}
CPU_rdma_write()
16
Sparse data
1 0 0 1 0 0 1
0 3 5 2 5 1 8
Offsets
0x0
0x3
0x6
GPUrdma library
GPU_kernel<<<>>> {
GPU_rdma_read()
GPU_Compute()
GPU_rdma_write()
}
GPUrdma Node
17
• No CPU intervention
• Overlapping communication and computation
• One kernel call
• Efficient shared memory usage
• Send spare data directly from the kernel
Agenda
18
GPUrdma3
GPUrdma Evaluation4
GPI25
Introduction1
InfiniBand Background2
InfiniBand Background
1. Queue pair buffer (QP)
• Send queue
• Receive queue
2. Work Queue Element
• Contains communication instructions
3. Completion queue buffer (CQ)
• Contains completion elements
4. Completion queue element
• Contains information about completed jobs
19
Task
Verbs
Completion
QP Buffer
CQ Buffer
NIC
WQE WQE
CQE CQE
InfiniBand Background
1. Write work queue element to QP buffer
2. Ring the Door-Bell
3. Check completion queue element status
20
Ring the Door-Bell to execute jobs
• MMIO address
• Informs the NIC about new jobs
CPU Memory
QP CQ
CPU
Data
Control Path
Data
Path
01100110
01101001
10011001
Agenda
21
GPUrdma Evaluation4
GPI25
Introduction1
InfiniBand Background2
GPUrdma3
GPUrdma Node
• Direct path for data exchange
• Direct HCA control from GPU kernels
• No CPU intervention
Native GPU Node
GPU
Control Path
22
GPU Memory
QP CQ Data
01100110
01101001
10011001
Data
Path
GPUrdma Implementation
23
CPU Memory
QP CQ Data
0110011001
1010011001
1001010101
CPU
GPU
Data Path
GPU Memory
GPUrdma Implementation
Data Path - GPUDirect RDMA
24
CPU Memory
CPU
GPU
Data Path
GPU Memory
Data
0110011001
1010011001
1001010101
QP CQ
GPUrdma Implementation
25
CPU Memory
QP CQ
CPU
GPU
GPU Memory
1. Move QP, CQ to GPU memory
Data
0110011001
1010011001
1001010101
Data Path
GPUrdma Implementation
1. Move QP, CQ to GPU memory
Modify InfiniBand Verbs
• ibv_create_qp()
• ibv_create_cq()
26
CPU Memory
CPU
GPU
Data Path
GPU Memory
QP CQ Data
0110011001
1010011001
1001010101
GPUrdma Implementation
2. Map the HCA doorbell address
into GPU address space
27
CPU Memory
CPU
GPU
Data Path
GPU Memory
QP CQ Data
0110011001
1010011001
1001010101
GPUrdma Implementation
28
CPU Memory
CPU
GPU
Data Path
GPU Memory
QP CQ Data
0110011001
1010011001
1001010101
2. Map the HCA doorbell address
into GPU address space
Modify NVIDIA driver
CPU ring doorbell
29
CPU
CPU_rdma_write {
doorbell_Vadd = CPU_get_doorbell_addr()
CPU_ring_doorbell(doorbell_Vadd)
}
PCIe doorbell_Padd
GPU ring doorbell
30
GPUdoorbell_Vadd = CPU_get_doorbell_addr()
GPU_doorbell_Vadd =
cudaHostRegister(doorbell_Vadd)
GPU_kernel<<<>>> {
GPU_ring_doorbell(GPU_doorbell_Vadd)
}
PCIe doorbell_Padd
GPU ring doorbell
31
GPUdoorbell_Vadd = CPU_get_doorbell_addr()
GPU_doorbell_Vadd =
cudaHostRegister(doorbell_Vadd)
GPU_kernel<<<>>> {
GPU_ring_doorbell(GPU_doorbell_Vadd)
}
PCIe doorbell_Padd
cudaHostRegister
32
cudaHostRegister(CPU_Vaddr)
get_user_pages(CPU_Vaddr)
Output : Struct Page
Physical Page Struct Page
Driver Hack
33
• Allocate fake_buffer in host memory
• cudaHostRegister( fake_buffer )
• Change the physical address with doorbell
address
Agenda
34
GPI25
Introduction1
InfiniBand Background2
GPUrdma Evaluation4
GPUrdma3
GPUrdma Evaluation
• Single QP
• Multiple QP
• Scalability - Optimal QP/CQ location
NVIDIA Tesla K40c GPU Mellanox Connect-IB HCA
35
GPUrdma – 1 thread , 1 QP
36
CPU
GPU
• CPU controller VS GPU controller
GPUrdma – 1 thread , 1 QP
37
• GPU controller – Optimize doorbell rings
Doorbell Optimized
CPU
GPU
GPUrdma – 1 thread , 1 QP
38
• GPU controller – Optimize CQ poll
CQ Optimized
Doorbell Optimized
CPU
GPUrdma – 32 threads , 1 QP
39
• GPU controller – Write parallel jobs
Parallel writes
CQ Optimized
CPU
GPUDirect RDMA
40
• CPU controller
CPU 30 QPs
CPU 1 QP
GPUrdma – 30 QPs
41
• 1 QP per Block vs 30 QPs per Block
30 QPs per Block
CPU 30 QPs
1 QP per Block
Scalability – Optimal QP/CQ location:
1. QP and CQ in GPU memory
2. QP in GPU and CQ in system memory
3. CQ in GPU and QP in system memory
4. QP and CQ in GPU memory
42
CPU Memory
CPU
GPU
Data Path
GPU Memory
QP CQ Data
0110011001
1010011001
1001010101
Scalability – Optimal QP/CQ location:
43
CPU Memory
CPU
GPU
Data Path
GPU Memory
QP
CQ
Data
0110011001
1010011001
1001010101
1. QP and CQ in GPU memory
2. QP in GPU and CQ in system memory
3. CQ in GPU and QP in system memory
4. QP and CQ in GPU memory
Scalability – Optimal QP/CQ location:
44
CPU Memory
CPU
GPU
Data Path
GPU Memory
QP
Data
0110011001
1010011001
1001010101
1. QP and CQ in GPU memory
2. QP in GPU and CQ in system memory
3. CQ in GPU and QP in system memory
4. QP and CQ in GPU memory
CQ
Scalability – Optimal QP/CQ location:
45
CPU Memory
CPU
GPU
Data Path
GPU Memory
Data
0110011001
1010011001
1001010101
1. QP and CQ in GPU memory
2. QP in GPU and CQ in system memory
3. CQ in GPU and QP in system memory
4. QP and CQ in system memory
QP CQ
Optimal QP/CQ location:
Transfer latency [µsec]
QP in CPU QP in GPU
CQ in CPU 8.6 6.2
CQ in GPU 6.8 4.8
46
o Throughput: No difference
o Latency:
Limitations
47
Symptom:
GPUrdma may observe stale data or data that arrives out of order
Scenario:
Intensive RDMA writes to GPU memory
Good news:
NVIDIA announced a CUDA 8 feature that enables consistent update
Suggested fix:
CRC32 integrity check API for error detection
Interim results summary
48
5 µsec GPU to GPU communication latency
GPUrdma outperforms the CPU RDMA for small packets
by factor of 4.5X
50 Gbps transfer bandwidth for messages from 16KB and
larger
Agenda
49
GPI2
Introduction1
InfiniBand Background2
GPUrdma3
GPUrdma Evaluation4
5
GPI2 for GPUs:
GPI - A framework to implement Partitioned Global Address Space (PGAS)
GPI2 - Extends this global address space to GPU memory
Threads
Global
memory Host
Local memory
Host
Global
memory GPU
Local memory
GPU
50
Global
memory Host
Local memory
Host
Threads Threads
Global
memory GPU
Local memory
GPU
Threads
GPI2 code example
CPU Node
gaspi_segment_create (CPU_MEM)
Initialize data
gaspi_write_notify
gaspi_notify_waitsome
gaspi_proc_term
GPU Node
gaspi_segment_create (GPU_MEM)
gaspi_notify_waitsome
GPU_Compute _data<<<>>>
gaspi_write_notify
gaspi_proc_term
51
GPI2 using GPUrdma
CPU Node
gaspi_segment_create (CPU_MEM)
Initialize data
gaspi_write_notify
gaspi_notify_waitsome
gaspi_proc_term
GPU Node
gaspi_segment_create (GPU_MEM)
GPU_start_kernel <<<>>> {
gpu_gaspi_notify_waitsome
Compute_data()
gpu_gaspi_write_notify }
gaspi_proc_term
52
GPUrdma Ping-Pong
53
• 10 µsec roundtrip latency
• 40.5 Gb/sec throughput
• 52 Gb/sec throughput with
overlapping
• 38 Gb/sec for GPI2
CPU Node
Start timer
gaspi_write_notify
gaspi_notify_waitsome
Stop timer
GPU Node
gpu_gaspi_notify_waitsome
gpu_gaspi_write_notify
GPUrdma Multi-Matrix vector product
CPU Node
Start timer
gaspi_write_notify
gaspi_notify_waitsome
Stop timer
GPU Node
gpu_notify_waitsome
Matrix_compute()
gpu_write_notify
Batch size
[Vectors]
GPI2 GPUrdma
480 2.6 11.7
960 4.8 18.8
1920 8.4 25.2
3840 13.9 29.1
7680 19.9 30.3
15360 24.3 31.5
• System throughput in millions of 32x1 vector
multiplications per second as a function of the batch size
54
Lena Oden, Fraunhofer Institute for Industrial Mathematics:
• Infiniband-Verbs on GPU: A case study of controlling an Infiniband network device from
the GPU
• Analyzing Put/Get APIs for Thread-collaborative Processors
Mark Silberstein, Technion – Israel Institute of Technology:
• GPUnet: networking abstractions for GPU programs
• GPUfs: Integrating a file system with GPUs
55
Related works
56
Future works
Thanks
• GPUrdma for new accelerators (Tesla k80)
• Integrating GPUrdma with GPUnet
• Integrating GPUrdma with NCCL (library of standard collective communication
routines)
New research works using GPUrdma:
• Exploiting advances in GPU IO abilities for building a scaling deep learning
training systems

More Related Content

What's hot

GTS, Global Trigger and Synchronization system
GTS, Global Trigger and Synchronization systemGTS, Global Trigger and Synchronization system
GTS, Global Trigger and Synchronization system
JoelChavas
 
Dpdk performance
Dpdk performanceDpdk performance
Dpdk performance
Stephen Hemminger
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Tommy Lee
 
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
Jim St. Leger
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
bmbouter
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
Brendan Gregg
 
한컴MDS_Virtual Target Debugging with TRACE32
한컴MDS_Virtual Target Debugging with TRACE32한컴MDS_Virtual Target Debugging with TRACE32
한컴MDS_Virtual Target Debugging with TRACE32
HANCOM MDS
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
Shien-Chun Luo
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus SDN/OpenFlow switch
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
Shien-Chun Luo
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdk
markdgray
 
Le guide de dépannage de la jvm
Le guide de dépannage de la jvmLe guide de dépannage de la jvm
Le guide de dépannage de la jvm
Jean-Philippe BEMPEL
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
Jim St. Leger
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
Hisaki Ohara
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
Denys Haryachyy
 
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalShak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Tommy Lee
 
Understanding DPDK algorithmics
Understanding DPDK algorithmicsUnderstanding DPDK algorithmics
Understanding DPDK algorithmics
Denys Haryachyy
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
6WIND
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
Rohit Jnagal
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
Renaldas Zioma
 

What's hot (20)

GTS, Global Trigger and Synchronization system
GTS, Global Trigger and Synchronization systemGTS, Global Trigger and Synchronization system
GTS, Global Trigger and Synchronization system
 
Dpdk performance
Dpdk performanceDpdk performance
Dpdk performance
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
 
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
한컴MDS_Virtual Target Debugging with TRACE32
한컴MDS_Virtual Target Debugging with TRACE32한컴MDS_Virtual Target Debugging with TRACE32
한컴MDS_Virtual Target Debugging with TRACE32
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdk
 
Le guide de dépannage de la jvm
Le guide de dépannage de la jvmLe guide de dépannage de la jvm
Le guide de dépannage de la jvm
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalShak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-final
 
Understanding DPDK algorithmics
Understanding DPDK algorithmicsUnderstanding DPDK algorithmics
Understanding DPDK algorithmics
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 

Viewers also liked

HERD-Hanjun
HERD-HanjunHERD-Hanjun
HERD-Hanjun
Hanjun Xiao
 
Paper on RDMA enabled Cluster FileSystem at Intel Developer Forum
Paper on RDMA enabled Cluster FileSystem at Intel Developer ForumPaper on RDMA enabled Cluster FileSystem at Intel Developer Forum
Paper on RDMA enabled Cluster FileSystem at Intel Developer Forum
somenathb
 
slides
slidesslides
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
UniFabric
 
Persistent memory
Persistent memoryPersistent memory
Persistent memory
Benoit Hudzia
 
Approaching hyperconvergedopenstack
Approaching hyperconvergedopenstackApproaching hyperconvergedopenstack
Approaching hyperconvergedopenstack
Ikuo Kumagai
 
DMA, Infiniband
DMA, InfinibandDMA, Infiniband
DMA, Infiniband
Park Chunduck
 
Ceph on rdma
Ceph on rdmaCeph on rdma
Ceph on rdma
Somnath Roy
 
San disk axel rosenberg
San disk axel rosenbergSan disk axel rosenberg
San disk axel rosenberg
BigDataExpo
 
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
NAIM Networks, Inc.
 
Virtualization Acceleration
Virtualization Acceleration Virtualization Acceleration
Virtualization Acceleration
Mellanox Technologies
 
Function Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe DriverFunction Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe Driver
인구 강
 
Mellanox Storage Solutions
Mellanox Storage SolutionsMellanox Storage Solutions
Mellanox Storage Solutions
Mellanox Technologies
 
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage ComparisonIntel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
DataStax Academy
 
NVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxNVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in Linux
LF Events
 
Herd
HerdHerd
Introduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RIntroduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3R
Simon Huang
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
inside-BigData.com
 
Moving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressMoving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM Express
Odinot Stanislas
 
NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)
NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)
NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)
Ontico
 

Viewers also liked (20)

HERD-Hanjun
HERD-HanjunHERD-Hanjun
HERD-Hanjun
 
Paper on RDMA enabled Cluster FileSystem at Intel Developer Forum
Paper on RDMA enabled Cluster FileSystem at Intel Developer ForumPaper on RDMA enabled Cluster FileSystem at Intel Developer Forum
Paper on RDMA enabled Cluster FileSystem at Intel Developer Forum
 
slides
slidesslides
slides
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
 
Persistent memory
Persistent memoryPersistent memory
Persistent memory
 
Approaching hyperconvergedopenstack
Approaching hyperconvergedopenstackApproaching hyperconvergedopenstack
Approaching hyperconvergedopenstack
 
DMA, Infiniband
DMA, InfinibandDMA, Infiniband
DMA, Infiniband
 
Ceph on rdma
Ceph on rdmaCeph on rdma
Ceph on rdma
 
San disk axel rosenberg
San disk axel rosenbergSan disk axel rosenberg
San disk axel rosenberg
 
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
OVNC 2015-Open Ethernet과 SDN을 통한 Mellanox의 차세대 네트워크 혁신 방안
 
Virtualization Acceleration
Virtualization Acceleration Virtualization Acceleration
Virtualization Acceleration
 
Function Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe DriverFunction Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe Driver
 
Mellanox Storage Solutions
Mellanox Storage SolutionsMellanox Storage Solutions
Mellanox Storage Solutions
 
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage ComparisonIntel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
 
NVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxNVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in Linux
 
Herd
HerdHerd
Herd
 
Introduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RIntroduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3R
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Moving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressMoving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM Express
 
NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)
NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)
NVMf: 5 млн IOPS по сети своими руками / Андрей Николаенко (IBS)
 

Similar to GPUrdma - Presentation

Snug 2014 China
Snug 2014 ChinaSnug 2014 China
Snug 2014 China
jaseel_abdulla
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
Kohei KaiGai
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
Ganesan Narayanasamy
 
GPU Compute in Medical and Print Imaging
GPU Compute in Medical and Print ImagingGPU Compute in Medical and Print Imaging
GPU Compute in Medical and Print Imaging
AMD
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
Tigabu Yaya
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
Dilum Bandara
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
John Holden
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
inside-BigData.com
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
OpenStack Korea Community
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
inside-BigData.com
 
ODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & FeedsODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Workgroup
 
Increasing Cluster Performance by Combining rCUDA with Slurm
Increasing Cluster Performance by Combining rCUDA with SlurmIncreasing Cluster Performance by Combining rCUDA with Slurm
Increasing Cluster Performance by Combining rCUDA with Slurm
inside-BigData.com
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
Odinot Stanislas
 
IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPU
Jino Antony
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
MuhammadAbdullah311866
 
Eugene Khvedchenia - Image processing using FPGAs
Eugene Khvedchenia - Image processing using FPGAsEugene Khvedchenia - Image processing using FPGAs
Eugene Khvedchenia - Image processing using FPGAs
Eastern European Computer Vision Conference
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
fcassier
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data Plane
Open-NFP
 
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapUWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
edlangley
 

Similar to GPUrdma - Presentation (20)

Snug 2014 China
Snug 2014 ChinaSnug 2014 China
Snug 2014 China
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
GPU Compute in Medical and Print Imaging
GPU Compute in Medical and Print ImagingGPU Compute in Medical and Print Imaging
GPU Compute in Medical and Print Imaging
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
ODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & FeedsODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & Feeds
 
Increasing Cluster Performance by Combining rCUDA with Slurm
Increasing Cluster Performance by Combining rCUDA with SlurmIncreasing Cluster Performance by Combining rCUDA with Slurm
Increasing Cluster Performance by Combining rCUDA with Slurm
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPU
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
 
Eugene Khvedchenia - Image processing using FPGAs
Eugene Khvedchenia - Image processing using FPGAsEugene Khvedchenia - Image processing using FPGAs
Eugene Khvedchenia - Image processing using FPGAs
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data Plane
 
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapUWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
 

Recently uploaded

bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
nooriasukmaningtyas
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
ssuser36d3051
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
awadeshbabu
 

Recently uploaded (20)

bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
 

GPUrdma - Presentation

Editor's Notes

  1. new Streaming Multiprocessor Architecture called SMX
  2. No persistence kernels
  3. Query of random numbers 1 – even 0 – odd
  4. ability to send sparse data without the need to stop the kernel All the threads invoke the rdma command at the same time
  5. Queue based communication standard The importance of CQ poll
  6. Brief description on how InfiniBand networking works RDMA is the ability to transfer data directly between the applications over the network with no operating system involvement
  7. GPUrdma model
  8. Move data path – GPUDirect RDMA
  9. Modify the HCA libraries to support GPU buffers Now we are able to write jobs from GPU kernels
  10. Modify the HCA libraries to support GPU buffers Now we are able to write jobs from GPU kernels
  11. Patch GPU driver to map the door-bell BAR address
  12. Patch GPU driver to map the door-bell BAR address
  13. NIC expose doorbell address register on PCIe
  14. Same as before but with cudaHostRegister
  15. Each physical page in the system has a struct page associated with it The kernel represents every physical page on the system with a struct page
  16. Base version – bad performance Low bandwidth
  17. CPU pick performance 2K , GPU 4K Single qp is not enough
  18. Single qp is not enough
  19. For big clusters, we will need big number of QPs that will consume most of the GPU memory (16GB) 64 bytes each wqe QPs are point to point
  20. For big clusters, we will need big number of QPs that will consume most of the GPU memory (16GB) 64 bytes each wqe
  21. For big clusters, we will need big number of QPs that will consume most of the GPU memory (16GB) 64 bytes each wqe
  22. For big clusters, we will need big number of QPs that will consume most of the GPU memory (16GB) 64 bytes each wqe
  23. In Sequence RDMA writes to GPU memory , data can arrive out of order
  24. PGAS is a parallel programming model. It assumes a global memory address space that is logically partitioned and a portion of it is local to each process All nodes have transparent access to the whole global address space
  25. gpunet & gpufs shows that threadblock wise is good
  26. GPUfs - direct access to files from GPU kernel GPUnet - socket API for programs running on GPU
  27. GPUfs - direct access to files from GPU kernel GPUnet - socket API for programs running on GPU