SlideShare a Scribd company logo
1 of 76
Improving Networks Worldwide. Copyright 2012 IOL
Introduction to RDMA Programming
Robert D. Russell <rdr@unh.edu>
InterOperability Laboratory &
Computer Science Department
University of New Hampshire
Durham, New Hampshire 03824-3591, USA
2 Copyright 2012 IOL
RDMA – what is it?
A (relatively) new method for interconnecting
platforms in high-speed networks that overcomes
many of the difficulties encountered with
traditional networks such as TCP/IP over
Ethernet.
– new standards
– new protocols
– new hardware interface cards and switches
– new software
3 Copyright 2012 IOL
Remote Direct Memory Access
Remote
–data transfers between nodes in a network
Direct
–no Operating System Kernel involvement in transfers
–everything about a transfer offloaded onto Interface Card
Memory
–transfers between user space application virtual memory
–no extra copying or buffering
Access
–send, receive, read, write, atomic operations
4 Copyright 2012 IOL
RDMA Benefits
High throughput
Low latency
High messaging rate
Low CPU utilization
Low memory bus contention
Message boundaries preserved
Asynchronous operation
5 Copyright 2012 IOL
RDMA Technologies
InfiniBand – (41.8% of top 500 supercomputers)
–SDR 4x – 8 Gbps
–DDR 4x – 16 Gbps
–QDR 4x – 32 Gbps
–FDR 4x – 54 Gbps
iWarp – internet Wide Area RDMA Protocol
–10 Gbps
RoCE – RDMA over Converged Ethernet
–10 Gbps
–40 Gbps
6 Copyright 2012 IOL
RDMA Architecture Layering
User Application
OFA Verbs API
Physical
Data Link
Network
Transport
IWARP “RNIC” RoCE “NIC” InfiniBand “HCA”
RDMAP
DDP
MPA
TCP
IP
IB Transport API
IB Transport
IB Network
Ethernet MAC & LLC IB Link
Ethernet PHY IB PHY
OSI
Layers
CA
7 Copyright 2012 IOL
Software RDMA Drivers
Softiwarp
– www.zurich.ibm.com/sys/rdma
– open source kernel module that implements iWARP
protocols on top of ordinary kernel TCP sockets
– interoperates with hardware iWARP at other end of wire
Soft RoCE
– www.systemfabricworks.com/downloads/roce
– open source IB transport and network layers in software
over ordinary Ethernet
– interoperates with hardware RoCE at other end of wire
8 Copyright 2012 IOL
Verbs
InfiniBand specification written in terms of verbs
– semantic description of required behavior
– no syntactic or operating system specific details
– implementations free to define their own API
• syntax for functions, structures, types, etc.
OpenFabrics Alliance (OFA) Verbs API
– one possible syntactic definition of an API
– in syntax, each “verb” becomes an equivalent “function”
– done to prevent proliferation of incompatible definitions
– was an OFA strategy to unify InfiniBand market
9 Copyright 2012 IOL
OFA Verbs API
Implementations of OFA Verbs for Linux,
FreeBSD, Windows
Software interface for applications
– data structures, function prototypes, etc. that enable
C/C++ programs to access RDMA
User-space and kernel-space variants
– most applications and libraries are in user-space
Client-Server programming model
– some obvious analogies to TCP/IP sockets
– many differences because RDMA differs from TCP/IP
10 Copyright 2012 IOL
Users of OFA Verbs API
Applications
Libraries
File Systems
Storage Systems
Other protocols
11 Copyright 2012 IOL
Libraries that access RDMA
MPI – Message Passing Interface
–Main tool for High Performance Computing (HPC)
–Physics, fluid dynamics, modeling and simulations
–Many versions available
•OpenMPI
•MVAPICH
•Intel MPI
12 Copyright 2012 IOL
Layering with user level libraries
User Application
OFA Verbs API
Physical
Data Link
Network
Transport
IWARP “RNIC” RoCE “NIC” InfiniBand “HCA”
RDMAP
DDP
MPA
TCP
IP
IB Transport API
IB Transport
IB Network
Ethernet MAC & LLC IB Link
Ethernet PHY IB PHY
OSI
Layers
CA
User level libraries, such as MPI
13 Copyright 2012 IOL
Additional ways to access RDMA
File systems
Lustre – parallel distributed file system for Linux
NFS_RDMA – Network File System over RDMA
Storage appliances by DDN and NetApp
SRP – SCSI RDMA (Remote) Protocol – Linux kernel
iSER – iSCSI Extensions for RDMA – Linux kernel
14 Copyright 2012 IOL
Additional ways to access RDMA
Pseudo sockets libraries
SDP – Sockets Direct Protocol – supported by Oracle
rsockets – RDMA Sockets – supported by Intel
mva – Mellanox Messaging Accelerator
SMC-R – proposed by IBM
All these access methods written on top of OFA verbs
15 Copyright 2012 IOL
RDMA Architecture Layering
User Application
OFA Verbs API
Physical
Data Link
Network
Transport
IWARP “RNIC” RoCE “NIC” InfiniBand “HCA”
RDMAP
DDP
MPA
TCP
IP
IB Transport API
IB Transport
IB Network
Ethernet MAC & LLC IB Link
Ethernet PHY IB PHY
OSI
Layers
CA
16 Copyright 2012 IOL
Similarities between TCP and RDMA
Both utilize the client-server model
Both require a connection for reliable transport
Both provide a reliable transport mode
– TCP provides a reliable in-order sequence of bytes
– RDMA provides a reliable in-order sequence of messages
17 Copyright 2012 IOL
How RDMA differs from TCP/IP
“zero copy” – data transferred directly from
virtual memory on one node to virtual memory
on another node
“kernel bypass” – no operating system
involvement during data transfers
asynchronous operation – threads not
blocked during I/O transfers
18 Copyright 2012 IOL
User
App
Kernel
Stack
CA
Wire
TCP/IP setup
client server
setup setup
connect
listen
accept
bind
User
App
Kernel
Stack
CA
Wire
blue lines: control information
red lines: user data
green lines: control and data
19 Copyright 2012 IOL
User
App
Kernel
Stack
CA
Wire
RDMA setup
client server
setup setup
rdma_
connect
rdma_listen
rdma_accept
rdma_bind
User
App
Kernel
Stack
CA
Wire
blue lines: control information
red lines: user data
green lines: control and data
20 Copyright 2012 IOL
User
App
Kernel
Stack
CA
Wire
TCP/IP setup
client server
setup setup
connect
listen
accept
bind
User
App
Kernel
Stack
CA
Wire
blue lines: control information
red lines: user data
green lines: control and data
21 Copyright 2012 IOL
User
App
Kernel
Stack
CA
Wire
TCP/IP transfer
client server
setup setup
connect
listen
accept
bind
User
App
Kernel
Stack
CA
Wire
blue lines: control information
data
send
data
recv
red lines: user data
green lines: control and data
transfer transfer
data
copy
data
copy
22 Copyright 2012 IOL
User
App
Kernel
Stack
CA
Wire
RDMA transfer
client server
setup setup
rdma_
connect
rdma_listen
rdma_accept
rdma_bind
User
App
Kernel
Stack
CA
Wire
blue lines: control information
data
rdma_
post_
send
data rdma_
post_
recv
red lines: user data
green lines: control and data
transfer transfer
23 Copyright 2012 IOL
“Normal” TCP/IP socket access model
Byte streams – requires application to delimit /
recover message boundaries
Synchronous – blocks until data is sent/received
–O_NONBLOCK, MSG_DONTWAIT are not asynchronous,
are “try” and “try again”
 send() and recv() are paired
–both sides must participate in the transfer
 Requires data copy into system buffers
–order and timing of send() and recv() are irrelevant
–user memory accessible immediately before and
immediately after each send() and recv() call
24 Copyright 2012 IOL
virtual
memory
allocate
add to tables
sleep
wakeup
access
TCP
buffers
metadata
control
copy
data packets
ACKs
TCP RECV()
blocked
status
recv()
USER OPERATING SYSTEM NIC WIRE
25 Copyright 2012 IOL
allocate
access
metadata
control
data packets
ACK
RDMA RECV()
status
recv()
USER CHANNEL ADAPTER WIRE
register
poll_cq()
recv queue
completion queue
. . .
. . .
virtual
memory
parallel
activity
26 Copyright 2012 IOL
RDMA access model
Messages – preserves user's message boundaries
Asynchronous – no blocking during a transfer, which
–starts when metadata added to work queue
–finishes when status available in completion queue
 1-sided (unpaired) and 2-sided (paired) transfers
 No data copying into system buffers
–order and timing of send() and recv() are relevant
•recv() must be waiting before issuing send()
–memory involved in transfer is untouchable between start and
completion of transfer
27 Copyright 2012 IOL
Asynchronous Data Transfer
Posting
– term used to mark the initiation of a data transfer
– done by adding a work request to a work queue
Completion
– term used to mark the end of a data transfer
– done by removing a work completion from completion
queue
Important note:
– between posting and completion the state of user
memory involved in the transfer is undefined and
should NOT be changed by the user program
28 Copyright 2012 IOL
Posting – Completion
29 Copyright 2012 IOL
Kernel Bypass
User interacts directly with CA queues
Queue Pair from program to CA
– work request – data structure describing data transfer
– send queue – post work requests to CA that send data
– secv queue – post work requests to CA that receive data
Completion queues from CA to program
– work completion – data structure describing transfer status
– Can have separate send and receive completion queues
– Can have one queue for both send and receive completions
30 Copyright 2012 IOL
allocate
access
metadata
control
data packets
ACK
RDMA recv and completion queues
status
rdma_post_recv()
USER CHANNEL ADAPTER WIRE
register
ibv_poll_cq()
recv queue
completion queue
. . .
. . .
virtual
memory
parallel
activity
31 Copyright 2012 IOL
RDMA memory must be registered
To “pin” it into physical memory
–so it can not be paged in/out during transfer
–so CA can obtain physical to virtual mapping
•CA, not OS, does mapping during a transfer
•CA, not OS, checks validity of the transfer
 To create “keys” linking memory, process, and CA
–supplied by user as part of every transfer
–allows user to control access rights of a transfer
–allows CA to find correct mapping in a transfer
–allows CA to verify access rights in a transfer
32 Copyright 2012 IOL
RDMA transfer types
SEND/RECV – similar to “normal” TCP sockets
– each send on one side must match a recv on other side
WRITE – only in RDMA
– “pushes” data into remote virtual memory
READ – only in RDMA
– “pulls” data out of remote virtual memory
Atomics – only in InfiniBand and RoCE
– updates cell in remote virtual memory
Same verbs and data structures used by all
33 Copyright 2012 IOL
user
registered
virtual
memory
user
registered
virtual
memory
rdma_post_send() rdma_post_recv()
sender receiver
RDMA SEND/RECV data transfer
34 Copyright 2012 IOL
SEND/RECV similarities with sockets
Sender must issue listen() before client issues
connect()
Both sender and receiver must actively
participate in all data transfers
– sender must issue send() operations
– receiver must issue recv() operations
Sender does not know remote receiver's virtual
memory location
Receiver does not know remote sender's virtual
memory location
35 Copyright 2012 IOL
SEND/RECV differences with sockets
“normal” TCP/IP sockets are buffered
– time order of send() and recv() on each side is irrelevant
RDMA sockets are not buffered
– recv() must be posted by receiver before send() can be
posted by sender
– not doing this results in a fatal error
“normal” TCP/IP sockets have no notion of
“memory registration”
RDMA sockets require that memory
participating be “registered”
36 Copyright 2012 IOL
ping-pong using RDMA SEND/RECV
client server agent
loop
registered
memory
registered
memory
registered
memory
loop
ping data
pong data
37 Copyright 2012 IOL
3 phases in using reliable connections
Setup Phase
– obtain and convert addressing information
– create and configure local endpoints for communication
– setup local memory to be used in transfer
– establish the connection with the remote side
Use Phase
– actually transfer data to/from the remote side
Break-down Phase
– basically “undo” the setup phase
– close connection, free memory and communication
resources
38 Copyright 2012 IOL
Client setup phase
TCP RDMA
1. process command-line options process command-line options
2. convert DNS name and port no.
getaddrinfo()
convert DNS name and port no.
rdma_getaddrinfo()
3. define properties of new queue pair
struct ibv_qp_init_attr
4. create local end point
socket()
create local end point
rdma_create_ep()
5. allocate user virtual memory
malloc()
allocate user virtual memory
malloc()
6.
7. define properties of new connection
struct rdma_conn_param
8. create connection with server
connect()
create connection with server
rdma_connect()
register user virtual memory with CA
rdma_reg_msgs()
39 Copyright 2012 IOL
Client use phase
TCP RDMA
9. mark start time for statistics mark start time for statistics
10. start of transfer loop start of transfer loop
11. post receive to catch agent's pong data
rdma_post_recv()
12. transfer ping data to agent
send()
post send to start transfer of ping data to agent
rdma_post_send()
14. receive pong data from agent
recv()
wait for receive to complete
ibv_poll_cq()
15. optionally verify pong data is ok
memcmp()
optionally verify pong data is ok
memcmp()
16. end of transfer loop end of transfer loop
17. mark stop time and print statistics mark stop time and print statistics
wait for send to complete
ibv_poll_cq()
13.
40 Copyright 2012 IOL
Client breakdown phase
TCP RDMA
18. break connection with server
close()
break connection with server
rdma_disconnect()
19. deregister user virtual memory
rdma_dereg_mr()
20. free user virtual memory
free()
free user virtual memory
free()
21. destroy local end point
rdma_destroy_ep()
22. free getaddrinfo resources
freeaddrinfo()
free rdma_getaddrinfo resources
rdma_freeaddrinfo()
23. “unprocess” command-line options “unprocess” command-line options
41 Copyright 2012 IOL
Server participants
Listener
– waits for connection requests from client
– gets new system-provided connection to client
– hands-off new connection to agent
– never transfers any data to/from client
Agent
– creates control structures to deal with one client
– allocates memory to deal with one client
– performs all data transfers with one client
– disconnects from client when transfers all finished
42 Copyright 2012 IOL
Listener setup and use phases
TCP RDMA
1. process command-line options process command-line options
2. convert DNS name and port no.
getaddrinfo()
convert DNS name and port no.
rdma_getaddrinfo()
3. create local end point
socket()
define properties of new queue pair
struct ibv_qp_init_attr
4. bind to address and port
bind()
5. establish socket as listener
listen()
establish socket as listener
rdma_listen()
6. start loop start loop
7. get connection request from client
accept()
get connection request from client
rdma_get_request()
8. hand connection over to agent hand connection over to agent
9. end loop end loop
create and bind local end point
rdma_create_ep()
43 Copyright 2012 IOL
Listen breakdown phase
TCP RDMA
10. destroy local endpoint
close()
destroy local endpoint
rdma_destroy_ep()
11. free getaddrinfo resources
freegetaddrinfo()
free getaddrinfo resources
rdma_freegetaddrinfo()
12. “unprocess” command-line options “unprocess” command-line options
44 Copyright 2012 IOL
Agent setup phase
TCP RDMA
make copy of listener's options
and new cm_id for client
2. allocate user virtual memory
malloc()
allocate user virtual memory
malloc()
3. register user virtual memory with CA
rdma_reg_msgs()
finalize connection with client
rdma_accept()
1. make copy of listener's options
4. post first receive of ping data from client
rdma_post_recv()
6.
5. define properties of new connection
struct rdma_conn_param
45 Copyright 2012 IOL
Agent use phase
TCP RDMA
7. start of transfer loop start of transfer loop
wait to receive ping data from client
ibv_poll_cq()
9. if first time through loop
mark start time for statistics
11. transfer pong data to client
send()
post send to start transfer of pong data to client
rdma_post_send()
12. wait for send to complete
ibv_poll_cq()
13. end of transfer loop end of transfer loop
14. mark stop time and print statistics mark stop time and print statistics
10. post next receive for ping data from client
rdma_post_recv()
If first time through loop
mark start time for statistics
8. wait to receive ping data from client
recv()
46 Copyright 2012 IOL
Agent breakdown phase
TCP RDMA
15. break connection with client
close()
break connection with client
rdma_disconnect()
16. deregister user virtual memory
rdma_dereg_mr()
17. free user virtual memory
free()
free user virtual memory
free()
18. free copy of listener's options free copy of listener's options
19. destroy local end point
rdma_destroy_ep()
47 Copyright 2012 IOL
ping-pong measurements
Client
– round-trip-time 15.7 microseconds
– user CPU time 100% of elapsed time
– kernel CPU time 0% of elapsed time
Server
– round-trip time 15.7 microseconds
– user CPU time 100% of elapsed time
– kernel CPU time 0% of elapsed time
InfiniBand QDR 4x through a switch
48 Copyright 2012 IOL
How to reduce 100% CPU usage
Cause is “busy polling” to wait for completions
– in tight loop on ibv_poll_cq()
– burns CPU since most calls find nothing
Why is “busy polling” used at all?
– simple to write such a loop
– gives very fast response to a completion
– (i.e., gives low latency)
49 Copyright 2012 IOL
”busy polling” to get completions
1. start loop
2. ibv_poll_cq() to get any completion in queue
3. exit loop if a completion is found
4. end loop
50 Copyright 2012 IOL
How to eliminate “busy polling”
Cannot make ibv_poll_cq() block
– no flag parameter
– no timeout parameter
Must replace busy loop with “wait – wakeup”
Solution is a “wait-for-event” mechanism
–ibv_req_notify_cq() - tell CA to send an
“event” when next WC enters CQ
–ibv_get_cq_event() - blocks until gets “event”
–ibv_ack_cq_event() - acknowledges “event”
51 Copyright 2012 IOL
allocate
access
metadata
control
data packets
ACK
OFA verbs API for recv wait-wakeup
status
ibv_post_recv()
USER CHANNEL ADAPTER WIRE
register
ibv_poll_cq()
recv queue
completion queue
. . .
. . .
virtual
memory
ibv_ack_cq_events() control
ibv_req_notify_cq()
ibv_get_cq_event()
wakeup
control
wait
blocked
parallel
activity
52 Copyright 2012 IOL
”wait-for-event” to get completions
1. start loop
2. ibv_poll_cq() to get any completion in CQ
3. exit loop if a completion is found
4. ibv_req_notify_cq() to arm CA to send event on
next completion added to CQ
5. ibv_poll_cq() to get new completion between 2&4
6. exit loop if a completion is found
7. ibv_get_cq_event() to wait until CA sends event
8. ibv_ack_cq_events() to acknowledge event
9. end loop
53 Copyright 2012 IOL
ping-pong measurements with wait
Client
– round-trip-time 21.1 microseconds – up 34%
– user CPU time 9.0% of elapsed time
– kernel CPU time 9.1% of elapsed time
– total CPU time 18% of elapsed time – down 82%
Server
– round-trip time 21.1 microseconds – up 34%
– user CPU time 14.5% of elapsed time
– kernel CPU time 6.5% of elapsed time
– total CPU time 21% of elapsed time – down 79%
54 Copyright 2012 IOL
rdma_xxxx “wrappers” around ibv_xxxx
rdma_get_recv_comp() - wrapper for wait-wakeup
loop on receive completion queue
rdma_get_send_comp() - wrapper for wait-wakeup
loop on send completion queue
rdma_post_recv() - wrapper for ibv_post_recv()
rdma_post_send() - wrapper for ibv_post_send()
rdma_reg_msgs() - wrapper for ibv_reg_mr for
SEND/RECV
rdma_dereg_mr() - wrapper for ibv_dereg_mr()
55 Copyright 2012 IOL
where to find “wrappers”, prototypes,
data structures, etc.
/usr/include/rdma/rdma_verbs.h
–contains rdma_xxxx “wrappers”
/usr/include/infiniband/verbs.h
–contains ibv_xxxx verbs and all ibv data
structures, etc.
/usr/include/rdma/rdma_cm.h
–contains rdma_yyyy verbs and all rdma data
structures, etc. for connection management
56 Copyright 2012 IOL
Transfer choices
TCP/UDP transfer operations
–send()/recv() (and related forms)
RDMA transfer operations
–SEND/RECV similar to TCP/UDP
–RDMA WRITE push to remote virtual memory
–RDMA READ pull from remote virtual memory
–RDMA WRITE_WITH_IMM push with passive side notification
57 Copyright 2012 IOL
RDMA WRITE operation
Very different concept from normal TCP/IP
Very different concept from RDMA SEND/RECV
Only one side is active, other is passive
Active side (requester) issues RDMA WRITE
Passive side (responder) does NOTHING!
A better name would be “RDMA PUSH”
– data is “pushed” from active side's virtual memory into
passive sides' virtual memory
– passive side issues no operation, uses no CPU cycles,
gets no indication “push” started or completed
58 Copyright 2012 IOL
user
registered
virtual
memory
user
registered
virtual
memory
rdma_post_write()
active passive
RDMA WRITE data transfer
59 Copyright 2012 IOL
Differences with RDMA SEND
Active side calls rdma_post_write()
– opcode is RDMA_WRITE, not SEND
– work request MUST include passive side's virtual memory
address and memory registration key
Prior to issuing this operation, active side MUST
obtain passive side's address and key
– use send/recv to transfer this “metadata”
– (could actually use any means to transfer “metadata”)
Passive side provides “metadata” that enables
the data “push”, but does not participate in it
60 Copyright 2012 IOL
Similarities with RDMA SEND
Both transfer types move messages, not streams
Both transfer types are unbuffered
Both transfer types require registered virtual
memory on both sides of the transfer
Both transfer types operate asynchronously
– active side posts work request to send queue
– active side gets work completion from completion queue
Both transfer types use same verbs and data
structures (although values and fields differ)
61 Copyright 2012 IOL
RDMA READ operation
Very different from normal TCP/IP
Very different from RDMA SEND/RECV
Only one side is active, other is passive
Active side (requester) issues RDMA READ
Passive side (responder) does NOTHING!
A better name would be “RDMA PULL”
– data is “pulled” into active side's virtual memory from
passive sides' virtual memory
– passive side issues no operation, uses no CPU cycles,
gets no indication “pull” started or completed
62 Copyright 2012 IOL
user
registered
virtual
memory
user
registered
virtual
memory
rdma_post_read()
active passive
RDMA READ data transfer
63 Copyright 2012 IOL
Ping-pong with RDMA WRITE/READ
Client is active side in ping-pong loop
– client posts RDMA WRITE from ping buffer
– client posts RDMA READ into pong buffer
Server agent is passive side in ping-pong loop
– does nothing
Server agent must send its buffer's address and
registration key to client before loop
Client must send total number of transfers to
agent after the loop
– otherwise agent has no way of knowing this number
– agent needs to receive something to tell it “loop finished”
64 Copyright 2012 IOL
ping-pong using RDMA WRITE/READ
client server agent
loop
registered
memory
registered
memory
registered
memory
loop
ping data
pong data
65 Copyright 2012 IOL
Client ping-pong transfer loop
start of transfer loop
rdma_post_write() of RDMA WRITE ping data
ibv_poll_cq() to wait for RDMA WRITE completion
rdma_post_read() of RDMA READ pong data
ibv_poll_cq() to wait for RDMA READ completion
optionally verify pong data equals ping data
end of transfer loop
66 Copyright 2012 IOL
Agent ping-pong transfer loop
ibv_post_recv() to catch client's “finished”
message
ibv_poll_cq() to wait for “finished” from client
67 Copyright 2012 IOL
ping-pong RDMA WRITE/READ
measurements with wait
Client
– round-trip-time 14.3 microseconds
– user CPU time 26.4% of elapsed time
– kernel CPU time 3.0% of elapsed time
– total CPU time 29.4% of elapsed time
Server
– round-trip time 14.3 microseconds
– user CPU time 0% of elapsed time
– kernel CPU time 0% of elapsed time
– total CPU time 0% of elapsed time
68 Copyright 2012 IOL
Improving performance further
All postings discussed so far generate completions
– required for all rdma_post_recv() postings
– optional for all other prdma_post_xxxx() postings
User controls completion generation with
IBV_SEND_SIGNALED flag in rdma_post_write()
– supplying this flag always generates a completion for that
posting
– Not setting this flag generates a completion for that posting
only in case of an error
69 Copyright 2012 IOL
How client can benefit from this feature
RDMA READ posting follows RDMA WRITE
RDMA READ must finish after RDMA WRITE
– due to strict ordering rules in standards
Therefore we don't need to do anything with
RDMA WRITE completion
– completion of RDMA READ guarantees RDMA WRITE
transfer succeeded
– error on RDMA WRITE transfer will generate a completion
Therefore we can send RDMA WRITE
unsignaled and NOT wait for its completion
70 Copyright 2012 IOL
ping-pong using unsignaled WRITE
client server agent
loop loop
ping data
pong data
registered
memory
registered
memory
registered
memory
71 Copyright 2012 IOL
Client unsignaled transfer loop
start of transfer loop
rdma_post_write() of unsignaled RDMA WRITE
–generates no completion (except on error)
do not wait for RDMA WRITE completion
rdma_post_read() of signaled RDMA READ
ibv_poll_cq() to wait for RDMA READ completion
optionally verify pong data equals ping data
end of transfer loop
72 Copyright 2012 IOL
ping-pong RDMA WRITE/READ
measurements with unsignaled wait
Client
– round-trip-time 8.3 microseconds – down 42%
– user CPU time 28.0% of elapsed time – up 6.1%
– kernel CPU time 2.8% of elapsed time – down 6.7%
– total CPU time 30.8% of elapsed time – up 4.8%
Server
– round-trip time 8.3 microseconds – down 42%
– user CPU time 0% of elapsed time
– kernel CPU time 0% of elapsed time
– total CPU time 0% of elapsed time
73 Copyright 2012 IOL
Ping-pong performance summary
Rankings for Round-Trip Time (RTT)
8.3 usec unsignaled RDMA_WRITE/READ with wait
14.3 usec signaled RDMA_WRITE/READ with wait
15.7 usec signaled SEND/RECV with busy polling
21.1 used signaled SEND/RECV with wait
Rankings for client CPU usage
18.0% signaled SEND/RECV with wait
29.4% signaled RDMA_WRITE/READ with wait
30.8% unsignaled RDMA_WRITE/READ with wait
100% signaled SEND/RECV with busy polling
74 Copyright 2012 IOL
QUESTIONS?
75 Copyright 2012 IOL
Acknowledgments
This material is based upon work supported by
the National Science Foundation under Grant
No. OCI-1127228.
Any opinions, findings, and conclusions or
recommendations expressed in this material are
those of the author and do not necessarily reflect
the views of the National Science Foundation.
76 Copyright 2012 IOL
THANK YOU!

More Related Content

Similar to rdma-intro-module.ppt

The TCP/IP and OSI models
The TCP/IP and OSI modelsThe TCP/IP and OSI models
The TCP/IP and OSI models
Jake Weaver
 

Similar to rdma-intro-module.ppt (20)

pppppppppppppppppjjjjjjjjjjjpppppppp.pptx
pppppppppppppppppjjjjjjjjjjjpppppppp.pptxpppppppppppppppppjjjjjjjjjjjpppppppp.pptx
pppppppppppppppppjjjjjjjjjjjpppppppp.pptx
 
Protocol and Integration Challenges for SDN
Protocol and Integration Challenges for SDNProtocol and Integration Challenges for SDN
Protocol and Integration Challenges for SDN
 
OSI_TCPIP_layers.pptx
OSI_TCPIP_layers.pptxOSI_TCPIP_layers.pptx
OSI_TCPIP_layers.pptx
 
OSI Layering
OSI Layering OSI Layering
OSI Layering
 
Communication Mechanisms, Past, Present & Future
Communication Mechanisms, Past, Present & FutureCommunication Mechanisms, Past, Present & Future
Communication Mechanisms, Past, Present & Future
 
Introduction to OSI and QUIC
Introduction to OSI and QUICIntroduction to OSI and QUIC
Introduction to OSI and QUIC
 
Bhargava Presentation.ppt
Bhargava Presentation.pptBhargava Presentation.ppt
Bhargava Presentation.ppt
 
Bhargava Presentation.ppt
Bhargava Presentation.pptBhargava Presentation.ppt
Bhargava Presentation.ppt
 
Scada protocols-and-communications-trends
Scada protocols-and-communications-trendsScada protocols-and-communications-trends
Scada protocols-and-communications-trends
 
Networking Models
Networking ModelsNetworking Models
Networking Models
 
Ead pertemuan-7
Ead pertemuan-7Ead pertemuan-7
Ead pertemuan-7
 
chapter 4.pptx
chapter 4.pptxchapter 4.pptx
chapter 4.pptx
 
SDN, OpenFlow, NFV, and Virtual Network
SDN, OpenFlow, NFV, and Virtual NetworkSDN, OpenFlow, NFV, and Virtual Network
SDN, OpenFlow, NFV, and Virtual Network
 
1. RINA motivation - TF Workshop
1. RINA motivation - TF Workshop1. RINA motivation - TF Workshop
1. RINA motivation - TF Workshop
 
Open Networking through Programmability
Open Networking through ProgrammabilityOpen Networking through Programmability
Open Networking through Programmability
 
OSI model.pptx
OSI model.pptxOSI model.pptx
OSI model.pptx
 
Tcp/Ip Model
Tcp/Ip ModelTcp/Ip Model
Tcp/Ip Model
 
The TCP/IP and OSI models
The TCP/IP and OSI modelsThe TCP/IP and OSI models
The TCP/IP and OSI models
 
computer network and chapter 7 OSI layers.pptx
computer network and chapter 7 OSI layers.pptxcomputer network and chapter 7 OSI layers.pptx
computer network and chapter 7 OSI layers.pptx
 
Cloud Presentation.pdf
Cloud Presentation.pdfCloud Presentation.pdf
Cloud Presentation.pdf
 

More from KalimuthuVelappan (8)

lesson24.ppt
lesson24.pptlesson24.ppt
lesson24.ppt
 
kerch04.ppt
kerch04.pptkerch04.ppt
kerch04.ppt
 
Netlink-Optimization.pptx
Netlink-Optimization.pptxNetlink-Optimization.pptx
Netlink-Optimization.pptx
 
memory_mapping.ppt
memory_mapping.pptmemory_mapping.ppt
memory_mapping.ppt
 
DPKG caching framework-latest .pptx
DPKG caching framework-latest .pptxDPKG caching framework-latest .pptx
DPKG caching framework-latest .pptx
 
memory.ppt
memory.pptmemory.ppt
memory.ppt
 
stack.pptx
stack.pptxstack.pptx
stack.pptx
 
lesson05.ppt
lesson05.pptlesson05.ppt
lesson05.ppt
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

rdma-intro-module.ppt

  • 1. Improving Networks Worldwide. Copyright 2012 IOL Introduction to RDMA Programming Robert D. Russell <rdr@unh.edu> InterOperability Laboratory & Computer Science Department University of New Hampshire Durham, New Hampshire 03824-3591, USA
  • 2. 2 Copyright 2012 IOL RDMA – what is it? A (relatively) new method for interconnecting platforms in high-speed networks that overcomes many of the difficulties encountered with traditional networks such as TCP/IP over Ethernet. – new standards – new protocols – new hardware interface cards and switches – new software
  • 3. 3 Copyright 2012 IOL Remote Direct Memory Access Remote –data transfers between nodes in a network Direct –no Operating System Kernel involvement in transfers –everything about a transfer offloaded onto Interface Card Memory –transfers between user space application virtual memory –no extra copying or buffering Access –send, receive, read, write, atomic operations
  • 4. 4 Copyright 2012 IOL RDMA Benefits High throughput Low latency High messaging rate Low CPU utilization Low memory bus contention Message boundaries preserved Asynchronous operation
  • 5. 5 Copyright 2012 IOL RDMA Technologies InfiniBand – (41.8% of top 500 supercomputers) –SDR 4x – 8 Gbps –DDR 4x – 16 Gbps –QDR 4x – 32 Gbps –FDR 4x – 54 Gbps iWarp – internet Wide Area RDMA Protocol –10 Gbps RoCE – RDMA over Converged Ethernet –10 Gbps –40 Gbps
  • 6. 6 Copyright 2012 IOL RDMA Architecture Layering User Application OFA Verbs API Physical Data Link Network Transport IWARP “RNIC” RoCE “NIC” InfiniBand “HCA” RDMAP DDP MPA TCP IP IB Transport API IB Transport IB Network Ethernet MAC & LLC IB Link Ethernet PHY IB PHY OSI Layers CA
  • 7. 7 Copyright 2012 IOL Software RDMA Drivers Softiwarp – www.zurich.ibm.com/sys/rdma – open source kernel module that implements iWARP protocols on top of ordinary kernel TCP sockets – interoperates with hardware iWARP at other end of wire Soft RoCE – www.systemfabricworks.com/downloads/roce – open source IB transport and network layers in software over ordinary Ethernet – interoperates with hardware RoCE at other end of wire
  • 8. 8 Copyright 2012 IOL Verbs InfiniBand specification written in terms of verbs – semantic description of required behavior – no syntactic or operating system specific details – implementations free to define their own API • syntax for functions, structures, types, etc. OpenFabrics Alliance (OFA) Verbs API – one possible syntactic definition of an API – in syntax, each “verb” becomes an equivalent “function” – done to prevent proliferation of incompatible definitions – was an OFA strategy to unify InfiniBand market
  • 9. 9 Copyright 2012 IOL OFA Verbs API Implementations of OFA Verbs for Linux, FreeBSD, Windows Software interface for applications – data structures, function prototypes, etc. that enable C/C++ programs to access RDMA User-space and kernel-space variants – most applications and libraries are in user-space Client-Server programming model – some obvious analogies to TCP/IP sockets – many differences because RDMA differs from TCP/IP
  • 10. 10 Copyright 2012 IOL Users of OFA Verbs API Applications Libraries File Systems Storage Systems Other protocols
  • 11. 11 Copyright 2012 IOL Libraries that access RDMA MPI – Message Passing Interface –Main tool for High Performance Computing (HPC) –Physics, fluid dynamics, modeling and simulations –Many versions available •OpenMPI •MVAPICH •Intel MPI
  • 12. 12 Copyright 2012 IOL Layering with user level libraries User Application OFA Verbs API Physical Data Link Network Transport IWARP “RNIC” RoCE “NIC” InfiniBand “HCA” RDMAP DDP MPA TCP IP IB Transport API IB Transport IB Network Ethernet MAC & LLC IB Link Ethernet PHY IB PHY OSI Layers CA User level libraries, such as MPI
  • 13. 13 Copyright 2012 IOL Additional ways to access RDMA File systems Lustre – parallel distributed file system for Linux NFS_RDMA – Network File System over RDMA Storage appliances by DDN and NetApp SRP – SCSI RDMA (Remote) Protocol – Linux kernel iSER – iSCSI Extensions for RDMA – Linux kernel
  • 14. 14 Copyright 2012 IOL Additional ways to access RDMA Pseudo sockets libraries SDP – Sockets Direct Protocol – supported by Oracle rsockets – RDMA Sockets – supported by Intel mva – Mellanox Messaging Accelerator SMC-R – proposed by IBM All these access methods written on top of OFA verbs
  • 15. 15 Copyright 2012 IOL RDMA Architecture Layering User Application OFA Verbs API Physical Data Link Network Transport IWARP “RNIC” RoCE “NIC” InfiniBand “HCA” RDMAP DDP MPA TCP IP IB Transport API IB Transport IB Network Ethernet MAC & LLC IB Link Ethernet PHY IB PHY OSI Layers CA
  • 16. 16 Copyright 2012 IOL Similarities between TCP and RDMA Both utilize the client-server model Both require a connection for reliable transport Both provide a reliable transport mode – TCP provides a reliable in-order sequence of bytes – RDMA provides a reliable in-order sequence of messages
  • 17. 17 Copyright 2012 IOL How RDMA differs from TCP/IP “zero copy” – data transferred directly from virtual memory on one node to virtual memory on another node “kernel bypass” – no operating system involvement during data transfers asynchronous operation – threads not blocked during I/O transfers
  • 18. 18 Copyright 2012 IOL User App Kernel Stack CA Wire TCP/IP setup client server setup setup connect listen accept bind User App Kernel Stack CA Wire blue lines: control information red lines: user data green lines: control and data
  • 19. 19 Copyright 2012 IOL User App Kernel Stack CA Wire RDMA setup client server setup setup rdma_ connect rdma_listen rdma_accept rdma_bind User App Kernel Stack CA Wire blue lines: control information red lines: user data green lines: control and data
  • 20. 20 Copyright 2012 IOL User App Kernel Stack CA Wire TCP/IP setup client server setup setup connect listen accept bind User App Kernel Stack CA Wire blue lines: control information red lines: user data green lines: control and data
  • 21. 21 Copyright 2012 IOL User App Kernel Stack CA Wire TCP/IP transfer client server setup setup connect listen accept bind User App Kernel Stack CA Wire blue lines: control information data send data recv red lines: user data green lines: control and data transfer transfer data copy data copy
  • 22. 22 Copyright 2012 IOL User App Kernel Stack CA Wire RDMA transfer client server setup setup rdma_ connect rdma_listen rdma_accept rdma_bind User App Kernel Stack CA Wire blue lines: control information data rdma_ post_ send data rdma_ post_ recv red lines: user data green lines: control and data transfer transfer
  • 23. 23 Copyright 2012 IOL “Normal” TCP/IP socket access model Byte streams – requires application to delimit / recover message boundaries Synchronous – blocks until data is sent/received –O_NONBLOCK, MSG_DONTWAIT are not asynchronous, are “try” and “try again”  send() and recv() are paired –both sides must participate in the transfer  Requires data copy into system buffers –order and timing of send() and recv() are irrelevant –user memory accessible immediately before and immediately after each send() and recv() call
  • 24. 24 Copyright 2012 IOL virtual memory allocate add to tables sleep wakeup access TCP buffers metadata control copy data packets ACKs TCP RECV() blocked status recv() USER OPERATING SYSTEM NIC WIRE
  • 25. 25 Copyright 2012 IOL allocate access metadata control data packets ACK RDMA RECV() status recv() USER CHANNEL ADAPTER WIRE register poll_cq() recv queue completion queue . . . . . . virtual memory parallel activity
  • 26. 26 Copyright 2012 IOL RDMA access model Messages – preserves user's message boundaries Asynchronous – no blocking during a transfer, which –starts when metadata added to work queue –finishes when status available in completion queue  1-sided (unpaired) and 2-sided (paired) transfers  No data copying into system buffers –order and timing of send() and recv() are relevant •recv() must be waiting before issuing send() –memory involved in transfer is untouchable between start and completion of transfer
  • 27. 27 Copyright 2012 IOL Asynchronous Data Transfer Posting – term used to mark the initiation of a data transfer – done by adding a work request to a work queue Completion – term used to mark the end of a data transfer – done by removing a work completion from completion queue Important note: – between posting and completion the state of user memory involved in the transfer is undefined and should NOT be changed by the user program
  • 28. 28 Copyright 2012 IOL Posting – Completion
  • 29. 29 Copyright 2012 IOL Kernel Bypass User interacts directly with CA queues Queue Pair from program to CA – work request – data structure describing data transfer – send queue – post work requests to CA that send data – secv queue – post work requests to CA that receive data Completion queues from CA to program – work completion – data structure describing transfer status – Can have separate send and receive completion queues – Can have one queue for both send and receive completions
  • 30. 30 Copyright 2012 IOL allocate access metadata control data packets ACK RDMA recv and completion queues status rdma_post_recv() USER CHANNEL ADAPTER WIRE register ibv_poll_cq() recv queue completion queue . . . . . . virtual memory parallel activity
  • 31. 31 Copyright 2012 IOL RDMA memory must be registered To “pin” it into physical memory –so it can not be paged in/out during transfer –so CA can obtain physical to virtual mapping •CA, not OS, does mapping during a transfer •CA, not OS, checks validity of the transfer  To create “keys” linking memory, process, and CA –supplied by user as part of every transfer –allows user to control access rights of a transfer –allows CA to find correct mapping in a transfer –allows CA to verify access rights in a transfer
  • 32. 32 Copyright 2012 IOL RDMA transfer types SEND/RECV – similar to “normal” TCP sockets – each send on one side must match a recv on other side WRITE – only in RDMA – “pushes” data into remote virtual memory READ – only in RDMA – “pulls” data out of remote virtual memory Atomics – only in InfiniBand and RoCE – updates cell in remote virtual memory Same verbs and data structures used by all
  • 33. 33 Copyright 2012 IOL user registered virtual memory user registered virtual memory rdma_post_send() rdma_post_recv() sender receiver RDMA SEND/RECV data transfer
  • 34. 34 Copyright 2012 IOL SEND/RECV similarities with sockets Sender must issue listen() before client issues connect() Both sender and receiver must actively participate in all data transfers – sender must issue send() operations – receiver must issue recv() operations Sender does not know remote receiver's virtual memory location Receiver does not know remote sender's virtual memory location
  • 35. 35 Copyright 2012 IOL SEND/RECV differences with sockets “normal” TCP/IP sockets are buffered – time order of send() and recv() on each side is irrelevant RDMA sockets are not buffered – recv() must be posted by receiver before send() can be posted by sender – not doing this results in a fatal error “normal” TCP/IP sockets have no notion of “memory registration” RDMA sockets require that memory participating be “registered”
  • 36. 36 Copyright 2012 IOL ping-pong using RDMA SEND/RECV client server agent loop registered memory registered memory registered memory loop ping data pong data
  • 37. 37 Copyright 2012 IOL 3 phases in using reliable connections Setup Phase – obtain and convert addressing information – create and configure local endpoints for communication – setup local memory to be used in transfer – establish the connection with the remote side Use Phase – actually transfer data to/from the remote side Break-down Phase – basically “undo” the setup phase – close connection, free memory and communication resources
  • 38. 38 Copyright 2012 IOL Client setup phase TCP RDMA 1. process command-line options process command-line options 2. convert DNS name and port no. getaddrinfo() convert DNS name and port no. rdma_getaddrinfo() 3. define properties of new queue pair struct ibv_qp_init_attr 4. create local end point socket() create local end point rdma_create_ep() 5. allocate user virtual memory malloc() allocate user virtual memory malloc() 6. 7. define properties of new connection struct rdma_conn_param 8. create connection with server connect() create connection with server rdma_connect() register user virtual memory with CA rdma_reg_msgs()
  • 39. 39 Copyright 2012 IOL Client use phase TCP RDMA 9. mark start time for statistics mark start time for statistics 10. start of transfer loop start of transfer loop 11. post receive to catch agent's pong data rdma_post_recv() 12. transfer ping data to agent send() post send to start transfer of ping data to agent rdma_post_send() 14. receive pong data from agent recv() wait for receive to complete ibv_poll_cq() 15. optionally verify pong data is ok memcmp() optionally verify pong data is ok memcmp() 16. end of transfer loop end of transfer loop 17. mark stop time and print statistics mark stop time and print statistics wait for send to complete ibv_poll_cq() 13.
  • 40. 40 Copyright 2012 IOL Client breakdown phase TCP RDMA 18. break connection with server close() break connection with server rdma_disconnect() 19. deregister user virtual memory rdma_dereg_mr() 20. free user virtual memory free() free user virtual memory free() 21. destroy local end point rdma_destroy_ep() 22. free getaddrinfo resources freeaddrinfo() free rdma_getaddrinfo resources rdma_freeaddrinfo() 23. “unprocess” command-line options “unprocess” command-line options
  • 41. 41 Copyright 2012 IOL Server participants Listener – waits for connection requests from client – gets new system-provided connection to client – hands-off new connection to agent – never transfers any data to/from client Agent – creates control structures to deal with one client – allocates memory to deal with one client – performs all data transfers with one client – disconnects from client when transfers all finished
  • 42. 42 Copyright 2012 IOL Listener setup and use phases TCP RDMA 1. process command-line options process command-line options 2. convert DNS name and port no. getaddrinfo() convert DNS name and port no. rdma_getaddrinfo() 3. create local end point socket() define properties of new queue pair struct ibv_qp_init_attr 4. bind to address and port bind() 5. establish socket as listener listen() establish socket as listener rdma_listen() 6. start loop start loop 7. get connection request from client accept() get connection request from client rdma_get_request() 8. hand connection over to agent hand connection over to agent 9. end loop end loop create and bind local end point rdma_create_ep()
  • 43. 43 Copyright 2012 IOL Listen breakdown phase TCP RDMA 10. destroy local endpoint close() destroy local endpoint rdma_destroy_ep() 11. free getaddrinfo resources freegetaddrinfo() free getaddrinfo resources rdma_freegetaddrinfo() 12. “unprocess” command-line options “unprocess” command-line options
  • 44. 44 Copyright 2012 IOL Agent setup phase TCP RDMA make copy of listener's options and new cm_id for client 2. allocate user virtual memory malloc() allocate user virtual memory malloc() 3. register user virtual memory with CA rdma_reg_msgs() finalize connection with client rdma_accept() 1. make copy of listener's options 4. post first receive of ping data from client rdma_post_recv() 6. 5. define properties of new connection struct rdma_conn_param
  • 45. 45 Copyright 2012 IOL Agent use phase TCP RDMA 7. start of transfer loop start of transfer loop wait to receive ping data from client ibv_poll_cq() 9. if first time through loop mark start time for statistics 11. transfer pong data to client send() post send to start transfer of pong data to client rdma_post_send() 12. wait for send to complete ibv_poll_cq() 13. end of transfer loop end of transfer loop 14. mark stop time and print statistics mark stop time and print statistics 10. post next receive for ping data from client rdma_post_recv() If first time through loop mark start time for statistics 8. wait to receive ping data from client recv()
  • 46. 46 Copyright 2012 IOL Agent breakdown phase TCP RDMA 15. break connection with client close() break connection with client rdma_disconnect() 16. deregister user virtual memory rdma_dereg_mr() 17. free user virtual memory free() free user virtual memory free() 18. free copy of listener's options free copy of listener's options 19. destroy local end point rdma_destroy_ep()
  • 47. 47 Copyright 2012 IOL ping-pong measurements Client – round-trip-time 15.7 microseconds – user CPU time 100% of elapsed time – kernel CPU time 0% of elapsed time Server – round-trip time 15.7 microseconds – user CPU time 100% of elapsed time – kernel CPU time 0% of elapsed time InfiniBand QDR 4x through a switch
  • 48. 48 Copyright 2012 IOL How to reduce 100% CPU usage Cause is “busy polling” to wait for completions – in tight loop on ibv_poll_cq() – burns CPU since most calls find nothing Why is “busy polling” used at all? – simple to write such a loop – gives very fast response to a completion – (i.e., gives low latency)
  • 49. 49 Copyright 2012 IOL ”busy polling” to get completions 1. start loop 2. ibv_poll_cq() to get any completion in queue 3. exit loop if a completion is found 4. end loop
  • 50. 50 Copyright 2012 IOL How to eliminate “busy polling” Cannot make ibv_poll_cq() block – no flag parameter – no timeout parameter Must replace busy loop with “wait – wakeup” Solution is a “wait-for-event” mechanism –ibv_req_notify_cq() - tell CA to send an “event” when next WC enters CQ –ibv_get_cq_event() - blocks until gets “event” –ibv_ack_cq_event() - acknowledges “event”
  • 51. 51 Copyright 2012 IOL allocate access metadata control data packets ACK OFA verbs API for recv wait-wakeup status ibv_post_recv() USER CHANNEL ADAPTER WIRE register ibv_poll_cq() recv queue completion queue . . . . . . virtual memory ibv_ack_cq_events() control ibv_req_notify_cq() ibv_get_cq_event() wakeup control wait blocked parallel activity
  • 52. 52 Copyright 2012 IOL ”wait-for-event” to get completions 1. start loop 2. ibv_poll_cq() to get any completion in CQ 3. exit loop if a completion is found 4. ibv_req_notify_cq() to arm CA to send event on next completion added to CQ 5. ibv_poll_cq() to get new completion between 2&4 6. exit loop if a completion is found 7. ibv_get_cq_event() to wait until CA sends event 8. ibv_ack_cq_events() to acknowledge event 9. end loop
  • 53. 53 Copyright 2012 IOL ping-pong measurements with wait Client – round-trip-time 21.1 microseconds – up 34% – user CPU time 9.0% of elapsed time – kernel CPU time 9.1% of elapsed time – total CPU time 18% of elapsed time – down 82% Server – round-trip time 21.1 microseconds – up 34% – user CPU time 14.5% of elapsed time – kernel CPU time 6.5% of elapsed time – total CPU time 21% of elapsed time – down 79%
  • 54. 54 Copyright 2012 IOL rdma_xxxx “wrappers” around ibv_xxxx rdma_get_recv_comp() - wrapper for wait-wakeup loop on receive completion queue rdma_get_send_comp() - wrapper for wait-wakeup loop on send completion queue rdma_post_recv() - wrapper for ibv_post_recv() rdma_post_send() - wrapper for ibv_post_send() rdma_reg_msgs() - wrapper for ibv_reg_mr for SEND/RECV rdma_dereg_mr() - wrapper for ibv_dereg_mr()
  • 55. 55 Copyright 2012 IOL where to find “wrappers”, prototypes, data structures, etc. /usr/include/rdma/rdma_verbs.h –contains rdma_xxxx “wrappers” /usr/include/infiniband/verbs.h –contains ibv_xxxx verbs and all ibv data structures, etc. /usr/include/rdma/rdma_cm.h –contains rdma_yyyy verbs and all rdma data structures, etc. for connection management
  • 56. 56 Copyright 2012 IOL Transfer choices TCP/UDP transfer operations –send()/recv() (and related forms) RDMA transfer operations –SEND/RECV similar to TCP/UDP –RDMA WRITE push to remote virtual memory –RDMA READ pull from remote virtual memory –RDMA WRITE_WITH_IMM push with passive side notification
  • 57. 57 Copyright 2012 IOL RDMA WRITE operation Very different concept from normal TCP/IP Very different concept from RDMA SEND/RECV Only one side is active, other is passive Active side (requester) issues RDMA WRITE Passive side (responder) does NOTHING! A better name would be “RDMA PUSH” – data is “pushed” from active side's virtual memory into passive sides' virtual memory – passive side issues no operation, uses no CPU cycles, gets no indication “push” started or completed
  • 58. 58 Copyright 2012 IOL user registered virtual memory user registered virtual memory rdma_post_write() active passive RDMA WRITE data transfer
  • 59. 59 Copyright 2012 IOL Differences with RDMA SEND Active side calls rdma_post_write() – opcode is RDMA_WRITE, not SEND – work request MUST include passive side's virtual memory address and memory registration key Prior to issuing this operation, active side MUST obtain passive side's address and key – use send/recv to transfer this “metadata” – (could actually use any means to transfer “metadata”) Passive side provides “metadata” that enables the data “push”, but does not participate in it
  • 60. 60 Copyright 2012 IOL Similarities with RDMA SEND Both transfer types move messages, not streams Both transfer types are unbuffered Both transfer types require registered virtual memory on both sides of the transfer Both transfer types operate asynchronously – active side posts work request to send queue – active side gets work completion from completion queue Both transfer types use same verbs and data structures (although values and fields differ)
  • 61. 61 Copyright 2012 IOL RDMA READ operation Very different from normal TCP/IP Very different from RDMA SEND/RECV Only one side is active, other is passive Active side (requester) issues RDMA READ Passive side (responder) does NOTHING! A better name would be “RDMA PULL” – data is “pulled” into active side's virtual memory from passive sides' virtual memory – passive side issues no operation, uses no CPU cycles, gets no indication “pull” started or completed
  • 62. 62 Copyright 2012 IOL user registered virtual memory user registered virtual memory rdma_post_read() active passive RDMA READ data transfer
  • 63. 63 Copyright 2012 IOL Ping-pong with RDMA WRITE/READ Client is active side in ping-pong loop – client posts RDMA WRITE from ping buffer – client posts RDMA READ into pong buffer Server agent is passive side in ping-pong loop – does nothing Server agent must send its buffer's address and registration key to client before loop Client must send total number of transfers to agent after the loop – otherwise agent has no way of knowing this number – agent needs to receive something to tell it “loop finished”
  • 64. 64 Copyright 2012 IOL ping-pong using RDMA WRITE/READ client server agent loop registered memory registered memory registered memory loop ping data pong data
  • 65. 65 Copyright 2012 IOL Client ping-pong transfer loop start of transfer loop rdma_post_write() of RDMA WRITE ping data ibv_poll_cq() to wait for RDMA WRITE completion rdma_post_read() of RDMA READ pong data ibv_poll_cq() to wait for RDMA READ completion optionally verify pong data equals ping data end of transfer loop
  • 66. 66 Copyright 2012 IOL Agent ping-pong transfer loop ibv_post_recv() to catch client's “finished” message ibv_poll_cq() to wait for “finished” from client
  • 67. 67 Copyright 2012 IOL ping-pong RDMA WRITE/READ measurements with wait Client – round-trip-time 14.3 microseconds – user CPU time 26.4% of elapsed time – kernel CPU time 3.0% of elapsed time – total CPU time 29.4% of elapsed time Server – round-trip time 14.3 microseconds – user CPU time 0% of elapsed time – kernel CPU time 0% of elapsed time – total CPU time 0% of elapsed time
  • 68. 68 Copyright 2012 IOL Improving performance further All postings discussed so far generate completions – required for all rdma_post_recv() postings – optional for all other prdma_post_xxxx() postings User controls completion generation with IBV_SEND_SIGNALED flag in rdma_post_write() – supplying this flag always generates a completion for that posting – Not setting this flag generates a completion for that posting only in case of an error
  • 69. 69 Copyright 2012 IOL How client can benefit from this feature RDMA READ posting follows RDMA WRITE RDMA READ must finish after RDMA WRITE – due to strict ordering rules in standards Therefore we don't need to do anything with RDMA WRITE completion – completion of RDMA READ guarantees RDMA WRITE transfer succeeded – error on RDMA WRITE transfer will generate a completion Therefore we can send RDMA WRITE unsignaled and NOT wait for its completion
  • 70. 70 Copyright 2012 IOL ping-pong using unsignaled WRITE client server agent loop loop ping data pong data registered memory registered memory registered memory
  • 71. 71 Copyright 2012 IOL Client unsignaled transfer loop start of transfer loop rdma_post_write() of unsignaled RDMA WRITE –generates no completion (except on error) do not wait for RDMA WRITE completion rdma_post_read() of signaled RDMA READ ibv_poll_cq() to wait for RDMA READ completion optionally verify pong data equals ping data end of transfer loop
  • 72. 72 Copyright 2012 IOL ping-pong RDMA WRITE/READ measurements with unsignaled wait Client – round-trip-time 8.3 microseconds – down 42% – user CPU time 28.0% of elapsed time – up 6.1% – kernel CPU time 2.8% of elapsed time – down 6.7% – total CPU time 30.8% of elapsed time – up 4.8% Server – round-trip time 8.3 microseconds – down 42% – user CPU time 0% of elapsed time – kernel CPU time 0% of elapsed time – total CPU time 0% of elapsed time
  • 73. 73 Copyright 2012 IOL Ping-pong performance summary Rankings for Round-Trip Time (RTT) 8.3 usec unsignaled RDMA_WRITE/READ with wait 14.3 usec signaled RDMA_WRITE/READ with wait 15.7 usec signaled SEND/RECV with busy polling 21.1 used signaled SEND/RECV with wait Rankings for client CPU usage 18.0% signaled SEND/RECV with wait 29.4% signaled RDMA_WRITE/READ with wait 30.8% unsignaled RDMA_WRITE/READ with wait 100% signaled SEND/RECV with busy polling
  • 74. 74 Copyright 2012 IOL QUESTIONS?
  • 75. 75 Copyright 2012 IOL Acknowledgments This material is based upon work supported by the National Science Foundation under Grant No. OCI-1127228. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
  • 76. 76 Copyright 2012 IOL THANK YOU!