RDMAoE collaboration with KISTI




          Tuesday 6/7/2011
     10:00am-11:00am (50B-2222)

          mbalman@lbl.gov
RDMA for High Performance Data
              Movement

     Network I/O operations are costly:
      −   CPU load
      −   Context switching
      −   Memory latency

     Zero-copy networking
      −   NIC copies data directly to/from application
          memory

     IB transport (HPC applications)

     iWARP (TCP stack / TOE)
RDMA model


    One sided operations

    Get/Put semantics
              
                  Send/receive

    Direct data placement
              
                  RDMA Write
              
                  RDMA Read

    Asyschronous
        −   Work Queue (send queue – receive queue)
        −   Completion Queue
RDMA Programming Model

    Objects
                
                    Queue Pairs (protection domain)
                
                    Send queue (RDMA write, RDMA read)
                
                    Receive queue
                
                    Modify state
                
                    Completetion queue (poll)
                
                    Memory region (MR)

    Functions (verbs)
        −     IB (libmlx4) iWARP (libcxgb3)

    Librdmacm (connection setup)
RDMA/iWARP


    Implicit RDMA support

    Explicit RDMA support


    iWARP
       −    encapsulate RDMA traffic at a high level
       −    Use TCP stack
       −    Without TOE is it beneficial?
Alternative Approaches


    RDMA over Converged Ethernet (RoCE)
       −    Lightweight RDMA transport over Ethernet
              
                  Widely deployed technology
              
                  Support kernel bypass
              
                  OFED 1.5.1 supports RoCE

    SoftRDMAs...
       −    SoftRoCE (OFED 1.5.1 supports softRoCE)
       −    SoftiWARP (new TPC kernel stack)
Hidden Cost


    Memory Registration
       −   RDMA Read/Write

    Connection Setup
       −   Librdmacm


→ Bulk data movement?

    Asynchronous Model
       −   Buffer Management
Challanges in Bulk Transfer


    Application Level Adjustments

    Request Aggregation
        −   Small data files
        −   Does FTP like transfer mechanism is appropriate
            for RDMA?

    File System Overhead
        −   Asynchronous Operations

    Connection Caching / Multiple Connection?
Local Area / Wide Area


    IB RDMA designed for local area
        −   How does RDMA perform in Wide Area?

    iWARP
        −   No promising results - Over TCP (with TOE?)
        −   SoftiWARP ???

    RoCE
        −   Isolated traffic ? / much less CPU usage
        −   softRoCE?
GridFTP over RDMA


    XIO driver for GridFTP
        −   Experimented using Chelsio cards (cxgb3)
        −   10GE
        −   WAN testing in progress!

        −   Local area: 910MBbps – 1175MBps

        −   Much better than GridFTP over TCP
              
                   Much less CPU load (1/2)
FTP100 – FTP over RDMA


    Experimented with Mellonox Cards
       −    Local area – 10GE

       −    iWARP
              
                  Did not perform well compared to TCP
                     −   No significant gain
       −    RoCE tests
              
                  In progress (have some initial results)
              
                  Limited by the disk performance
              
                  Mem2mem:
                     −   Can already saturate the 10GE link
What is Next?

Experiments RDMA model over WAN


    SoftiWARP from IBM Zurich
        −   TCP kernel stack implementing/defining RDMA
            iverbs


    SoftRoCE – OFED 1.5.2-rxe distribution
        −   Multiple connections?
Transfer Applications over RDMA


    Simple Client/Server:
        −   Developing a prototype for transferring climate
            dataset using RDMA protocols
        −   Asysnchronous memory management module


    Application level tuning?
        −   Memory regions (max/min?)
        −   Multiple QPs
Climate Analysis

Climate Applications are Data-Intensive


    Shared data repository:
        −   Data files needs to be downloaded for further
            processing and analysis
        −   Data retrieval is the main bottleneck
        −   Multiple clients (working as VM instances)
               
                   Can not depent on HW support
               
                   SoftRoCE ? softiWARP
What can we do for WAN testing?



    Q&A?



→ https://sdm.lbl.gov/climate100/

Rdma presentation-kisti-v2

  • 1.
    RDMAoE collaboration withKISTI Tuesday 6/7/2011 10:00am-11:00am (50B-2222) mbalman@lbl.gov
  • 2.
    RDMA for HighPerformance Data Movement  Network I/O operations are costly: − CPU load − Context switching − Memory latency  Zero-copy networking − NIC copies data directly to/from application memory  IB transport (HPC applications)  iWARP (TCP stack / TOE)
  • 3.
    RDMA model  One sided operations  Get/Put semantics  Send/receive  Direct data placement  RDMA Write  RDMA Read  Asyschronous − Work Queue (send queue – receive queue) − Completion Queue
  • 4.
    RDMA Programming Model  Objects  Queue Pairs (protection domain)  Send queue (RDMA write, RDMA read)  Receive queue  Modify state  Completetion queue (poll)  Memory region (MR)  Functions (verbs) − IB (libmlx4) iWARP (libcxgb3)  Librdmacm (connection setup)
  • 5.
    RDMA/iWARP  Implicit RDMA support  Explicit RDMA support  iWARP − encapsulate RDMA traffic at a high level − Use TCP stack − Without TOE is it beneficial?
  • 6.
    Alternative Approaches  RDMA over Converged Ethernet (RoCE) − Lightweight RDMA transport over Ethernet  Widely deployed technology  Support kernel bypass  OFED 1.5.1 supports RoCE  SoftRDMAs... − SoftRoCE (OFED 1.5.1 supports softRoCE) − SoftiWARP (new TPC kernel stack)
  • 7.
    Hidden Cost  Memory Registration − RDMA Read/Write  Connection Setup − Librdmacm → Bulk data movement?  Asynchronous Model − Buffer Management
  • 8.
    Challanges in BulkTransfer  Application Level Adjustments  Request Aggregation − Small data files − Does FTP like transfer mechanism is appropriate for RDMA?  File System Overhead − Asynchronous Operations  Connection Caching / Multiple Connection?
  • 9.
    Local Area /Wide Area  IB RDMA designed for local area − How does RDMA perform in Wide Area?  iWARP − No promising results - Over TCP (with TOE?) − SoftiWARP ???  RoCE − Isolated traffic ? / much less CPU usage − softRoCE?
  • 10.
    GridFTP over RDMA  XIO driver for GridFTP − Experimented using Chelsio cards (cxgb3) − 10GE − WAN testing in progress! − Local area: 910MBbps – 1175MBps − Much better than GridFTP over TCP  Much less CPU load (1/2)
  • 11.
    FTP100 – FTPover RDMA  Experimented with Mellonox Cards − Local area – 10GE − iWARP  Did not perform well compared to TCP − No significant gain − RoCE tests  In progress (have some initial results)  Limited by the disk performance  Mem2mem: − Can already saturate the 10GE link
  • 12.
    What is Next? ExperimentsRDMA model over WAN  SoftiWARP from IBM Zurich − TCP kernel stack implementing/defining RDMA iverbs  SoftRoCE – OFED 1.5.2-rxe distribution − Multiple connections?
  • 13.
    Transfer Applications overRDMA  Simple Client/Server: − Developing a prototype for transferring climate dataset using RDMA protocols − Asysnchronous memory management module  Application level tuning? − Memory regions (max/min?) − Multiple QPs
  • 14.
    Climate Analysis Climate Applicationsare Data-Intensive  Shared data repository: − Data files needs to be downloaded for further processing and analysis − Data retrieval is the main bottleneck − Multiple clients (working as VM instances)  Can not depent on HW support  SoftRoCE ? softiWARP
  • 15.
    What can wedo for WAN testing?  Q&A? → https://sdm.lbl.gov/climate100/