InfiniBand/RDMA for Storage - SRP vs. iSER


Published on

This is the talk from Sebastian Parschauer (Riemer) on LinuxTag 2013. He is a Linux kernel developer in the storage team at ProfitBricks and develops storage solutions for the IaaS 2.0 cloud.
Especially the last slide about replication caused a lot of discussion.

Published in: Technology, Business
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

InfiniBand/RDMA for Storage - SRP vs. iSER

  1. 1. InfiniBand/RDMA for Storage –SRP vs. iSERSebastian RiemerLinux Kernel Developer – Storage23.05.2013
  2. 2. Structure● RDMA Basics● RDMA Hardware● InfiniBand, iWARP, RoCE● RDMA Software + Network Protocols● SRP vs. iSERRDMA for Storage 2/28 23.05.2013
  3. 3. RDMA BasicsRDMA for Storage 3/28 23.05.2013
  4. 4. Remote Direct Memory Access(RDMA)RDMA for Storage 4/28 23.05.2013
  5. 5. LatencyRDMA for Storage 5/28 23.05.2013e.g. 4k sync. reads, status/information requests, ...
  6. 6. RDMA MTU● RDMA MTU: 256, 512, 1024, 2048, 4096 Bytes● MTU : Throughput , Transfer Latency● Max. MTU is settable● Active MTU is determined● InfiniBand: RDMA MTU is native● iWARP/RoCE: RDMA MTU must fit into EthernetMTU: 1500 → 1024 BytesRDMA for Storage 6/28 23.05.2013
  7. 7. RDMA HardwareRDMA for Storage 7/28 23.05.2013
  8. 8. InfiniBand (IB)● Switched fabric interconnect● Arbitrary topologies: Fat Tree, Mesh, Lash,...● Point-to-point bidirectional serial links● Used in HPC and Enterprise Data Centers● QDR 10 Gbit/s, FDR 14 Gbit/s per lane● Lanes: 4● Low end-to-end latency < 2 µs (1 GbE: 35 µs)RDMA for Storage 8/28 23.05.2013
  9. 9. InfiniBand (IB)● Subnet Manager (SM)● LID (16 bit) and GID (128 bit) addressing● GID = 64 bit subnet prefix + 64 bit GUID● Max. 128 partitions (like VLANs)● QoS, reliability and scalability● Credit-based flow control → no packet lossRDMA for Storage 9/28 23.05.2013
  10. 10. InfiniBand Congestion● Congestion Control (CC) not ready, yet● CC = tell SM to tell others to reduce their speed● Reduce MTU, set QoS, set IO limits, multipathRDMA for Storage 10/28 23.05.2013BLOCKED,NO CREDITS,(tell SM)master SM slave SM
  11. 11. Host Channel Adapters (HCA)● IB counterpart of NICs● Communicate via a Queue Pair (QP) constistingof Send Queue (SQ) and Receive Queue (RQ)● Reliable/Unreliable, Connected/Disconnected● Support for atomic operations● Error counters in HWRDMA for Storage 11/28 23.05.2013
  12. 12. Host Channel Adapters (HCA)Mellanox QDRdriver: mlx4_ibConnectX-2 VPIRDMA for Storage 12/28 23.05.2013QLogic/Intel QDRdriver: qib7300 Seriesbetter for the DC/cloud
  13. 13. Internet Wide Area RDMA Protocol (iWARP)● RDMA Network Interface Card (RNIC)● Connection-oriented (TCP), only RDMAtechnology routable through the Internet● Reliable Connected (RC) only● Latency, bandwidth: >= 3 µs, usually 10 Gbit/s● Vendors: Chelsio (driver cxgb3/4),Intel NetEffect (driver nes)RDMA for Storage 13/28 23.05.2013
  14. 14. RDMA over Converged Ethernet (RoCE)● Limited to a single Ethernet broadcast domain● InfiniBand frame encapsulation (IBoE)● GID is composed of MAC address + reserved● Better suited upon congestion● Scaling issues in big data center setups● Latency, bandwidth: < 2 µs, 10/40 Gbit/s● Vendors: Mellanox (driver mlx4_en),Emulex (driver ocrdma),RDMA for Storage 14/28 23.05.2013
  15. 15. RDMA Software +Network ProtocolsRDMA for Storage 15/28 23.05.2013
  16. 16. OpenFabrics Enterprise Distribution(OFED)● Approx. 30 SW packets● Upstream version: 3.5● IB Verbs: Hardware/OS abstraction layer● One IB verbs user-space driver per RDMA HW● IB Subnet Management (e.g. opensm)● Communication Management (CM)● Performance and diagnosis tools + utilitiesRDMA for Storage 16/28 23.05.2013
  17. 17. RDMA Network Protocols● IP over InfiniBand (IPoIB)● iSCSI Extensions for RDMA (iSER)● SCSI RDMA Protocol (SRP)● Network File Systems (NFS-RDMA)● Distributed File Systems (GlusterFS, Lustre)RDMA for Storage 17/28 23.05.2013
  18. 18. SRP vs. iSERRDMA for Storage 18/28 23.05.2013
  19. 19. iSCSI Extensions for RDMA (iSER)RDMA for Storage 19/28 23.05.2013● SolarisCOMSTAR● (LIO isert,kernel 3.10)● STGTuserkernel● Mellanox pushes iSER andSTGT● No advanced features withSTGT like live resizing● ProfitBricks chose Solaris forZFS and iSER● LIO isert is too newTarget
  20. 20. iSCSI Extensions for RDMA (iSER)RDMA for Storage 20/28 23.05.2013● ib_iser● libiscsi● scsi_transport_iscsi● (ib_ipoib)● iscsiduserkernel● Complexity● Multiple maintainers● Major IPoIB bugs● IP-based DDoS reconnect● Mellanox is mainlyimproving performance● Too unstable for IBopen-iscsi Initiator
  21. 21. SCSI RDMA Protocol (SRP)RDMA for Storage 21/28 23.05.2013● SCST ib_srpt● Solaris COMSTAR● (LIO ib_srpt)userkernel● Very committed SCSTmaintainers Bart and Vlad(Bart Van Assche,Vladislav Bolkhovitin)● ProfitBricks chose SCST dueto ZFS and iSER issues● LIO SRP unstable/unusableTarget
  22. 22. SCSI RDMA Protocol (SRP)RDMA for Storage 22/28 23.05.2013● ib_srp● scsi_transport_srp● (srp-tools)userkernel● Simplicity: RDMA-only,kernel-only possible● Inactive Maintainer● No fast IO failing, nocontinuous reconnect● Loosing SCSI disks● Bart + Mellanox are active● Barts work doesnt fit usInitiator
  23. 23. ProfitBricks Choices● Simplicity = Stablity → SRP without srp-tools● Help improving SCST● Improved SRP initiator ourselves● Just fast IO failing + automatic reconnect● Never loose SCSI devices automatically● Published SRP initiator fixes● Implement RDMA into QEMU for performanceRDMA for Storage 23/28 23.05.2013
  24. 24. SRP Fixes● From Bart:● From ProfitBricks:● Bart also has performance patches + backport● Bart uses the srp-tools + loosing SCSI devices● Gradually finding compromisesRDMA for Storage 24/28 23.05.2013
  25. 25. ● THCA_GUID="0002c903004ed0b2"● TGID_P1="fe800000000000000002c903004ed0b3"● PKEY="ffff"● IHCA="mlx4_0"● IHCA_P1="1"● SRP=“id_ext=${THCA_GUID},ioc_guid=${THCA_GUID},dgid=${TGID_P1},pkey=${PKEY},service_id=${THCA_GUID}“● echo "${SRP}" > /sys/class/infiniband_srp/srp-${IHCA}-${IHCA_P1}/add_targetEstablish an SRP connectionRDMA for Storage 25/28 23.05.2013
  26. 26. InfiniBand/RDMA Links/Information● InfiniBand Trade Association(IB specification, doc,● OpenFabrics Alliance (OFA, OFED providers,● Mellanox Technologies (● mailing list● LinkedIn group „InfiniBand Technologists“RDMA for Storage 26/28 23.05.2013
  27. 27. Questions?● Questions???●● www.profitbricks.comRDMA for Storage 27/28 23.05.2013
  28. 28. Bonus: How to do replication right?RDMA for Storage 28/28 23.05.2013Primary Secondary Primary Primary LUN LUNIP IPClusterManagerClusterManagerWRONG!Store&ForwardWrites! Slow!WRONG!Complex,error-prone!SRP/iSER/iSCSISRP/iSER/iSCSISRP/iSER/iSCSISRP/iSER/iSCSISRP/iSER/iSCSIe.g. SW RAID-1RIGHT!Simpleand fast!