Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

InfiniBand/RDMA for Storage - SRP vs. iSER


Published on

This is the talk from Sebastian Parschauer (Riemer) on LinuxTag 2013. He is a Linux kernel developer in the storage team at ProfitBricks and develops storage solutions for the IaaS 2.0 cloud.
Especially the last slide about replication caused a lot of discussion.

Published in: Technology, Business
  • @tkiblin We were using iSER with Solaris 11 COMSTAR target and open-iscsi initiator but it was too unstable at that time and is too complex for IB. So we decided to go for a Linux storage with SRP and SCST.
    Are you sure you want to  Yes  No
    Your message goes here
  • Sebastian, thanks for this deck. I'm a bit confused though, you say you use iSER and SRP on each of the respective slides. What are you guys using SRP? And if so, what target/storage, Comstar with ZFS?
    Are you sure you want to  Yes  No
    Your message goes here
  • About the ZFS issue: Solaris 11 ZFS completely fragmented in our IaaS cloud after short time and we didn't even take a single snapshot. With only 2-3 MB/s at the end it became unusably slow. We've complained at Oracle of cause but that was really substantial and I've talked to another company here in Berlin and they had the same issue as well and the same performance number.
    Are you sure you want to  Yes  No
    Your message goes here
  • The SRP initiator code from Bart which has been merged into mainline in Linux 3.13 became better than ours. We will pick that code up. Great work!
    Are you sure you want to  Yes  No
    Your message goes here
  • Some corrections/updates:
    Besides iWARP there is also RRoCE (routable RoCE) which is routable through the Internet (information by Sagi Grimberg, Mellanox).

    'loosing SCSI devices' with SRP might be misleading. It is intentional behavior of the current ib_srp implementation. The idea was to let the srp-tools do the reconnect afterwards. (hint by Sagi Grimberg, Mellanox)
    But the IaaS cloud with up to 512 LUNs per SCSI host showed that this isn't the right approach for ProfitBricks.

    Bart also implemented an automatic reconnect in kernel code without the need to remove the SCSI devices by now.
    Are you sure you want to  Yes  No
    Your message goes here

InfiniBand/RDMA for Storage - SRP vs. iSER

  1. 1. InfiniBand/RDMA for Storage –SRP vs. iSERSebastian RiemerLinux Kernel Developer – Storage23.05.2013
  2. 2. Structure● RDMA Basics● RDMA Hardware● InfiniBand, iWARP, RoCE● RDMA Software + Network Protocols● SRP vs. iSERRDMA for Storage 2/28 23.05.2013
  3. 3. RDMA BasicsRDMA for Storage 3/28 23.05.2013
  4. 4. Remote Direct Memory Access(RDMA)RDMA for Storage 4/28 23.05.2013
  5. 5. LatencyRDMA for Storage 5/28 23.05.2013e.g. 4k sync. reads, status/information requests, ...
  6. 6. RDMA MTU● RDMA MTU: 256, 512, 1024, 2048, 4096 Bytes● MTU : Throughput , Transfer Latency● Max. MTU is settable● Active MTU is determined● InfiniBand: RDMA MTU is native● iWARP/RoCE: RDMA MTU must fit into EthernetMTU: 1500 → 1024 BytesRDMA for Storage 6/28 23.05.2013
  7. 7. RDMA HardwareRDMA for Storage 7/28 23.05.2013
  8. 8. InfiniBand (IB)● Switched fabric interconnect● Arbitrary topologies: Fat Tree, Mesh, Lash,...● Point-to-point bidirectional serial links● Used in HPC and Enterprise Data Centers● QDR 10 Gbit/s, FDR 14 Gbit/s per lane● Lanes: 4● Low end-to-end latency < 2 µs (1 GbE: 35 µs)RDMA for Storage 8/28 23.05.2013
  9. 9. InfiniBand (IB)● Subnet Manager (SM)● LID (16 bit) and GID (128 bit) addressing● GID = 64 bit subnet prefix + 64 bit GUID● Max. 128 partitions (like VLANs)● QoS, reliability and scalability● Credit-based flow control → no packet lossRDMA for Storage 9/28 23.05.2013
  10. 10. InfiniBand Congestion● Congestion Control (CC) not ready, yet● CC = tell SM to tell others to reduce their speed● Reduce MTU, set QoS, set IO limits, multipathRDMA for Storage 10/28 23.05.2013BLOCKED,NO CREDITS,(tell SM)master SM slave SM
  11. 11. Host Channel Adapters (HCA)● IB counterpart of NICs● Communicate via a Queue Pair (QP) constistingof Send Queue (SQ) and Receive Queue (RQ)● Reliable/Unreliable, Connected/Disconnected● Support for atomic operations● Error counters in HWRDMA for Storage 11/28 23.05.2013
  12. 12. Host Channel Adapters (HCA)Mellanox QDRdriver: mlx4_ibConnectX-2 VPIRDMA for Storage 12/28 23.05.2013QLogic/Intel QDRdriver: qib7300 Seriesbetter for the DC/cloud
  13. 13. Internet Wide Area RDMA Protocol (iWARP)● RDMA Network Interface Card (RNIC)● Connection-oriented (TCP), only RDMAtechnology routable through the Internet● Reliable Connected (RC) only● Latency, bandwidth: >= 3 µs, usually 10 Gbit/s● Vendors: Chelsio (driver cxgb3/4),Intel NetEffect (driver nes)RDMA for Storage 13/28 23.05.2013
  14. 14. RDMA over Converged Ethernet (RoCE)● Limited to a single Ethernet broadcast domain● InfiniBand frame encapsulation (IBoE)● GID is composed of MAC address + reserved● Better suited upon congestion● Scaling issues in big data center setups● Latency, bandwidth: < 2 µs, 10/40 Gbit/s● Vendors: Mellanox (driver mlx4_en),Emulex (driver ocrdma),RDMA for Storage 14/28 23.05.2013
  15. 15. RDMA Software +Network ProtocolsRDMA for Storage 15/28 23.05.2013
  16. 16. OpenFabrics Enterprise Distribution(OFED)● Approx. 30 SW packets● Upstream version: 3.5● IB Verbs: Hardware/OS abstraction layer● One IB verbs user-space driver per RDMA HW● IB Subnet Management (e.g. opensm)● Communication Management (CM)● Performance and diagnosis tools + utilitiesRDMA for Storage 16/28 23.05.2013
  17. 17. RDMA Network Protocols● IP over InfiniBand (IPoIB)● iSCSI Extensions for RDMA (iSER)● SCSI RDMA Protocol (SRP)● Network File Systems (NFS-RDMA)● Distributed File Systems (GlusterFS, Lustre)RDMA for Storage 17/28 23.05.2013
  18. 18. SRP vs. iSERRDMA for Storage 18/28 23.05.2013
  19. 19. iSCSI Extensions for RDMA (iSER)RDMA for Storage 19/28 23.05.2013● SolarisCOMSTAR● (LIO isert,kernel 3.10)● STGTuserkernel● Mellanox pushes iSER andSTGT● No advanced features withSTGT like live resizing● ProfitBricks chose Solaris forZFS and iSER● LIO isert is too newTarget
  20. 20. iSCSI Extensions for RDMA (iSER)RDMA for Storage 20/28 23.05.2013● ib_iser● libiscsi● scsi_transport_iscsi● (ib_ipoib)● iscsiduserkernel● Complexity● Multiple maintainers● Major IPoIB bugs● IP-based DDoS reconnect● Mellanox is mainlyimproving performance● Too unstable for IBopen-iscsi Initiator
  21. 21. SCSI RDMA Protocol (SRP)RDMA for Storage 21/28 23.05.2013● SCST ib_srpt● Solaris COMSTAR● (LIO ib_srpt)userkernel● Very committed SCSTmaintainers Bart and Vlad(Bart Van Assche,Vladislav Bolkhovitin)● ProfitBricks chose SCST dueto ZFS and iSER issues● LIO SRP unstable/unusableTarget
  22. 22. SCSI RDMA Protocol (SRP)RDMA for Storage 22/28 23.05.2013● ib_srp● scsi_transport_srp● (srp-tools)userkernel● Simplicity: RDMA-only,kernel-only possible● Inactive Maintainer● No fast IO failing, nocontinuous reconnect● Loosing SCSI disks● Bart + Mellanox are active● Barts work doesnt fit usInitiator
  23. 23. ProfitBricks Choices● Simplicity = Stablity → SRP without srp-tools● Help improving SCST● Improved SRP initiator ourselves● Just fast IO failing + automatic reconnect● Never loose SCSI devices automatically● Published SRP initiator fixes● Implement RDMA into QEMU for performanceRDMA for Storage 23/28 23.05.2013
  24. 24. SRP Fixes● From Bart:● From ProfitBricks:● Bart also has performance patches + backport● Bart uses the srp-tools + loosing SCSI devices● Gradually finding compromisesRDMA for Storage 24/28 23.05.2013
  25. 25. ● THCA_GUID="0002c903004ed0b2"● TGID_P1="fe800000000000000002c903004ed0b3"● PKEY="ffff"● IHCA="mlx4_0"● IHCA_P1="1"● SRP=“id_ext=${THCA_GUID},ioc_guid=${THCA_GUID},dgid=${TGID_P1},pkey=${PKEY},service_id=${THCA_GUID}“● echo "${SRP}" > /sys/class/infiniband_srp/srp-${IHCA}-${IHCA_P1}/add_targetEstablish an SRP connectionRDMA for Storage 25/28 23.05.2013
  26. 26. InfiniBand/RDMA Links/Information● InfiniBand Trade Association(IB specification, doc,● OpenFabrics Alliance (OFA, OFED providers,● Mellanox Technologies (● mailing list● LinkedIn group „InfiniBand Technologists“RDMA for Storage 26/28 23.05.2013
  27. 27. Questions?● Questions???●● www.profitbricks.comRDMA for Storage 27/28 23.05.2013
  28. 28. Bonus: How to do replication right?RDMA for Storage 28/28 23.05.2013Primary Secondary Primary Primary LUN LUNIP IPClusterManagerClusterManagerWRONG!Store&ForwardWrites! Slow!WRONG!Complex,error-prone!SRP/iSER/iSCSISRP/iSER/iSCSISRP/iSER/iSCSISRP/iSER/iSCSISRP/iSER/iSCSIe.g. SW RAID-1RIGHT!Simpleand fast!