Successfully reported this slideshow.
Your SlideShare is downloading. ×

Introduction to NVMe Over Fabrics-V3R

Loading in …3

Check these out next

1 of 31 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (20)


Similar to Introduction to NVMe Over Fabrics-V3R (20)

Introduction to NVMe Over Fabrics-V3R

  1. 1. Introduction to NVMe over Fabrics 10/2016 v3 Simon Huang
  2. 2. • What is NVM Express™? • What’s NVMe over Fabrics? • Why NVMe over Fabrics? • Expanding NVMe to Fabrics • NVMe over Fabrics in the Data Center • End-to-End NVMe over Fabrics • NVMe Multi-Fabric Transport Mapping • NVMe over Fabrics at Storage Tiers • End-to-End NVMe Model • Shared Server Flash • NVMe Over Fabrics Products(Examples) • Recap • Backup 1 and 2 Agenda
  3. 3. What is NVM Express™? • Industry standard for PCIe SSDs • High-performance, low-latency, PCIe SSD interface • Command set + PCIe register interface • In-box NVMe host drivers for Linux, Windows, VmWare, … • Standard h/w drive form factors, mobile to enterprise • NVMe community is 100+ companies strong and growing • Learn more at
  4. 4. What’s NVMe over Fabrics? • Nonvolatile Memory Express (NVMe) over Fabrics is a technology specification designed to enable NVMe message-based commands to transfer data between a host computer and a target solid-state storage device or system over a network such as Ethernet, Fibre Channel, and InfiniBand.
  5. 5. Why NVMe over Fabrics? • End-to-End NVMe semantics across a range of topologies – Retains NVMe efficiency and performance over network fabrics – Eliminates unnecessary protocol translations – Enables low-latency and high IOPS remote NVMe storage solutions
  6. 6. Expanding NVMe to Fabrics • Built on common NVMe architecture with additional definitions to support message-based NVMe operations • Standardization of NVMe over a range Fabric types • Initial fabrics: RDMA(RoCE, iWARP, InfiniBand™) and Fibre Channel • First specification has been released in June, 2016 • Fabrics Linux Driver WG developing host and target drivers
  7. 7. NVMe Over Fabrics Evolution of Non-Volatile Storage in the Data Center
  8. 8. End-to-End NVMe over Fabrics Extend efficiency of NVMe over Front and Back-end Fabrics Enables efficient NVMe end-to-end model (Host<->NVMe PCIe SSD)
  9. 9. NVM Over Fabrics Advantages • Industry standard interface (Multiple sources) • Unlimited storage per server • Scale storage independent of servers • High Efficient shared storage • HA is straightforward • Greater IO performance
  10. 10. NVMe Multi-Fabric Transport Mapping Fabric Message Based Transports
  11. 11. NVMe over Fabrics at Storage Tiers
  12. 12. End-to-End NVMe Model • NVMe efficiencies scaled across entire fabric
  13. 13. Shared Server Flash - NVMe Storage • RDMA support required for lowest latency • Ethernet or IB or OmniPath fabrics possible – IB and OmniPath support RDMA – Ethernet has RoCEv1-v2, iWARP and iSCSI RDMA options – iSCSI offload has built-in RDMA WRITE • Disaster Recovery (DR) requires MAN or WAN – iWARP, iSCSI only options that support MAN and WAN
  14. 14. NVMe Over Fabrics Products(Examples) • Gangster (NX6320/NX6325/NX6340) All-Flash Arrays • Chelsio’s Terminator 5 • QLogic FastLinQ QL45611HLCU 100Gb Intelligent Ethernet Adapter Arrays: Adapters: • EMC DSSD D5
  15. 15. Recap • NVMe was built from the ground up to support a consistent model for NVM interfaces, even across network fabrics • Simplicity of protocol enables hardware automated I/O Queues – NVMe transport bridge • No translation to or from another protocol like SCSI (in firmware/software) • Inherent parallelism of NVMe multiple I/O Queues is exposed to the host • NVMe commands and structures are transferred end-to-end • Maintains the NVMe architecture across a range of fabric types
  16. 16. Backup-1
  17. 17. Seagate SSD 1200.2 Series SAS 12Gbs -Up to 210K RR IPOS and 25 DWPD XM1400 Series M.2 22110 PCIe G3 x 4 -Up to 3DWPD XF1400 Series U.2 PCIe G3 x 4 -Up to 200K RR IPOS and 3 DWPD XP6500 Series AIC PCIe G3 x 8 -Up to 300K RR IPOS XP7200 Series AIC PCIe G3 x 16 -Up to 940K RR IPOS XP6300 Series AIC PCIe G3 x 8 -Up to 296K RR IPOS
  18. 18. Traditional Scale Out Storage • Support for high BW/IOPS NVMe support preserves software investment, because it keeps existing software price/performance competitive • Support for high BW/IOPS NVMe support realizes most of the NVMe speedup benefits • Disaster Recovery (DR) requires MAN or WAN
  19. 19. RDMA • RDMA stands for Remote Direct Memory Access and enables one computer to access another’s internal memory directly without involving the destination computer’s operating system. The destination computer’s network adapter moves data directly from the network into an area of application memory without involving the OS involving its own data buffers and network I/O stack. Consequently the transfer us very fast. It has the downside of not having an acknowledgement (ack) sent back to the source computer telling it that the transfer has been successful. • There is no general RDMA standard, meaning that implementations are specific to particular servers and network adapters, operating systems and applications. There are RDMA implementations for Linux and Windows Server 2012, which may use iWARP, RoCE, and InfiniBand as the carrying layer for the transfers.
  20. 20. iWARP - Internet Wide Area RDMA Protocol • iWARP (internet Wide Area RDMA Protocol) implements RDMA over Internet Protocol networks. It is layered on IETF-standard congestion- aware protocols such as TCP and SCTP, and uses a mix of layers, including DDP (Direct Data Placement), MPA (Marker PDU Aligned framing), and a separate RDMA protocol (RDMAP) to deliver RDMA services over TCP/IP. Because of this it's said to have lower throughput, higher latency and require higher CPU and memory utilisation than RoCE. • For example: "Latency will be higher than RoCE (at least with both Chelsio and Intel/NetEffect implementations), but still well under 10 μs." • Mellanox says no iWARP support is available at 25, 50, and 100Gbit/s Ethernet speeds. Chelsio saysthe IETF standard for RDMA is iWARP. It provides the same host interface as InfiniBand and is available in the same OpenFabrics Enterprise Distribution (OFED). • Chelsio, which positions iWARP as an alternative to InfiniBand, says iWARP is the industry standard for RDMA over Ethernet is iWARP. High performance iWARP implementations are available and compete directly with InfiniBand in application benchmarks.
  21. 21. RoCE - RDMA over Converged Ethernet • RoCE (RDMA over Converged Ethernet) allows remote direct memory access (RDMA) over an Ethernet network. It operates over layer 2 and layer 3 DCB-capable (DCB - Data Centre Bridging) switches. Such switches comply with the IEEE 802.1 Data Center Bridging standard, which is a set of extensions to traditional Ethernet, geared to providing a lossless data centre transport layer that, Cisco says, helps enable the convergence of LANs and SANs onto a single unified fabric. DCB switches support the Fibre Channel over Ethernet (FCoE) networking protocol. There are two versions: • RoCE v1 uses the Ethernet protocol as a link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain, • RoCE v2 is a RDMA running on top of UDP/IP and can be routed.
  22. 22. Backup-2
  23. 23. facebook – Lightning NVMe JBOF Architecture
  24. 24. Facebook Lightning Target • Hot-plug. We want the NVMe JBOF to behave like a SAS JBOD when drives are replaced. We don't want to follow the complicated procedure that traditional PCIe hot-plug requires. As a result, we need to be able to robustly support surprise hot-removal and surprise hot-add without causing operating system hangs or crashes. • Management.PCIe does not yet have an in-band enclosure and chassis management scheme like the SAS ecosystem does. While this is coming, we chose to address this using a more traditional BMC approach, which can be modified in the future as the ecosystem evolves. • Signal integrity. The decision to maintain the separation of a PEB from the PDPB as well as supporting multiple SSDs per “slot” results in some long PCIe connections through multiple connectors. Extensive simulations, layout optimizations, and the use of low-loss but still low-cost PCB materials should allow us to achieve the bit error rate requirements of PCIe without the use of redrivers/retimers or exotic PCB materials. • External PCIe cables. We chose to keep the compute head node separate from the storage chassis, as this gives us the flexibility to scale the compute-to-storage ratio as needed. It also allows us to use more powerful CPUs, larger memory footprints, and faster network connections all of which will be needed to take full advantage of high-performance SSDs. As the existing PCIe cables are clunky and difficult to use, we chose to use mini-SAS HD cables (SFF-8644). This also aligns with upcoming external cabling standards. We designed the cables such that they include a full complement of PCIe side-band signals and a USB connection for an out-of-band management connection. • Power. Current 2.5" NVMe SSDs may consume up to 25W of power! This creates an unnecessary system constraint, and we have chosen to limit the power consumption per slot to 14W. This aligns much better with the switch oversubscription and our performance targets.
  25. 25. NVMe JBOF Benefits • Manageability • Flexibility • Modularity • Performance
  26. 26. Manageability of BMC USB, I2C, Ethernet Out-of-band (OOB)
  27. 27. Flexibility in NVMe SSDs
  28. 28. Modularity in PCIe Switch A common switch board for both trays •Easily design new or difference version without modifying the rest of the infrastructure
  29. 29. Low IO-Watt Performance
  30. 30. Ultra-High I/O Performance 5X Throughput + 1200X IOPS
  31. 31. OCP All-Flash NVMe Storage • 2U 60/30 NVMe SSDs • Ultra-high IOPS and <10µS latency • PCIe 3.0 + U.2 or M.2 NVMe SSD support • High density storage system with 60 SSDs (M.2)