Rob Davis, July 2020
WHY NVME OVER FABRICS
IS IMPORTANT
2
Why NVMe-oF?
Only 2
SSDs fill
10GbE
24 HDDs to
fill 10GbE
One SAS
SSD fills
10GbE
SSDs
move the
Bottleneck
from the
Disk to the
Network
3
NVME TECHNOLOGY
NVMe Technology
§Optimized for flash and PM
§ Traditional SCSI interfaces designed for spinning disk
§ NVMe bypasses unneeded layers
§NVMe Flash Outperforms SAS/SATA Flash
§ +2.5x more bandwidth, +50% lower latency, +3x
more IOPS
4
“NVME OVER FABRICS” WAS THE LOGICAL AND HISTORICAL
NEXT STEP
§Sharing NVMe based storage across
multiple servers/CPUs was the next step
§ Better utilization: capacity, rack space, power
§ Scalability, management, fault isolation
§NVMe over Fabrics standard
§ 50+ contributors
§ Version 1.0 released in June 2016
§Pre-standard demos in 2014
5
End to End 200Gb/s
Faster Network Products from Nvidia Solves
Half the Network Bottle Neck Problem…
6
Faster Protocols from Nvidia Solves the Other Half
Faster
Protocol
RDM
A
NVMe-oF
Faster
Network
SSDHDD
7
WHAT IS RDMA?
adapter based transport
8NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
RDMA NOW COMMON ACROSS ALL STORAGE TYPES
Persistent
Memory
RPM
Block
File
Object
RDMA
SMB (CIFS)
NFS
Ceph
iSERSMB Direct
Ceph
over
RDMA
NVMe-
oF/RDMA
Swift
§ RDMA for optimal
performance
• InfiniBand & RoCE
o NVMe-oF
o Nvidia GPU Direct Storage
§ Persistent Memory (3D-
XPoint)
S3
FC
iSCSI
NFSoRDMANFSoRDMA NVMe-
oF/TCP
9
NVME OVER FABRICS (NVME-OF) TRANSPORTS
§ The NVMe-oF standard is
not Fabric specific
§ Instead there is a separate
Transport Binding
specification for each
Transport Layer
§ RDMA was 1st
§ Later Fibre Channel
§ Most recent TCP
InfiniBand
10
How Does NVMe-oF Maintain NVMe Like Performance?
§ By extending NVMe efficiency over a fabric
§ NVMe commands and data structures are transferred end
to end
§ Bypassing legacy stacks for performance
§ First products all used RDMA
§ Performance is impressive
SAS/sATA
Deviceover Fabrics
NVMe/RDMA
NVMe/TCP
Transport
Transport
or IB
11
How Does NVMe-oF Maintain NVMe Like Performance?
§ By extending NVMe efficiency over a fabric
§ NVMe commands and data structures are transferred end
to end
§ Bypassing legacy stacks for performance
§ First products all used RDMA
§ Performance is impressive
SAS/sATA
Deviceover Fabrics
NVMe/RDMA
NVMe/TCP
12
NVMe-oF Applications - Composable Infrastructure
§Also called Compute
Storage Disaggregation
and Rack Scale
§NVMe over Fabrics
enables Composable
Infrastructure
§ Low latency
§ High bandwidth
§ Nearly local disk
performance
13
IMPORTANCE OF NETWORK LATENCY WHEN NETWORKING
PERSISTENT MEMORY
Logarithmicscale
Nvidia Ethernet Switch & Adapter
Network hops multiply latency
Request/Response
PM
200Gb
100Gb
InfiniBand900ns
1.3µs
Common Ethernet Switch & Adapter
14
HYPERCONVERGED AND SCALE-OUT STORAGE USE CASE
§Scale-out
§Cluster of commodity servers
§Software provides storage
functions
§Hyperconverged collapses
compute & storage
§Integrated compute-storage
nodes & software
§NVMe-oF performs like
local/direct-attached SSD
Scale out Storage
Mellanox x86 Switch
Compute Nodes
Storage Application
VM VM
VM VM
NVMe NVMe NVMe
NVMe NVMe NVMe
Storage
App
HCI Nodes
15
BACKEND SCALE OUT USE CASE
Backend
Network
JBOF
Frontend
... ...
16
Nvidia GPUDirect Storage
CPU
GPU Chip
set
GPU
Mem
System
Memory
CPU
Chip
set
System
Memory
Network
2
1
1
2
Without GPUDirect StorageReceive Transmit
CPU
GPU Chip
set
GPU
Memory
System
Memory CPU
Chip
set
System
Memory
Network
1
1
With GPUDirect StorageReceive Transmit
RDMARDMA
https://info.nvidia.com/gpudirect-storage-webinar-reg-page.html?ondemandrgt=yes
17
NVME/TCP
§NVMe-oF commands are sent over standard TCP/IP sockets
§Each NVMe queue pair is mapped to a TCP connection
§Easy to support NVMe over TCP with no changes
§Good for distance, stranded server, and out of band management connectivity
18
LATENCY: NVME-RDMA VS NVME-TCP
LocalSSDWrite
RDMAWrite
TCPWrite
LocalSSDRead
RDMARead
TCPRead
Tail LatencyTail Latency
FractionofIOswiththisorlesslatency
19
IMPORTANCE OF TAIL LATENCY IN TODAY’S DATACENTERS
§ Most datacenters today need to support interactive real-time requests
§ Online searches generate a small amount of network traffic between the requestor and datacenter, but
the response generates a massive amount of traffic within the datacenter
§ Much of this traffic is related to the core advertising business model of the datacenter’s owner
§ The distributive software architecture that drives these businesses is very susceptible to tail latency
§ https://www.nextplatform.com/2018/03/27/in-modern-datacenters-the-latency-tail-wags-the-network-
dog/
20
CONCLUSIONS
§NVMe-oF brings the value of networked storage to NVMe
based solutions
§NVMe-oF is supported across many network technologies
§The performance advantages of NVMe, are not lost with
NVMe-oF
§Especially with RDMA
§There are many suppliers of NVMe-oF solutions across a
variety of important data center use cases
THANK YOU

NVMe over Fabric

  • 1.
    Rob Davis, July2020 WHY NVME OVER FABRICS IS IMPORTANT
  • 2.
    2 Why NVMe-oF? Only 2 SSDsfill 10GbE 24 HDDs to fill 10GbE One SAS SSD fills 10GbE SSDs move the Bottleneck from the Disk to the Network
  • 3.
    3 NVME TECHNOLOGY NVMe Technology §Optimizedfor flash and PM § Traditional SCSI interfaces designed for spinning disk § NVMe bypasses unneeded layers §NVMe Flash Outperforms SAS/SATA Flash § +2.5x more bandwidth, +50% lower latency, +3x more IOPS
  • 4.
    4 “NVME OVER FABRICS”WAS THE LOGICAL AND HISTORICAL NEXT STEP §Sharing NVMe based storage across multiple servers/CPUs was the next step § Better utilization: capacity, rack space, power § Scalability, management, fault isolation §NVMe over Fabrics standard § 50+ contributors § Version 1.0 released in June 2016 §Pre-standard demos in 2014
  • 5.
    5 End to End200Gb/s Faster Network Products from Nvidia Solves Half the Network Bottle Neck Problem…
  • 6.
    6 Faster Protocols fromNvidia Solves the Other Half Faster Protocol RDM A NVMe-oF Faster Network SSDHDD
  • 7.
    7 WHAT IS RDMA? adapterbased transport
  • 8.
    8NVIDIA CONFIDENTIAL. DONOT DISTRIBUTE. RDMA NOW COMMON ACROSS ALL STORAGE TYPES Persistent Memory RPM Block File Object RDMA SMB (CIFS) NFS Ceph iSERSMB Direct Ceph over RDMA NVMe- oF/RDMA Swift § RDMA for optimal performance • InfiniBand & RoCE o NVMe-oF o Nvidia GPU Direct Storage § Persistent Memory (3D- XPoint) S3 FC iSCSI NFSoRDMANFSoRDMA NVMe- oF/TCP
  • 9.
    9 NVME OVER FABRICS(NVME-OF) TRANSPORTS § The NVMe-oF standard is not Fabric specific § Instead there is a separate Transport Binding specification for each Transport Layer § RDMA was 1st § Later Fibre Channel § Most recent TCP InfiniBand
  • 10.
    10 How Does NVMe-oFMaintain NVMe Like Performance? § By extending NVMe efficiency over a fabric § NVMe commands and data structures are transferred end to end § Bypassing legacy stacks for performance § First products all used RDMA § Performance is impressive SAS/sATA Deviceover Fabrics NVMe/RDMA NVMe/TCP Transport Transport or IB
  • 11.
    11 How Does NVMe-oFMaintain NVMe Like Performance? § By extending NVMe efficiency over a fabric § NVMe commands and data structures are transferred end to end § Bypassing legacy stacks for performance § First products all used RDMA § Performance is impressive SAS/sATA Deviceover Fabrics NVMe/RDMA NVMe/TCP
  • 12.
    12 NVMe-oF Applications -Composable Infrastructure §Also called Compute Storage Disaggregation and Rack Scale §NVMe over Fabrics enables Composable Infrastructure § Low latency § High bandwidth § Nearly local disk performance
  • 13.
    13 IMPORTANCE OF NETWORKLATENCY WHEN NETWORKING PERSISTENT MEMORY Logarithmicscale Nvidia Ethernet Switch & Adapter Network hops multiply latency Request/Response PM 200Gb 100Gb InfiniBand900ns 1.3µs Common Ethernet Switch & Adapter
  • 14.
    14 HYPERCONVERGED AND SCALE-OUTSTORAGE USE CASE §Scale-out §Cluster of commodity servers §Software provides storage functions §Hyperconverged collapses compute & storage §Integrated compute-storage nodes & software §NVMe-oF performs like local/direct-attached SSD Scale out Storage Mellanox x86 Switch Compute Nodes Storage Application VM VM VM VM NVMe NVMe NVMe NVMe NVMe NVMe Storage App HCI Nodes
  • 15.
    15 BACKEND SCALE OUTUSE CASE Backend Network JBOF Frontend ... ...
  • 16.
    16 Nvidia GPUDirect Storage CPU GPUChip set GPU Mem System Memory CPU Chip set System Memory Network 2 1 1 2 Without GPUDirect StorageReceive Transmit CPU GPU Chip set GPU Memory System Memory CPU Chip set System Memory Network 1 1 With GPUDirect StorageReceive Transmit RDMARDMA https://info.nvidia.com/gpudirect-storage-webinar-reg-page.html?ondemandrgt=yes
  • 17.
    17 NVME/TCP §NVMe-oF commands aresent over standard TCP/IP sockets §Each NVMe queue pair is mapped to a TCP connection §Easy to support NVMe over TCP with no changes §Good for distance, stranded server, and out of band management connectivity
  • 18.
    18 LATENCY: NVME-RDMA VSNVME-TCP LocalSSDWrite RDMAWrite TCPWrite LocalSSDRead RDMARead TCPRead Tail LatencyTail Latency FractionofIOswiththisorlesslatency
  • 19.
    19 IMPORTANCE OF TAILLATENCY IN TODAY’S DATACENTERS § Most datacenters today need to support interactive real-time requests § Online searches generate a small amount of network traffic between the requestor and datacenter, but the response generates a massive amount of traffic within the datacenter § Much of this traffic is related to the core advertising business model of the datacenter’s owner § The distributive software architecture that drives these businesses is very susceptible to tail latency § https://www.nextplatform.com/2018/03/27/in-modern-datacenters-the-latency-tail-wags-the-network- dog/
  • 20.
    20 CONCLUSIONS §NVMe-oF brings thevalue of networked storage to NVMe based solutions §NVMe-oF is supported across many network technologies §The performance advantages of NVMe, are not lost with NVMe-oF §Especially with RDMA §There are many suppliers of NVMe-oF solutions across a variety of important data center use cases
  • 21.