Zoltan Arnold Nagy
IBM Research - Zurich
Disaggregating Ceph using
NVMeoF
About me
2
 Technical Lead – Zurich Compute Cloud @ IBM Research – Zurich
– Involved all aspects (compute, storage, networking…)
 OpenStack since 2011 – “cactus”
 Service local Zurich Resarch Lab’s research community – some data must remain
in Switzerland/EU and/or too large to move off-site
 ~4.5k cores / ~90TB memory and growing
 10/25/100GbE
 Ceph + GPFS
 Ceph since 2014 – “firefly”
– Current cluster is 2.2PiB RAW
 Mostly HDD
 100TB NVMe that sparked this whole investigation
– Upgraded and growing since firefly!
About IBM Research - Zurich
3
 Established in 1956
 45+ different nationalities
 Open Collaboration:
– Horizon2020: 50+ funded projects and 500+ partners
 Two Nobel Prizes:
– 1986: Nobel Prize in Physics for the invention of the scanning
tunneling microscope by Heinrich Rohrer and Gerd K. Binnig
– 1987: Nobel Prize in Physics for the discovery of
high-temperature superconductivity by
K. Alex Müller and J. Georg Bednorz
 2017: European Physical Society Historic Site
 Binnig and Rohrer Nanotechnology Centre opened in
2011 (Public Private Partnership with ETH Zürich and EMPA)
 7 European Research Council Grants
Motivation #1
4
 Were great when we got them – years ago
 2xE5-2630v3 – 2x8 cores @ 2.4GHz
 2x10Gbit LACP, flat L2 network
 Wanted to add NVMe to our current nodes
– E5-2630v3 / 64GB RAM
5
6
7
1x Intel Optane 900P
8x Samsung 970 PRO
1TB
1x Mellanox ConnectX-
4 (2x100GbE - PCIe v3
limits to ~128GBit/s)
Motivation
8
56 cores / 140 GHz total compute for 7x NVMe drives
Motivation
9
48 cores / 129.6 GHz total compute for 10 NVMe drives
Motivation
10
Conclusion on those configurations?
small block size IO: you run out of CPU
large block size IO: you run ouf of network
Quick math
11
 Resources per device (lots of assumptions: idle OS, RAM, NUMA, …)
– 32 threads / 8 NVMe = 4 thread / NVMe
– 100Gbit / 8 NVMe = 12.5Gbit/s
– 3x replication: n Gbit/s write on the frontend
causes 2n outgoing bandwidth
-> we can support 6.25Gbit/s write per OSD as
theoretical maximum throughput!
12
Can we do better?
Don’t we have a bunch of compute nodes?
13
14
84 compute nodes per rack
(yes, you need cooling…)
Each node:
2xE5-2683v4
(2x16 cores @ 2.1GHz)
256GB RAM
Plan
15
OSD OSD
OSD OSD
OSD OSD
OSD OSD
Storage node
100Gbps
Plan
16
OSD OSD
OSD OSD
OSD OSD
OSD OSD
Storage node
100Gbps
OSD
Compute nodes
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
How does the math improve for writes?
17
OSD OSD OSD
OSD OSD OSD
RX n
TX 2n
RX n
Client
TX n
RX n
TX n TX nTX n
18
We know the protocol (NVMe) – let’s talk fabric!
Fabric topology
19
32x compute nodes
leaf leaf
1…n spines (32x100GbE)
leaf
32x25GbE, 8x100GbE
on compute leafs
6-10 storage nodes
32x100GbE
on storage leafs
20
6x Mellanox SN2100 switch per rack
(16x100GbE)
split into 8x4x25GbE + 8x100GbE
Each node has full bi-directional
bandwidth to the spines!
Fabric latency (ib_write_lat)
21
Fabric bandwidth (ib_write_bw)
22
Ingredient 1: RoCEv2
23
 R stands for RDMA that stands for “remote DMA”
 “oCE” is over Converged Ethernet
– Tries to be “lossless”
– PFC (L2 for example NIC<>Switch)
– ECN (L3)
 Applications can directly copy to each
other’s memory, skipping the kernel
 Some cards can do full NVMeoF offload
meaning 0% CPU use on the target
Ingredient 2: NVMeoF
24
 NVMe = storage protocol = how do I talk to my storage?
 “oF” = “over Fabrics” where ”a fabric” can be
– Fibre Channel
– RDMA over Converged Ethernet (RoCE)
– TCP
 Basically attach a remote disk over some fabric to your local system pretending to be a
local NVMe device
– If target is native NVMe, pretty ideal
– NVMeoF vs iSCSI: the same comparison applies as to NVMe vs SATA/SAS/SCSI
 Linux kernel 5.0 introduced native NVMe-oF/TCP support
 SPDK supports both being a target and an initiator in userspace
25
SQ = Submission Queue
CQ = Completion Queue
Netdev 0x13
26
Netdev 0x13
27
Netdev 0x13
28
Netdev 0x13
29
NVMeF export
30
NVMeF/RDMA discovery
31
NVMeF/RDMA connect
32
• Each interface needs an IP, can’t be full L3
• I’d prefer a /32 loopback address + unnumbered BGP
• currently the kernel cannot specify source address for NVMeoF connections
• going to ”stick” to one of the interfaces
• TCP connections between OSD nodes going to be imbalanced
• source address is going to be one of the NICs (hashed by destination in
Drawbacks – network complexity blows up
33
Ceph measurements (WIP)
34
 Single client against 8xNVMe cluster
– 8 volumes:
randread: 210.29k IOPS (~26.29k IOPS/volume), stdev: 616.37
– @ 5.62ms 99p / 8.4ms 99.95p
randwrite: ~48440 IOPS (~6055 IOPS/volume) stdev: 94.46625
@ 12.9ms 99p / 19.6ms 99.95p
 Single client against 8xNVMe cluster distributed according to plans
– 8 volumes:
randread: 321.975k IOPS (40.25k IOPS/volume), stdev: 2483
@ 1.254ms 99p, 2.38ms 99.95p
randwrite: 43.56k IOPS (~5445 IOPS/volume), stddev: 5752
@ 14.1ms 99p / 21.3ms 99.95p
Can we still improve these numbers?
35
 Linux 5.1+ has a new interface instead of async calling “uring”
– short for userspace ring
– shared ring buffer between kernel and userspace
– The goal is to replace the async IO interface in the long run
– For more: https://lwn.net/Articles/776703/
 Bluestore has NVMEDevice support w/ SPDK
– Couldn’t get it to work with NVMeoF despite SPDK having full native support
Source: Netdev 0x13
36
Netdev 0x13
37
Future: targets maybe replaced by ASICs?
38
External references:
39
 RHCS lab environment: https://ceph.io/community/bluestore-default-vs-tuned-
performance-comparison/
 Micron’s reference architecture: https://www.micron.com/-
/media/client/global/documents/products/other-
documents/micron_9300_and_red_hat_ceph_reference_architecture.pdf
 Marvell ASIC: https://www.servethehome.com/marvell-25gbe-nvmeof-adapter-prefaces-a-
super-cool-future/
 Netdev 0x13 SPDK RDMA vs TCP:
https://www.youtube.com/watch?v=HLXxE5WWRf8&feature=youtu.be&t=643
 Zcopy: https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93-
introduction-to-rdma/

Disaggregating Ceph using NVMeoF

  • 1.
    Zoltan Arnold Nagy IBMResearch - Zurich Disaggregating Ceph using NVMeoF
  • 2.
    About me 2  TechnicalLead – Zurich Compute Cloud @ IBM Research – Zurich – Involved all aspects (compute, storage, networking…)  OpenStack since 2011 – “cactus”  Service local Zurich Resarch Lab’s research community – some data must remain in Switzerland/EU and/or too large to move off-site  ~4.5k cores / ~90TB memory and growing  10/25/100GbE  Ceph + GPFS  Ceph since 2014 – “firefly” – Current cluster is 2.2PiB RAW  Mostly HDD  100TB NVMe that sparked this whole investigation – Upgraded and growing since firefly!
  • 3.
    About IBM Research- Zurich 3  Established in 1956  45+ different nationalities  Open Collaboration: – Horizon2020: 50+ funded projects and 500+ partners  Two Nobel Prizes: – 1986: Nobel Prize in Physics for the invention of the scanning tunneling microscope by Heinrich Rohrer and Gerd K. Binnig – 1987: Nobel Prize in Physics for the discovery of high-temperature superconductivity by K. Alex Müller and J. Georg Bednorz  2017: European Physical Society Historic Site  Binnig and Rohrer Nanotechnology Centre opened in 2011 (Public Private Partnership with ETH Zürich and EMPA)  7 European Research Council Grants
  • 4.
    Motivation #1 4  Weregreat when we got them – years ago  2xE5-2630v3 – 2x8 cores @ 2.4GHz  2x10Gbit LACP, flat L2 network  Wanted to add NVMe to our current nodes – E5-2630v3 / 64GB RAM
  • 5.
  • 6.
  • 7.
    7 1x Intel Optane900P 8x Samsung 970 PRO 1TB 1x Mellanox ConnectX- 4 (2x100GbE - PCIe v3 limits to ~128GBit/s)
  • 8.
    Motivation 8 56 cores /140 GHz total compute for 7x NVMe drives
  • 9.
    Motivation 9 48 cores /129.6 GHz total compute for 10 NVMe drives
  • 10.
    Motivation 10 Conclusion on thoseconfigurations? small block size IO: you run out of CPU large block size IO: you run ouf of network
  • 11.
    Quick math 11  Resourcesper device (lots of assumptions: idle OS, RAM, NUMA, …) – 32 threads / 8 NVMe = 4 thread / NVMe – 100Gbit / 8 NVMe = 12.5Gbit/s – 3x replication: n Gbit/s write on the frontend causes 2n outgoing bandwidth -> we can support 6.25Gbit/s write per OSD as theoretical maximum throughput!
  • 12.
    12 Can we dobetter? Don’t we have a bunch of compute nodes?
  • 13.
  • 14.
    14 84 compute nodesper rack (yes, you need cooling…) Each node: 2xE5-2683v4 (2x16 cores @ 2.1GHz) 256GB RAM
  • 15.
    Plan 15 OSD OSD OSD OSD OSDOSD OSD OSD Storage node 100Gbps
  • 16.
    Plan 16 OSD OSD OSD OSD OSDOSD OSD OSD Storage node 100Gbps OSD Compute nodes 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps
  • 17.
    How does themath improve for writes? 17 OSD OSD OSD OSD OSD OSD RX n TX 2n RX n Client TX n RX n TX n TX nTX n
  • 18.
    18 We know theprotocol (NVMe) – let’s talk fabric!
  • 19.
    Fabric topology 19 32x computenodes leaf leaf 1…n spines (32x100GbE) leaf 32x25GbE, 8x100GbE on compute leafs 6-10 storage nodes 32x100GbE on storage leafs
  • 20.
    20 6x Mellanox SN2100switch per rack (16x100GbE) split into 8x4x25GbE + 8x100GbE Each node has full bi-directional bandwidth to the spines!
  • 21.
  • 22.
  • 23.
    Ingredient 1: RoCEv2 23 R stands for RDMA that stands for “remote DMA”  “oCE” is over Converged Ethernet – Tries to be “lossless” – PFC (L2 for example NIC<>Switch) – ECN (L3)  Applications can directly copy to each other’s memory, skipping the kernel  Some cards can do full NVMeoF offload meaning 0% CPU use on the target
  • 24.
    Ingredient 2: NVMeoF 24 NVMe = storage protocol = how do I talk to my storage?  “oF” = “over Fabrics” where ”a fabric” can be – Fibre Channel – RDMA over Converged Ethernet (RoCE) – TCP  Basically attach a remote disk over some fabric to your local system pretending to be a local NVMe device – If target is native NVMe, pretty ideal – NVMeoF vs iSCSI: the same comparison applies as to NVMe vs SATA/SAS/SCSI  Linux kernel 5.0 introduced native NVMe-oF/TCP support  SPDK supports both being a target and an initiator in userspace
  • 25.
    25 SQ = SubmissionQueue CQ = Completion Queue
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
    • Each interfaceneeds an IP, can’t be full L3 • I’d prefer a /32 loopback address + unnumbered BGP • currently the kernel cannot specify source address for NVMeoF connections • going to ”stick” to one of the interfaces • TCP connections between OSD nodes going to be imbalanced • source address is going to be one of the NICs (hashed by destination in Drawbacks – network complexity blows up 33
  • 34.
    Ceph measurements (WIP) 34 Single client against 8xNVMe cluster – 8 volumes: randread: 210.29k IOPS (~26.29k IOPS/volume), stdev: 616.37 – @ 5.62ms 99p / 8.4ms 99.95p randwrite: ~48440 IOPS (~6055 IOPS/volume) stdev: 94.46625 @ 12.9ms 99p / 19.6ms 99.95p  Single client against 8xNVMe cluster distributed according to plans – 8 volumes: randread: 321.975k IOPS (40.25k IOPS/volume), stdev: 2483 @ 1.254ms 99p, 2.38ms 99.95p randwrite: 43.56k IOPS (~5445 IOPS/volume), stddev: 5752 @ 14.1ms 99p / 21.3ms 99.95p
  • 35.
    Can we stillimprove these numbers? 35  Linux 5.1+ has a new interface instead of async calling “uring” – short for userspace ring – shared ring buffer between kernel and userspace – The goal is to replace the async IO interface in the long run – For more: https://lwn.net/Articles/776703/  Bluestore has NVMEDevice support w/ SPDK – Couldn’t get it to work with NVMeoF despite SPDK having full native support
  • 36.
  • 37.
  • 38.
    Future: targets maybereplaced by ASICs? 38
  • 39.
    External references: 39  RHCSlab environment: https://ceph.io/community/bluestore-default-vs-tuned- performance-comparison/  Micron’s reference architecture: https://www.micron.com/- /media/client/global/documents/products/other- documents/micron_9300_and_red_hat_ceph_reference_architecture.pdf  Marvell ASIC: https://www.servethehome.com/marvell-25gbe-nvmeof-adapter-prefaces-a- super-cool-future/  Netdev 0x13 SPDK RDMA vs TCP: https://www.youtube.com/watch?v=HLXxE5WWRf8&feature=youtu.be&t=643  Zcopy: https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93- introduction-to-rdma/