Disaggregating Ceph using NVMeoF

Zoltan Arnold Nagy
IBM Research - Zurich
Disaggregating Ceph using
NVMeoF

About me
2
 Technical Lead – Zurich Compute Cloud @ IBM Research – Zurich
– Involved all aspects (compute, storage, networking…)
 OpenStack since 2011 – “cactus”
 Service local Zurich Resarch Lab’s research community – some data must remain
in Switzerland/EU and/or too large to move off-site
 ~4.5k cores / ~90TB memory and growing
 10/25/100GbE
 Ceph + GPFS
 Ceph since 2014 – “firefly”
– Current cluster is 2.2PiB RAW
 Mostly HDD
 100TB NVMe that sparked this whole investigation
– Upgraded and growing since firefly!

About IBM Research - Zurich
3
 Established in 1956
 45+ different nationalities
 Open Collaboration:
– Horizon2020: 50+ funded projects and 500+ partners
 Two Nobel Prizes:
– 1986: Nobel Prize in Physics for the invention of the scanning
tunneling microscope by Heinrich Rohrer and Gerd K. Binnig
– 1987: Nobel Prize in Physics for the discovery of
high-temperature superconductivity by
K. Alex Müller and J. Georg Bednorz
 2017: European Physical Society Historic Site
 Binnig and Rohrer Nanotechnology Centre opened in
2011 (Public Private Partnership with ETH Zürich and EMPA)
 7 European Research Council Grants

Motivation #1
4
 Were great when we got them – years ago
 2xE5-2630v3 – 2x8 cores @ 2.4GHz
 2x10Gbit LACP, flat L2 network
 Wanted to add NVMe to our current nodes
– E5-2630v3 / 64GB RAM

7
1x Intel Optane 900P
8x Samsung 970 PRO
1TB
1x Mellanox ConnectX-
4 (2x100GbE - PCIe v3
limits to ~128GBit/s)

Motivation
8
56 cores / 140 GHz total compute for 7x NVMe drives

Motivation
9
48 cores / 129.6 GHz total compute for 10 NVMe drives

Motivation
10
Conclusion on those configurations?
small block size IO: you run out of CPU
large block size IO: you run ouf of network

Quick math
11
 Resources per device (lots of assumptions: idle OS, RAM, NUMA, …)
– 32 threads / 8 NVMe = 4 thread / NVMe
– 100Gbit / 8 NVMe = 12.5Gbit/s
– 3x replication: n Gbit/s write on the frontend
causes 2n outgoing bandwidth
-> we can support 6.25Gbit/s write per OSD as
theoretical maximum throughput!

12
Can we do better?
Don’t we have a bunch of compute nodes?

14
84 compute nodes per rack
(yes, you need cooling…)
Each node:
2xE5-2683v4
(2x16 cores @ 2.1GHz)
256GB RAM

Plan
15
OSD OSD
OSD OSD
OSD OSD
OSD OSD
Storage node
100Gbps

Plan
16
OSD OSD
OSD OSD
OSD OSD
OSD OSD
Storage node
100Gbps
OSD
Compute nodes
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps

How does the math improve for writes?
17
OSD OSD OSD
OSD OSD OSD
RX n
TX 2n
RX n
Client
TX n
RX n
TX n TX nTX n

18
We know the protocol (NVMe) – let’s talk fabric!

Fabric topology
19
32x compute nodes
leaf leaf
1…n spines (32x100GbE)
leaf
32x25GbE, 8x100GbE
on compute leafs
6-10 storage nodes
32x100GbE
on storage leafs

20
6x Mellanox SN2100 switch per rack
(16x100GbE)
split into 8x4x25GbE + 8x100GbE
Each node has full bi-directional
bandwidth to the spines!

Fabric latency (ib_write_lat)
21

Fabric bandwidth (ib_write_bw)
22

Ingredient 1: RoCEv2
23
 R stands for RDMA that stands for “remote DMA”
 “oCE” is over Converged Ethernet
– Tries to be “lossless”
– PFC (L2 for example NIC<>Switch)
– ECN (L3)
 Applications can directly copy to each
other’s memory, skipping the kernel
 Some cards can do full NVMeoF offload
meaning 0% CPU use on the target

Ingredient 2: NVMeoF
24
 NVMe = storage protocol = how do I talk to my storage?
 “oF” = “over Fabrics” where ”a fabric” can be
– Fibre Channel
– RDMA over Converged Ethernet (RoCE)
– TCP
 Basically attach a remote disk over some fabric to your local system pretending to be a
local NVMe device
– If target is native NVMe, pretty ideal
– NVMeoF vs iSCSI: the same comparison applies as to NVMe vs SATA/SAS/SCSI
 Linux kernel 5.0 introduced native NVMe-oF/TCP support
 SPDK supports both being a target and an initiator in userspace

25
SQ = Submission Queue
CQ = Completion Queue

• Each interface needs an IP, can’t be full L3
• I’d prefer a /32 loopback address + unnumbered BGP
• currently the kernel cannot specify source address for NVMeoF connections
• going to ”stick” to one of the interfaces
• TCP connections between OSD nodes going to be imbalanced
• source address is going to be one of the NICs (hashed by destination in
Drawbacks – network complexity blows up
33

Ceph measurements (WIP)
34
 Single client against 8xNVMe cluster
– 8 volumes:
randread: 210.29k IOPS (~26.29k IOPS/volume), stdev: 616.37
– @ 5.62ms 99p / 8.4ms 99.95p
randwrite: ~48440 IOPS (~6055 IOPS/volume) stdev: 94.46625
@ 12.9ms 99p / 19.6ms 99.95p
 Single client against 8xNVMe cluster distributed according to plans
– 8 volumes:
randread: 321.975k IOPS (40.25k IOPS/volume), stdev: 2483
@ 1.254ms 99p, 2.38ms 99.95p
randwrite: 43.56k IOPS (~5445 IOPS/volume), stddev: 5752
@ 14.1ms 99p / 21.3ms 99.95p

Can we still improve these numbers?
35
 Linux 5.1+ has a new interface instead of async calling “uring”
– short for userspace ring
– shared ring buffer between kernel and userspace
– The goal is to replace the async IO interface in the long run
– For more: https://lwn.net/Articles/776703/
 Bluestore has NVMEDevice support w/ SPDK
– Couldn’t get it to work with NVMeoF despite SPDK having full native support

Future: targets maybe replaced by ASICs?
38

External references:
39
 RHCS lab environment: https://ceph.io/community/bluestore-default-vs-tuned-
performance-comparison/
 Micron’s reference architecture: https://www.micron.com/-
/media/client/global/documents/products/other-
documents/micron_9300_and_red_hat_ceph_reference_architecture.pdf
 Marvell ASIC: https://www.servethehome.com/marvell-25gbe-nvmeof-adapter-prefaces-a-
super-cool-future/
 Netdev 0x13 SPDK RDMA vs TCP:
https://www.youtube.com/watch?v=HLXxE5WWRf8&feature=youtu.be&t=643
 Zcopy: https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93-
introduction-to-rdma/

Disaggregating Ceph using NVMeoF

More Related Content

What's hot

Similar to Disaggregating Ceph using NVMeoF

Recently uploaded

Disaggregating Ceph using NVMeoF