Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Disaggregating Ceph using NVMeoF

46 views

Published on

Zoltan Arnold: Disaggregating Ceph using NVMeoF

Published in: Technology
  • Be the first to comment

Disaggregating Ceph using NVMeoF

  1. 1. Zoltan Arnold Nagy IBM Research - Zurich Disaggregating Ceph using NVMeoF
  2. 2. About me 2 § Technical Lead – Zurich Compute Cloud @ IBM Research – Zurich – Involved all aspects (compute, storage, networking…) § OpenStack since 2011 – “cactus” § Service local Zurich Resarch Lab’s research community – some data must remain in Switzerland/EU and/or too large to move off-site § ~4.5k cores / ~90TB memory and growing § 10/25/100GbE § Ceph + GPFS § Ceph since 2014 – “firefly” – Current cluster is 2.2PiB RAW § Mostly HDD § 100TB NVMe that sparked this whole investigation – Upgraded and growing since firefly!
  3. 3. About IBM Research - Zurich 3 § Established in 1956 § 45+ different nationalities § Open Collaboration: – Horizon2020: 50+ funded projects and 500+ partners § Two Nobel Prizes: – 1986: Nobel Prize in Physics for the invention of the scanning tunneling microscope by Heinrich Rohrer and Gerd K. Binnig – 1987: Nobel Prize in Physics for the discovery of high-temperature superconductivity by K. Alex Müller and J. Georg Bednorz § 2017: European Physical Society Historic Site § Binnig and Rohrer Nanotechnology Centre opened in 2011 (Public Private Partnership with ETH Zürich and EMPA) § 7 European Research Council Grants
  4. 4. Motivation #1 4 § Were great when we got them – years ago § 2xE5-2630v3 – 2x8 cores @ 2.4GHz § 2x10Gbit LACP, flat L2 network § Wanted to add NVMe to our current nodes – E5-2630v3 / 64GB RAM
  5. 5. 5
  6. 6. 6
  7. 7. 7 1x Intel Optane 900P 8x Samsung 970 PRO 1TB 1x Mellanox ConnectX-4 (2x100GbE - PCIe v3 limits to ~128GBit/s)
  8. 8. Motivation 8 56 cores / 140 GHz total compute for 7x NVMe drives
  9. 9. Motivation 9 48 cores / 129.6 GHz total compute for 10 NVMe drives
  10. 10. Motivation 10 Conclusion on those configurations? small block size IO: you run out of CPU large block size IO: you run ouf of network
  11. 11. Quick math 11 § Resources per device (lots of assumptions: idle OS, RAM, NUMA, …) – 32 threads / 8 NVMe = 4 thread / NVMe – 100Gbit / 8 NVMe = 12.5Gbit/s – 3x replication: n Gbit/s write on the frontend causes 2n outgoing bandwidth -> we can support 6.25Gbit/s write per OSD as theoretical maximum throughput!
  12. 12. 12 Can we do better? Don’t we have a bunch of compute nodes?
  13. 13. 13
  14. 14. 14 84 compute nodes per rack (yes, you need cooling…) Each node: 2xE5-2683v4 (2x16 cores @ 2.1GHz) 256GB RAM
  15. 15. Plan 15 OSD OSD OSD OSD OSD OSD OSD OSD Storage node 100Gbps
  16. 16. Plan 16 OSD OSD OSD OSD OSD OSD OSD OSD Storage node 100Gbps OSD Compute nodes 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps
  17. 17. How does the math improve for writes? 17 OSD OSD OSD OSD OSD OSD RX n TX 2n RX n Client TX n RX n TX n TX nTX n
  18. 18. 18 We know the protocol (NVMe) – let’s talk fabric!
  19. 19. Fabric topology 19 32x compute nodes leaf leaf 1…n spines (32x100GbE) leaf 32x25GbE, 8x100GbE on compute leafs 6-10 storage nodes 32x100GbE on storage leafs
  20. 20. 20 6x Mellanox SN2100 switch per rack (16x100GbE) split into 8x4x25GbE + 8x100GbE Each node has full bi-directional bandwidth to the spines!
  21. 21. Fabric latency (ib_write_lat) 21
  22. 22. Fabric bandwidth (ib_write_bw) 22
  23. 23. Ingredient 1: RoCEv2 23 § R stands for RDMA that stands for “remote DMA” § “oCE” is over Converged Ethernet – Tries to be “lossless” – PFC (L2 for example NIC<>Switch) – ECN (L3) § Applications can directly copy to each other’s memory, skipping the kernel § Some cards can do full NVMeoF offload meaning 0% CPU use on the target
  24. 24. Ingredient 2: NVMeoF 24 § NVMe = storage protocol = how do I talk to my storage? § “oF” = “over Fabrics” where ”a fabric” can be – Fibre Channel – RDMA over Converged Ethernet (RoCE) – TCP § Basically attach a remote disk over some fabric to your local system pretending to be a local NVMe device – If target is native NVMe, pretty ideal – NVMeoF vs iSCSI: the same comparison applies as to NVMe vs SATA/SAS/SCSI § Linux kernel 5.0 introduced native NVMe-oF/TCP support § SPDK supports both being a target and an initiator in userspace
  25. 25. 25 SQ = Submission Queue CQ = Completion Queue
  26. 26. Netdev 0x13 26
  27. 27. Netdev 0x13 27
  28. 28. Netdev 0x13 28
  29. 29. Netdev 0x13 29
  30. 30. NVMeF export 30
  31. 31. NVMeF/RDMA discovery 31
  32. 32. NVMeF/RDMA connect 32
  33. 33. • Each interface needs an IP, can’t be full L3 • I’d prefer a /32 loopback address + unnumbered BGP • currently the kernel cannot specify source address for NVMeoF connections • going to ”stick” to one of the interfaces • TCP connections between OSD nodes going to be imbalanced • source address is going to be one of the NICs (hashed by destination info) Drawbacks – network complexity blows up 33
  34. 34. Ceph measurements (WIP) 34 § Single client against 8xNVMe cluster – 8 volumes: randread: 210.29k IOPS (~26.29k IOPS/volume), stdev: 616.37 – @ 5.62ms 99p / 8.4ms 99.95p randwrite: ~48440 IOPS (~6055 IOPS/volume) stdev: 94.46625 @ 12.9ms 99p / 19.6ms 99.95p § Single client against 8xNVMe cluster distributed according to plans – 8 volumes: randread: 321.975k IOPS (40.25k IOPS/volume), stdev: 2483 @ 1.254ms 99p, 2.38ms 99.95p randwrite: 43.56k IOPS (~5445 IOPS/volume), stddev: 5752 @ 14.1ms 99p / 21.3ms 99.95p
  35. 35. Can we still improve these numbers? 35 § Linux 5.1+ has a new interface instead of async calling “uring” – short for userspace ring – shared ring buffer between kernel and userspace – The goal is to replace the async IO interface in the long run – For more: https://lwn.net/Articles/776703/ § Bluestore has NVMEDevice support w/ SPDK – Couldn’t get it to work with NVMeoF despite SPDK having full native support
  36. 36. Source: Netdev 0x13 36
  37. 37. Netdev 0x13 37
  38. 38. Future: targets maybe replaced by ASICs? 38
  39. 39. External references: 39 § RHCS lab environment: https://ceph.io/community/bluestore-default-vs-tuned-performance- comparison/ § Micron’s reference architecture: https://www.micron.com/- /media/client/global/documents/products/other- documents/micron_9300_and_red_hat_ceph_reference_architecture.pdf § Marvell ASIC: https://www.servethehome.com/marvell-25gbe-nvmeof-adapter-prefaces-a- super-cool-future/ § Netdev 0x13 SPDK RDMA vs TCP: https://www.youtube.com/watch?v=HLXxE5WWRf8&feature=youtu.be&t=643 § Zcopy: https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93- introduction-to-rdma/

×