SlideShare a Scribd company logo
1 of 39
Zoltan Arnold Nagy
IBM Research - Zurich
Disaggregating Ceph using
NVMeoF
About me
2
 Technical Lead – Zurich Compute Cloud @ IBM Research – Zurich
– Involved all aspects (compute, storage, networking…)
 OpenStack since 2011 – “cactus”
 Service local Zurich Resarch Lab’s research community – some data must remain
in Switzerland/EU and/or too large to move off-site
 ~4.5k cores / ~90TB memory and growing
 10/25/100GbE
 Ceph + GPFS
 Ceph since 2014 – “firefly”
– Current cluster is 2.2PiB RAW
 Mostly HDD
 100TB NVMe that sparked this whole investigation
– Upgraded and growing since firefly!
About IBM Research - Zurich
3
 Established in 1956
 45+ different nationalities
 Open Collaboration:
– Horizon2020: 50+ funded projects and 500+ partners
 Two Nobel Prizes:
– 1986: Nobel Prize in Physics for the invention of the scanning
tunneling microscope by Heinrich Rohrer and Gerd K. Binnig
– 1987: Nobel Prize in Physics for the discovery of
high-temperature superconductivity by
K. Alex Müller and J. Georg Bednorz
 2017: European Physical Society Historic Site
 Binnig and Rohrer Nanotechnology Centre opened in
2011 (Public Private Partnership with ETH Zürich and EMPA)
 7 European Research Council Grants
Motivation #1
4
 Were great when we got them – years ago
 2xE5-2630v3 – 2x8 cores @ 2.4GHz
 2x10Gbit LACP, flat L2 network
 Wanted to add NVMe to our current nodes
– E5-2630v3 / 64GB RAM
5
6
7
1x Intel Optane 900P
8x Samsung 970 PRO
1TB
1x Mellanox ConnectX-
4 (2x100GbE - PCIe v3
limits to ~128GBit/s)
Motivation
8
56 cores / 140 GHz total compute for 7x NVMe drives
Motivation
9
48 cores / 129.6 GHz total compute for 10 NVMe drives
Motivation
10
Conclusion on those configurations?
small block size IO: you run out of CPU
large block size IO: you run ouf of network
Quick math
11
 Resources per device (lots of assumptions: idle OS, RAM, NUMA, …)
– 32 threads / 8 NVMe = 4 thread / NVMe
– 100Gbit / 8 NVMe = 12.5Gbit/s
– 3x replication: n Gbit/s write on the frontend
causes 2n outgoing bandwidth
-> we can support 6.25Gbit/s write per OSD as
theoretical maximum throughput!
12
Can we do better?
Don’t we have a bunch of compute nodes?
13
14
84 compute nodes per rack
(yes, you need cooling…)
Each node:
2xE5-2683v4
(2x16 cores @ 2.1GHz)
256GB RAM
Plan
15
OSD OSD
OSD OSD
OSD OSD
OSD OSD
Storage node
100Gbps
Plan
16
OSD OSD
OSD OSD
OSD OSD
OSD OSD
Storage node
100Gbps
OSD
Compute nodes
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
How does the math improve for writes?
17
OSD OSD OSD
OSD OSD OSD
RX n
TX 2n
RX n
Client
TX n
RX n
TX n TX nTX n
18
We know the protocol (NVMe) – let’s talk fabric!
Fabric topology
19
32x compute nodes
leaf leaf
1…n spines (32x100GbE)
leaf
32x25GbE, 8x100GbE
on compute leafs
6-10 storage nodes
32x100GbE
on storage leafs
20
6x Mellanox SN2100 switch per rack
(16x100GbE)
split into 8x4x25GbE + 8x100GbE
Each node has full bi-directional
bandwidth to the spines!
Fabric latency (ib_write_lat)
21
Fabric bandwidth (ib_write_bw)
22
Ingredient 1: RoCEv2
23
 R stands for RDMA that stands for “remote DMA”
 “oCE” is over Converged Ethernet
– Tries to be “lossless”
– PFC (L2 for example NIC<>Switch)
– ECN (L3)
 Applications can directly copy to each
other’s memory, skipping the kernel
 Some cards can do full NVMeoF offload
meaning 0% CPU use on the target
Ingredient 2: NVMeoF
24
 NVMe = storage protocol = how do I talk to my storage?
 “oF” = “over Fabrics” where ”a fabric” can be
– Fibre Channel
– RDMA over Converged Ethernet (RoCE)
– TCP
 Basically attach a remote disk over some fabric to your local system pretending to be a
local NVMe device
– If target is native NVMe, pretty ideal
– NVMeoF vs iSCSI: the same comparison applies as to NVMe vs SATA/SAS/SCSI
 Linux kernel 5.0 introduced native NVMe-oF/TCP support
 SPDK supports both being a target and an initiator in userspace
25
SQ = Submission Queue
CQ = Completion Queue
Netdev 0x13
26
Netdev 0x13
27
Netdev 0x13
28
Netdev 0x13
29
NVMeF export
30
NVMeF/RDMA discovery
31
NVMeF/RDMA connect
32
• Each interface needs an IP, can’t be full L3
• I’d prefer a /32 loopback address + unnumbered BGP
• currently the kernel cannot specify source address for NVMeoF connections
• going to ”stick” to one of the interfaces
• TCP connections between OSD nodes going to be imbalanced
• source address is going to be one of the NICs (hashed by destination in
Drawbacks – network complexity blows up
33
Ceph measurements (WIP)
34
 Single client against 8xNVMe cluster
– 8 volumes:
randread: 210.29k IOPS (~26.29k IOPS/volume), stdev: 616.37
– @ 5.62ms 99p / 8.4ms 99.95p
randwrite: ~48440 IOPS (~6055 IOPS/volume) stdev: 94.46625
@ 12.9ms 99p / 19.6ms 99.95p
 Single client against 8xNVMe cluster distributed according to plans
– 8 volumes:
randread: 321.975k IOPS (40.25k IOPS/volume), stdev: 2483
@ 1.254ms 99p, 2.38ms 99.95p
randwrite: 43.56k IOPS (~5445 IOPS/volume), stddev: 5752
@ 14.1ms 99p / 21.3ms 99.95p
Can we still improve these numbers?
35
 Linux 5.1+ has a new interface instead of async calling “uring”
– short for userspace ring
– shared ring buffer between kernel and userspace
– The goal is to replace the async IO interface in the long run
– For more: https://lwn.net/Articles/776703/
 Bluestore has NVMEDevice support w/ SPDK
– Couldn’t get it to work with NVMeoF despite SPDK having full native support
Source: Netdev 0x13
36
Netdev 0x13
37
Future: targets maybe replaced by ASICs?
38
External references:
39
 RHCS lab environment: https://ceph.io/community/bluestore-default-vs-tuned-
performance-comparison/
 Micron’s reference architecture: https://www.micron.com/-
/media/client/global/documents/products/other-
documents/micron_9300_and_red_hat_ceph_reference_architecture.pdf
 Marvell ASIC: https://www.servethehome.com/marvell-25gbe-nvmeof-adapter-prefaces-a-
super-cool-future/
 Netdev 0x13 SPDK RDMA vs TCP:
https://www.youtube.com/watch?v=HLXxE5WWRf8&feature=youtu.be&t=643
 Zcopy: https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93-
introduction-to-rdma/

More Related Content

What's hot

Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Community
 
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESQuick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Jan Kalcic
 

What's hot (20)

Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
Designing for High Performance Ceph at Scale
Designing for High Performance Ceph at ScaleDesigning for High Performance Ceph at Scale
Designing for High Performance Ceph at Scale
 
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red HatHyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
 
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLESQuick-and-Easy Deployment of a Ceph Storage Cluster with SLES
Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES
 
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
GlusterFS As an Object Storage
GlusterFS As an Object StorageGlusterFS As an Object Storage
GlusterFS As an Object Storage
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex LauDoing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
 
Ceph on Windows
Ceph on WindowsCeph on Windows
Ceph on Windows
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERN
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
The container revolution, and what it means to operators.pptx
The container revolution, and what it means to operators.pptxThe container revolution, and what it means to operators.pptx
The container revolution, and what it means to operators.pptx
 

Similar to Disaggregating Ceph using NVMeoF

Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
Hajime Tazaki
 

Similar to Disaggregating Ceph using NVMeoF (20)

Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Ceph in the GRNET cloud stack
Ceph in the GRNET cloud stackCeph in the GRNET cloud stack
Ceph in the GRNET cloud stack
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osio
 
TUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data CenterTUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data Center
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterKernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Cilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPCilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDP
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 
Ceph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfCeph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdf
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 

Recently uploaded (20)

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 

Disaggregating Ceph using NVMeoF

  • 1. Zoltan Arnold Nagy IBM Research - Zurich Disaggregating Ceph using NVMeoF
  • 2. About me 2  Technical Lead – Zurich Compute Cloud @ IBM Research – Zurich – Involved all aspects (compute, storage, networking…)  OpenStack since 2011 – “cactus”  Service local Zurich Resarch Lab’s research community – some data must remain in Switzerland/EU and/or too large to move off-site  ~4.5k cores / ~90TB memory and growing  10/25/100GbE  Ceph + GPFS  Ceph since 2014 – “firefly” – Current cluster is 2.2PiB RAW  Mostly HDD  100TB NVMe that sparked this whole investigation – Upgraded and growing since firefly!
  • 3. About IBM Research - Zurich 3  Established in 1956  45+ different nationalities  Open Collaboration: – Horizon2020: 50+ funded projects and 500+ partners  Two Nobel Prizes: – 1986: Nobel Prize in Physics for the invention of the scanning tunneling microscope by Heinrich Rohrer and Gerd K. Binnig – 1987: Nobel Prize in Physics for the discovery of high-temperature superconductivity by K. Alex Müller and J. Georg Bednorz  2017: European Physical Society Historic Site  Binnig and Rohrer Nanotechnology Centre opened in 2011 (Public Private Partnership with ETH Zürich and EMPA)  7 European Research Council Grants
  • 4. Motivation #1 4  Were great when we got them – years ago  2xE5-2630v3 – 2x8 cores @ 2.4GHz  2x10Gbit LACP, flat L2 network  Wanted to add NVMe to our current nodes – E5-2630v3 / 64GB RAM
  • 5. 5
  • 6. 6
  • 7. 7 1x Intel Optane 900P 8x Samsung 970 PRO 1TB 1x Mellanox ConnectX- 4 (2x100GbE - PCIe v3 limits to ~128GBit/s)
  • 8. Motivation 8 56 cores / 140 GHz total compute for 7x NVMe drives
  • 9. Motivation 9 48 cores / 129.6 GHz total compute for 10 NVMe drives
  • 10. Motivation 10 Conclusion on those configurations? small block size IO: you run out of CPU large block size IO: you run ouf of network
  • 11. Quick math 11  Resources per device (lots of assumptions: idle OS, RAM, NUMA, …) – 32 threads / 8 NVMe = 4 thread / NVMe – 100Gbit / 8 NVMe = 12.5Gbit/s – 3x replication: n Gbit/s write on the frontend causes 2n outgoing bandwidth -> we can support 6.25Gbit/s write per OSD as theoretical maximum throughput!
  • 12. 12 Can we do better? Don’t we have a bunch of compute nodes?
  • 13. 13
  • 14. 14 84 compute nodes per rack (yes, you need cooling…) Each node: 2xE5-2683v4 (2x16 cores @ 2.1GHz) 256GB RAM
  • 15. Plan 15 OSD OSD OSD OSD OSD OSD OSD OSD Storage node 100Gbps
  • 16. Plan 16 OSD OSD OSD OSD OSD OSD OSD OSD Storage node 100Gbps OSD Compute nodes 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps
  • 17. How does the math improve for writes? 17 OSD OSD OSD OSD OSD OSD RX n TX 2n RX n Client TX n RX n TX n TX nTX n
  • 18. 18 We know the protocol (NVMe) – let’s talk fabric!
  • 19. Fabric topology 19 32x compute nodes leaf leaf 1…n spines (32x100GbE) leaf 32x25GbE, 8x100GbE on compute leafs 6-10 storage nodes 32x100GbE on storage leafs
  • 20. 20 6x Mellanox SN2100 switch per rack (16x100GbE) split into 8x4x25GbE + 8x100GbE Each node has full bi-directional bandwidth to the spines!
  • 23. Ingredient 1: RoCEv2 23  R stands for RDMA that stands for “remote DMA”  “oCE” is over Converged Ethernet – Tries to be “lossless” – PFC (L2 for example NIC<>Switch) – ECN (L3)  Applications can directly copy to each other’s memory, skipping the kernel  Some cards can do full NVMeoF offload meaning 0% CPU use on the target
  • 24. Ingredient 2: NVMeoF 24  NVMe = storage protocol = how do I talk to my storage?  “oF” = “over Fabrics” where ”a fabric” can be – Fibre Channel – RDMA over Converged Ethernet (RoCE) – TCP  Basically attach a remote disk over some fabric to your local system pretending to be a local NVMe device – If target is native NVMe, pretty ideal – NVMeoF vs iSCSI: the same comparison applies as to NVMe vs SATA/SAS/SCSI  Linux kernel 5.0 introduced native NVMe-oF/TCP support  SPDK supports both being a target and an initiator in userspace
  • 25. 25 SQ = Submission Queue CQ = Completion Queue
  • 33. • Each interface needs an IP, can’t be full L3 • I’d prefer a /32 loopback address + unnumbered BGP • currently the kernel cannot specify source address for NVMeoF connections • going to ”stick” to one of the interfaces • TCP connections between OSD nodes going to be imbalanced • source address is going to be one of the NICs (hashed by destination in Drawbacks – network complexity blows up 33
  • 34. Ceph measurements (WIP) 34  Single client against 8xNVMe cluster – 8 volumes: randread: 210.29k IOPS (~26.29k IOPS/volume), stdev: 616.37 – @ 5.62ms 99p / 8.4ms 99.95p randwrite: ~48440 IOPS (~6055 IOPS/volume) stdev: 94.46625 @ 12.9ms 99p / 19.6ms 99.95p  Single client against 8xNVMe cluster distributed according to plans – 8 volumes: randread: 321.975k IOPS (40.25k IOPS/volume), stdev: 2483 @ 1.254ms 99p, 2.38ms 99.95p randwrite: 43.56k IOPS (~5445 IOPS/volume), stddev: 5752 @ 14.1ms 99p / 21.3ms 99.95p
  • 35. Can we still improve these numbers? 35  Linux 5.1+ has a new interface instead of async calling “uring” – short for userspace ring – shared ring buffer between kernel and userspace – The goal is to replace the async IO interface in the long run – For more: https://lwn.net/Articles/776703/  Bluestore has NVMEDevice support w/ SPDK – Couldn’t get it to work with NVMeoF despite SPDK having full native support
  • 38. Future: targets maybe replaced by ASICs? 38
  • 39. External references: 39  RHCS lab environment: https://ceph.io/community/bluestore-default-vs-tuned- performance-comparison/  Micron’s reference architecture: https://www.micron.com/- /media/client/global/documents/products/other- documents/micron_9300_and_red_hat_ceph_reference_architecture.pdf  Marvell ASIC: https://www.servethehome.com/marvell-25gbe-nvmeof-adapter-prefaces-a- super-cool-future/  Netdev 0x13 SPDK RDMA vs TCP: https://www.youtube.com/watch?v=HLXxE5WWRf8&feature=youtu.be&t=643  Zcopy: https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93- introduction-to-rdma/