http://intrbiz.comchris@intrbiz.com
Routed Fabrics For Ceph
Chris Ellis - @intrbiz
Fast & Effective Networking For Ceph
Ceph Day London 2019
http://intrbiz.comchris@intrbiz.com
Hello!
● I’m Chris
○ IT jack of all trades
● Mostly a PostgreSQL Consultant
○ Full stack:
■ from electronic design to web dev
● Very much into Open Source
○ Started a monitoring system project a few years ago
○ Big openSUSE and PostgreSQL fan
● Been using and playing with Ceph for a couple of years
○ Build a small VM farm with Ceph for shared storage
http://intrbiz.comchris@intrbiz.com
Routed Fabrics, Huh?
http://intrbiz.comchris@intrbiz.com
Routed Fabrics, Huh?
● Essentially we make servers participate in routing
○ Every network link the server has is active / active utilised
○ Every server takes part in the routing protocol
○ Routing protocol deals with device and link failures
■ Data just takes another path in the event of a fault
● Equal Cost Multi Path (ECMP) is used to efficiently move traffic
○ IP packets are routed over all available links
○ TCP streams don’t get split across more than one path
■ Single stream is still limited to the bandwidth of your links
○ IE: with 4x 10Gbe NICs we can push 40Gb/s of traffic in aggregate
■ An individual TCP stream maxes at 10Gb/s
http://intrbiz.comchris@intrbiz.com
The Build
● My setup is about as small as you can go
● I've my R&D setup
● It's only two switches
● But it's about showing that these approaches work even at small scale
○ All traffic is still routed
○ We still get all benefits of a Routed Fabric
○ We can use cheap commodity switching
○ You don't need super high end kit to get efficiency and speed
● Yes, it's not a real Clos topology, you need a bigger problem domain for that
● This is about thinking about different ways of doing things
http://intrbiz.comchris@intrbiz.com
What You’ll Need
http://intrbiz.comchris@intrbiz.com
Connecting Things
http://intrbiz.comchris@intrbiz.com
Connecting Things
http://intrbiz.comchris@intrbiz.com
A Cunning Plan - Network Assignments
● Switch 1: 172.31.1.0/24
○ Port 1: 172.31.1.0/30
○ Port 2: 172.31.1.4/30
○ …
○ Port 24: 172.31.1.92/30
● Inter-switch: 172.31.3.0/24
○ Link 1: 172.31.3.0/30
○ Link 2: 172.31.3.4/30
○ …
○ Link 8: 172.31.3.28/30
● Switch 2: 172.31.2.0/24
○ Port 1: 172.31.2.0/30
○ Port 2: 172.31.2.4/30
○ …
○ Port 24: 172.31.2.92/30
● Ceph: 172.28.0.0/24
○ Node 1: 172.28.0.1/32
○ Node 2: 172.28.0.2/32
○ …
○ Node 12: 172.28.0.12/32
http://intrbiz.comchris@intrbiz.com
Configuring Your Switches - Turn On Routing
ip routing
router ospf
router-id 172.26.1.210
network 172.31.1.0 255.255.255.0 area 0.0.0.0
network 172.31.3.0 255.255.255.0 area 0.0.0.0
redistribute connected
redistribute static
exit
http://intrbiz.comchris@intrbiz.com
Configuring Your Switches - Server Interface
interface 0/1
mtu 9018
routing
ip address 172.31.1.1 255.255.255.252
ip ospf area 0.0.0.0
exit
http://intrbiz.comchris@intrbiz.com
Configuring Your Switches - Server Interface
interface 0/2
mtu 9018
routing
ip address 172.31.1.5 255.255.255.252
ip ospf area 0.0.0.0
exit
http://intrbiz.comchris@intrbiz.com
Configuring Your Switches - Server Interface
interface 0/24
mtu 9018
routing
ip address 172.31.1.93 255.255.255.252
ip ospf area 0.0.0.0
exit
http://intrbiz.comchris@intrbiz.com
Configuring Your Switches - Inter Switch Interface
interface 0/28
mtu 9018
routing
ip address 172.31.3.1 255.255.255.252
ip ospf area 0.0.0.0
exit
http://intrbiz.comchris@intrbiz.com
Configuring Your Ceph Server - Interfaces
$> cat ifcfg-eth4
BOOTPROTO='static'
IPADDR='172.31.1.2/30'
MTU='9000'
$> cat ifcfg-eth6
BOOTPROTO='static'
IPADDR='172.31.1.6/30'
MTU='9000'
$> cat ifcfg-eth5
BOOTPROTO='static'
IPADDR='172.31.2.2/30'
MTU='9000'
$> cat ifcfg-eth7
BOOTPROTO='static'
IPADDR='172.31.2.6/30'
MTU='9000'
http://intrbiz.comchris@intrbiz.com
Configuring Your Ceph Server - Dummy Interface
$> cat ifcfg-dummy0
BOOTPROTO='static'
IPADDR='172.28.0.1/32'
MTU='9000'
http://intrbiz.comchris@intrbiz.com
Configuring Your Ceph Server - Quagga & OSPFd
$> zypper in quagga ospfd
http://intrbiz.comchris@intrbiz.com
Configuring Your Ceph Server - Quagga
$> cat zebra.conf
hostname ceph1
!
interface eth4
ip address 172.31.1.2/30
interface eth5
ip address 172.31.2.2/30
interface eth6
ip address 172.31.1.6/30
interface eth7
ip address 172.31.2.6/30
!
http://intrbiz.comchris@intrbiz.com
Configuring Your Ceph Server - OSPFd
$> cat ospfd.conf
hostname ceph1
!
interface eth4
interface eth5
interface eth6
interface eth7
router ospf
ospf router-id 172.26.1.1
network 172.28.0.1/32 area 0
network 172.31.1.2/30 area 0
network 172.31.1.6/30 area 0
network 172.31.2.2/30 area 0
network 172.31.2.6/30 area 0
!
http://intrbiz.comchris@intrbiz.com
Configuring Your Ceph Server - Kernel
$> cat sysctl.conf
# Enable IP routing
net.ipv4.ip_forward = 1
# Tweak ECMP policy
net.ipv4.fib_multipath_hash_policy = 1
net.ipv4.fib_multipath_use_neigh = 1
# Disable reverse path filtering
net.ipv4.conf.all.rp_filter = 0
# Enable reverse path filtering on normal NICs
net.ipv4.conf.bond1.rp_filter = 1
http://intrbiz.comchris@intrbiz.com
Configuring Your Ceph Server - Ceph
$> cat ceph.conf
[global]
public_network = 172.28.0.0/24
http://intrbiz.comchris@intrbiz.com
Et Volia
$> ip route
172.26.28.2 proto zebra metric 20
nexthop via 172.31.1.10 dev eth7 weight 1
nexthop via 172.31.1.14 dev eth6 weight 1
nexthop via 172.31.2.10 dev eth4 weight 1
nexthop via 172.31.2.14 dev eth5 weight 1
172.26.28.3 proto zebra metric 20
nexthop via 172.31.1.18 dev eth7 weight 1
nexthop via 172.31.1.22 dev eth6 weight 1
nexthop via 172.31.2.18 dev eth4 weight 1
nexthop via 172.31.2.22 dev eth5 weight 1
...
http://intrbiz.comchris@intrbiz.com
Caveats
● Make sure that MTUs are configured correctly and match
○ OSPF is a custom IP type, if your MTU is mismatched packets get corrupted
● Label your cables
○ Swapping cables around will break things
● Quagga will only set a default route if no default route is already defined
○ OSPFd needs: `default-information originate metric-type 1`
http://intrbiz.comchris@intrbiz.com
Further Reading
● Intro to Clos networks
○ https://en.wikipedia.org/wiki/Clos_network
● Google white paper on their CLOS topologies
○ https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43837.pdf
● Cumulus on Clos and ECMP:
○ https://cumulusnetworks.com/blog/celebrating-ecmp-part-one/
● Benefits of ditching layer 2
○ https://thenewstack.io/ditch-pitfalls-layer-2-networks-modern-data-center-design/

Routed Fabrics For Ceph