Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Excitingly simple multi-path OpenStack networking: LAG-less, L2-less, yet fully redundant

9,302 views

Published on

Yuki Nishiwaki / Samir Ibradzic (LINE Corporation)
OpenStack Summit Vancouver, May 2018
https://www.openstack.org/summit/vancouver-2018/summit-schedule/global-search?t=Yuki%20Nishiwaki

Published in: Technology
  • Follow the link, new dating source: ♥♥♥ http://bit.ly/2Qu6Caa ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Sex in your area is here: ♥♥♥ http://bit.ly/2Qu6Caa ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Holistic Uterine Fibroids Secrets, Eliminate Uterine Fibroids Fast, Natural cure e-book reveals all..  https://bit.ly/2ONPXxg
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Excitingly simple multi-path OpenStack networking: LAG-less, L2-less, yet fully redundant

  1. 1. Excitingly simple multi-path OpenStack networking LAG-less, L2-less, yet fully redundant Yuki Nishiwaki (LINE Corp) Samir Ibradžić (LINE Corp)
  2. 2. Agenda 1. Motivation for Clos Network in the Datacenter 2. Achieving Redundancy between ToR & Hypervisors 3. L3 Routing vs VM Network Overlay 4. Solution & Implementation 5. Summary 6. Q&A
  3. 3. 1. Motivation for Clos Network in the Datacenter
  4. 4. TOR Core Aggregation ToR Aggregation ToR Server Server Server Server Server Server Server Server Aggregation ToR Server Server Server Server Core Aggregation OSPF ECMP MC-LAG Bonding Active-Back up Non-blocking 42% of full capacity 8% off full capacity Overview: Traditional Datacenter Network Link Redundancy Method ➢ Three switching layers; Core, Aggregation, ToR ➢ VLAN terminated in Aggregation switches ➢ L2 redundancy between Aggregation and ToR ➢ Only traffic between under same ToR is ensured to reach the wire rate ➢ Traffic passing over ToR could drop to 42% of the total wire capacity, due to the narrow uplinks ➢ Worst case between Aggregation switches: 8% of total wire capacity
  5. 5. TORToR ToR Server Server Server Server Server Server Server Server Overview: VM scheduling considerations VM1 VM2VM1 VM1 VM1 VM1 VM2 VM2 VM2 VM2 VM2 TOR Aggregation ToR Aggregation ToR Server Server Server Server Server Server Server Server VM1 VM1 VM1 VM1 VM VM VM VM VM VM ToR uplink bottleneck ➢ Force scheduling of VMs under same ToRs if high East-West throughput is needed ➢ Segregate noisy neighbour VMs by careful scheduling across ToR and Aggregation zones ➢ Nova network replacement hacks AggregationAggregation
  6. 6. Pain Points of our initial cloud network architecture ● Hard to scale: ○ Ever increasing network traffic between VMs ○ Difficult to deploy zones and aggregate switch groups ● Scheduling: a workaround for lacking network architecture: ○ Only a workaround, no other benefits, just a PITA ○ Scheduling to same rack = same failure domain ● Implementation difficulties; ugly hacks, how to upgrade?
  7. 7. Network Node Scalability Issues (on big L2 overlays) ● Normally, Network Node provides single metadata/dhcp for each L2 network ● What if 10.000 VMs reboot at the same time? ● What if noisy VM keep 1000 tcp session for metadata proxy? ● What if rogue VM sends 1000s of invalid DHCP requests? ● What if metadata proxy/dnsmasq suddenly dies? Huge Failure Domain L2 separation is inevitable
  8. 8. AggregationAggregation ToR Core Aggregation ToR Aggregation Server Server Server Server Core AggregationAggregationAggregationAggregation ToR ToR Server Server Server Server Core Core Core Non-blocking Cluster ........ New Approach: Horizontally Scalable DC Network Arch. EBGP ECMP EBGP ECMP ➢ Non-blocking fabric bandwidth: Σ ToR uplinks > Σ ToR downlinks ➢ Use BGP ECMP to provide redundancy, bandwidth and balancing ➢ Move from OSPF to BGP on a level between Aggregation and Core switches ➢ Enough bandwidth capacity to ensure no traffic between different ToRs or Aggregate switches are affected Similar to RFC7938 proposal
  9. 9. Clos Network Architecture solved our fabric issues ● Bandwidth bottleneck between ToRs: gone ○ Much more uplink bandwidth ○ Use BGP for traffic load balancing (even more bandwidth) ● No more scheduling workarounds needed ● Very suitable for all our VM workloads ○ We don’t really need L2. No, really.
  10. 10. 2. Achieving Redundancy between ToR & Hypervisors
  11. 11. ToR - Aggregation aka Clos BGP Network benefits Scalability: Well, it scales on an Internet level, so it does in DC Reduced complexity: No need to operate expensive non-blocking redundant protocols (L3 ECMP, OSPF…) Get rid of L2 redundancy solutions that heavily depends on vendor implementation quirks (MC-LAG, TRILL…) We chose L3-only BGP & ECMP, because it is a proven choice with other large scale DC network operators, it’s very scalable, not vendor-locked, relies on less features and yet allows necessary customizations... It’s a common choice! ToR Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) Hypervisor VM
  12. 12. But what about (an OpenStack) Hypervisor? MC-LAG(L2) or BGP(L3)? ● MC-LAG ○ Depend on vendor implementation ○ Difficult to traffic engineering ○ Cost of complexity in managing multiple protocols ToR Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) Hypervisor VM ● BGP ECMP ○ Standardize redundancy method to BGP in datacenter (simplest design) L3 based redundancy L2 based redundancy
  13. 13. Conclusion: Use BGP everywhere... BGP(L3) ● No vendor lock ● BGP redundancy all over the place ○ Simplifies data-center network design ● Easy traffic engineering ○ No downtime ToR maintenance ToR Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) Hypervisor VM
  14. 14. 3. L3 Routing vs VM Network Overlay
  15. 15. Routing vs VM Network Overlay Overlay or Routing? ● No more L2 connectivity between Hypervisors, we can’t just use V(x)LANs ● How do we achieve (minimal but critical) L2 capabilities over L3: ○ VM provisioning (DHCP)? ○ VM-2-VM connectivity (ARP - L2)? BGP(L3) ToR Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) Hypervisor VM
  16. 16. ToR VM Hypervisor Hypervisor VM Routing vs VM Network Overlay Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) BGP(L3) L2 ● Design complexities ○ Mediate point between underlay and overlay ● Encapsulating performance overhead ● Need Network Node (mostly) ● No performance overhead ● No Network Node ○ Requires distributed ARP / DHCP / Metadata Underlay Overlay
  17. 17. Conclusion: Use Routing for all VM networks ● No overhead ● No Network Node(s) ● No VM L2, even on a single hypervisor level (full L2 isolation) ● Minimal failure domain for network services ○ Single hypervisor scope ○ DHCP, metadata, routing ● Portable IP addresses without any overlays ToR VM Hypervisor Hypervisor VM Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) BGP(L3) Static Routing(L3)
  18. 18. ToR Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) VM BGP(L3) Hypervisor Hypervisor VM L2-less & fully-redundant VM networks in OpenStack Static Routing (L3) Neutron takes care that VM joins the actual BGP routed network Custom Plugin
  19. 19. 4. Solution & Implementation
  20. 20. “L2 Isolate” Neutron plugin (Custom Plugin/Agent) ● VMs don’t share L2 Network ● Ensure that VMs have L3 reachability
  21. 21. Solution Overview Hypervisor VM ToR Hypervisor VM VM VM default via 172.16.24.0 10.0.0.2/32 via tapXXX 10.0.0.3/32 via tapXXX 10.0.0.0/24 .5 .1 .3 172.16.24.0/31 .0 172.16.24.2/31 .2 .4 10.0.0.0/24 .3.2 .1 .1 10.0.0.2/32 via 172.16.24.1 10.0.0.3/32 via 172.16.24.1 10.0.0.4/32 via 172.16.24.2 10.0.0.5/32 via 172.16.24.2 10.0.0.254/32 via 172.16.24.1 Advertise 10.0.0.2 to 172.16.24.1 Withdraw 10.0.0.254 proxy arp
  22. 22. Solution Overview Hypervisor VM ToR Hypervisor VM VM VM default via 172.16.24.0 10.0.0.2/32 via tapXXX 10.0.0.3/32 via tapXXX 10.0.0.0/24 .5 .1 .3 172.16.24.0/31 .0 172.16.24.2/31 .2 .4 10.0.0.0/24 .3.2 .1 .1 10.0.0.2/32 via 172.16.24.1 10.0.0.3/32 via 172.16.24.1 10.0.0.4/32 via 172.16.24.2 10.0.0.5/32 via 172.16.24.2 10.0.0.254/32 via 172.16.24.1Advertise 10.0.0.2 to 172.16.24.1 Withdraw 10.0.0.254 L2 Isolate Agent Mechanism driver Type driver Neutron Server Linuxbridge Agent New
  23. 23. Compute Node Controller Node Network Node Compute Node Compute Node nova-compute linuxbridge-agent dhcp-agent metadata-agent nova-XXXX neutron-server VM1 dnsmasq ns-metadata-proxy Compute Node Controller Node Network Node Compute Node Compute Node nova-compute l2isolate-agent metadata-agent nova-XXXX neutron-server VM1 dnsmasq ns-metadata-proxy VM2 VM2 Dynamically provisioned Pre-provisioned (neutron related) Pre-provisioned (other component) L2 isolate Plugin Common Linux-bridge spawn but just 1 process for a node Deployment scope changes Ml2 - linuxbridge mechanism Ml2 - vlan type Ml2 - l2isolate mechanism Ml2 - routed type New New New
  24. 24. service_plugin Neutron core_plugin L3LBaaS FWaaS ML2 Type Driver Mechanism Driver Linuxbridge OVS OctaviaHaproxy Server Side Agent Side OVS AgentLinuxbridge Agent L3 Agent L2 Agent Metadata AgentDHCP Agent LBaaS Agent ML2 Related Agent “L2 Isolate” plugin for L3-only datacenter VLANFlat
  25. 25. service_plugin core_plugin L3LBaaS FWaaS ML2 Type Driver Mechanism Driver VLANFlat Linuxbridge OVS OctaviaHaproxy Server Side Agent Side OVS AgentLinuxbridge Agent L3 Agent L2 Agent Metadata AgentDHCP Agent LBaaS Agent ML2 Related Agent Define new type of network that represents all end-device connected via L3 Implement logic to achieve the above type of network Implement agent working with new mechanism driver Can not reuse existing agent expecting network entity to be L2 network Type Driver Mechanism Driver L2 Agent ML2 Related Agent .... “L2 Isolate” plugin for L3-only datacenter
  26. 26. Private Network Public Network End Device End Device End Device L3 Reachable End Device End Device Type Driver - Routed L3 Reachable
  27. 27. Mechanism Driver - L2 Isolate ● Support Routed Type Network ● Similar implementation as with linuxbridge ● Use TAP VIF_TYPE not Bridge ● Checks for IP overlaps on all subnets of all routed type networks when subnet is created
  28. 28. bridge Why TAP VIF_TYPE? Hypervisor VM VM tap tap Hypervisor VM VM tap tap bridge-if ● L2 Isolation inside Hypervisor? ○ Some say shared/external network can not be trusted ● Bridge? ○ No L2 isolation on Hypervisor ● Dedicate bridge for each port? ○ Don’t create unnecessary resource
  29. 29. Agent - L2 Isolate Agent ● Monitors tap device on compute node (like linuxbridge) ● Configures ○ Linux ARP proxy, static route ○ routing software to advertise, withdraw VM IPs ○ DHCP for VMs in a same Hypervisor ○ metadata proxy for VMs in a same Hypervisor ● Support iptables base security group implementation for tap
  30. 30. A tour of what “L2 isolate” agent really does
  31. 31. Configure Proxy ARP Inject VM’s IP/32 Route via TAP VIF_TYPE: TAP No prefix route option required 1. Connectivity to other VM inside Hypervisor
  32. 32. FRR configured to watch the Routing Table Advertise, Withdraw VM’s IP/32 according to the change 2. Connectivity to VM from outside Hypervisor Routing Software
  33. 33. Add/Update/Remove <tap name> tapA <subnet id> subnetA tapB subnetB MAC:IP:set<subnet-id > tag:<subnet-id>,options:router tag:<subnet-id>,options:netmask.. dhcp-hosts.d dhcp-opts.d Watch and Reload (--dhcp-XXXdir option) 3. DHCP Use 1 dnsmasq for all VM in same hypervisor
  34. 34. unix socket TCP 169.254.169.254:80 LISTEN curl http://169.254.169.254 nova-api Spawned 1 proxy with fake-router id X-Neutron-Router-Id: fake HTTP Header X-Instance-Id X-Tenant-Id 4. Metadata Proxy neutron-server Look up port by just VM’s IP because fake router id passed
  35. 35. Chain neutron-<agent>-FORWARD neutron-<agent>-sg-chain all -- any any anywhere anywhere PHYSDEV match --physdev-out tap97125c-21 --physdev-is-bridged Chain neutron-<agent>-INPUT neutron-<agent>-o97125c-2 all -- any any anywhere anywhere PHYSDEV match --physdev-in tap97125c-21 --physdev-is-bridged Chain neutron-<agent>-sg-chain neutron-<agent>-i9715c-2 all -- any any anywhere anywhere PHYSDEV match --physdev-out tap97125c-21 --physdev-is-bridged Default iptables firewall driver Expecting tap device to be connected via bridge 5. Security Group Support
  36. 36. Chain neutron-<agent>-FORWARD neutron-<agent>-sg-chain all -- any any anywhere anywhere PHYSDEV match --physdev-out tap97125c-21 --physdev-is-bridged Chain neutron-<agent>-INPUT neutron-<agent>-o97125c-2 all -- any any anywhere anywhere PHYSDEV match --physdev-in tap97125c-21 --physdev-is-bridged Chain neutron-<agent>-sg-chain neutron-<agent>-i9715c-2 all -- any any anywhere anywhere PHYSDEV match --physdev-out tap97125c-21 --physdev-is-bridged chain neutron-<agent>-FORWARD neutron-<agent>-sg-chain all -- any tap2d89f44a-6b anywhere anywhere chain neutron-<agent>-INPUT neutron-<agent>-o2d89f44a-6 all -- tap2d89f44a-6b any anywhere anywhere chain neutron-<agent>-sg-chain neutron-<agent>-o2d89f44a-6 all -- tap2d89f44a-6b any anywhere anywhere Default iptables firewall driver tapbase_iptables_firewall driver Just specify tap device as in-out interface 5. Security Group Support
  37. 37. End of Tour
  38. 38. 5. Summary
  39. 39. * Throughput of VM - VM traffic over ToR improved significantly => Improved uplink bandwidth of ToR (Clos Network) => No encapsulation overhead for VM Network (L2isolate Plugin) * Completely L2-free datacenter :) * No network node bottlenecks (routing, dhcp, metadata proxy) * Portable instance IP without introducing overlays => Supporting live migration everywhere What we got (Pros)
  40. 40. * BGP Network operating cost but operated by Network team :) * Maintenance cost for custom plugin * Broadcast/Multicast doesn’t work in our environment What we got (Cons)
  41. 41. * Upstream the Neutron l2isolate plugin? * Offload L3 forwarding * Support overlay network as well as routed network => How to co-exist these network models? => Solution not using l3 agent? Future Work
  42. 42. 6. Q&A
  43. 43. Appendix
  44. 44. * Use dynamic directory for dhcp-hosts, opts configuration (--dhcp-hostsdir, --dhcp-optsdir) => Enough to have only config for VMs in same hypervisor => Easy to add new/remove subnet/new host info => Avoid to maintain big configuration file [1] http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=4f7bb57e9747577600b3d385e0e3418ec17e73e0 1. Running dnsmasq for all subnets trick (1/3) * Workaround for reloading bug[1] in case older version than 2.79 => Add --dhcp-hostsfile, --dhcp-optsfile with dummy(empty) file
  45. 45. * Use multiple /1 range to represents all subnets in --dhcp-range --dhcp-range=0.0.0.0,static --dhcp-range=0.0.0.0,static,0.0.0.0 --dhcp-range=0.0.0.0,255.255.255.255 See around https://github.com/imp/dnsmasq/blob/master/src/dhcp.c#L501-L560 --dhcp-range=0.0.0.0,static,128.0.0.0 --dhcp-range=128.0.0.0,static,128.0.0.0 * Specify netmask in dhcp-option dhcp-opts.d <subnet id> subnetA subnetB tag:<subnet-id>,option:router tag:<subnet-id>,option:netmask tag:<subnet-id>,option:classless-static-route If it’s missing, automatically use dhcp-range’s netmask(/1) 1. Running dnsmasq for all subnets trick (2/3)
  46. 46. * Use tag to load appropriate subnet option for dhcp client <subnet id> subnetA subnetB tag:<subnet-id>,options:router tag:<subnet-id>,options:netmask.. dhcp-opts.d <tap name> tapA tapB MAC:IP:set<subnet-id > dhcp-hosts.d subnetA subnetB Will be applied 1. Running dnsmasq for all subnets trick (3/3)
  47. 47. 2. Re-Cap: usual neutron metadata access curl http://169.254.169.254 neutron-l3-agent neutron-ns-metadata-proxy VM1 neutron-metadata-agent unix socket neutron-server neutron-linuxbridge-agent TCP 169.254.169.254:80 LISTEN network-namespace veth bridge X-Neutron-Router-Id: A HTTP Header 1: Search all networks belong to router A (1RPC) 2: Prepare Filter to match network(result of 1) and VM IP 3: Search port matching filter (1RPC) 4: Retrieve instance-id, tenant-id from port(result of 3) X-Neutron-Router-Id X-Instance-Id X-Tenant-Id nova-api Spawned with --router-id=A
  48. 48. curl http://169.254.169.254 neutron-ns-metadata-proxy VM1 neutron-metadata-agent unix socket neutron-server neutron-l2isolate-agent TCP 169.254.169.254:80 LISTEN X-Neutron-Router-Id: fake HTTP Header Spawned with --router-id=fake 1: Search all networks belong to router fake (1RPC) 2: Prepare Filter to match only VM IP (due to no network found) 3: Search port matching filter (1RPC) 4: Retrieve instance-id, tenant-id from port(result of 3) X-Neutron-Router-Id nova-api 2. neutron-ns-metadata-proxy for all networks X-Instance-Id X-Tenant-Id
  49. 49. curl http://169.254.169.254 neutron-ns-metadata-proxy VM1 neutron-metadata-agent unix socket neutron-server neutron-l2isolate-agent TCP 169.254.169.254:80 LISTEN X-Neutron-Router-Id: fake HTTP Header Spawned with --router-id=fake 1: Search all networks belong to router fake (1RPC) 2: Prepare Filter to match only VM IP (due to no network found) 3: Search port matching filter (1RPC) 4: Retrieve instance-id, tenant-id from port(result of 3) X-Neutron-Router-Id nova-api 2. neutron-ns-metadata-proxy for all networks X-Instance-Id X-Tenant-Id Limitation: IP overwrapping not allowed If we want to allow it, we have to re-write metadata-agent as well

×