Excitingly simple multi-path OpenStack networking
LAG-less, L2-less, yet fully redundant
Yuki Nishiwaki (LINE Corp)
Samir Ibradžić (LINE Corp)
Agenda
1. Motivation for Clos Network in the Datacenter
2. Achieving Redundancy between ToR & Hypervisors
3. L3 Routing vs VM Network Overlay
4. Solution & Implementation
5. Summary
6. Q&A
1. Motivation for Clos Network in the Datacenter
TOR
Core
Aggregation
ToR
Aggregation
ToR
Server
Server
Server
Server
Server
Server
Server
Server
Aggregation
ToR
Server
Server
Server
Server
Core
Aggregation
OSPF ECMP
MC-LAG
Bonding
Active-Back
up
Non-blocking
42% of full capacity
8% off full capacity
Overview: Traditional Datacenter Network
Link Redundancy Method
➢ Three switching layers; Core, Aggregation, ToR
➢ VLAN terminated in Aggregation switches
➢ L2 redundancy between Aggregation and ToR
➢ Only traffic between under same ToR is ensured
to reach the wire rate
➢ Traffic passing over ToR could drop to 42% of the
total wire capacity, due to the narrow uplinks
➢ Worst case between Aggregation switches: 8% of
total wire capacity
TORToR ToR
Server
Server
Server
Server
Server
Server
Server
Server
Overview: VM scheduling considerations
VM1 VM2VM1
VM1
VM1
VM1
VM2
VM2
VM2
VM2
VM2
TOR
Aggregation
ToR
Aggregation
ToR
Server
Server
Server
Server
Server
Server
Server
Server
VM1
VM1
VM1
VM1
VM
VM
VM
VM
VM
VM
ToR uplink bottleneck
➢ Force scheduling of VMs under same ToRs if high East-West
throughput is needed
➢ Segregate noisy neighbour VMs by careful scheduling across
ToR and Aggregation zones
➢ Nova network replacement hacks
AggregationAggregation
Pain Points of our initial cloud network architecture
● Hard to scale:
○ Ever increasing network traffic between VMs
○ Difficult to deploy zones and aggregate switch groups
● Scheduling: a workaround for lacking network architecture:
○ Only a workaround, no other benefits, just a PITA
○ Scheduling to same rack = same failure domain
● Implementation difficulties; ugly hacks, how to upgrade?
Network Node Scalability Issues (on big L2 overlays)
● Normally, Network Node provides single metadata/dhcp for each L2 network
● What if 10.000 VMs reboot at the same time?
● What if noisy VM keep 1000 tcp session for metadata proxy?
● What if rogue VM sends 1000s of invalid DHCP requests?
● What if metadata proxy/dnsmasq suddenly dies?
Huge Failure Domain
L2 separation is inevitable
AggregationAggregation
ToR
Core
Aggregation
ToR
Aggregation
Server
Server
Server
Server
Core
AggregationAggregationAggregationAggregation
ToR ToR
Server
Server
Server
Server
Core Core Core
Non-blocking
Cluster
........
New Approach: Horizontally Scalable DC Network Arch.
EBGP ECMP
EBGP ECMP
➢ Non-blocking fabric bandwidth:
Σ ToR uplinks > Σ ToR downlinks
➢ Use BGP ECMP to provide redundancy,
bandwidth and balancing
➢ Move from OSPF to BGP on a level
between Aggregation and Core switches
➢ Enough bandwidth capacity to ensure no
traffic between different ToRs or Aggregate
switches are affected
Similar to RFC7938 proposal
Clos Network Architecture solved our fabric issues
● Bandwidth bottleneck between ToRs: gone
○ Much more uplink bandwidth
○ Use BGP for traffic load balancing (even more bandwidth)
● No more scheduling workarounds needed
● Very suitable for all our VM workloads
○ We don’t really need L2. No, really.
2. Achieving Redundancy between ToR & Hypervisors
ToR - Aggregation aka Clos BGP Network benefits
Scalability: Well, it scales on an Internet level, so it does in DC
Reduced complexity: No need to operate expensive
non-blocking redundant protocols (L3 ECMP, OSPF…)
Get rid of L2 redundancy solutions that heavily depends on
vendor implementation quirks (MC-LAG, TRILL…)
We chose L3-only BGP & ECMP, because it is a proven choice
with other large scale DC network operators, it’s very scalable,
not vendor-locked, relies on less features and yet allows
necessary customizations...
It’s a common choice!
ToR
Aggregation
Core
EBGP ECMP(L3)
EBGP ECMP(L3)
Hypervisor
VM
But what about (an OpenStack) Hypervisor?
MC-LAG(L2) or
BGP(L3)?
● MC-LAG
○ Depend on vendor implementation
○ Difficult to traffic engineering
○ Cost of complexity in managing
multiple protocols
ToR
Aggregation
Core
EBGP ECMP(L3)
EBGP ECMP(L3)
Hypervisor
VM
● BGP ECMP
○ Standardize redundancy method to
BGP in datacenter (simplest design)
L3 based redundancy
L2 based redundancy
Conclusion: Use BGP everywhere...
BGP(L3)
● No vendor lock
● BGP redundancy all over the place
○ Simplifies data-center network design
● Easy traffic engineering
○ No downtime ToR maintenance
ToR
Aggregation
Core
EBGP ECMP(L3)
EBGP ECMP(L3)
Hypervisor
VM
3. L3 Routing vs VM Network Overlay
Routing vs VM Network Overlay
Overlay or Routing?
● No more L2 connectivity between
Hypervisors, we can’t just use V(x)LANs
● How do we achieve (minimal but critical)
L2 capabilities over L3:
○ VM provisioning (DHCP)?
○ VM-2-VM connectivity (ARP - L2)?
BGP(L3)
ToR
Aggregation
Core
EBGP ECMP(L3)
EBGP ECMP(L3)
Hypervisor
VM
ToR
VM
Hypervisor Hypervisor
VM
Routing vs VM Network Overlay
Aggregation
Core
EBGP ECMP(L3)
EBGP ECMP(L3)
BGP(L3)
L2
● Design complexities
○ Mediate point between underlay
and overlay
● Encapsulating performance overhead
● Need Network Node (mostly)
● No performance overhead
● No Network Node
○ Requires distributed
ARP / DHCP / Metadata
Underlay
Overlay
Conclusion: Use Routing for all VM networks
● No overhead
● No Network Node(s)
● No VM L2, even on a single hypervisor
level (full L2 isolation)
● Minimal failure domain for network services
○ Single hypervisor scope
○ DHCP, metadata, routing
● Portable IP addresses without any overlays
ToR
VM
Hypervisor Hypervisor
VM
Aggregation
Core
EBGP ECMP(L3)
EBGP ECMP(L3)
BGP(L3)
Static Routing(L3)
ToR
Aggregation
Core
EBGP ECMP(L3)
EBGP ECMP(L3)
VM
BGP(L3)
Hypervisor Hypervisor
VM
L2-less & fully-redundant VM networks in OpenStack
Static Routing (L3)
Neutron takes care that VM joins
the actual BGP routed network
Custom Plugin
4. Solution & Implementation
“L2 Isolate” Neutron plugin (Custom Plugin/Agent)
● VMs don’t share L2 Network
● Ensure that VMs have L3 reachability
Solution Overview
Hypervisor
VM
ToR
Hypervisor
VM VM VM
default via 172.16.24.0
10.0.0.2/32 via tapXXX
10.0.0.3/32 via tapXXX
10.0.0.0/24
.5
.1 .3
172.16.24.0/31
.0
172.16.24.2/31
.2
.4
10.0.0.0/24
.3.2
.1 .1
10.0.0.2/32 via 172.16.24.1
10.0.0.3/32 via 172.16.24.1
10.0.0.4/32 via 172.16.24.2
10.0.0.5/32 via 172.16.24.2
10.0.0.254/32 via 172.16.24.1
Advertise 10.0.0.2 to 172.16.24.1
Withdraw 10.0.0.254
proxy arp
Solution Overview
Hypervisor
VM
ToR
Hypervisor
VM VM VM
default via 172.16.24.0
10.0.0.2/32 via tapXXX
10.0.0.3/32 via tapXXX
10.0.0.0/24
.5
.1 .3
172.16.24.0/31
.0
172.16.24.2/31
.2
.4
10.0.0.0/24
.3.2
.1 .1
10.0.0.2/32 via 172.16.24.1
10.0.0.3/32 via 172.16.24.1
10.0.0.4/32 via 172.16.24.2
10.0.0.5/32 via 172.16.24.2
10.0.0.254/32 via 172.16.24.1Advertise 10.0.0.2 to 172.16.24.1
Withdraw 10.0.0.254
L2 Isolate Agent
Mechanism driver
Type driver
Neutron Server
Linuxbridge Agent
New
Compute
Node
Controller
Node
Network
Node
Compute
Node
Compute
Node
nova-compute
linuxbridge-agent
dhcp-agent
metadata-agent
nova-XXXX
neutron-server
VM1
dnsmasq
ns-metadata-proxy
Compute
Node Controller
Node
Network
Node
Compute
Node
Compute
Node
nova-compute
l2isolate-agent
metadata-agent
nova-XXXX
neutron-server
VM1
dnsmasq
ns-metadata-proxy
VM2
VM2
Dynamically provisioned
Pre-provisioned
(neutron related)
Pre-provisioned
(other component)
L2 isolate Plugin
Common Linux-bridge
spawn but just 1 process for a node
Deployment scope changes
Ml2 - linuxbridge mechanism
Ml2 - vlan type
Ml2 - l2isolate mechanism
Ml2 - routed type
New
New
New
service_plugin
Neutron
core_plugin
L3LBaaS FWaaS ML2
Type Driver Mechanism Driver
Linuxbridge OVS
OctaviaHaproxy
Server Side
Agent Side
OVS AgentLinuxbridge Agent
L3 Agent
L2 Agent
Metadata AgentDHCP Agent
LBaaS Agent ML2 Related Agent
“L2 Isolate” plugin for L3-only datacenter
VLANFlat
service_plugin core_plugin
L3LBaaS FWaaS ML2
Type Driver Mechanism Driver
VLANFlat Linuxbridge OVS
OctaviaHaproxy
Server Side
Agent Side
OVS AgentLinuxbridge Agent
L3 Agent
L2 Agent
Metadata AgentDHCP Agent
LBaaS Agent ML2 Related Agent
Define new type of network that
represents all end-device connected via L3
Implement logic to achieve the above type
of network
Implement agent working with new
mechanism driver
Can not reuse existing agent expecting
network entity to be L2 network
Type Driver
Mechanism Driver
L2 Agent
ML2 Related Agent
....
“L2 Isolate” plugin for L3-only datacenter
Private Network
Public Network
End Device
End Device
End Device
L3 Reachable
End Device
End Device
Type Driver - Routed
L3 Reachable
Mechanism Driver - L2 Isolate
● Support Routed Type Network
● Similar implementation as with linuxbridge
● Use TAP VIF_TYPE not Bridge
● Checks for IP overlaps on all subnets of all routed type
networks when subnet is created
bridge
Why TAP VIF_TYPE?
Hypervisor
VM VM
tap tap
Hypervisor
VM VM
tap tap
bridge-if
● L2 Isolation inside Hypervisor?
○ Some say shared/external network can
not be trusted
● Bridge?
○ No L2 isolation on Hypervisor
● Dedicate bridge for each port?
○ Don’t create unnecessary resource
Agent - L2 Isolate Agent
● Monitors tap device on compute node (like linuxbridge)
● Configures
○ Linux ARP proxy, static route
○ routing software to advertise, withdraw VM IPs
○ DHCP for VMs in a same Hypervisor
○ metadata proxy for VMs in a same Hypervisor
● Support iptables base security group implementation for tap
A tour of what “L2 isolate” agent really does
Configure Proxy ARP
Inject VM’s IP/32 Route via TAP
VIF_TYPE: TAP
No prefix route
option required
1. Connectivity to other VM inside Hypervisor
FRR configured to watch the Routing Table
Advertise, Withdraw VM’s IP/32 according
to
the change
2. Connectivity to VM from outside Hypervisor
Routing Software
Add/Update/Remove
<tap name>
tapA
<subnet id>
subnetA
tapB
subnetB
MAC:IP:set<subnet-id >
tag:<subnet-id>,options:router
tag:<subnet-id>,options:netmask..
dhcp-hosts.d
dhcp-opts.d
Watch and Reload
(--dhcp-XXXdir option)
3. DHCP
Use 1 dnsmasq for all VM in
same hypervisor
unix socket
TCP 169.254.169.254:80 LISTEN
curl http://169.254.169.254
nova-api
Spawned 1 proxy with fake-router id
X-Neutron-Router-Id: fake
HTTP Header
X-Instance-Id
X-Tenant-Id
4. Metadata Proxy
neutron-server
Look up port by just VM’s IP
because fake router id passed
Chain neutron-<agent>-FORWARD
neutron-<agent>-sg-chain all -- any any anywhere anywhere PHYSDEV match --physdev-out tap97125c-21 --physdev-is-bridged
Chain neutron-<agent>-INPUT
neutron-<agent>-o97125c-2 all -- any any anywhere anywhere PHYSDEV match --physdev-in tap97125c-21 --physdev-is-bridged
Chain neutron-<agent>-sg-chain
neutron-<agent>-i9715c-2 all -- any any anywhere anywhere PHYSDEV match --physdev-out tap97125c-21 --physdev-is-bridged
Default iptables firewall driver
Expecting tap device to be connected via bridge
5. Security Group Support
Chain neutron-<agent>-FORWARD
neutron-<agent>-sg-chain all -- any any anywhere anywhere PHYSDEV match --physdev-out tap97125c-21 --physdev-is-bridged
Chain neutron-<agent>-INPUT
neutron-<agent>-o97125c-2 all -- any any anywhere anywhere PHYSDEV match --physdev-in tap97125c-21 --physdev-is-bridged
Chain neutron-<agent>-sg-chain
neutron-<agent>-i9715c-2 all -- any any anywhere anywhere PHYSDEV match --physdev-out tap97125c-21 --physdev-is-bridged
chain neutron-<agent>-FORWARD
neutron-<agent>-sg-chain all -- any tap2d89f44a-6b anywhere anywhere
chain neutron-<agent>-INPUT
neutron-<agent>-o2d89f44a-6 all -- tap2d89f44a-6b any anywhere anywhere
chain neutron-<agent>-sg-chain
neutron-<agent>-o2d89f44a-6 all -- tap2d89f44a-6b any anywhere anywhere
Default iptables firewall driver
tapbase_iptables_firewall driver
Just specify tap device as in-out interface
5. Security Group Support
End of Tour
5. Summary
* Throughput of VM - VM traffic over ToR improved significantly
=> Improved uplink bandwidth of ToR (Clos Network)
=> No encapsulation overhead for VM Network (L2isolate Plugin)
* Completely L2-free datacenter :)
* No network node bottlenecks (routing, dhcp, metadata proxy)
* Portable instance IP without introducing overlays
=> Supporting live migration everywhere
What we got (Pros)
* BGP Network operating cost but operated by Network team :)
* Maintenance cost for custom plugin
* Broadcast/Multicast doesn’t work in our environment
What we got (Cons)
* Upstream the Neutron l2isolate plugin?
* Offload L3 forwarding
* Support overlay network as well as routed network
=> How to co-exist these network models?
=> Solution not using l3 agent?
Future Work
6. Q&A
Appendix
* Use dynamic directory for dhcp-hosts, opts configuration
(--dhcp-hostsdir, --dhcp-optsdir)
=> Enough to have only config for VMs in same hypervisor
=> Easy to add new/remove subnet/new host info
=> Avoid to maintain big configuration file
[1] http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=4f7bb57e9747577600b3d385e0e3418ec17e73e0
1. Running dnsmasq for all subnets trick (1/3)
* Workaround for reloading bug[1] in case older version than 2.79
=> Add --dhcp-hostsfile, --dhcp-optsfile with dummy(empty) file
* Use multiple /1 range to represents all subnets in --dhcp-range
--dhcp-range=0.0.0.0,static
--dhcp-range=0.0.0.0,static,0.0.0.0
--dhcp-range=0.0.0.0,255.255.255.255
See around https://github.com/imp/dnsmasq/blob/master/src/dhcp.c#L501-L560
--dhcp-range=0.0.0.0,static,128.0.0.0
--dhcp-range=128.0.0.0,static,128.0.0.0
* Specify netmask in dhcp-option
dhcp-opts.d
<subnet id>
subnetA
subnetB
tag:<subnet-id>,option:router
tag:<subnet-id>,option:netmask
tag:<subnet-id>,option:classless-static-route
If it’s missing,
automatically use dhcp-range’s netmask(/1)
1. Running dnsmasq for all subnets trick (2/3)
* Use tag to load appropriate subnet option for dhcp client
<subnet id>
subnetA subnetB
tag:<subnet-id>,options:router
tag:<subnet-id>,options:netmask..
dhcp-opts.d
<tap name>
tapA tapB
MAC:IP:set<subnet-id >
dhcp-hosts.d
subnetA subnetB
Will be applied
1. Running dnsmasq for all subnets trick (3/3)
2. Re-Cap: usual neutron metadata access
curl http://169.254.169.254
neutron-l3-agent
neutron-ns-metadata-proxy
VM1
neutron-metadata-agent
unix socket
neutron-server
neutron-linuxbridge-agent
TCP 169.254.169.254:80 LISTEN
network-namespace
veth
bridge
X-Neutron-Router-Id: A
HTTP Header
1: Search all networks belong to router A (1RPC)
2: Prepare Filter to match network(result of 1) and VM IP
3: Search port matching filter (1RPC)
4: Retrieve instance-id, tenant-id from port(result of 3)
X-Neutron-Router-Id
X-Instance-Id
X-Tenant-Id
nova-api
Spawned with --router-id=A
curl http://169.254.169.254
neutron-ns-metadata-proxy
VM1
neutron-metadata-agent
unix socket
neutron-server
neutron-l2isolate-agent
TCP 169.254.169.254:80 LISTEN
X-Neutron-Router-Id: fake
HTTP Header
Spawned with
--router-id=fake
1: Search all networks belong to router fake (1RPC)
2: Prepare Filter to match only VM IP (due to no network found)
3: Search port matching filter (1RPC)
4: Retrieve instance-id, tenant-id from port(result of 3)
X-Neutron-Router-Id
nova-api
2. neutron-ns-metadata-proxy for all networks
X-Instance-Id
X-Tenant-Id
curl http://169.254.169.254
neutron-ns-metadata-proxy
VM1
neutron-metadata-agent
unix socket
neutron-server
neutron-l2isolate-agent
TCP 169.254.169.254:80 LISTEN
X-Neutron-Router-Id: fake
HTTP Header
Spawned with
--router-id=fake
1: Search all networks belong to router fake (1RPC)
2: Prepare Filter to match only VM IP (due to no network found)
3: Search port matching filter (1RPC)
4: Retrieve instance-id, tenant-id from port(result of 3)
X-Neutron-Router-Id
nova-api
2. neutron-ns-metadata-proxy for all networks
X-Instance-Id
X-Tenant-Id
Limitation: IP overwrapping not allowed
If we want to allow it, we have to re-write
metadata-agent as well

Excitingly simple multi-path OpenStack networking: LAG-less, L2-less, yet fully redundant

  • 1.
    Excitingly simple multi-pathOpenStack networking LAG-less, L2-less, yet fully redundant Yuki Nishiwaki (LINE Corp) Samir Ibradžić (LINE Corp)
  • 2.
    Agenda 1. Motivation forClos Network in the Datacenter 2. Achieving Redundancy between ToR & Hypervisors 3. L3 Routing vs VM Network Overlay 4. Solution & Implementation 5. Summary 6. Q&A
  • 3.
    1. Motivation forClos Network in the Datacenter
  • 4.
    TOR Core Aggregation ToR Aggregation ToR Server Server Server Server Server Server Server Server Aggregation ToR Server Server Server Server Core Aggregation OSPF ECMP MC-LAG Bonding Active-Back up Non-blocking 42% offull capacity 8% off full capacity Overview: Traditional Datacenter Network Link Redundancy Method ➢ Three switching layers; Core, Aggregation, ToR ➢ VLAN terminated in Aggregation switches ➢ L2 redundancy between Aggregation and ToR ➢ Only traffic between under same ToR is ensured to reach the wire rate ➢ Traffic passing over ToR could drop to 42% of the total wire capacity, due to the narrow uplinks ➢ Worst case between Aggregation switches: 8% of total wire capacity
  • 5.
    TORToR ToR Server Server Server Server Server Server Server Server Overview: VMscheduling considerations VM1 VM2VM1 VM1 VM1 VM1 VM2 VM2 VM2 VM2 VM2 TOR Aggregation ToR Aggregation ToR Server Server Server Server Server Server Server Server VM1 VM1 VM1 VM1 VM VM VM VM VM VM ToR uplink bottleneck ➢ Force scheduling of VMs under same ToRs if high East-West throughput is needed ➢ Segregate noisy neighbour VMs by careful scheduling across ToR and Aggregation zones ➢ Nova network replacement hacks AggregationAggregation
  • 6.
    Pain Points ofour initial cloud network architecture ● Hard to scale: ○ Ever increasing network traffic between VMs ○ Difficult to deploy zones and aggregate switch groups ● Scheduling: a workaround for lacking network architecture: ○ Only a workaround, no other benefits, just a PITA ○ Scheduling to same rack = same failure domain ● Implementation difficulties; ugly hacks, how to upgrade?
  • 7.
    Network Node ScalabilityIssues (on big L2 overlays) ● Normally, Network Node provides single metadata/dhcp for each L2 network ● What if 10.000 VMs reboot at the same time? ● What if noisy VM keep 1000 tcp session for metadata proxy? ● What if rogue VM sends 1000s of invalid DHCP requests? ● What if metadata proxy/dnsmasq suddenly dies? Huge Failure Domain L2 separation is inevitable
  • 8.
    AggregationAggregation ToR Core Aggregation ToR Aggregation Server Server Server Server Core AggregationAggregationAggregationAggregation ToR ToR Server Server Server Server Core CoreCore Non-blocking Cluster ........ New Approach: Horizontally Scalable DC Network Arch. EBGP ECMP EBGP ECMP ➢ Non-blocking fabric bandwidth: Σ ToR uplinks > Σ ToR downlinks ➢ Use BGP ECMP to provide redundancy, bandwidth and balancing ➢ Move from OSPF to BGP on a level between Aggregation and Core switches ➢ Enough bandwidth capacity to ensure no traffic between different ToRs or Aggregate switches are affected Similar to RFC7938 proposal
  • 9.
    Clos Network Architecturesolved our fabric issues ● Bandwidth bottleneck between ToRs: gone ○ Much more uplink bandwidth ○ Use BGP for traffic load balancing (even more bandwidth) ● No more scheduling workarounds needed ● Very suitable for all our VM workloads ○ We don’t really need L2. No, really.
  • 10.
    2. Achieving Redundancybetween ToR & Hypervisors
  • 11.
    ToR - Aggregationaka Clos BGP Network benefits Scalability: Well, it scales on an Internet level, so it does in DC Reduced complexity: No need to operate expensive non-blocking redundant protocols (L3 ECMP, OSPF…) Get rid of L2 redundancy solutions that heavily depends on vendor implementation quirks (MC-LAG, TRILL…) We chose L3-only BGP & ECMP, because it is a proven choice with other large scale DC network operators, it’s very scalable, not vendor-locked, relies on less features and yet allows necessary customizations... It’s a common choice! ToR Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) Hypervisor VM
  • 12.
    But what about(an OpenStack) Hypervisor? MC-LAG(L2) or BGP(L3)? ● MC-LAG ○ Depend on vendor implementation ○ Difficult to traffic engineering ○ Cost of complexity in managing multiple protocols ToR Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) Hypervisor VM ● BGP ECMP ○ Standardize redundancy method to BGP in datacenter (simplest design) L3 based redundancy L2 based redundancy
  • 13.
    Conclusion: Use BGPeverywhere... BGP(L3) ● No vendor lock ● BGP redundancy all over the place ○ Simplifies data-center network design ● Easy traffic engineering ○ No downtime ToR maintenance ToR Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) Hypervisor VM
  • 14.
    3. L3 Routingvs VM Network Overlay
  • 15.
    Routing vs VMNetwork Overlay Overlay or Routing? ● No more L2 connectivity between Hypervisors, we can’t just use V(x)LANs ● How do we achieve (minimal but critical) L2 capabilities over L3: ○ VM provisioning (DHCP)? ○ VM-2-VM connectivity (ARP - L2)? BGP(L3) ToR Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) Hypervisor VM
  • 16.
    ToR VM Hypervisor Hypervisor VM Routing vsVM Network Overlay Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) BGP(L3) L2 ● Design complexities ○ Mediate point between underlay and overlay ● Encapsulating performance overhead ● Need Network Node (mostly) ● No performance overhead ● No Network Node ○ Requires distributed ARP / DHCP / Metadata Underlay Overlay
  • 17.
    Conclusion: Use Routingfor all VM networks ● No overhead ● No Network Node(s) ● No VM L2, even on a single hypervisor level (full L2 isolation) ● Minimal failure domain for network services ○ Single hypervisor scope ○ DHCP, metadata, routing ● Portable IP addresses without any overlays ToR VM Hypervisor Hypervisor VM Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) BGP(L3) Static Routing(L3)
  • 18.
    ToR Aggregation Core EBGP ECMP(L3) EBGP ECMP(L3) VM BGP(L3) HypervisorHypervisor VM L2-less & fully-redundant VM networks in OpenStack Static Routing (L3) Neutron takes care that VM joins the actual BGP routed network Custom Plugin
  • 19.
    4. Solution &Implementation
  • 20.
    “L2 Isolate” Neutronplugin (Custom Plugin/Agent) ● VMs don’t share L2 Network ● Ensure that VMs have L3 reachability
  • 21.
    Solution Overview Hypervisor VM ToR Hypervisor VM VMVM default via 172.16.24.0 10.0.0.2/32 via tapXXX 10.0.0.3/32 via tapXXX 10.0.0.0/24 .5 .1 .3 172.16.24.0/31 .0 172.16.24.2/31 .2 .4 10.0.0.0/24 .3.2 .1 .1 10.0.0.2/32 via 172.16.24.1 10.0.0.3/32 via 172.16.24.1 10.0.0.4/32 via 172.16.24.2 10.0.0.5/32 via 172.16.24.2 10.0.0.254/32 via 172.16.24.1 Advertise 10.0.0.2 to 172.16.24.1 Withdraw 10.0.0.254 proxy arp
  • 22.
    Solution Overview Hypervisor VM ToR Hypervisor VM VMVM default via 172.16.24.0 10.0.0.2/32 via tapXXX 10.0.0.3/32 via tapXXX 10.0.0.0/24 .5 .1 .3 172.16.24.0/31 .0 172.16.24.2/31 .2 .4 10.0.0.0/24 .3.2 .1 .1 10.0.0.2/32 via 172.16.24.1 10.0.0.3/32 via 172.16.24.1 10.0.0.4/32 via 172.16.24.2 10.0.0.5/32 via 172.16.24.2 10.0.0.254/32 via 172.16.24.1Advertise 10.0.0.2 to 172.16.24.1 Withdraw 10.0.0.254 L2 Isolate Agent Mechanism driver Type driver Neutron Server Linuxbridge Agent New
  • 23.
  • 24.
    service_plugin Neutron core_plugin L3LBaaS FWaaS ML2 TypeDriver Mechanism Driver Linuxbridge OVS OctaviaHaproxy Server Side Agent Side OVS AgentLinuxbridge Agent L3 Agent L2 Agent Metadata AgentDHCP Agent LBaaS Agent ML2 Related Agent “L2 Isolate” plugin for L3-only datacenter VLANFlat
  • 25.
    service_plugin core_plugin L3LBaaS FWaaSML2 Type Driver Mechanism Driver VLANFlat Linuxbridge OVS OctaviaHaproxy Server Side Agent Side OVS AgentLinuxbridge Agent L3 Agent L2 Agent Metadata AgentDHCP Agent LBaaS Agent ML2 Related Agent Define new type of network that represents all end-device connected via L3 Implement logic to achieve the above type of network Implement agent working with new mechanism driver Can not reuse existing agent expecting network entity to be L2 network Type Driver Mechanism Driver L2 Agent ML2 Related Agent .... “L2 Isolate” plugin for L3-only datacenter
  • 26.
    Private Network Public Network EndDevice End Device End Device L3 Reachable End Device End Device Type Driver - Routed L3 Reachable
  • 27.
    Mechanism Driver -L2 Isolate ● Support Routed Type Network ● Similar implementation as with linuxbridge ● Use TAP VIF_TYPE not Bridge ● Checks for IP overlaps on all subnets of all routed type networks when subnet is created
  • 28.
    bridge Why TAP VIF_TYPE? Hypervisor VMVM tap tap Hypervisor VM VM tap tap bridge-if ● L2 Isolation inside Hypervisor? ○ Some say shared/external network can not be trusted ● Bridge? ○ No L2 isolation on Hypervisor ● Dedicate bridge for each port? ○ Don’t create unnecessary resource
  • 29.
    Agent - L2Isolate Agent ● Monitors tap device on compute node (like linuxbridge) ● Configures ○ Linux ARP proxy, static route ○ routing software to advertise, withdraw VM IPs ○ DHCP for VMs in a same Hypervisor ○ metadata proxy for VMs in a same Hypervisor ● Support iptables base security group implementation for tap
  • 30.
    A tour ofwhat “L2 isolate” agent really does
  • 31.
    Configure Proxy ARP InjectVM’s IP/32 Route via TAP VIF_TYPE: TAP No prefix route option required 1. Connectivity to other VM inside Hypervisor
  • 32.
    FRR configured towatch the Routing Table Advertise, Withdraw VM’s IP/32 according to the change 2. Connectivity to VM from outside Hypervisor Routing Software
  • 33.
    Add/Update/Remove <tap name> tapA <subnet id> subnetA tapB subnetB MAC:IP:set<subnet-id> tag:<subnet-id>,options:router tag:<subnet-id>,options:netmask.. dhcp-hosts.d dhcp-opts.d Watch and Reload (--dhcp-XXXdir option) 3. DHCP Use 1 dnsmasq for all VM in same hypervisor
  • 34.
    unix socket TCP 169.254.169.254:80LISTEN curl http://169.254.169.254 nova-api Spawned 1 proxy with fake-router id X-Neutron-Router-Id: fake HTTP Header X-Instance-Id X-Tenant-Id 4. Metadata Proxy neutron-server Look up port by just VM’s IP because fake router id passed
  • 35.
    Chain neutron-<agent>-FORWARD neutron-<agent>-sg-chain all-- any any anywhere anywhere PHYSDEV match --physdev-out tap97125c-21 --physdev-is-bridged Chain neutron-<agent>-INPUT neutron-<agent>-o97125c-2 all -- any any anywhere anywhere PHYSDEV match --physdev-in tap97125c-21 --physdev-is-bridged Chain neutron-<agent>-sg-chain neutron-<agent>-i9715c-2 all -- any any anywhere anywhere PHYSDEV match --physdev-out tap97125c-21 --physdev-is-bridged Default iptables firewall driver Expecting tap device to be connected via bridge 5. Security Group Support
  • 36.
    Chain neutron-<agent>-FORWARD neutron-<agent>-sg-chain all-- any any anywhere anywhere PHYSDEV match --physdev-out tap97125c-21 --physdev-is-bridged Chain neutron-<agent>-INPUT neutron-<agent>-o97125c-2 all -- any any anywhere anywhere PHYSDEV match --physdev-in tap97125c-21 --physdev-is-bridged Chain neutron-<agent>-sg-chain neutron-<agent>-i9715c-2 all -- any any anywhere anywhere PHYSDEV match --physdev-out tap97125c-21 --physdev-is-bridged chain neutron-<agent>-FORWARD neutron-<agent>-sg-chain all -- any tap2d89f44a-6b anywhere anywhere chain neutron-<agent>-INPUT neutron-<agent>-o2d89f44a-6 all -- tap2d89f44a-6b any anywhere anywhere chain neutron-<agent>-sg-chain neutron-<agent>-o2d89f44a-6 all -- tap2d89f44a-6b any anywhere anywhere Default iptables firewall driver tapbase_iptables_firewall driver Just specify tap device as in-out interface 5. Security Group Support
  • 37.
  • 38.
  • 39.
    * Throughput ofVM - VM traffic over ToR improved significantly => Improved uplink bandwidth of ToR (Clos Network) => No encapsulation overhead for VM Network (L2isolate Plugin) * Completely L2-free datacenter :) * No network node bottlenecks (routing, dhcp, metadata proxy) * Portable instance IP without introducing overlays => Supporting live migration everywhere What we got (Pros)
  • 40.
    * BGP Networkoperating cost but operated by Network team :) * Maintenance cost for custom plugin * Broadcast/Multicast doesn’t work in our environment What we got (Cons)
  • 41.
    * Upstream theNeutron l2isolate plugin? * Offload L3 forwarding * Support overlay network as well as routed network => How to co-exist these network models? => Solution not using l3 agent? Future Work
  • 42.
  • 43.
  • 44.
    * Use dynamicdirectory for dhcp-hosts, opts configuration (--dhcp-hostsdir, --dhcp-optsdir) => Enough to have only config for VMs in same hypervisor => Easy to add new/remove subnet/new host info => Avoid to maintain big configuration file [1] http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=4f7bb57e9747577600b3d385e0e3418ec17e73e0 1. Running dnsmasq for all subnets trick (1/3) * Workaround for reloading bug[1] in case older version than 2.79 => Add --dhcp-hostsfile, --dhcp-optsfile with dummy(empty) file
  • 45.
    * Use multiple/1 range to represents all subnets in --dhcp-range --dhcp-range=0.0.0.0,static --dhcp-range=0.0.0.0,static,0.0.0.0 --dhcp-range=0.0.0.0,255.255.255.255 See around https://github.com/imp/dnsmasq/blob/master/src/dhcp.c#L501-L560 --dhcp-range=0.0.0.0,static,128.0.0.0 --dhcp-range=128.0.0.0,static,128.0.0.0 * Specify netmask in dhcp-option dhcp-opts.d <subnet id> subnetA subnetB tag:<subnet-id>,option:router tag:<subnet-id>,option:netmask tag:<subnet-id>,option:classless-static-route If it’s missing, automatically use dhcp-range’s netmask(/1) 1. Running dnsmasq for all subnets trick (2/3)
  • 46.
    * Use tagto load appropriate subnet option for dhcp client <subnet id> subnetA subnetB tag:<subnet-id>,options:router tag:<subnet-id>,options:netmask.. dhcp-opts.d <tap name> tapA tapB MAC:IP:set<subnet-id > dhcp-hosts.d subnetA subnetB Will be applied 1. Running dnsmasq for all subnets trick (3/3)
  • 47.
    2. Re-Cap: usualneutron metadata access curl http://169.254.169.254 neutron-l3-agent neutron-ns-metadata-proxy VM1 neutron-metadata-agent unix socket neutron-server neutron-linuxbridge-agent TCP 169.254.169.254:80 LISTEN network-namespace veth bridge X-Neutron-Router-Id: A HTTP Header 1: Search all networks belong to router A (1RPC) 2: Prepare Filter to match network(result of 1) and VM IP 3: Search port matching filter (1RPC) 4: Retrieve instance-id, tenant-id from port(result of 3) X-Neutron-Router-Id X-Instance-Id X-Tenant-Id nova-api Spawned with --router-id=A
  • 48.
    curl http://169.254.169.254 neutron-ns-metadata-proxy VM1 neutron-metadata-agent unix socket neutron-server neutron-l2isolate-agent TCP169.254.169.254:80 LISTEN X-Neutron-Router-Id: fake HTTP Header Spawned with --router-id=fake 1: Search all networks belong to router fake (1RPC) 2: Prepare Filter to match only VM IP (due to no network found) 3: Search port matching filter (1RPC) 4: Retrieve instance-id, tenant-id from port(result of 3) X-Neutron-Router-Id nova-api 2. neutron-ns-metadata-proxy for all networks X-Instance-Id X-Tenant-Id
  • 49.
    curl http://169.254.169.254 neutron-ns-metadata-proxy VM1 neutron-metadata-agent unix socket neutron-server neutron-l2isolate-agent TCP169.254.169.254:80 LISTEN X-Neutron-Router-Id: fake HTTP Header Spawned with --router-id=fake 1: Search all networks belong to router fake (1RPC) 2: Prepare Filter to match only VM IP (due to no network found) 3: Search port matching filter (1RPC) 4: Retrieve instance-id, tenant-id from port(result of 3) X-Neutron-Router-Id nova-api 2. neutron-ns-metadata-proxy for all networks X-Instance-Id X-Tenant-Id Limitation: IP overwrapping not allowed If we want to allow it, we have to re-write metadata-agent as well