Large Scale Overlay Networks with OVN:
Problems and Solutions
Han Zhou (hzhou8@ebay.com)
Open Infrastructure Summit - Denver, 2019
Agenda
● Background
● Control-plane components scaling
○ OVN-Controller
○ South-bound DB
○ OVN-Northd
● Scaling ACL
● Scaling nested workloads (containers on VMs)
Background of OVN
● SDN solution developed by OVS (Open vSwitch) community
● OpenStack support - neutron ML2 plugin: networking-ovn
● Kubernetes support - CNI plugin: ovn-kubernetes
● Main Features
● Full L2/L3 virtualization with overlay
networks (Geneve, STT, VxLAN)
● L2 gateway, L3 gateway
(centralized/distributed) & NAT with HA
● L4 ACLs (stateful FW) with address-set,
port-group and packet logging
● Distributed Load-Balancer
● L2/L3 Port-security
● ARP responder, static/dynamic ARP
● Flat/Vlan physical networks
● Native DHCP, Metadata
● Parent-child ports for nested workloads
● QoS
● IPSec
● Policy-based routing
● ...
● Logical/physical separation
● Distributed local controllers
● Database Approach (ovsdb) Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
Distributed Control Plane
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
OVN-Controller Scaling Challenges
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
● Factors
○ Size of data
○ Rate of changes
● Challenges
○ Big size of data to be processed
■ E.g. 10k logical ports generates >40k
logical flows and 10k port-bindings
○ Logical flow parsing is CPU intensive
○ Cloud workload changes frequently
○ Lots of inputs for flow computation
OVS
qos Address Sets
(converted)
MFF OVN
Geneve
OVS
open_vswitch
OVS
bridge
SB
logical_flow
SB
chassis
SB
encap
SB
mc_group
SB
dp_binding
SB
port_binding
SB
mac_binding
SB
dhcp
SB
dhcpv6
SB
dns
SB
gw_chassis
OVS
port
SB
addr_set
SB
port_group
Runtime Data
------------------------------
Local_datapath
Local_lports
Local_lport_ids
Active_tunnels
Ct_zone_bitmap
Pending_ct_zones
Ct_zones
Flow Output
---------------------------
Desired_flow_table
Group_table
Meter_table
Conj_id_ofs
SB OVSDB input
Local OVSDB input
Dependency Graph of OVN-Controller
Port Groups
(converted)
Original Approach - Recomputing
● Compute OVS flows by reprocessing all inputs when
○ Any input changes
○ Or even when there is no change at all (but just unrelated events)
● Benefit
○ Relatively easy to implement and maintain
● Problems
○ 100% CPU of ovn-controller process on all compute nodes
○ High control plane latency
Solution - Incremental Processing Engine
● DAG representing dependencies
● Each node contains
○ Data
○ Links to input nodes
○ Change-handler for each input
○ Full recompute handler
● Engine
○ DFS post-order traverse the DAG from the
final output node
○ Invoke change-handlers for inputs that
changed
○ Fall back to recompute if for ANY of its inputs:
■ Change-handler is not implemented for that
input, or
■ Change-handler cannot handle the particular
change (returns false)
input
intermediate
input
intermediate
output
input
OVS
qos Address Sets
(converted)
MFF OVN
Geneve
OVS
open_vswitch
OVS
bridge
SB
logical_flow
SB
chassis
SB
encap
SB
mc_group
SB
dp_binding
SB
port_binding
SB
mac_binding
SB
dhcp
SB
dhcpv6
SB
dns
SB
gw_chassis
OVS
port
SB
addr_set
SB
port_group
Runtime Data
------------------------------
Local_datapath
Local_lports
Local_lport_ids
Active_tunnels
Ct_zone_bitmap
Pending_ct_zones
Ct_zones
Flow Output
---------------------------
Desired_flow_table
Group_table
Meter_table
Conj_id_ofs
SB OVSDB input
Local OVSDB input
Input with change
handler implemented
Change Handler Implemented
Port Groups
(converted)
● Create and bind 10k ports on 1k HVs
○ Simulated 1k HVs on 20 BMs x 40 cores (2.50GHz)
○ 10k ports all under the same logical router
○ Batch size 100 lports
○ Bind port one by one for each batch
○ Wait all ports up before next batch
CPU Efficiency Improvement
● End to end latency on top of 10k existed logical ports
○ Create one more logical port and bind the port on HV
○ Wait until northd generate lflows and create port-binding in SB
○ Wait until ovn-controller claim the port on HV
○ Wait until northd generate all lflows
○ Wait until OVS flows programmed on all HVs
Latency Improvement
Tests at Larger Scale
● Next bottle-necks:
○ OVS flow installation
○ Port-binding handling when the binding happens locally
What’s next for Incremental-Processing (WIP)
● Incremental flow installation
○ Low hanging fruit - with the help of incremental flow computing
● Implement more change handlers as needed
○ E.g. support incremental processing when port-binding happens locally - further improve
end-to-end latency
● New implementation: Differential Datalog (DDlog)
○ Data-flow approach
○ Reuse the effort taken for Northd improvement (will be discussed in Northd scaling)
● Upstream?
○ Not in upstream, because DDlog is the preferred long term solution
○ For those who need this:
■ Rebased on Master: https://github.com/hzhou8/ovs/tree/ovn-controller-inc-proc
■ Rebased on 2.11: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.11
■ Rebased on 2.10: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.10
OVN-Controller Other Improvements (WIP)
● Reduce data size per-HV
○ Problem: External Provider Network connects everything
○ Solution: Don’t cross external network boundary when calculating connected datapaths
● On-demand tunnel port creation
○ Problem: Too many OVS ports when there are a lot of HVs
○ Solution: Create tunnel to a remote host only if there are ports on these hosts logically connected.
● Factors
○ Number of clients (HVs & GWs)
○ Size of data
○ Rate of changes
● Problems
○ Probe handling
○ Data resync during restart/failover
○ Clustered-mode problems
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
SB DB Scaling Challenges
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
SB DB Probe
● Default 5 sec probe interval causing connection flapping
○ Ovsdb-server response can occasionally exceed 5 sec
■ DB log compression
■ Large transaction handling
○ Clients reconnecting adds more load to the server - cascade failure
■ Clients resync data from server (solved - see next slide)
● Solution
○ Increase probe interval
■ Client side (on HVs)
● ovs-vsctl set open . external_ids:ovn-remote-probe-interval=60000
■ Server side (DON’T FORGET!!)
● ovn-sbctl -- --id=@conn_uuid create Connection 
target="ptcp:6642:0.0.0.0" 
inactivity_probe=0 -- set SB_Global . connections=@conn_uuid
○ Rely on external monitorings for HVs connectivity
Data re-sync during DB reconnect
● Problem
○ OVSDB client caching => NOT a problem
○ Server restart/failover: re-sync data for all
clients. => This is the problem!
● Solution - OVSDB fast re-sync (in master -> v2.12)
○ Track and maintain recent history transactions
in disk and memory.
○ New method monitor_cond_since in OVSDB
protocol, to request changes since last point
before connection lost.
○ Note: now it works for clustered mode only.
● Test Result - 1k HVs, 10k ports
○ Before: SB DB 100% CPU, >30 min to recover.
○ After: No CPU spike, all connections restored in
<1 min (probe interval).
OVSDB Clustered Mode
● Raft based clustering (experimental support since v2.9)
● Problems at scale
○ High CPU load (solved in master)
○ Follower update latency (solved in master)
○ Leader flapping (WIP, workaround ready)
○ Client reconnect (solved in master)
OVSDB Clustered Mode - High CPU
● OVSDB Raft Implementation
○ Preprocessing on followers before sending to leader - share
some load for leader
○ Send preprocessed transaction to leader together with a
prerequisite version ID
● Problem
○ Lots of prerequisite check failure and retry at large scale
■ Different HVs update chassis/port_binding at the same time
through different follower nodes
○ Continuous retry causes 100% CPU
● Solution (in master -> v2.12)
○ Retry only when the follower have applied the largest local
Raft log index
■ Otherwise, the prerequisite is already out-of-date, so don’t
waste CPU
OVSDB Clustered Mode - Follower Latency
● Original behavior: leader sends Raft log update to follower nodes when:
○ A new change is proposed, or
○ A heartbeat is sent
● Problem
○ Update from follower node suffers big latency
● Solution (in master -> v2.12)
○ Send log to followers as soon as a new entry is committed
● Test result: 100 updates through same follower from same client
○ Before: >30 sec
○ After: 500 ms
OVSDB Clustered Mode - Leader Flapping
● Problem: heartbeat timeout, triggering re-election
○ Large transaction execution
○ Raft log compression (snapshot)
● Solution
○ Quick and dirty: Increase election timeout (hardcode)
○ Short term: Make election timeout configurable at cluster level (WIP)
○ Longer term: Separate thread for Raft RPC (WIP)
■ Still need to configure timeout for snapshot scenarios
OVSDB Clustered Mode - Client Reconnect
● Problem: during leader failover, all clients of new leader will reconnect
○ DB state changes to “disconnected” when there is no leader (temporarily)
○ Client tries to reconnect to a new node
● Solution (in master -> v2.12)
○ Don’t change state to “disconnected” if
■ Current node is candidate, and
■ Election didn’t timeout yet
Scale Test for Clustered Mode
● Setup
○ 3-node cluster, 1k HVs
○ Election timeout: 10s (hardcoded in the test)
● Test
○ Keep creating and binding ports up to 10k
○ Periodically kill->wait(10s)->start each ovsdb-server randomly
● Test passed at scale!
○ All port creation and binding completed correctly.
○ Fast-resync helped!
Further Improvement: SB-DB Scale-out Replicas (TODO)
● How to support more HVs - 2k? 5k? 10k?
○ More nodes in cluster? Doesn’t scale.
○ Multi-threading OVSDB? Would help, but...
● Precondition: no write to SB from HV
○ Chassis/Encap/Port-binding update by
CMS/northd only
○ Does not use dynamic ARP (mac-binding)
● How
○ Use replication mode of OVSDB to create N
read-only replicas
○ HV connections sharding on read-only
replicas
○ HV can failover to other replicas
NorthdNorthd
SB ovsdb
SB
Replica 1
SB
Replica 2
SB
Replica n
…
HV HV HV
…
HV HV HV
…
HV HV HV
…
CMS
NB ovsdb
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
OVN-Northd Scaling Challenges
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
● Factors
○ Size of data
○ Rate of changes
● Problems
○ Recompute
OVN-Northd Incremental Processing (WIP from community)
● OVN-Northd is a perfect target user of Differential Datalog (DDlog)
○ Inputs - NB DB tables (logical routers, switch, port, etc.)
○ Outputs - SB DB tables (logical flows, port-bindings, etc.)
○ Rules to convert inputs to outputs
● Differential Datalog
○ An open-source datalog language for incremental data-flow processing
○ Defining inputs and outputs as relations
○ Defining rules to generate outputs from inputs
● Efforts can be reused by OVN-Controller
○ OVSDB - DDlog wrappers
○ Process framework changes
● OVN-Northd
● OVN-SB DB
● OVN-Controller Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
Recap Scaling Bottlenecks
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
Some More Scaling Problems
● Security Group / Network policy using ACLs
● Nested workloads (K8S containers)
ACLs
● Used by Security Group (OpenStack) / Network Policy (K8S)
● Typical use case: members of same group are allowed to access each other
● Naked => O(N^2)
● Using Address Set => O(N)
● #Flows in OVS is always O(M*N) (M = number of ports on the HV)
outport == <port1_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
outport == <port2_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
...
outport == <portN_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
outport == <port1_uuid> && ip4 && ip4.src == $as_ip4_sg1
outport == <port2_uuid> && ip4 && ip4.src == $as_ip4_sg1
...
outport == <portN_uuid> && ip4 && ip4.src == $as_ip4_sg1
Solution - Port Group (Released in v2.10)
● All-in-one
● Greatly simplified CMS Implementation
○ networking-ovn
○ ovn-kubernetes
● Enables more efficient OVS flow generation with conjunction, when multiple ports on same HV
belongs to same port-group
○ E.g.
■ N members in a port-group, all M ports on HV1 belong to this group
■ Number of OVS flows on HV1 will be M + N, instead of M * N
outport == @port_group1 && ip4 && ip4.src == $port_group1_ip4
CMS creates
port-group instead
of address-set
OVN-Northd
generates
address-set for you
Further Improvement - Group-ID in Packet (TODO)
● Problem - still too many OVS flows
○ Best case: M + N, if all M ports on HV belongs to same group.
○ Worst case: M * N, if ports are distributed randomly.
■ M ports on HV, each belongs to a different group, each group has N members
● Solution (just an idea)
○ Encoding port-group in tunnel metadata
■ Only M flows in all cases
■ Best part: no local flow change needed for remote member changes
○ Challenge: what if a port belongs to multiple groups
■ Limit the number of groups for a single port
■ Fall back to old way if exceeds
○ Limitation: works for ingress (to-lport) rules only
outport == @port_group1 && src_group_id == <group1 id>
From tunnel
metadata
Scaling Nested Workloads
● Use Case
○ VM overlay networking with OVN (e.g. using OpenStack networking-ovn)
○ Run Kubernetes on top of the VMs
● Problem
○ How to connect the pods at scale?
ARP Proxy
● OVN doesn’t support MAC-learning (MAC-Port binding
learning), but IP-MAC binding can be learned through
ARP
● How
○ LR send ARP request for Pod IPs
○ ARP proxy in the VM replies with VM’s MAC for
all Pod IPs on the VM
● Works, but
○ Requires VM and Pods on same subnet
○ Unreliable when SB DB connection fails
○ Scale: O(N), N = number of pods, usually much
bigger than number of VMs
■ Note: IP-MAC Binding incremental processing
change handler is implemented - no re-compute.
HV
VM
OVS
Pod
Pod Pod
Pod
ARP
Proxy
OVN
Controller
SB
IP-MAC
Binding Table
LR ARP Cache (dynamic):
10.0.0.102 => aa:bb:cc:dd:ee:ff
10.0.0.103 => aa:bb:cc:dd:ee:ff
10.0.0.104 => aa:bb:cc:dd:ee:ff
...
10.0.0.102
10.0.0.103 10.0.0.104
10.0.0.105
10.0.0.2 (aa:bb:cc:dd:ee:ff)
LR Static Route
● Assign Pod subnet(s) per VM (minion)
● How
○ Configure static routes in OVN LR for pod
subnets: next hop = VM IP
● Considerations
○ De-couples VM and Pod subnets
○ Declarative, more reliable than ARP
○ May waste more IPs, but size of subnet is
flexible
○ Scale: O(S), S = number of pod subnets
■ Worst case O(N), N = number of pods, if subnet
size is /32.
HV
VM
OVS
Pod
Pod Pod
Pod
10.0.0.2/25
10.0.0.3/25 10.0.0.4/25
10.0.0.5/25
172.0.0.2/24
LR Routing Table (static):
10.0.0.0/25 => 172.0.0.2
10.0.0.128/25 => 172.0.1.100
10.0.0.1/25 => 172.0.1.3
...
● OVS/OVN
○ http://www.openvswitch.org/
● Networking-OVN
○ https://docs.openstack.org/networking-ovn/latest/
● OVN-Kubernetes
○ https://github.com/openvswitch/ovn-kubernetes/
● OVN-Scale-Test
○ https://github.com/openvswitch/ovn-scale-test
● GO-OVN library
○ https://github.com/eBay/go-ovn
References

Large scale overlay networks with ovn: problems and solutions

  • 1.
    Large Scale OverlayNetworks with OVN: Problems and Solutions Han Zhou (hzhou8@ebay.com) Open Infrastructure Summit - Denver, 2019
  • 2.
    Agenda ● Background ● Control-planecomponents scaling ○ OVN-Controller ○ South-bound DB ○ OVN-Northd ● Scaling ACL ● Scaling nested workloads (containers on VMs)
  • 3.
    Background of OVN ●SDN solution developed by OVS (Open vSwitch) community ● OpenStack support - neutron ML2 plugin: networking-ovn ● Kubernetes support - CNI plugin: ovn-kubernetes ● Main Features ● Full L2/L3 virtualization with overlay networks (Geneve, STT, VxLAN) ● L2 gateway, L3 gateway (centralized/distributed) & NAT with HA ● L4 ACLs (stateful FW) with address-set, port-group and packet logging ● Distributed Load-Balancer ● L2/L3 Port-security ● ARP responder, static/dynamic ARP ● Flat/Vlan physical networks ● Native DHCP, Metadata ● Parent-child ports for nested workloads ● QoS ● IPSec ● Policy-based routing ● ...
  • 4.
    ● Logical/physical separation ●Distributed local controllers ● Database Approach (ovsdb) Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb Distributed Control Plane OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows
  • 5.
    Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb OVN-Controller Scaling Challenges OVN-Controller OVS HV HV… OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows ● Factors ○ Size of data ○ Rate of changes ● Challenges ○ Big size of data to be processed ■ E.g. 10k logical ports generates >40k logical flows and 10k port-bindings ○ Logical flow parsing is CPU intensive ○ Cloud workload changes frequently ○ Lots of inputs for flow computation
  • 6.
    OVS qos Address Sets (converted) MFFOVN Geneve OVS open_vswitch OVS bridge SB logical_flow SB chassis SB encap SB mc_group SB dp_binding SB port_binding SB mac_binding SB dhcp SB dhcpv6 SB dns SB gw_chassis OVS port SB addr_set SB port_group Runtime Data ------------------------------ Local_datapath Local_lports Local_lport_ids Active_tunnels Ct_zone_bitmap Pending_ct_zones Ct_zones Flow Output --------------------------- Desired_flow_table Group_table Meter_table Conj_id_ofs SB OVSDB input Local OVSDB input Dependency Graph of OVN-Controller Port Groups (converted)
  • 7.
    Original Approach -Recomputing ● Compute OVS flows by reprocessing all inputs when ○ Any input changes ○ Or even when there is no change at all (but just unrelated events) ● Benefit ○ Relatively easy to implement and maintain ● Problems ○ 100% CPU of ovn-controller process on all compute nodes ○ High control plane latency
  • 8.
    Solution - IncrementalProcessing Engine ● DAG representing dependencies ● Each node contains ○ Data ○ Links to input nodes ○ Change-handler for each input ○ Full recompute handler ● Engine ○ DFS post-order traverse the DAG from the final output node ○ Invoke change-handlers for inputs that changed ○ Fall back to recompute if for ANY of its inputs: ■ Change-handler is not implemented for that input, or ■ Change-handler cannot handle the particular change (returns false) input intermediate input intermediate output input
  • 9.
    OVS qos Address Sets (converted) MFFOVN Geneve OVS open_vswitch OVS bridge SB logical_flow SB chassis SB encap SB mc_group SB dp_binding SB port_binding SB mac_binding SB dhcp SB dhcpv6 SB dns SB gw_chassis OVS port SB addr_set SB port_group Runtime Data ------------------------------ Local_datapath Local_lports Local_lport_ids Active_tunnels Ct_zone_bitmap Pending_ct_zones Ct_zones Flow Output --------------------------- Desired_flow_table Group_table Meter_table Conj_id_ofs SB OVSDB input Local OVSDB input Input with change handler implemented Change Handler Implemented Port Groups (converted)
  • 10.
    ● Create andbind 10k ports on 1k HVs ○ Simulated 1k HVs on 20 BMs x 40 cores (2.50GHz) ○ 10k ports all under the same logical router ○ Batch size 100 lports ○ Bind port one by one for each batch ○ Wait all ports up before next batch CPU Efficiency Improvement
  • 11.
    ● End toend latency on top of 10k existed logical ports ○ Create one more logical port and bind the port on HV ○ Wait until northd generate lflows and create port-binding in SB ○ Wait until ovn-controller claim the port on HV ○ Wait until northd generate all lflows ○ Wait until OVS flows programmed on all HVs Latency Improvement
  • 12.
    Tests at LargerScale ● Next bottle-necks: ○ OVS flow installation ○ Port-binding handling when the binding happens locally
  • 13.
    What’s next forIncremental-Processing (WIP) ● Incremental flow installation ○ Low hanging fruit - with the help of incremental flow computing ● Implement more change handlers as needed ○ E.g. support incremental processing when port-binding happens locally - further improve end-to-end latency ● New implementation: Differential Datalog (DDlog) ○ Data-flow approach ○ Reuse the effort taken for Northd improvement (will be discussed in Northd scaling) ● Upstream? ○ Not in upstream, because DDlog is the preferred long term solution ○ For those who need this: ■ Rebased on Master: https://github.com/hzhou8/ovs/tree/ovn-controller-inc-proc ■ Rebased on 2.11: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.11 ■ Rebased on 2.10: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.10
  • 14.
    OVN-Controller Other Improvements(WIP) ● Reduce data size per-HV ○ Problem: External Provider Network connects everything ○ Solution: Don’t cross external network boundary when calculating connected datapaths ● On-demand tunnel port creation ○ Problem: Too many OVS ports when there are a lot of HVs ○ Solution: Create tunnel to a remote host only if there are ports on these hosts logically connected.
  • 15.
    ● Factors ○ Numberof clients (HVs & GWs) ○ Size of data ○ Rate of changes ● Problems ○ Probe handling ○ Data resync during restart/failover ○ Clustered-mode problems Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb SB DB Scaling Challenges OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows
  • 16.
    SB DB Probe ●Default 5 sec probe interval causing connection flapping ○ Ovsdb-server response can occasionally exceed 5 sec ■ DB log compression ■ Large transaction handling ○ Clients reconnecting adds more load to the server - cascade failure ■ Clients resync data from server (solved - see next slide) ● Solution ○ Increase probe interval ■ Client side (on HVs) ● ovs-vsctl set open . external_ids:ovn-remote-probe-interval=60000 ■ Server side (DON’T FORGET!!) ● ovn-sbctl -- --id=@conn_uuid create Connection target="ptcp:6642:0.0.0.0" inactivity_probe=0 -- set SB_Global . connections=@conn_uuid ○ Rely on external monitorings for HVs connectivity
  • 17.
    Data re-sync duringDB reconnect ● Problem ○ OVSDB client caching => NOT a problem ○ Server restart/failover: re-sync data for all clients. => This is the problem! ● Solution - OVSDB fast re-sync (in master -> v2.12) ○ Track and maintain recent history transactions in disk and memory. ○ New method monitor_cond_since in OVSDB protocol, to request changes since last point before connection lost. ○ Note: now it works for clustered mode only. ● Test Result - 1k HVs, 10k ports ○ Before: SB DB 100% CPU, >30 min to recover. ○ After: No CPU spike, all connections restored in <1 min (probe interval).
  • 18.
    OVSDB Clustered Mode ●Raft based clustering (experimental support since v2.9) ● Problems at scale ○ High CPU load (solved in master) ○ Follower update latency (solved in master) ○ Leader flapping (WIP, workaround ready) ○ Client reconnect (solved in master)
  • 19.
    OVSDB Clustered Mode- High CPU ● OVSDB Raft Implementation ○ Preprocessing on followers before sending to leader - share some load for leader ○ Send preprocessed transaction to leader together with a prerequisite version ID ● Problem ○ Lots of prerequisite check failure and retry at large scale ■ Different HVs update chassis/port_binding at the same time through different follower nodes ○ Continuous retry causes 100% CPU ● Solution (in master -> v2.12) ○ Retry only when the follower have applied the largest local Raft log index ■ Otherwise, the prerequisite is already out-of-date, so don’t waste CPU
  • 20.
    OVSDB Clustered Mode- Follower Latency ● Original behavior: leader sends Raft log update to follower nodes when: ○ A new change is proposed, or ○ A heartbeat is sent ● Problem ○ Update from follower node suffers big latency ● Solution (in master -> v2.12) ○ Send log to followers as soon as a new entry is committed ● Test result: 100 updates through same follower from same client ○ Before: >30 sec ○ After: 500 ms
  • 21.
    OVSDB Clustered Mode- Leader Flapping ● Problem: heartbeat timeout, triggering re-election ○ Large transaction execution ○ Raft log compression (snapshot) ● Solution ○ Quick and dirty: Increase election timeout (hardcode) ○ Short term: Make election timeout configurable at cluster level (WIP) ○ Longer term: Separate thread for Raft RPC (WIP) ■ Still need to configure timeout for snapshot scenarios
  • 22.
    OVSDB Clustered Mode- Client Reconnect ● Problem: during leader failover, all clients of new leader will reconnect ○ DB state changes to “disconnected” when there is no leader (temporarily) ○ Client tries to reconnect to a new node ● Solution (in master -> v2.12) ○ Don’t change state to “disconnected” if ■ Current node is candidate, and ■ Election didn’t timeout yet
  • 23.
    Scale Test forClustered Mode ● Setup ○ 3-node cluster, 1k HVs ○ Election timeout: 10s (hardcoded in the test) ● Test ○ Keep creating and binding ports up to 10k ○ Periodically kill->wait(10s)->start each ovsdb-server randomly ● Test passed at scale! ○ All port creation and binding completed correctly. ○ Fast-resync helped!
  • 24.
    Further Improvement: SB-DBScale-out Replicas (TODO) ● How to support more HVs - 2k? 5k? 10k? ○ More nodes in cluster? Doesn’t scale. ○ Multi-threading OVSDB? Would help, but... ● Precondition: no write to SB from HV ○ Chassis/Encap/Port-binding update by CMS/northd only ○ Does not use dynamic ARP (mac-binding) ● How ○ Use replication mode of OVSDB to create N read-only replicas ○ HV connections sharding on read-only replicas ○ HV can failover to other replicas NorthdNorthd SB ovsdb SB Replica 1 SB Replica 2 SB Replica n … HV HV HV … HV HV HV … HV HV HV … CMS NB ovsdb
  • 25.
    Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb OVN-Northd Scaling Challenges HV… OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows ● Factors ○ Size of data ○ Rate of changes ● Problems ○ Recompute
  • 26.
    OVN-Northd Incremental Processing(WIP from community) ● OVN-Northd is a perfect target user of Differential Datalog (DDlog) ○ Inputs - NB DB tables (logical routers, switch, port, etc.) ○ Outputs - SB DB tables (logical flows, port-bindings, etc.) ○ Rules to convert inputs to outputs ● Differential Datalog ○ An open-source datalog language for incremental data-flow processing ○ Defining inputs and outputs as relations ○ Defining rules to generate outputs from inputs ● Efforts can be reused by OVN-Controller ○ OVSDB - DDlog wrappers ○ Process framework changes
  • 27.
    ● OVN-Northd ● OVN-SBDB ● OVN-Controller Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb Recap Scaling Bottlenecks OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows
  • 28.
    Some More ScalingProblems ● Security Group / Network policy using ACLs ● Nested workloads (K8S containers)
  • 29.
    ACLs ● Used bySecurity Group (OpenStack) / Network Policy (K8S) ● Typical use case: members of same group are allowed to access each other ● Naked => O(N^2) ● Using Address Set => O(N) ● #Flows in OVS is always O(M*N) (M = number of ports on the HV) outport == <port1_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN} outport == <port2_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN} ... outport == <portN_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN} outport == <port1_uuid> && ip4 && ip4.src == $as_ip4_sg1 outport == <port2_uuid> && ip4 && ip4.src == $as_ip4_sg1 ... outport == <portN_uuid> && ip4 && ip4.src == $as_ip4_sg1
  • 30.
    Solution - PortGroup (Released in v2.10) ● All-in-one ● Greatly simplified CMS Implementation ○ networking-ovn ○ ovn-kubernetes ● Enables more efficient OVS flow generation with conjunction, when multiple ports on same HV belongs to same port-group ○ E.g. ■ N members in a port-group, all M ports on HV1 belong to this group ■ Number of OVS flows on HV1 will be M + N, instead of M * N outport == @port_group1 && ip4 && ip4.src == $port_group1_ip4 CMS creates port-group instead of address-set OVN-Northd generates address-set for you
  • 31.
    Further Improvement -Group-ID in Packet (TODO) ● Problem - still too many OVS flows ○ Best case: M + N, if all M ports on HV belongs to same group. ○ Worst case: M * N, if ports are distributed randomly. ■ M ports on HV, each belongs to a different group, each group has N members ● Solution (just an idea) ○ Encoding port-group in tunnel metadata ■ Only M flows in all cases ■ Best part: no local flow change needed for remote member changes ○ Challenge: what if a port belongs to multiple groups ■ Limit the number of groups for a single port ■ Fall back to old way if exceeds ○ Limitation: works for ingress (to-lport) rules only outport == @port_group1 && src_group_id == <group1 id> From tunnel metadata
  • 32.
    Scaling Nested Workloads ●Use Case ○ VM overlay networking with OVN (e.g. using OpenStack networking-ovn) ○ Run Kubernetes on top of the VMs ● Problem ○ How to connect the pods at scale?
  • 33.
    ARP Proxy ● OVNdoesn’t support MAC-learning (MAC-Port binding learning), but IP-MAC binding can be learned through ARP ● How ○ LR send ARP request for Pod IPs ○ ARP proxy in the VM replies with VM’s MAC for all Pod IPs on the VM ● Works, but ○ Requires VM and Pods on same subnet ○ Unreliable when SB DB connection fails ○ Scale: O(N), N = number of pods, usually much bigger than number of VMs ■ Note: IP-MAC Binding incremental processing change handler is implemented - no re-compute. HV VM OVS Pod Pod Pod Pod ARP Proxy OVN Controller SB IP-MAC Binding Table LR ARP Cache (dynamic): 10.0.0.102 => aa:bb:cc:dd:ee:ff 10.0.0.103 => aa:bb:cc:dd:ee:ff 10.0.0.104 => aa:bb:cc:dd:ee:ff ... 10.0.0.102 10.0.0.103 10.0.0.104 10.0.0.105 10.0.0.2 (aa:bb:cc:dd:ee:ff)
  • 34.
    LR Static Route ●Assign Pod subnet(s) per VM (minion) ● How ○ Configure static routes in OVN LR for pod subnets: next hop = VM IP ● Considerations ○ De-couples VM and Pod subnets ○ Declarative, more reliable than ARP ○ May waste more IPs, but size of subnet is flexible ○ Scale: O(S), S = number of pod subnets ■ Worst case O(N), N = number of pods, if subnet size is /32. HV VM OVS Pod Pod Pod Pod 10.0.0.2/25 10.0.0.3/25 10.0.0.4/25 10.0.0.5/25 172.0.0.2/24 LR Routing Table (static): 10.0.0.0/25 => 172.0.0.2 10.0.0.128/25 => 172.0.1.100 10.0.0.1/25 => 172.0.1.3 ...
  • 35.
    ● OVS/OVN ○ http://www.openvswitch.org/ ●Networking-OVN ○ https://docs.openstack.org/networking-ovn/latest/ ● OVN-Kubernetes ○ https://github.com/openvswitch/ovn-kubernetes/ ● OVN-Scale-Test ○ https://github.com/openvswitch/ovn-scale-test ● GO-OVN library ○ https://github.com/eBay/go-ovn References