SlideShare a Scribd company logo
Large Scale Overlay Networks with OVN:
Problems and Solutions
Han Zhou (hzhou8@ebay.com)
Open Infrastructure Summit - Denver, 2019
Agenda
● Background
● Control-plane components scaling
○ OVN-Controller
○ South-bound DB
○ OVN-Northd
● Scaling ACL
● Scaling nested workloads (containers on VMs)
Background of OVN
● SDN solution developed by OVS (Open vSwitch) community
● OpenStack support - neutron ML2 plugin: networking-ovn
● Kubernetes support - CNI plugin: ovn-kubernetes
● Main Features
● Full L2/L3 virtualization with overlay
networks (Geneve, STT, VxLAN)
● L2 gateway, L3 gateway
(centralized/distributed) & NAT with HA
● L4 ACLs (stateful FW) with address-set,
port-group and packet logging
● Distributed Load-Balancer
● L2/L3 Port-security
● ARP responder, static/dynamic ARP
● Flat/Vlan physical networks
● Native DHCP, Metadata
● Parent-child ports for nested workloads
● QoS
● IPSec
● Policy-based routing
● ...
● Logical/physical separation
● Distributed local controllers
● Database Approach (ovsdb) Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
Distributed Control Plane
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
OVN-Controller Scaling Challenges
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
● Factors
○ Size of data
○ Rate of changes
● Challenges
○ Big size of data to be processed
■ E.g. 10k logical ports generates >40k
logical flows and 10k port-bindings
○ Logical flow parsing is CPU intensive
○ Cloud workload changes frequently
○ Lots of inputs for flow computation
OVS
qos Address Sets
(converted)
MFF OVN
Geneve
OVS
open_vswitch
OVS
bridge
SB
logical_flow
SB
chassis
SB
encap
SB
mc_group
SB
dp_binding
SB
port_binding
SB
mac_binding
SB
dhcp
SB
dhcpv6
SB
dns
SB
gw_chassis
OVS
port
SB
addr_set
SB
port_group
Runtime Data
------------------------------
Local_datapath
Local_lports
Local_lport_ids
Active_tunnels
Ct_zone_bitmap
Pending_ct_zones
Ct_zones
Flow Output
---------------------------
Desired_flow_table
Group_table
Meter_table
Conj_id_ofs
SB OVSDB input
Local OVSDB input
Dependency Graph of OVN-Controller
Port Groups
(converted)
Original Approach - Recomputing
● Compute OVS flows by reprocessing all inputs when
○ Any input changes
○ Or even when there is no change at all (but just unrelated events)
● Benefit
○ Relatively easy to implement and maintain
● Problems
○ 100% CPU of ovn-controller process on all compute nodes
○ High control plane latency
Solution - Incremental Processing Engine
● DAG representing dependencies
● Each node contains
○ Data
○ Links to input nodes
○ Change-handler for each input
○ Full recompute handler
● Engine
○ DFS post-order traverse the DAG from the
final output node
○ Invoke change-handlers for inputs that
changed
○ Fall back to recompute if for ANY of its inputs:
■ Change-handler is not implemented for that
input, or
■ Change-handler cannot handle the particular
change (returns false)
input
intermediate
input
intermediate
output
input
OVS
qos Address Sets
(converted)
MFF OVN
Geneve
OVS
open_vswitch
OVS
bridge
SB
logical_flow
SB
chassis
SB
encap
SB
mc_group
SB
dp_binding
SB
port_binding
SB
mac_binding
SB
dhcp
SB
dhcpv6
SB
dns
SB
gw_chassis
OVS
port
SB
addr_set
SB
port_group
Runtime Data
------------------------------
Local_datapath
Local_lports
Local_lport_ids
Active_tunnels
Ct_zone_bitmap
Pending_ct_zones
Ct_zones
Flow Output
---------------------------
Desired_flow_table
Group_table
Meter_table
Conj_id_ofs
SB OVSDB input
Local OVSDB input
Input with change
handler implemented
Change Handler Implemented
Port Groups
(converted)
● Create and bind 10k ports on 1k HVs
○ Simulated 1k HVs on 20 BMs x 40 cores (2.50GHz)
○ 10k ports all under the same logical router
○ Batch size 100 lports
○ Bind port one by one for each batch
○ Wait all ports up before next batch
CPU Efficiency Improvement
● End to end latency on top of 10k existed logical ports
○ Create one more logical port and bind the port on HV
○ Wait until northd generate lflows and create port-binding in SB
○ Wait until ovn-controller claim the port on HV
○ Wait until northd generate all lflows
○ Wait until OVS flows programmed on all HVs
Latency Improvement
Tests at Larger Scale
● Next bottle-necks:
○ OVS flow installation
○ Port-binding handling when the binding happens locally
What’s next for Incremental-Processing (WIP)
● Incremental flow installation
○ Low hanging fruit - with the help of incremental flow computing
● Implement more change handlers as needed
○ E.g. support incremental processing when port-binding happens locally - further improve
end-to-end latency
● New implementation: Differential Datalog (DDlog)
○ Data-flow approach
○ Reuse the effort taken for Northd improvement (will be discussed in Northd scaling)
● Upstream?
○ Not in upstream, because DDlog is the preferred long term solution
○ For those who need this:
■ Rebased on Master: https://github.com/hzhou8/ovs/tree/ovn-controller-inc-proc
■ Rebased on 2.11: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.11
■ Rebased on 2.10: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.10
OVN-Controller Other Improvements (WIP)
● Reduce data size per-HV
○ Problem: External Provider Network connects everything
○ Solution: Don’t cross external network boundary when calculating connected datapaths
● On-demand tunnel port creation
○ Problem: Too many OVS ports when there are a lot of HVs
○ Solution: Create tunnel to a remote host only if there are ports on these hosts logically connected.
● Factors
○ Number of clients (HVs & GWs)
○ Size of data
○ Rate of changes
● Problems
○ Probe handling
○ Data resync during restart/failover
○ Clustered-mode problems
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
SB DB Scaling Challenges
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
SB DB Probe
● Default 5 sec probe interval causing connection flapping
○ Ovsdb-server response can occasionally exceed 5 sec
■ DB log compression
■ Large transaction handling
○ Clients reconnecting adds more load to the server - cascade failure
■ Clients resync data from server (solved - see next slide)
● Solution
○ Increase probe interval
■ Client side (on HVs)
● ovs-vsctl set open . external_ids:ovn-remote-probe-interval=60000
■ Server side (DON’T FORGET!!)
● ovn-sbctl -- --id=@conn_uuid create Connection 
target="ptcp:6642:0.0.0.0" 
inactivity_probe=0 -- set SB_Global . connections=@conn_uuid
○ Rely on external monitorings for HVs connectivity
Data re-sync during DB reconnect
● Problem
○ OVSDB client caching => NOT a problem
○ Server restart/failover: re-sync data for all
clients. => This is the problem!
● Solution - OVSDB fast re-sync (in master -> v2.12)
○ Track and maintain recent history transactions
in disk and memory.
○ New method monitor_cond_since in OVSDB
protocol, to request changes since last point
before connection lost.
○ Note: now it works for clustered mode only.
● Test Result - 1k HVs, 10k ports
○ Before: SB DB 100% CPU, >30 min to recover.
○ After: No CPU spike, all connections restored in
<1 min (probe interval).
OVSDB Clustered Mode
● Raft based clustering (experimental support since v2.9)
● Problems at scale
○ High CPU load (solved in master)
○ Follower update latency (solved in master)
○ Leader flapping (WIP, workaround ready)
○ Client reconnect (solved in master)
OVSDB Clustered Mode - High CPU
● OVSDB Raft Implementation
○ Preprocessing on followers before sending to leader - share
some load for leader
○ Send preprocessed transaction to leader together with a
prerequisite version ID
● Problem
○ Lots of prerequisite check failure and retry at large scale
■ Different HVs update chassis/port_binding at the same time
through different follower nodes
○ Continuous retry causes 100% CPU
● Solution (in master -> v2.12)
○ Retry only when the follower have applied the largest local
Raft log index
■ Otherwise, the prerequisite is already out-of-date, so don’t
waste CPU
OVSDB Clustered Mode - Follower Latency
● Original behavior: leader sends Raft log update to follower nodes when:
○ A new change is proposed, or
○ A heartbeat is sent
● Problem
○ Update from follower node suffers big latency
● Solution (in master -> v2.12)
○ Send log to followers as soon as a new entry is committed
● Test result: 100 updates through same follower from same client
○ Before: >30 sec
○ After: 500 ms
OVSDB Clustered Mode - Leader Flapping
● Problem: heartbeat timeout, triggering re-election
○ Large transaction execution
○ Raft log compression (snapshot)
● Solution
○ Quick and dirty: Increase election timeout (hardcode)
○ Short term: Make election timeout configurable at cluster level (WIP)
○ Longer term: Separate thread for Raft RPC (WIP)
■ Still need to configure timeout for snapshot scenarios
OVSDB Clustered Mode - Client Reconnect
● Problem: during leader failover, all clients of new leader will reconnect
○ DB state changes to “disconnected” when there is no leader (temporarily)
○ Client tries to reconnect to a new node
● Solution (in master -> v2.12)
○ Don’t change state to “disconnected” if
■ Current node is candidate, and
■ Election didn’t timeout yet
Scale Test for Clustered Mode
● Setup
○ 3-node cluster, 1k HVs
○ Election timeout: 10s (hardcoded in the test)
● Test
○ Keep creating and binding ports up to 10k
○ Periodically kill->wait(10s)->start each ovsdb-server randomly
● Test passed at scale!
○ All port creation and binding completed correctly.
○ Fast-resync helped!
Further Improvement: SB-DB Scale-out Replicas (TODO)
● How to support more HVs - 2k? 5k? 10k?
○ More nodes in cluster? Doesn’t scale.
○ Multi-threading OVSDB? Would help, but...
● Precondition: no write to SB from HV
○ Chassis/Encap/Port-binding update by
CMS/northd only
○ Does not use dynamic ARP (mac-binding)
● How
○ Use replication mode of OVSDB to create N
read-only replicas
○ HV connections sharding on read-only
replicas
○ HV can failover to other replicas
NorthdNorthd
SB ovsdb
SB
Replica 1
SB
Replica 2
SB
Replica n
…
HV HV HV
…
HV HV HV
…
HV HV HV
…
CMS
NB ovsdb
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
OVN-Northd Scaling Challenges
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
● Factors
○ Size of data
○ Rate of changes
● Problems
○ Recompute
OVN-Northd Incremental Processing (WIP from community)
● OVN-Northd is a perfect target user of Differential Datalog (DDlog)
○ Inputs - NB DB tables (logical routers, switch, port, etc.)
○ Outputs - SB DB tables (logical flows, port-bindings, etc.)
○ Rules to convert inputs to outputs
● Differential Datalog
○ An open-source datalog language for incremental data-flow processing
○ Defining inputs and outputs as relations
○ Defining rules to generate outputs from inputs
● Efforts can be reused by OVN-Controller
○ OVSDB - DDlog wrappers
○ Process framework changes
● OVN-Northd
● OVN-SB DB
● OVN-Controller Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
Recap Scaling Bottlenecks
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
Some More Scaling Problems
● Security Group / Network policy using ACLs
● Nested workloads (K8S containers)
ACLs
● Used by Security Group (OpenStack) / Network Policy (K8S)
● Typical use case: members of same group are allowed to access each other
● Naked => O(N^2)
● Using Address Set => O(N)
● #Flows in OVS is always O(M*N) (M = number of ports on the HV)
outport == <port1_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
outport == <port2_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
...
outport == <portN_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
outport == <port1_uuid> && ip4 && ip4.src == $as_ip4_sg1
outport == <port2_uuid> && ip4 && ip4.src == $as_ip4_sg1
...
outport == <portN_uuid> && ip4 && ip4.src == $as_ip4_sg1
Solution - Port Group (Released in v2.10)
● All-in-one
● Greatly simplified CMS Implementation
○ networking-ovn
○ ovn-kubernetes
● Enables more efficient OVS flow generation with conjunction, when multiple ports on same HV
belongs to same port-group
○ E.g.
■ N members in a port-group, all M ports on HV1 belong to this group
■ Number of OVS flows on HV1 will be M + N, instead of M * N
outport == @port_group1 && ip4 && ip4.src == $port_group1_ip4
CMS creates
port-group instead
of address-set
OVN-Northd
generates
address-set for you
Further Improvement - Group-ID in Packet (TODO)
● Problem - still too many OVS flows
○ Best case: M + N, if all M ports on HV belongs to same group.
○ Worst case: M * N, if ports are distributed randomly.
■ M ports on HV, each belongs to a different group, each group has N members
● Solution (just an idea)
○ Encoding port-group in tunnel metadata
■ Only M flows in all cases
■ Best part: no local flow change needed for remote member changes
○ Challenge: what if a port belongs to multiple groups
■ Limit the number of groups for a single port
■ Fall back to old way if exceeds
○ Limitation: works for ingress (to-lport) rules only
outport == @port_group1 && src_group_id == <group1 id>
From tunnel
metadata
Scaling Nested Workloads
● Use Case
○ VM overlay networking with OVN (e.g. using OpenStack networking-ovn)
○ Run Kubernetes on top of the VMs
● Problem
○ How to connect the pods at scale?
ARP Proxy
● OVN doesn’t support MAC-learning (MAC-Port binding
learning), but IP-MAC binding can be learned through
ARP
● How
○ LR send ARP request for Pod IPs
○ ARP proxy in the VM replies with VM’s MAC for
all Pod IPs on the VM
● Works, but
○ Requires VM and Pods on same subnet
○ Unreliable when SB DB connection fails
○ Scale: O(N), N = number of pods, usually much
bigger than number of VMs
■ Note: IP-MAC Binding incremental processing
change handler is implemented - no re-compute.
HV
VM
OVS
Pod
Pod Pod
Pod
ARP
Proxy
OVN
Controller
SB
IP-MAC
Binding Table
LR ARP Cache (dynamic):
10.0.0.102 => aa:bb:cc:dd:ee:ff
10.0.0.103 => aa:bb:cc:dd:ee:ff
10.0.0.104 => aa:bb:cc:dd:ee:ff
...
10.0.0.102
10.0.0.103 10.0.0.104
10.0.0.105
10.0.0.2 (aa:bb:cc:dd:ee:ff)
LR Static Route
● Assign Pod subnet(s) per VM (minion)
● How
○ Configure static routes in OVN LR for pod
subnets: next hop = VM IP
● Considerations
○ De-couples VM and Pod subnets
○ Declarative, more reliable than ARP
○ May waste more IPs, but size of subnet is
flexible
○ Scale: O(S), S = number of pod subnets
■ Worst case O(N), N = number of pods, if subnet
size is /32.
HV
VM
OVS
Pod
Pod Pod
Pod
10.0.0.2/25
10.0.0.3/25 10.0.0.4/25
10.0.0.5/25
172.0.0.2/24
LR Routing Table (static):
10.0.0.0/25 => 172.0.0.2
10.0.0.128/25 => 172.0.1.100
10.0.0.1/25 => 172.0.1.3
...
● OVS/OVN
○ http://www.openvswitch.org/
● Networking-OVN
○ https://docs.openstack.org/networking-ovn/latest/
● OVN-Kubernetes
○ https://github.com/openvswitch/ovn-kubernetes/
● OVN-Scale-Test
○ https://github.com/openvswitch/ovn-scale-test
● GO-OVN library
○ https://github.com/eBay/go-ovn
References

More Related Content

What's hot

OVN 設定サンプル | OVN config example 2015/12/27
OVN 設定サンプル | OVN config example 2015/12/27OVN 設定サンプル | OVN config example 2015/12/27
OVN 設定サンプル | OVN config example 2015/12/27
Kentaro Ebisawa
 
BGP Dynamic Routing and Neutron
BGP Dynamic Routing and NeutronBGP Dynamic Routing and Neutron
BGP Dynamic Routing and Neutron
rktidwell
 
OVN operationalization at scale at eBay
OVN operationalization at scale at eBayOVN operationalization at scale at eBay
OVN operationalization at scale at eBay
Aliasgar Ginwala
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch Introduction
HungWei Chiu
 
Deploying IPv6 on OpenStack
Deploying IPv6 on OpenStackDeploying IPv6 on OpenStack
Deploying IPv6 on OpenStack
Vietnam Open Infrastructure User Group
 
오픈스택 기반 클라우드 서비스 구축 방안 및 사례
오픈스택 기반 클라우드 서비스 구축 방안 및 사례오픈스택 기반 클라우드 서비스 구축 방안 및 사례
오픈스택 기반 클라우드 서비스 구축 방안 및 사례
SONG INSEOB
 
Demystifying openvswitch
Demystifying openvswitchDemystifying openvswitch
Demystifying openvswitch
Prasad Mukhedkar
 
Open stack networking vlan, gre
Open stack networking   vlan, greOpen stack networking   vlan, gre
Open stack networking vlan, greSim Janghoon
 
Openstack Neutron, interconnections with BGP/MPLS VPNs
Openstack Neutron, interconnections with BGP/MPLS VPNsOpenstack Neutron, interconnections with BGP/MPLS VPNs
Openstack Neutron, interconnections with BGP/MPLS VPNs
Thomas Morin
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offload
Kevin Traynor
 
Understanding Open vSwitch
Understanding Open vSwitch Understanding Open vSwitch
Understanding Open vSwitch
YongKi Kim
 
OpenStack networking (Neutron)
OpenStack networking (Neutron) OpenStack networking (Neutron)
OpenStack networking (Neutron)
CREATE-NET
 
Neutron packet logging framework
Neutron packet logging frameworkNeutron packet logging framework
Neutron packet logging framework
Vietnam Open Infrastructure User Group
 
Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조
Seung-Hoon Baek
 
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
OpenStack Korea Community
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
ShapeBlue
 
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
OpenStack
 
DoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKDoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDK
Marian Marinov
 
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
Ian Choi
 
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Thomas Graf
 

What's hot (20)

OVN 設定サンプル | OVN config example 2015/12/27
OVN 設定サンプル | OVN config example 2015/12/27OVN 設定サンプル | OVN config example 2015/12/27
OVN 設定サンプル | OVN config example 2015/12/27
 
BGP Dynamic Routing and Neutron
BGP Dynamic Routing and NeutronBGP Dynamic Routing and Neutron
BGP Dynamic Routing and Neutron
 
OVN operationalization at scale at eBay
OVN operationalization at scale at eBayOVN operationalization at scale at eBay
OVN operationalization at scale at eBay
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch Introduction
 
Deploying IPv6 on OpenStack
Deploying IPv6 on OpenStackDeploying IPv6 on OpenStack
Deploying IPv6 on OpenStack
 
오픈스택 기반 클라우드 서비스 구축 방안 및 사례
오픈스택 기반 클라우드 서비스 구축 방안 및 사례오픈스택 기반 클라우드 서비스 구축 방안 및 사례
오픈스택 기반 클라우드 서비스 구축 방안 및 사례
 
Demystifying openvswitch
Demystifying openvswitchDemystifying openvswitch
Demystifying openvswitch
 
Open stack networking vlan, gre
Open stack networking   vlan, greOpen stack networking   vlan, gre
Open stack networking vlan, gre
 
Openstack Neutron, interconnections with BGP/MPLS VPNs
Openstack Neutron, interconnections with BGP/MPLS VPNsOpenstack Neutron, interconnections with BGP/MPLS VPNs
Openstack Neutron, interconnections with BGP/MPLS VPNs
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offload
 
Understanding Open vSwitch
Understanding Open vSwitch Understanding Open vSwitch
Understanding Open vSwitch
 
OpenStack networking (Neutron)
OpenStack networking (Neutron) OpenStack networking (Neutron)
OpenStack networking (Neutron)
 
Neutron packet logging framework
Neutron packet logging frameworkNeutron packet logging framework
Neutron packet logging framework
 
Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조
 
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
 
DoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKDoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDK
 
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
[OpenStack] 공개 소프트웨어 오픈스택 입문 & 파헤치기
 
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
 

Similar to Large scale overlay networks with ovn: problems and solutions

Baker: Scaling OVN with Kubernetes API Server
Baker: Scaling OVN with Kubernetes API ServerBaker: Scaling OVN with Kubernetes API Server
Baker: Scaling OVN with Kubernetes API Server
Han Zhou
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
Alluxio, Inc.
 
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebula Project
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
HBaseCon
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
Ceph Community
 
99.999% Available OpenStack Cloud - A Builder's Guide
99.999% Available OpenStack Cloud - A Builder's Guide99.999% Available OpenStack Cloud - A Builder's Guide
99.999% Available OpenStack Cloud - A Builder's Guide
Danny Al-Gaaf
 
haproxy_Load_Balancer.pptx
haproxy_Load_Balancer.pptxhaproxy_Load_Balancer.pptx
haproxy_Load_Balancer.pptx
crezzcrezz
 
haproxy_Load_Balancer.pdf
haproxy_Load_Balancer.pdfhaproxy_Load_Balancer.pdf
haproxy_Load_Balancer.pdf
crezzcrezz
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
OVHcloud
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with Alluxio
Alluxio, Inc.
 
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBasehbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
Michael Stack
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Erik Krogen
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
Belmiro Moreira
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
Severalnines
 
M|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change MethodsM|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change Methods
MariaDB plc
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
Ceph Community
 
What's new in Neutron Juno
What's new in Neutron JunoWhat's new in Neutron Juno
What's new in Neutron Juno
Jaume Devesa Gomez
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red_Hat_Storage
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
Alluxio, Inc.
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
HBaseCon
 

Similar to Large scale overlay networks with ovn: problems and solutions (20)

Baker: Scaling OVN with Kubernetes API Server
Baker: Scaling OVN with Kubernetes API ServerBaker: Scaling OVN with Kubernetes API Server
Baker: Scaling OVN with Kubernetes API Server
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
99.999% Available OpenStack Cloud - A Builder's Guide
99.999% Available OpenStack Cloud - A Builder's Guide99.999% Available OpenStack Cloud - A Builder's Guide
99.999% Available OpenStack Cloud - A Builder's Guide
 
haproxy_Load_Balancer.pptx
haproxy_Load_Balancer.pptxhaproxy_Load_Balancer.pptx
haproxy_Load_Balancer.pptx
 
haproxy_Load_Balancer.pdf
haproxy_Load_Balancer.pdfhaproxy_Load_Balancer.pdf
haproxy_Load_Balancer.pdf
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with Alluxio
 
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBasehbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
 
M|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change MethodsM|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change Methods
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
 
What's new in Neutron Juno
What's new in Neutron JunoWhat's new in Neutron Juno
What's new in Neutron Juno
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
 

Recently uploaded

Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ShahidSultan24
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 

Recently uploaded (20)

Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 

Large scale overlay networks with ovn: problems and solutions

  • 1. Large Scale Overlay Networks with OVN: Problems and Solutions Han Zhou (hzhou8@ebay.com) Open Infrastructure Summit - Denver, 2019
  • 2. Agenda ● Background ● Control-plane components scaling ○ OVN-Controller ○ South-bound DB ○ OVN-Northd ● Scaling ACL ● Scaling nested workloads (containers on VMs)
  • 3. Background of OVN ● SDN solution developed by OVS (Open vSwitch) community ● OpenStack support - neutron ML2 plugin: networking-ovn ● Kubernetes support - CNI plugin: ovn-kubernetes ● Main Features ● Full L2/L3 virtualization with overlay networks (Geneve, STT, VxLAN) ● L2 gateway, L3 gateway (centralized/distributed) & NAT with HA ● L4 ACLs (stateful FW) with address-set, port-group and packet logging ● Distributed Load-Balancer ● L2/L3 Port-security ● ARP responder, static/dynamic ARP ● Flat/Vlan physical networks ● Native DHCP, Metadata ● Parent-child ports for nested workloads ● QoS ● IPSec ● Policy-based routing ● ...
  • 4. ● Logical/physical separation ● Distributed local controllers ● Database Approach (ovsdb) Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb Distributed Control Plane OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows
  • 5. Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb OVN-Controller Scaling Challenges OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows ● Factors ○ Size of data ○ Rate of changes ● Challenges ○ Big size of data to be processed ■ E.g. 10k logical ports generates >40k logical flows and 10k port-bindings ○ Logical flow parsing is CPU intensive ○ Cloud workload changes frequently ○ Lots of inputs for flow computation
  • 6. OVS qos Address Sets (converted) MFF OVN Geneve OVS open_vswitch OVS bridge SB logical_flow SB chassis SB encap SB mc_group SB dp_binding SB port_binding SB mac_binding SB dhcp SB dhcpv6 SB dns SB gw_chassis OVS port SB addr_set SB port_group Runtime Data ------------------------------ Local_datapath Local_lports Local_lport_ids Active_tunnels Ct_zone_bitmap Pending_ct_zones Ct_zones Flow Output --------------------------- Desired_flow_table Group_table Meter_table Conj_id_ofs SB OVSDB input Local OVSDB input Dependency Graph of OVN-Controller Port Groups (converted)
  • 7. Original Approach - Recomputing ● Compute OVS flows by reprocessing all inputs when ○ Any input changes ○ Or even when there is no change at all (but just unrelated events) ● Benefit ○ Relatively easy to implement and maintain ● Problems ○ 100% CPU of ovn-controller process on all compute nodes ○ High control plane latency
  • 8. Solution - Incremental Processing Engine ● DAG representing dependencies ● Each node contains ○ Data ○ Links to input nodes ○ Change-handler for each input ○ Full recompute handler ● Engine ○ DFS post-order traverse the DAG from the final output node ○ Invoke change-handlers for inputs that changed ○ Fall back to recompute if for ANY of its inputs: ■ Change-handler is not implemented for that input, or ■ Change-handler cannot handle the particular change (returns false) input intermediate input intermediate output input
  • 9. OVS qos Address Sets (converted) MFF OVN Geneve OVS open_vswitch OVS bridge SB logical_flow SB chassis SB encap SB mc_group SB dp_binding SB port_binding SB mac_binding SB dhcp SB dhcpv6 SB dns SB gw_chassis OVS port SB addr_set SB port_group Runtime Data ------------------------------ Local_datapath Local_lports Local_lport_ids Active_tunnels Ct_zone_bitmap Pending_ct_zones Ct_zones Flow Output --------------------------- Desired_flow_table Group_table Meter_table Conj_id_ofs SB OVSDB input Local OVSDB input Input with change handler implemented Change Handler Implemented Port Groups (converted)
  • 10. ● Create and bind 10k ports on 1k HVs ○ Simulated 1k HVs on 20 BMs x 40 cores (2.50GHz) ○ 10k ports all under the same logical router ○ Batch size 100 lports ○ Bind port one by one for each batch ○ Wait all ports up before next batch CPU Efficiency Improvement
  • 11. ● End to end latency on top of 10k existed logical ports ○ Create one more logical port and bind the port on HV ○ Wait until northd generate lflows and create port-binding in SB ○ Wait until ovn-controller claim the port on HV ○ Wait until northd generate all lflows ○ Wait until OVS flows programmed on all HVs Latency Improvement
  • 12. Tests at Larger Scale ● Next bottle-necks: ○ OVS flow installation ○ Port-binding handling when the binding happens locally
  • 13. What’s next for Incremental-Processing (WIP) ● Incremental flow installation ○ Low hanging fruit - with the help of incremental flow computing ● Implement more change handlers as needed ○ E.g. support incremental processing when port-binding happens locally - further improve end-to-end latency ● New implementation: Differential Datalog (DDlog) ○ Data-flow approach ○ Reuse the effort taken for Northd improvement (will be discussed in Northd scaling) ● Upstream? ○ Not in upstream, because DDlog is the preferred long term solution ○ For those who need this: ■ Rebased on Master: https://github.com/hzhou8/ovs/tree/ovn-controller-inc-proc ■ Rebased on 2.11: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.11 ■ Rebased on 2.10: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.10
  • 14. OVN-Controller Other Improvements (WIP) ● Reduce data size per-HV ○ Problem: External Provider Network connects everything ○ Solution: Don’t cross external network boundary when calculating connected datapaths ● On-demand tunnel port creation ○ Problem: Too many OVS ports when there are a lot of HVs ○ Solution: Create tunnel to a remote host only if there are ports on these hosts logically connected.
  • 15. ● Factors ○ Number of clients (HVs & GWs) ○ Size of data ○ Rate of changes ● Problems ○ Probe handling ○ Data resync during restart/failover ○ Clustered-mode problems Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb SB DB Scaling Challenges OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows
  • 16. SB DB Probe ● Default 5 sec probe interval causing connection flapping ○ Ovsdb-server response can occasionally exceed 5 sec ■ DB log compression ■ Large transaction handling ○ Clients reconnecting adds more load to the server - cascade failure ■ Clients resync data from server (solved - see next slide) ● Solution ○ Increase probe interval ■ Client side (on HVs) ● ovs-vsctl set open . external_ids:ovn-remote-probe-interval=60000 ■ Server side (DON’T FORGET!!) ● ovn-sbctl -- --id=@conn_uuid create Connection target="ptcp:6642:0.0.0.0" inactivity_probe=0 -- set SB_Global . connections=@conn_uuid ○ Rely on external monitorings for HVs connectivity
  • 17. Data re-sync during DB reconnect ● Problem ○ OVSDB client caching => NOT a problem ○ Server restart/failover: re-sync data for all clients. => This is the problem! ● Solution - OVSDB fast re-sync (in master -> v2.12) ○ Track and maintain recent history transactions in disk and memory. ○ New method monitor_cond_since in OVSDB protocol, to request changes since last point before connection lost. ○ Note: now it works for clustered mode only. ● Test Result - 1k HVs, 10k ports ○ Before: SB DB 100% CPU, >30 min to recover. ○ After: No CPU spike, all connections restored in <1 min (probe interval).
  • 18. OVSDB Clustered Mode ● Raft based clustering (experimental support since v2.9) ● Problems at scale ○ High CPU load (solved in master) ○ Follower update latency (solved in master) ○ Leader flapping (WIP, workaround ready) ○ Client reconnect (solved in master)
  • 19. OVSDB Clustered Mode - High CPU ● OVSDB Raft Implementation ○ Preprocessing on followers before sending to leader - share some load for leader ○ Send preprocessed transaction to leader together with a prerequisite version ID ● Problem ○ Lots of prerequisite check failure and retry at large scale ■ Different HVs update chassis/port_binding at the same time through different follower nodes ○ Continuous retry causes 100% CPU ● Solution (in master -> v2.12) ○ Retry only when the follower have applied the largest local Raft log index ■ Otherwise, the prerequisite is already out-of-date, so don’t waste CPU
  • 20. OVSDB Clustered Mode - Follower Latency ● Original behavior: leader sends Raft log update to follower nodes when: ○ A new change is proposed, or ○ A heartbeat is sent ● Problem ○ Update from follower node suffers big latency ● Solution (in master -> v2.12) ○ Send log to followers as soon as a new entry is committed ● Test result: 100 updates through same follower from same client ○ Before: >30 sec ○ After: 500 ms
  • 21. OVSDB Clustered Mode - Leader Flapping ● Problem: heartbeat timeout, triggering re-election ○ Large transaction execution ○ Raft log compression (snapshot) ● Solution ○ Quick and dirty: Increase election timeout (hardcode) ○ Short term: Make election timeout configurable at cluster level (WIP) ○ Longer term: Separate thread for Raft RPC (WIP) ■ Still need to configure timeout for snapshot scenarios
  • 22. OVSDB Clustered Mode - Client Reconnect ● Problem: during leader failover, all clients of new leader will reconnect ○ DB state changes to “disconnected” when there is no leader (temporarily) ○ Client tries to reconnect to a new node ● Solution (in master -> v2.12) ○ Don’t change state to “disconnected” if ■ Current node is candidate, and ■ Election didn’t timeout yet
  • 23. Scale Test for Clustered Mode ● Setup ○ 3-node cluster, 1k HVs ○ Election timeout: 10s (hardcoded in the test) ● Test ○ Keep creating and binding ports up to 10k ○ Periodically kill->wait(10s)->start each ovsdb-server randomly ● Test passed at scale! ○ All port creation and binding completed correctly. ○ Fast-resync helped!
  • 24. Further Improvement: SB-DB Scale-out Replicas (TODO) ● How to support more HVs - 2k? 5k? 10k? ○ More nodes in cluster? Doesn’t scale. ○ Multi-threading OVSDB? Would help, but... ● Precondition: no write to SB from HV ○ Chassis/Encap/Port-binding update by CMS/northd only ○ Does not use dynamic ARP (mac-binding) ● How ○ Use replication mode of OVSDB to create N read-only replicas ○ HV connections sharding on read-only replicas ○ HV can failover to other replicas NorthdNorthd SB ovsdb SB Replica 1 SB Replica 2 SB Replica n … HV HV HV … HV HV HV … HV HV HV … CMS NB ovsdb
  • 25. Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb OVN-Northd Scaling Challenges HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows ● Factors ○ Size of data ○ Rate of changes ● Problems ○ Recompute
  • 26. OVN-Northd Incremental Processing (WIP from community) ● OVN-Northd is a perfect target user of Differential Datalog (DDlog) ○ Inputs - NB DB tables (logical routers, switch, port, etc.) ○ Outputs - SB DB tables (logical flows, port-bindings, etc.) ○ Rules to convert inputs to outputs ● Differential Datalog ○ An open-source datalog language for incremental data-flow processing ○ Defining inputs and outputs as relations ○ Defining rules to generate outputs from inputs ● Efforts can be reused by OVN-Controller ○ OVSDB - DDlog wrappers ○ Process framework changes
  • 27. ● OVN-Northd ● OVN-SB DB ● OVN-Controller Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb Recap Scaling Bottlenecks OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows
  • 28. Some More Scaling Problems ● Security Group / Network policy using ACLs ● Nested workloads (K8S containers)
  • 29. ACLs ● Used by Security Group (OpenStack) / Network Policy (K8S) ● Typical use case: members of same group are allowed to access each other ● Naked => O(N^2) ● Using Address Set => O(N) ● #Flows in OVS is always O(M*N) (M = number of ports on the HV) outport == <port1_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN} outport == <port2_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN} ... outport == <portN_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN} outport == <port1_uuid> && ip4 && ip4.src == $as_ip4_sg1 outport == <port2_uuid> && ip4 && ip4.src == $as_ip4_sg1 ... outport == <portN_uuid> && ip4 && ip4.src == $as_ip4_sg1
  • 30. Solution - Port Group (Released in v2.10) ● All-in-one ● Greatly simplified CMS Implementation ○ networking-ovn ○ ovn-kubernetes ● Enables more efficient OVS flow generation with conjunction, when multiple ports on same HV belongs to same port-group ○ E.g. ■ N members in a port-group, all M ports on HV1 belong to this group ■ Number of OVS flows on HV1 will be M + N, instead of M * N outport == @port_group1 && ip4 && ip4.src == $port_group1_ip4 CMS creates port-group instead of address-set OVN-Northd generates address-set for you
  • 31. Further Improvement - Group-ID in Packet (TODO) ● Problem - still too many OVS flows ○ Best case: M + N, if all M ports on HV belongs to same group. ○ Worst case: M * N, if ports are distributed randomly. ■ M ports on HV, each belongs to a different group, each group has N members ● Solution (just an idea) ○ Encoding port-group in tunnel metadata ■ Only M flows in all cases ■ Best part: no local flow change needed for remote member changes ○ Challenge: what if a port belongs to multiple groups ■ Limit the number of groups for a single port ■ Fall back to old way if exceeds ○ Limitation: works for ingress (to-lport) rules only outport == @port_group1 && src_group_id == <group1 id> From tunnel metadata
  • 32. Scaling Nested Workloads ● Use Case ○ VM overlay networking with OVN (e.g. using OpenStack networking-ovn) ○ Run Kubernetes on top of the VMs ● Problem ○ How to connect the pods at scale?
  • 33. ARP Proxy ● OVN doesn’t support MAC-learning (MAC-Port binding learning), but IP-MAC binding can be learned through ARP ● How ○ LR send ARP request for Pod IPs ○ ARP proxy in the VM replies with VM’s MAC for all Pod IPs on the VM ● Works, but ○ Requires VM and Pods on same subnet ○ Unreliable when SB DB connection fails ○ Scale: O(N), N = number of pods, usually much bigger than number of VMs ■ Note: IP-MAC Binding incremental processing change handler is implemented - no re-compute. HV VM OVS Pod Pod Pod Pod ARP Proxy OVN Controller SB IP-MAC Binding Table LR ARP Cache (dynamic): 10.0.0.102 => aa:bb:cc:dd:ee:ff 10.0.0.103 => aa:bb:cc:dd:ee:ff 10.0.0.104 => aa:bb:cc:dd:ee:ff ... 10.0.0.102 10.0.0.103 10.0.0.104 10.0.0.105 10.0.0.2 (aa:bb:cc:dd:ee:ff)
  • 34. LR Static Route ● Assign Pod subnet(s) per VM (minion) ● How ○ Configure static routes in OVN LR for pod subnets: next hop = VM IP ● Considerations ○ De-couples VM and Pod subnets ○ Declarative, more reliable than ARP ○ May waste more IPs, but size of subnet is flexible ○ Scale: O(S), S = number of pod subnets ■ Worst case O(N), N = number of pods, if subnet size is /32. HV VM OVS Pod Pod Pod Pod 10.0.0.2/25 10.0.0.3/25 10.0.0.4/25 10.0.0.5/25 172.0.0.2/24 LR Routing Table (static): 10.0.0.0/25 => 172.0.0.2 10.0.0.128/25 => 172.0.1.100 10.0.0.1/25 => 172.0.1.3 ...
  • 35. ● OVS/OVN ○ http://www.openvswitch.org/ ● Networking-OVN ○ https://docs.openstack.org/networking-ovn/latest/ ● OVN-Kubernetes ○ https://github.com/openvswitch/ovn-kubernetes/ ● OVN-Scale-Test ○ https://github.com/openvswitch/ovn-scale-test ● GO-OVN library ○ https://github.com/eBay/go-ovn References