12. www.romana.io
Overcoming VPC network limits
• How do I build clusters larger than 50 nodes?
1. Overlay
2. Route aggregation
3. Secondary IPs
• When can I avoid an overlay?
• Can I apply Network Policy?
• Portability across multi-cloud?
13. www.romana.io
VPC 192.168.0.0/16
AZ 2: 192.168.2.0/24AZ 1: 192.168.1.0/24
Host 192.168.1.1
Host 192.168.1.2
Host 192.168.1.3
Host 192.168.2.1
Host 192.168.2.2
Host 192.168.2.3
Large cluster with overlay
Pods
10.0.1.0/24
Pods
10.0.2.0/24
Pods
10.0.3.0/24
Pods
10.0.4.0/24
Pods
10.0.5.0/24
Pods
10.0.6.0/24
VPC route table
TUN
TUN
TUN
TUN
TUN
TUN
14. www.romana.io
Route aggregation with Romana
● Topology aware IPAM
○ Maintains aggregated routes in L3 networks
■ A lot fewer routes to worry about
○ User chooses aggregation point
■ Nodes, ToR/Leaf, Core/Spine
○ Allows filtering/restrictive route distribution
● Network advertisement in L3 networks
○ EC2 VPC: via vpc-router service
○ Datacenter: via bird service (BGP, OSPF)
15. www.romana.io
VPC 192.168.0.0/16
AZ 1: 192.168.1.0/24 AZ 2: 192.168.2.0/24
Host 192.168.1.26
Host 192.168.1.1
0.0.0.0 -> eth0
10.0.3.0/24 -> 192.168.1.26
Host 192.168.1.2
0.0.0.0 -> eth0
Host 192.168.1.3
0.0.0.0 -> eth0
Host 192.168.2.1
0.0.0.0 -> eth0
Host 192.168.2.2
0.0.0.0 -> eth0
Host 192.168.2.3
0.0.0.0 -> eth0
Scale up the cluster
Pods
10.0.2.0/24
Pods
10.0.4.0/24
Pods
10.0.6.0/24
Pods
10.0.8.0/24
Pods
10.0.10.0/24
Pods
10.0.12.0/24
VPC route table
10.0.2.0/23-> 192.168.1.1
10.0.4.0/23 -> 192.168.1.2
10.0.6.0/23 -> 192.168.1.3
10.0.8.0/23 -> 192.168.2.1
10.0.10.0/23 -> 192.168.2.2
10.0.12.0/23 -> 192.168.2.3
…
…
…
...
10.0.98.0/23 -> 192.168.1.25
10.0.100.0/23 -> 192.168.2.25
1. Use each route to forward traffic
to more than one instance
2. Assign IPs based on zone
3. Install more specific route on
random node
4. Let instance forward to
destination instance
Pods
10.0.3.0/24
16. www.romana.io
VPC 192.168.0.0/16
AZ 1: 192.168.1.0/24 AZ 2: 192.168.2.0/24
Host 192.168.1.26
Host 192.168.1.1
0.0.0.0 -> eth0
10.0.3.0/24 -> 192.168.1.26
Host 192.168.1.2
0.0.0.0 -> eth0
Host 192.168.1.3
0.0.0.0 -> eth0
Host 192.168.2.1
0.0.0.0 -> eth0
Host 192.168.2.2
0.0.0.0 -> eth0
Host 192.168.2.3
0.0.0.0 -> eth0
Scale up the cluster
Pods
10.0.2.0/24
Pods
10.0.4.0/24
Pods
10.0.6.0/24
Pods
10.0.8.0/24
Pods
10.0.10.0/24
Pods
10.0.12.0/24
VPC route table
10.0.2.0/23-> 192.168.1.1
10.0.4.0/23 -> 192.168.1.2
10.0.6.0/23 -> 192.168.1.3
10.0.8.0/23 -> 192.168.2.1
10.0.10.0/23 -> 192.168.2.2
10.0.12.0/23 -> 192.168.2.3
…
…
…
...
10.0.98.0/23 -> 192.168.1.25
10.0.100.0/23 -> 192.168.2.25
Pods
10.0.3.0/24
Some cross zone traffic to this node will
be forwarded off-host to another node
1. Use each route to forward traffic
to more than one instance
2. Assign IPs based on zone
3. Install more specific route on
random node
4. Let instance forward to
destination instance
18. www.romana.io
Route aggregation
● Pro
○ Native VPC networking
■ Stable, simple to debug, no overlay
○ Scales beyond 50 nodes
○ Supports Network Policy
○ Portable
● Con
○ Extra router hop for some cross zone traffic in large clusters
(~1-2ms)
○ vs. Encap for 100% of cross zone traffic for all nodes (~1ms)
■ (n-50)*100%/n when n >50
● 100 nodes, 50% of cross zone traffic will have extra hop
● 200 nodes, 75% of cross zone traffic will have extra hop
19. www.romana.io
Secondary IPs
• VPC route limitation disappears
• All IPs are on same subnet
• Default route to zone is enough
• Add additional IPs and interfaces as necessary
• Depending on Instance type, up to 750 IPs (16xl instances
50 IPs on 15 ENIs)
• Two new CNIs use them for pod networks
• Amazon
• Lyft
21. www.romana.io
AWS L-IPAM
• Whenever available IP addresses drop below min threshold, L-IPAM will:
• Create a new ENI and attach it to instance
• Allocate all available IP addresses on new ENI
• Once IP addresses become available, add them to warm-pool
• Whenever available IP addresses exceed max threshold, L-IPAM will:
• Pick ENI where all secondary IP address are in warm-pool (i.e. not in
use)
• Detach ENI interface
• Free it ENI pool
• Fragmentation of addresses on ENIs may prevent freeing ENIs even when
there are many unused IP addresses
22. www.romana.io
IPs per instance
Sample max pods per instance
Instance Type ENIs Secondary IPs Total IPs Pods/instance
Medium 2 6 12 10
Large 3 10 30 27
[2-4]xLarge 4 15 45 41
8xLarge 8 30 240 232
23. www.romana.io
VPC 192.168.0.0/16
AZ 1: 192.168.0.0/19 AZ 2: 192.168.64.0/19
Host 192.168.1.1
Host 192.168.2.1
Host 192.168.3.1
Host 192.168.65.1
Host 192.168.66.2
Host 192.168.67.3
AWS CNI with Local IPAM
VPC route table
CNI
L-IPAM
EC2 Metadata
Service
CNI
L-IPAM
CNI
L-IPAM
CNI
L-IPAM
CNI
L-IPAM
CNI
L-IPAM
Pods with IPs
on ENI
Pods
192.168.2.0/24
Pods
192.168.3.0/24
Pods
192.168.65.0/2
4
Pods
192.168.66.0/2
4
Pods
192.168.67.0/2
4
Pods with IPs
on ENI
Pods with IPs
on ENI
Pods with IPs
on ENI
Pods with IPs
on ENI
Pods with IPs
on ENI
Pods with IPs
on ENI
EC2 API
24. www.romana.io
Lyft ipvlan CNI
• Similar to AWS CNI
• CNI does IPAM
• Designed for low latency
• ipvlan device driver
• L2 Mode
• Pods share single MAC
• Adds second IP in pods for host networking
• Different tradeoffs
• Optimized for intra-VPC traffic
• No network service/daemonset
• Uses kube2IAM
• Pod traffic bypasses host networking
• Can’t apply L3 network policy on host
• Services are different
• Lose pod source IP
25. www.romana.io
VPC 192.168.0.0/16
AZ 1: 192.168.0.0/19 AZ 2: 192.168.64.0/19
Host 192.168.1.1
Host 192.168.2.1
Host 192.168.3.1
Host 192.168.65.1
Host 192.168.66.2
Host 192.168.67.3
Lyft ipvlan CNI
VPC route table
kube2IAM
EC2 Metadata
Service
Pods with IPs
on ENIPods with IPs
on ENI plus link
local IP
CNI
EC2 API
kube2IAM
CNI
kube2IAM
CNI
kube2IAM
CNI
kube2IAM
CNI
kube2IAM
CNI
Pods with IPs
on ENIPods with IPs
on ENI plus link
local IP
Pods with IPs
on ENIPods with IPs
on ENI plus link
local IP
Pods with IPs
on ENIPods with IPs
on ENI plus link
local IP
Pods with IPs
on ENIPods with IPs
on ENI plus link
local IP
Pods with IPs
on ENIPods with IPs
on ENI plus link
local IP
26. www.romana.io
Secondary IP Limitations and Considerations
• All ENIs share same subnet and same security groups
• Direct access to AWS APIs to allocate/de-allocation ENIs
• Direct access to Metadata Service for IPAM
• Enough IPs per instances
• How big is the ARP table??
27. www.romana.io
Native multi-zone VPC Networking
CNI Deployment
CNI plug-in plus….
Pod network IPAM Max pods per
instance
Network
Policy API
Multi-cloud
Romana vpc-router service on
master
No overlay.
Extra hop for some traffic
on large clusters.
Pod network any subnet
Central IPAM
service.
No limit Yes Yes
AWS L-IPAM daemonset No overlay.
Pod network VPC subnet
with secondary IPs.
Per node daemon.
Accesses EC2
Metadata
Depends on
instance size
TBD. Security
Groups
No
Lyft kube2IAM daemonset No overlay.
Pod network VPC subnet
with secondary IPs.
Link-local IP.
Built in to CNI.
Accesses EC2
Metadata
Depends on
instance size
No. Delegates
to Envoy
No