Deeper Dive in
Docker Overlay
Networks
Laurent
Bernaille
@lbernail
CTO D2SI
Agenda
Reminder on the Docker Overlay
VXLAN Control Plane options
Using BGP as a dynamic Control Plane
What can we do with this?
Reminder on the Docker
overlay
The Docker Overlay network
docker0:~$ docker network create --driver overlay --subnet 192.168.0.0/24 dockercon
d099dcc709daddbc0e143c24e7091bef6b13bdc3abb379473af4582bf1e112b1
docker1:~$ docker network ls
NETWORK ID NAME DRIVER SCOPE
d099dcc709da dockercon overlay global
docker0:~$ docker run -d --ip 192.168.0.100 --net dockercon --name C0 debian sleep infinity
docker1:~$ docker run -it --rm --net dockercon debian
root@950d67e96db7:/# ping 192.168.0.100
PING 192.168.0.100 (192.168.0.100): 56 data bytes
64 bytes from 192.168.0.100: seq=0 ttl=64 time=1.153 ms
Docker Overlay: Data plane
docker0
eth0
192.168.0.100
C0 Namespace
br0
vxlanveth
eth0
docker1
C1 Namespace
br0
vxlanveth
eth0PING
eth0
192.168.0.Y
10.0.0.10 10.0.0.11
IP
src:	10.0.0.11
dst:	10.0.0.10
UDP
src:	X
dst:	4789
VXLAN
VNI
Original	L2
src:	192.168.0.Y
dst:	192.168.0.100
What is VXLAN?
• Tunneling technology over UDP (L2 in UDP)
• Developed for cloud SDN to create multi-tenancy
• Without the need for L2 connectivity
• Without the normal VLAN limit (4096 VLAN Ids)
• Easy to encrypt: IPsec
• Overhead: 50 bytes
• In Linux
• Started with Open vSwitch
• Native with Kernel >= 3.7 and >=3.16 for Namespace support
Outer	IP	packet
UDP
dst:	4789
VXLAN
Header
Original	L2
VXLAN: Virtual eXtensible LAN
VNI: VXLAN Network Identifier
VTEP: VXLAN Tunnel Endpoint
docker0 docker1
10.0.0.0/16
10.0.0.10 10.0.1.10
Let's build an overlay
"manually"
Overlay namespaces
docker0
br42
vxlan42
eth0
docker1
br42
eth0
10.0.0.10 10.0.1.10
vxlan42
Creating the overlay
namespaceip netns add overns
ip netns exec overns ip link add dev br42 type bridge
ip netns exec overns ip addr add dev br42 192.168.0.1/24
ip link add dev vxlan42 type vxlan id 42 proxy dstport 4789
ip link set vxlan1 netns overns
ip netns exec overns ip link set vxlan42 master br42
ip netns exec overns ip link set vxlan42 up
ip netns exec overns ip link set br42 up
create overlay NS
create bridge in NS
create VXLAN interface
move it to NS
add it to bridge
bring all interfaces up
setup_vxlan script
docker0
C0 Namespace
br42
veth
eth0
docker1
C1 Namespace
br42
veth
eth0
eth0
192.168.0.10
eth0
192.168.0.20
10.0.0.10 10.0.1.10
vxlan42
vxlan42
Attach containers
docker0
docker run -d --net=none --name=demo debian sleep infinity
ctn_ns_path=$(docker inspect --format="{{ .NetworkSettings.SandboxKey}}" demo)
ctn_ns=${ctn_ns_path##*/}
ip link add dev veth1 mtu 1450 type veth peer name veth2 mtu 1450
ip link set dev veth1 netns overns
ip netns exec overns ip link set veth1 master br42
ip netns exec overns ip link set veth1 up
ip link set dev veth2 netns $ctn_ns
ip netns exec $ctn_ns ip link set dev veth2 name eth0 address 02:42:c0:a8:00:10
ip netns exec $ctn_ns ip addr add dev eth0 192.168.0.10
ip netns exec $ctn_ns ip link set dev eth0 up
docker1
Same with 192.168.0.20 / 02:42:c0:a8:00:20
Create container without net
Create veth
Send veth1 to overlay NS
Attach it to overlay bridge
Send veth2 to container
Rename & Configure
Get NS for container
Create containers and attach
them
plumb script
Does it ping?
docker0:~$ docker exec -it demo ping 192.168.0.20
PING 192.168.0.20 (192.168.0.20): 56 data bytes
92 bytes from 192.168.0.10: Destination Host Unreachable
docker0:~$ sudo ip netns exec overns ip neighbor show
docker0:~$ sudo ip netns exec overns ip neighbor add 192.168.0.20 lladdr 02:42:c0:a8:00:20 dev vxlan42
docker0:~$ sudo ip netns exec overns bridge fdb add 02:42:c0:a8:00:20 dev vxlan42 self dst 10.0.1.10 
vni 42 port 4789
docker1: Same with 192.168.0.10, 02:42:c0:a8:00:10 and 10.0.0.10
docker0
C0 Namespace
br42
veth
eth0
docker1
C1 Namespace
br42
veth
eth0
eth0
192.168.0.20
eth0
192.168.0.20
10.0.0.10 10.0.1.10
vxlan42
vxlan42
PING
FDB
ARP
FDB
ARP
Result
VXLAN Control Plane options
vxlan vxlan
vxlan
Multicast
239.x.x.x
ARP:	Who	has	192.168.0.2?
L2	discovery:	where	is	02:42:c0:a8:00:02	?
Use a multicast group to send traffic for unknown L3/L2 addresses
PROS: simple and efficient
CONS: Multicast connectivity not always available (on public clouds for
instance)
VXLAN Control Plane options - 1: Multicast
Configure a remote IP address where to send traffic for unknown addresses
PROS: simple, not need for multicast, very good for two hosts
CONS: difficult to manage with more than 2 hosts
VXLAN Control Plane options - 2: Point-to-
point
vxlan vxlan
Remote	IP:	point-to-point
Send everything to	remote IP
Do nothing, provide ARP / FDB information from outside
PROS: very flexible
CONS: requires a daemon and a centralized database of addresses
VXLAN Control Plane options - 3: User-Land
vxlan vxlan
daemon daemon
Manual	(with	a	daemon	modifying	ARP/FDB)
ARP:	Do	you	know		192.168.0.2?
L2:	where	is	02:42:c0:a8:00:02	?
vxlan
daemon
consul/swarm
docker0
eth0
192.168.0.100
C0 Namespace
br0
vxlan
veth
eth0
docker1
C1 Namespace
br0
vxlan
veth
eth0
192.168.0.Y
eth0
NAT
PING
dockerd dockerd
10.0.0.10 10.0.1.10
ARP
FDB
ARP
FDB
IP
src:	10.0.0.11
dst:	10.0.0.10
UDP
src:	X
dst:	4789
VXLAN
VNI
Original	L2
src:	192.168.0.Y
dst:	192.168.0.100
Serf / Gossip
Docker Overlay control plane (3: User-land)
"Deep Dive in Docker Overlay Networks", Dockercon Austin 2017
Slides
Video
Blog Posts
That was a lot of information
Using BGP as a dynamic
control plane
Rely on BGP eVPN address family to distribute L2 and L3 data
PROS: BGP is a standard to distribute addresses, supported by SDN
vendors
CONS: limited Linux implementations, requires some BGP knowledge
VXLAN Control Plane- Option 4: BGP-EVPN
vxlan vxlan
bgpd bgpd
vxlan
bgpd
Endpoint	data	is	distributed	with	BGP
BGP in one slide
● Routing Protocol between network entities ("Autonomous Systems", AS)
Google ASN: 15169 / Amazon ASN: 16509
(both actually have more than one)
● BGP is an EGP: Exterior Gateway Protocol
IGP: Interior Gateway Protocol (OSPF, EIGRP, IS-IS)
IGP: next hop is the IP of a router
BGP: next hop is an Autonomous System
● BGP is what makes Internet work
● BGP scales very well
500 000+ prefixes for a full Internet table
A quick BGP example
AS 1
AS 2
AS 3
AS 5
AS 4
eBGP
iBGP
20.0.0.0/16
20.0.0.0/16: AS1
20.0.0.0/16: AS1
20.0.0.0/16: AS4-AS1
Shortest PATH?
20.0.0.0/16: AS5-AS4-AS1
20.0.0.0/16: AS2-AS1
AS: Autonomous System
eBGP: external (different AS)
iBGP: internal (same AS)
iBGP
iBGP requires to mesh between all peers
n peers => n * (n-1) / 2 connections
50 peers => 1225 (49 of each host)
Route-reflectors simulate the mesh
More scalable and simpler
Possible to have more than one RR
RR
Distribute BGP information within an Autonomous System
BGP EVPN
● Part of MP-BGP (multi-protocol BGP: not only IP prefixes)
● Announce VXLAN information instead of IP prefixes
L3: IP addresses of VXLAN endpoints (VTEP)
L2: Location of MAC addresses
● BUM (Broadcast, Unknown, Multicast) traffic unicasted to all VTEPs
● Get the scalability of BGP
10.0.0.0/16
docker0:	10.0.0.10
Environment
RR1 RR2
quagga-
rr
quagga-
rr
10.0.0.5 10.0.1.5
docker1:	10.0.1.10
docker0:~$ docker run -t -d --privileged --name quagga -p 179:179 --hostname docker0 
-v $(pwd)/quagga:/etc/quagga cumulusnetworks/quagga (modify routing/forwarding)
router bgp 65000
bgp router-id 10.0.0.10
no bgp default ipv4-unicast
neighbor reflectors peer-group
neighbor reflectors remote-as 65000
neighbor reflectors capability extended-nexthop
neighbor 10.0.0.5 peer-group reflectors
neighbor 10.0.1.5 peer-group reflectors
address-family evpn
neighbor reflectors activate
advertise-all-vni
BGP configuration on Docker0
router bgp 65000
bgp router-id 10.0.0.5
bgp cluster-id 111.111.111.111
no bgp default ipv4-unicast
neighbor docker peer-group
neighbor docker remote-as 65000
bgp listen range 10.0.0.0/16 peer-group docker
address-family evpn
neighbor docker activate
neighbor docker route-reflector-client
BGP configuration on Route Reflectors
Creating our BGP clients on Docker hosts
10.0.0.0/16
docker0:	10.0.0.10
What we have so far
RR1 RR2
quagga-
rr
quagga-
rr
docker0
quagga eth0
10.0.0.5 10.0.1.5
docker1:	10.0.1.10
docker0
quaggaeth0
Let's look at the BGP data
docker0:~$ docker exec -it quagga vtysh
docker0# show run
docker0# show bgp neighbors
docker0# show bgp evpn summary
BGP router identifier 10.0.0.10, local AS number 65000 vrf-id 0
Peers 2, using 42 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
quagga0(10.0.0.5) 4 65000 42 43 0 0 0 00:02:01 0
quagga1(10.0.1.5) 4 65000 42 43 0 0 0 00:02:01 0
docker0# show bgp evpn route
No EVPN prefixes exist
Configuring VXLAN interfaces
sudo ./setup_vxlan 42 container:quagga dstport 4789 nolearning <= Only learn through EVPN
10.0.0.0/16
docker0:	10.0.0.10
RR1 RR2
quagga-
rr
quagga-
rr
docker0
br42 vxlan42
quagga eth0
10.0.0.5 10.0.1.5
docker1:	10.0.1.10
docker0
br42vxlan42
quaggaeth0
Let's look at the BGP data
docker0:~$ docker exec -it quagga vtysh
docker0# show bgp evpn route
BGP table version is 0, local router ID is 10.0.0.10
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 10.0.0.10:1
*> [3]:[0]:[32]:[10.0.0.10]
10.0.0.10 32768 i
Route Distinguisher: 10.0.1.10:1
*>i[3]:[0]:[32]:[10.0.1.10]
10.0.1.10 0 100 0 i
docker0# show evpn mac vni all
Let's add containers and try pinging
10.0.0.0/16
docker0:	10.0.0.10
RR1 RR2
quagga-
rr
quagga-
rr
docker0
br42 vxlan42
quagga
demo:	192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker1:	10.0.1.10
docker0
br42vxlan42
quagga
demo:	192.168.0.20
eth0
eth0
docker0:~$ sudo ./plumb br42@quagga demo 192.168.0.10/24@192.168.0.1 02:42:c0:a8:00:10
docker1:~$ sudo ./plumb br42@quagga demo 192.168.0.20/24@192.168.0.1 02:42:c0:a8:00:20
What about BGP?
docker0:~$ docker exec -it quagga vtysh
docker0# show bgp evpn route
BGP table version is 0, local router ID is 10.0.0.10
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
Route Distinguisher: 10.0.1.10:1
*>i[2]:[0]:[0]:[48]:[02:42:c0:a8:00:20]
10.0.1.10 0 100 0 i
* i[3]:[0]:[32]:[10.0.1.10]
10.0.1.10 0 100 0 i
docker0# show evpn mac vni all
VNI 42 #MACs (local and remote) 2
MAC Type Intf/Remote VTEP VLAN
02:42:c0:a8:00:10 local veth0pldemo
02:42:c0:a8:00:20 remote 10.0.1.10
10.0.0.0/16
docker0:	10.0.0.10
Overview
RR1 RR2
quagga-
rr
quagga-
rr
docker0
br42 vxlan42
quagga
demo:	192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker1:	10.0.1.10
docker0
br42vxlan42
quagga
demo:	192.168.0.20
eth0
eth0Control	plane
Data	plane
● Standard VXLAN address distribution (used on many
routers)
● Full management of BUM traffic
ARP queries
Broadcasts (DHCP)
Multicast (Discovery, keepalived)
● BUM traffic is unicasted (not efficient)
Possible optimizations: ARP suppression (Cumulus Quagga)
What's interesting about this setup?
What can we do with this?
What if we want a second Overlay?
10.0.0.0/16
docker0:	10.0.0.10
RR1 RR2
quagga-
rr
quagga-
rr
docker0
br42 vxlan42
quagga
demo
192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker1:	10.0.1.10
br66 vxlan66
docker0
br42vxlan42
quagga
demo
192.168.0.10
eth0
eth0
br66vxlan66
demo66
192.168.66.10
eth0
demo66
192.168.66.20
eth0
docker0:~$ sudo ./setup_vxlan 66 container:quagga dstport 4789 nolearning
docker0:~$ docker run -d --net=none --name=demo66 debian sleep infinity
docker0:~$ sudo ./plumb br66@quagga demo66 192.168.66.10/24 02:42:c0:a8:66:10
What about BGP?
docker0:~$ docker exec -it quagga vtysh
docker0# show evpn vni
Number of VNIs: 2
VNI VxLAN IF VTEP IP # MACs # ARPs # Remote VTEPs
42 vxlan42 0.0.0.0 2 0 1
66 vxlan66 0.0.0.0 2 0 1
docker0# show evpn mac vni all
VNI 42 #MACs (local and remote) 2
MAC Type Intf/Remote VTEP VLAN
02:42:c0:a8:00:10 local veth0pldemo
02:42:c0:a8:00:20 remote 10.0.1.10
VNI 66 #MACs (local and remote) 2
MAC Type Intf/Remote VTEP VLAN
02:42:c0:a8:66:10 local veth0pldemo66
02:42:c0:a8:66:20 remote 10.0.1.10
10.0.0.0/16
docker0:	10.0.0.10
RR1 RR2
quagga-
rr
quagga-
rr
docker0
br42 vxlan42
quagga
demo
192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker1:	10.0.1.10
docker0
br42vxlan42
quaggaeth0
Taking advantage of broadcast: DHCP
dhcp
192.168.0.254
eth0
demo
192.168.0.20
eth0
demodhcp
192.168.0.10?
eth0
Configuring DHCP
docker0:~$ docker run -d --net=none --name dhcp -v "$(pwd)/dhcp":/data networkboot/dhcpd eth0
docker0:~$ sudo ./plumb br42@quagga dhcp 192.168.0.254/24
docker1:~$ docker run -d --net=none --name=demodhcp debian sleep infinity
docker1:~$ sudo ./plumb br42@quagga demodhcp dhcp
docker1:~$ docker exec -it demodhcp ping 192.168.0.10
PING 192.168.0.10 (192.168.0.10): 56 data bytes
64 bytes from 192.168.0.10: icmp_seq=0 ttl=47 time=1.566 ms
subnet 192.168.0.0 netmask 255.255.255.0 {
range 192.168.0.100 192.168.0.200;
option routers 192.168.0.1;
option domain-name-servers 8.8.8.8;
}
DHCP configuration
10.0.0.0/16
RR1 RR2
quagga-
rr
quagga-
rr
docker0
br42 vxlan42
quagga
demo
192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker0
br42vxlan42
quaggaeth0
Getting out of our Docker environment
dhcp
192.168.0.254
eth0
demo
192.168.0.20
eth0
client
192.168.0.100
eth0
quagga
br42
vxlan42
vethgw
192.168.0.1
docker0:	10.0.0.10 docker1:	10.0.1.10gateway0:	10.0.0.20
Getting out of our Docker environment
gateway0:~$ ./setup_vxlan 42 host dstport 4789 nolearning
gateway0:~$ ip link add dev vethbr type veth peer name vethgw
gateway0:~$ ip link set vethbr master br42
gateway0:~$ ip addr add 192.168.0.1/24 dev vethgw
gateway0:~$ ping 192.168.0.10
PING 192.168.0.10 (192.168.0.10): 56 data bytes
64 bytes from 192.168.0.10: icmp_seq=0 ttl=47 time=0.866 ms
br42
vethgw
192.168.0.1
vxlan42
vethbr
10.0.0.0/16
RR1 RR2
quagga-
rr
quagga-
rr
docker0
br42 vxlan42
quagga
demo
192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker0
br42vxlan42
quaggaeth0
Getting out of VXLAN / Quagga
dhcp
192.168.0.254
eth0
demo
192.168.0.20
eth0
client
192.168.0.100
eth0
quagga
br42
vxlan42
vethgw
192.168.0.1
eth0
Non-VXLAN
host
10.0.0.30
route
10.0.0.0/16 ó 192.168.0.0/24
NAT
docker0:	10.0.0.10 docker1:	10.0.1.10gateway0:	10.0.0.20
Getting out of VXLAN / Quagga
gateway0:~$ echo 1 | sudo tee /proc/sys/net/ipv4/ip_forward
gateway0:~$ iptables -t nat -A POSTROUTING ! -d 10.0.0.0/16 -s 192.168.0.0/24 -o eth0 -j MASQUERADE
docker1:~$ docker exec -it demodhcp ping 192.168.0.1 <= Local (VXLAN)
docker1:~$ docker exec -it demodhcp ping 10.0.0.30 <= Routed
docker1:~$ docker exec -it demodhcp ping 8.8.8.8 <= NATed
simple1:~$ ping 192.168.0.1
simple1:~$ ping 192.168.0.10
eth0
routeNAT
10.0.0.0/16
docker0:	10.0.0.10
RR1 RR2
quagga-
rr
quagga-
rr
docker0
br42 vxlan42
quagga
demo
192.168.0.10
eth0
eth0
10.0.0.5 10.0.1.5
docker1:	10.0.1.10
docker0
br42vxlan42
quaggaeth0
Another nice thing we can do
dhcp
192.168.0.254
eth0
demo
192.168.0.20
eth0
demodhcp
192.168.0.100
eth0
gateway0:	10.0.0.20
quagga
br42
vxlan42
vethgw
192.168.0.1
eth0
Non-VXLAN
host
10.0.0.30
routeNAT
QEMU,	dhclient
192.168.0.10x
tap0
What could a real-life setup look like?
RR2
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
BGP/EVPN
Router
Standard h
ost
Standard h
ost
Standard h
ost
Standard h
ost
VXLAN
Routing
Routes from non-VXLAN infraRoutes to VXLAN networks
RR1
How does it compare to other
solutions?
Data plane Control Plane
Swarm Classic VXLAN External KV Store (Consul / Etcd)
SwarmKit VXLAN Swarmkit (Raft / Gossip implementation)
Flannel host-gw Routing Etcd / Kubernetes API
Flannel VXLAN VXLAN Etcd / Kubernetes API
Calico Routing / IPIP Etcd / BGP (IP prefixes)
Weave Classic Custom Custom
Weave Fast Datapath VXLAN Custom
Contiv VXLAN, Routing, L2 Etcd / BGP (IP and maybe eVPN)
Disclaimer: almost no experience with any (from documentation and discussions mostly)
Perspectives
● FFRouting
Quagga fork
Cumulus has switched to FFRouting and merged EVPN support
● Open vSwitch
Alternative to linux native bridge and VXLAN
(Possibly) better performances and more features
Not sure how Quagga/FFRouting would integrate with Open
vSwitch
● Performances
Measure impact of VXLAN
Test VXLAN acceleration when available on NICs
● CNI plugin (to test on Kubernetes and mostly for learning
Thank you!
Questions?
https://github.com/lbernail/dockerco
n2017
@lbernail

Deeper Dive in Docker Overlay Networks

  • 1.
    Deeper Dive in DockerOverlay Networks Laurent Bernaille @lbernail CTO D2SI
  • 2.
    Agenda Reminder on theDocker Overlay VXLAN Control Plane options Using BGP as a dynamic Control Plane What can we do with this?
  • 3.
    Reminder on theDocker overlay
  • 4.
    The Docker Overlaynetwork docker0:~$ docker network create --driver overlay --subnet 192.168.0.0/24 dockercon d099dcc709daddbc0e143c24e7091bef6b13bdc3abb379473af4582bf1e112b1 docker1:~$ docker network ls NETWORK ID NAME DRIVER SCOPE d099dcc709da dockercon overlay global docker0:~$ docker run -d --ip 192.168.0.100 --net dockercon --name C0 debian sleep infinity docker1:~$ docker run -it --rm --net dockercon debian root@950d67e96db7:/# ping 192.168.0.100 PING 192.168.0.100 (192.168.0.100): 56 data bytes 64 bytes from 192.168.0.100: seq=0 ttl=64 time=1.153 ms
  • 5.
    Docker Overlay: Dataplane docker0 eth0 192.168.0.100 C0 Namespace br0 vxlanveth eth0 docker1 C1 Namespace br0 vxlanveth eth0PING eth0 192.168.0.Y 10.0.0.10 10.0.0.11 IP src: 10.0.0.11 dst: 10.0.0.10 UDP src: X dst: 4789 VXLAN VNI Original L2 src: 192.168.0.Y dst: 192.168.0.100
  • 6.
    What is VXLAN? •Tunneling technology over UDP (L2 in UDP) • Developed for cloud SDN to create multi-tenancy • Without the need for L2 connectivity • Without the normal VLAN limit (4096 VLAN Ids) • Easy to encrypt: IPsec • Overhead: 50 bytes • In Linux • Started with Open vSwitch • Native with Kernel >= 3.7 and >=3.16 for Namespace support Outer IP packet UDP dst: 4789 VXLAN Header Original L2 VXLAN: Virtual eXtensible LAN VNI: VXLAN Network Identifier VTEP: VXLAN Tunnel Endpoint
  • 7.
  • 8.
  • 9.
    Creating the overlay namespaceipnetns add overns ip netns exec overns ip link add dev br42 type bridge ip netns exec overns ip addr add dev br42 192.168.0.1/24 ip link add dev vxlan42 type vxlan id 42 proxy dstport 4789 ip link set vxlan1 netns overns ip netns exec overns ip link set vxlan42 master br42 ip netns exec overns ip link set vxlan42 up ip netns exec overns ip link set br42 up create overlay NS create bridge in NS create VXLAN interface move it to NS add it to bridge bring all interfaces up setup_vxlan script
  • 10.
  • 11.
    docker0 docker run -d--net=none --name=demo debian sleep infinity ctn_ns_path=$(docker inspect --format="{{ .NetworkSettings.SandboxKey}}" demo) ctn_ns=${ctn_ns_path##*/} ip link add dev veth1 mtu 1450 type veth peer name veth2 mtu 1450 ip link set dev veth1 netns overns ip netns exec overns ip link set veth1 master br42 ip netns exec overns ip link set veth1 up ip link set dev veth2 netns $ctn_ns ip netns exec $ctn_ns ip link set dev veth2 name eth0 address 02:42:c0:a8:00:10 ip netns exec $ctn_ns ip addr add dev eth0 192.168.0.10 ip netns exec $ctn_ns ip link set dev eth0 up docker1 Same with 192.168.0.20 / 02:42:c0:a8:00:20 Create container without net Create veth Send veth1 to overlay NS Attach it to overlay bridge Send veth2 to container Rename & Configure Get NS for container Create containers and attach them plumb script
  • 12.
    Does it ping? docker0:~$docker exec -it demo ping 192.168.0.20 PING 192.168.0.20 (192.168.0.20): 56 data bytes 92 bytes from 192.168.0.10: Destination Host Unreachable docker0:~$ sudo ip netns exec overns ip neighbor show docker0:~$ sudo ip netns exec overns ip neighbor add 192.168.0.20 lladdr 02:42:c0:a8:00:20 dev vxlan42 docker0:~$ sudo ip netns exec overns bridge fdb add 02:42:c0:a8:00:20 dev vxlan42 self dst 10.0.1.10 vni 42 port 4789 docker1: Same with 192.168.0.10, 02:42:c0:a8:00:10 and 10.0.0.10
  • 13.
  • 14.
  • 15.
    vxlan vxlan vxlan Multicast 239.x.x.x ARP: Who has 192.168.0.2? L2 discovery: where is 02:42:c0:a8:00:02 ? Use amulticast group to send traffic for unknown L3/L2 addresses PROS: simple and efficient CONS: Multicast connectivity not always available (on public clouds for instance) VXLAN Control Plane options - 1: Multicast
  • 16.
    Configure a remoteIP address where to send traffic for unknown addresses PROS: simple, not need for multicast, very good for two hosts CONS: difficult to manage with more than 2 hosts VXLAN Control Plane options - 2: Point-to- point vxlan vxlan Remote IP: point-to-point Send everything to remote IP
  • 17.
    Do nothing, provideARP / FDB information from outside PROS: very flexible CONS: requires a daemon and a centralized database of addresses VXLAN Control Plane options - 3: User-Land vxlan vxlan daemon daemon Manual (with a daemon modifying ARP/FDB) ARP: Do you know 192.168.0.2? L2: where is 02:42:c0:a8:00:02 ? vxlan daemon
  • 18.
    consul/swarm docker0 eth0 192.168.0.100 C0 Namespace br0 vxlan veth eth0 docker1 C1 Namespace br0 vxlan veth eth0 192.168.0.Y eth0 NAT PING dockerddockerd 10.0.0.10 10.0.1.10 ARP FDB ARP FDB IP src: 10.0.0.11 dst: 10.0.0.10 UDP src: X dst: 4789 VXLAN VNI Original L2 src: 192.168.0.Y dst: 192.168.0.100 Serf / Gossip Docker Overlay control plane (3: User-land)
  • 19.
    "Deep Dive inDocker Overlay Networks", Dockercon Austin 2017 Slides Video Blog Posts That was a lot of information
  • 20.
    Using BGP asa dynamic control plane
  • 21.
    Rely on BGPeVPN address family to distribute L2 and L3 data PROS: BGP is a standard to distribute addresses, supported by SDN vendors CONS: limited Linux implementations, requires some BGP knowledge VXLAN Control Plane- Option 4: BGP-EVPN vxlan vxlan bgpd bgpd vxlan bgpd Endpoint data is distributed with BGP
  • 22.
    BGP in oneslide ● Routing Protocol between network entities ("Autonomous Systems", AS) Google ASN: 15169 / Amazon ASN: 16509 (both actually have more than one) ● BGP is an EGP: Exterior Gateway Protocol IGP: Interior Gateway Protocol (OSPF, EIGRP, IS-IS) IGP: next hop is the IP of a router BGP: next hop is an Autonomous System ● BGP is what makes Internet work ● BGP scales very well 500 000+ prefixes for a full Internet table
  • 23.
    A quick BGPexample AS 1 AS 2 AS 3 AS 5 AS 4 eBGP iBGP 20.0.0.0/16 20.0.0.0/16: AS1 20.0.0.0/16: AS1 20.0.0.0/16: AS4-AS1 Shortest PATH? 20.0.0.0/16: AS5-AS4-AS1 20.0.0.0/16: AS2-AS1 AS: Autonomous System eBGP: external (different AS) iBGP: internal (same AS)
  • 24.
    iBGP iBGP requires tomesh between all peers n peers => n * (n-1) / 2 connections 50 peers => 1225 (49 of each host) Route-reflectors simulate the mesh More scalable and simpler Possible to have more than one RR RR Distribute BGP information within an Autonomous System
  • 25.
    BGP EVPN ● Partof MP-BGP (multi-protocol BGP: not only IP prefixes) ● Announce VXLAN information instead of IP prefixes L3: IP addresses of VXLAN endpoints (VTEP) L2: Location of MAC addresses ● BUM (Broadcast, Unknown, Multicast) traffic unicasted to all VTEPs ● Get the scalability of BGP
  • 26.
  • 27.
    docker0:~$ docker run-t -d --privileged --name quagga -p 179:179 --hostname docker0 -v $(pwd)/quagga:/etc/quagga cumulusnetworks/quagga (modify routing/forwarding) router bgp 65000 bgp router-id 10.0.0.10 no bgp default ipv4-unicast neighbor reflectors peer-group neighbor reflectors remote-as 65000 neighbor reflectors capability extended-nexthop neighbor 10.0.0.5 peer-group reflectors neighbor 10.0.1.5 peer-group reflectors address-family evpn neighbor reflectors activate advertise-all-vni BGP configuration on Docker0 router bgp 65000 bgp router-id 10.0.0.5 bgp cluster-id 111.111.111.111 no bgp default ipv4-unicast neighbor docker peer-group neighbor docker remote-as 65000 bgp listen range 10.0.0.0/16 peer-group docker address-family evpn neighbor docker activate neighbor docker route-reflector-client BGP configuration on Route Reflectors Creating our BGP clients on Docker hosts
  • 28.
    10.0.0.0/16 docker0: 10.0.0.10 What we haveso far RR1 RR2 quagga- rr quagga- rr docker0 quagga eth0 10.0.0.5 10.0.1.5 docker1: 10.0.1.10 docker0 quaggaeth0
  • 29.
    Let's look atthe BGP data docker0:~$ docker exec -it quagga vtysh docker0# show run docker0# show bgp neighbors docker0# show bgp evpn summary BGP router identifier 10.0.0.10, local AS number 65000 vrf-id 0 Peers 2, using 42 KiB of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd quagga0(10.0.0.5) 4 65000 42 43 0 0 0 00:02:01 0 quagga1(10.0.1.5) 4 65000 42 43 0 0 0 00:02:01 0 docker0# show bgp evpn route No EVPN prefixes exist
  • 30.
    Configuring VXLAN interfaces sudo./setup_vxlan 42 container:quagga dstport 4789 nolearning <= Only learn through EVPN 10.0.0.0/16 docker0: 10.0.0.10 RR1 RR2 quagga- rr quagga- rr docker0 br42 vxlan42 quagga eth0 10.0.0.5 10.0.1.5 docker1: 10.0.1.10 docker0 br42vxlan42 quaggaeth0
  • 31.
    Let's look atthe BGP data docker0:~$ docker exec -it quagga vtysh docker0# show bgp evpn route BGP table version is 0, local router ID is 10.0.0.10 EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC] EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP] Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 10.0.0.10:1 *> [3]:[0]:[32]:[10.0.0.10] 10.0.0.10 32768 i Route Distinguisher: 10.0.1.10:1 *>i[3]:[0]:[32]:[10.0.1.10] 10.0.1.10 0 100 0 i docker0# show evpn mac vni all
  • 32.
    Let's add containersand try pinging 10.0.0.0/16 docker0: 10.0.0.10 RR1 RR2 quagga- rr quagga- rr docker0 br42 vxlan42 quagga demo: 192.168.0.10 eth0 eth0 10.0.0.5 10.0.1.5 docker1: 10.0.1.10 docker0 br42vxlan42 quagga demo: 192.168.0.20 eth0 eth0 docker0:~$ sudo ./plumb br42@quagga demo 192.168.0.10/24@192.168.0.1 02:42:c0:a8:00:10 docker1:~$ sudo ./plumb br42@quagga demo 192.168.0.20/24@192.168.0.1 02:42:c0:a8:00:20
  • 33.
    What about BGP? docker0:~$docker exec -it quagga vtysh docker0# show bgp evpn route BGP table version is 0, local router ID is 10.0.0.10 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal Origin codes: i - IGP, e - EGP, ? - incomplete EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC] EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP] Route Distinguisher: 10.0.1.10:1 *>i[2]:[0]:[0]:[48]:[02:42:c0:a8:00:20] 10.0.1.10 0 100 0 i * i[3]:[0]:[32]:[10.0.1.10] 10.0.1.10 0 100 0 i docker0# show evpn mac vni all VNI 42 #MACs (local and remote) 2 MAC Type Intf/Remote VTEP VLAN 02:42:c0:a8:00:10 local veth0pldemo 02:42:c0:a8:00:20 remote 10.0.1.10
  • 34.
    10.0.0.0/16 docker0: 10.0.0.10 Overview RR1 RR2 quagga- rr quagga- rr docker0 br42 vxlan42 quagga demo: 192.168.0.10 eth0 eth0 10.0.0.510.0.1.5 docker1: 10.0.1.10 docker0 br42vxlan42 quagga demo: 192.168.0.20 eth0 eth0Control plane Data plane
  • 35.
    ● Standard VXLANaddress distribution (used on many routers) ● Full management of BUM traffic ARP queries Broadcasts (DHCP) Multicast (Discovery, keepalived) ● BUM traffic is unicasted (not efficient) Possible optimizations: ARP suppression (Cumulus Quagga) What's interesting about this setup?
  • 36.
    What can wedo with this?
  • 37.
    What if wewant a second Overlay? 10.0.0.0/16 docker0: 10.0.0.10 RR1 RR2 quagga- rr quagga- rr docker0 br42 vxlan42 quagga demo 192.168.0.10 eth0 eth0 10.0.0.5 10.0.1.5 docker1: 10.0.1.10 br66 vxlan66 docker0 br42vxlan42 quagga demo 192.168.0.10 eth0 eth0 br66vxlan66 demo66 192.168.66.10 eth0 demo66 192.168.66.20 eth0 docker0:~$ sudo ./setup_vxlan 66 container:quagga dstport 4789 nolearning docker0:~$ docker run -d --net=none --name=demo66 debian sleep infinity docker0:~$ sudo ./plumb br66@quagga demo66 192.168.66.10/24 02:42:c0:a8:66:10
  • 38.
    What about BGP? docker0:~$docker exec -it quagga vtysh docker0# show evpn vni Number of VNIs: 2 VNI VxLAN IF VTEP IP # MACs # ARPs # Remote VTEPs 42 vxlan42 0.0.0.0 2 0 1 66 vxlan66 0.0.0.0 2 0 1 docker0# show evpn mac vni all VNI 42 #MACs (local and remote) 2 MAC Type Intf/Remote VTEP VLAN 02:42:c0:a8:00:10 local veth0pldemo 02:42:c0:a8:00:20 remote 10.0.1.10 VNI 66 #MACs (local and remote) 2 MAC Type Intf/Remote VTEP VLAN 02:42:c0:a8:66:10 local veth0pldemo66 02:42:c0:a8:66:20 remote 10.0.1.10
  • 39.
    10.0.0.0/16 docker0: 10.0.0.10 RR1 RR2 quagga- rr quagga- rr docker0 br42 vxlan42 quagga demo 192.168.0.10 eth0 eth0 10.0.0.510.0.1.5 docker1: 10.0.1.10 docker0 br42vxlan42 quaggaeth0 Taking advantage of broadcast: DHCP dhcp 192.168.0.254 eth0 demo 192.168.0.20 eth0 demodhcp 192.168.0.10? eth0
  • 40.
    Configuring DHCP docker0:~$ dockerrun -d --net=none --name dhcp -v "$(pwd)/dhcp":/data networkboot/dhcpd eth0 docker0:~$ sudo ./plumb br42@quagga dhcp 192.168.0.254/24 docker1:~$ docker run -d --net=none --name=demodhcp debian sleep infinity docker1:~$ sudo ./plumb br42@quagga demodhcp dhcp docker1:~$ docker exec -it demodhcp ping 192.168.0.10 PING 192.168.0.10 (192.168.0.10): 56 data bytes 64 bytes from 192.168.0.10: icmp_seq=0 ttl=47 time=1.566 ms subnet 192.168.0.0 netmask 255.255.255.0 { range 192.168.0.100 192.168.0.200; option routers 192.168.0.1; option domain-name-servers 8.8.8.8; } DHCP configuration
  • 41.
    10.0.0.0/16 RR1 RR2 quagga- rr quagga- rr docker0 br42 vxlan42 quagga demo 192.168.0.10 eth0 eth0 10.0.0.510.0.1.5 docker0 br42vxlan42 quaggaeth0 Getting out of our Docker environment dhcp 192.168.0.254 eth0 demo 192.168.0.20 eth0 client 192.168.0.100 eth0 quagga br42 vxlan42 vethgw 192.168.0.1 docker0: 10.0.0.10 docker1: 10.0.1.10gateway0: 10.0.0.20
  • 42.
    Getting out ofour Docker environment gateway0:~$ ./setup_vxlan 42 host dstport 4789 nolearning gateway0:~$ ip link add dev vethbr type veth peer name vethgw gateway0:~$ ip link set vethbr master br42 gateway0:~$ ip addr add 192.168.0.1/24 dev vethgw gateway0:~$ ping 192.168.0.10 PING 192.168.0.10 (192.168.0.10): 56 data bytes 64 bytes from 192.168.0.10: icmp_seq=0 ttl=47 time=0.866 ms br42 vethgw 192.168.0.1 vxlan42 vethbr
  • 43.
    10.0.0.0/16 RR1 RR2 quagga- rr quagga- rr docker0 br42 vxlan42 quagga demo 192.168.0.10 eth0 eth0 10.0.0.510.0.1.5 docker0 br42vxlan42 quaggaeth0 Getting out of VXLAN / Quagga dhcp 192.168.0.254 eth0 demo 192.168.0.20 eth0 client 192.168.0.100 eth0 quagga br42 vxlan42 vethgw 192.168.0.1 eth0 Non-VXLAN host 10.0.0.30 route 10.0.0.0/16 ó 192.168.0.0/24 NAT docker0: 10.0.0.10 docker1: 10.0.1.10gateway0: 10.0.0.20
  • 44.
    Getting out ofVXLAN / Quagga gateway0:~$ echo 1 | sudo tee /proc/sys/net/ipv4/ip_forward gateway0:~$ iptables -t nat -A POSTROUTING ! -d 10.0.0.0/16 -s 192.168.0.0/24 -o eth0 -j MASQUERADE docker1:~$ docker exec -it demodhcp ping 192.168.0.1 <= Local (VXLAN) docker1:~$ docker exec -it demodhcp ping 10.0.0.30 <= Routed docker1:~$ docker exec -it demodhcp ping 8.8.8.8 <= NATed simple1:~$ ping 192.168.0.1 simple1:~$ ping 192.168.0.10 eth0 routeNAT
  • 45.
    10.0.0.0/16 docker0: 10.0.0.10 RR1 RR2 quagga- rr quagga- rr docker0 br42 vxlan42 quagga demo 192.168.0.10 eth0 eth0 10.0.0.510.0.1.5 docker1: 10.0.1.10 docker0 br42vxlan42 quaggaeth0 Another nice thing we can do dhcp 192.168.0.254 eth0 demo 192.168.0.20 eth0 demodhcp 192.168.0.100 eth0 gateway0: 10.0.0.20 quagga br42 vxlan42 vethgw 192.168.0.1 eth0 Non-VXLAN host 10.0.0.30 routeNAT QEMU, dhclient 192.168.0.10x tap0
  • 46.
    What could areal-life setup look like? RR2 Docker quagga Docker quagga Docker quagga Docker quagga Docker quagga Docker quagga Docker quagga Docker quagga BGP/EVPN Router Standard h ost Standard h ost Standard h ost Standard h ost VXLAN Routing Routes from non-VXLAN infraRoutes to VXLAN networks RR1
  • 47.
    How does itcompare to other solutions? Data plane Control Plane Swarm Classic VXLAN External KV Store (Consul / Etcd) SwarmKit VXLAN Swarmkit (Raft / Gossip implementation) Flannel host-gw Routing Etcd / Kubernetes API Flannel VXLAN VXLAN Etcd / Kubernetes API Calico Routing / IPIP Etcd / BGP (IP prefixes) Weave Classic Custom Custom Weave Fast Datapath VXLAN Custom Contiv VXLAN, Routing, L2 Etcd / BGP (IP and maybe eVPN) Disclaimer: almost no experience with any (from documentation and discussions mostly)
  • 48.
    Perspectives ● FFRouting Quagga fork Cumulushas switched to FFRouting and merged EVPN support ● Open vSwitch Alternative to linux native bridge and VXLAN (Possibly) better performances and more features Not sure how Quagga/FFRouting would integrate with Open vSwitch ● Performances Measure impact of VXLAN Test VXLAN acceleration when available on NICs ● CNI plugin (to test on Kubernetes and mostly for learning
  • 49.