David Lapsley (@devlaps), Chet Burgess (@cfbIV), Kahou Lei (@kahou82)
May 20, 2015
OpenStack Vancouver Summit
VXLAN Distributed Service Node
Virtualization in the data
center has changed network
requirements
Number of end hosts 
Number of networks 
Bandwidth requirements 
This is a problem for
traditional data center
networks
• L2 Access with L3 Aggregation
• Wasted capacity: STP blocks ports to prevent loops
• VLAN Exhaustion: only 4K with 802.1Q label
• ToR Scalability: hw tables need to scale with endpoints
Traditional Data Centers
L3 to the edge can help
• L3 is Scalable
• Well known and supported
• Equal Cost Multi-Path (ECMP) Routing
• Each link active at all times
L3
How do we scope
tenants/projects?
• MAC over UDP/IP overlay
• Re-uses existing IP core (L3 ECMP, No STP)
• Reduces pressure on ToR L2 tables
• Supports over 16M+ VLANs
• Maintains L2 bridging semantics
VXLAN
VXLAN Encapsulation
• Virtual Network Identifier
• 24 bits  16+ million
• VXLAN Tunnel End Point (VTEP)
• Encapsulation, Decapsulation
• Listen on UDP port 4789 (IANA), 8472 (Linux default) for incoming VXLAN
packets
• VNI to VTEP IP mapping
VXLAN Components
VXLAN Example Deployment
Hypervisor 1
VM1 VM2
VTEP (vxlan100)
Tenant bridge (br101)
VM1 VM2
VTEP (vxlan101)
L3 Network
eth0
Hypervisor 2
Tenant bridge (br100)
VM3 VM4
VTEP (vxlan100)
Tenant bridge (br101)
VM3 VM4
VTEP (vxlan101)
eth0
VXLAN
100
VXLAN
101
DMAC SMAC 802.1Q EType Payload CRC
Outer
MAC
Outer
IP
Outer
UDP
VXLAN CRCPayload
VXLAN
Network Identifier
(24 bits)
VXLAN
Flags
(8 bits)
Reserved
(24 bits)
Reserved
(8 bits)
Tenant bridge (br100)
• Broadcast, Unknown, and Multicast packets (e.g. ARP,
DHCP, multi-cast, etc.) are flooded to all VTEPs for the
given VNI
• Two mechanisms used:
• Multicast
• Multi-cast address and VNI configured for each VXLAN segment
• VTEP sends IGMP join/leave as VMs spin up/down
• Broadcast domain implemented using multicast
• Service Node:
• Use a “central” service node to maintain mapping of VNIs to VTEP IPs
Broadcast, Unknown and Multicast Packets
Service Node
Hypervisor 1
VM1 VM2
vxlan100 (1.1.1.1)
Tenant bridge (br101)
VM1 VM2
vxlan101 (3.3.3.3)
L3 Network
eth0
Hypervisor 2
Tenant bridge (br100)
VM3 VM4
vxlan100 (2.2.2.2)
Tenant bridge (br101)
VM3 VM4
vxlan101 (4.4.4.4)
eth0
VXLAN
100
VXLAN
101
Tenant bridge (br100)
VNI VTEPs
100
1.1.1.1
2.2.2.2
101
3.3.3.3
4.4.4.4
Remote
Service
Node
Service Node
Central Service Node
Central Service Node
Distributed Service Node
Distributed Service Node
Distributed Service Node
Distributed Service Node
VXLAN Distributed Service
Node
Design
Design
Design
Controller 1 Controller 2 Controller 3
L3 Network
Hypervisor 1
Tenant bridge (br100)
VM1 VM2
Tenant bridge (br101)
VM1 VM2
VTEP (vxlan101)
eth0
Hypervisor 500
Tenant bridge (br100)
VM1 VM2
VTEP (vxlan100)
Tenant bridge (br101)
VM1 VM2
VTEP (vxlan101)
eth0
eth0
VTEP (vxlan100)
eth0 eth0
Distributed
VXLAN
Service Node
Distributed
VXLAN
Service Node
mcrouter
memcache
mcrouter
memcache
mcrouter
memcache
Design
Controller 1 Controller 2 Controller 3
L3 Network
Hypervisor 1
Tenant bridge (br100)
VM1 VM2
Tenant bridge (br101)
VM1 VM2
VTEP (vxlan101)
eth0
Hypervisor 500
Tenant bridge (br100)
VM1 VM2
VTEP (vxlan100)
Tenant bridge (br101)
VM1 VM2
VTEP (vxlan101)
eth0
eth0
VTEP (vxlan100)
eth0 eth0
Distributed
VXLAN
Service Node
Distributed
VXLAN
Service Node
mcrouter
memcache
mcrouter
memcache
mcrouter
memcache
Design
Controller 1 Controller 2 Controller 3
L3 Network
Hypervisor 1
Tenant bridge (br100)
VM1 VM2
Tenant bridge (br101)
VM1 VM2
VTEP (vxlan101)
eth0
Hypervisor 500
Tenant bridge (br100)
VM1 VM2
VTEP (vxlan100)
Tenant bridge (br101)
VM1 VM2
VTEP (vxlan101)
eth0
eth0
VTEP (vxlan100)
eth0 eth0
Distributed
VXLAN
Service Node
Distributed
VXLAN
Service Node
mcrouter
memcache
mcrouter
memcache
mcrouter
memcache
• Multi-threaded python program (multiprocessing module)
• Runs on every hypervisor
• Shares state using Distributed Cache
• FB Mcrouter – memcached protocol router (5B requests /second @ peak!)
• Listens for new VTEP registrations
• Forwards new mappings to Distributed Cache
• Listens for Broadcast, Unknown, Multicast packets
• Floods to all VTEPs in the Virtual Network
VXLAN Distributed Service Node
Service Node
Service Node
Configuring VXLAN
ip link add vxlan1 type vxlan id 1 remote 169.254.1.1 dev
eth0
ip addr add 172.16.1.1 dev vxlan1
ip link set dev vxlan1 mtu 1450
ip link set dev vxlan1 up
Creating VXLAN interfaces
root@mhv2:~# ip addr show vxlan1
4: vxlan1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc
noqueue state UNKNOWN group default
link/ether f2:af:3f:62:cf:65 brd ff:ff:ff:ff:ff:ff
inet 172.16.1.5/24 scope global vxlan1
valid_lft forever preferred_lft forever
inet6 fe80::f0af:3fff:fe62:cf65/64 scope link
valid_lft forever preferred_lft forever
Configured VXLAN Interface
iptables –t nat -A OUTPUT -d 169.254.1.1/32 -p udp -m udp -
-dport 8472 -j DNAT --to-destination 127.0.0.1:8473
The @cfbIV rule
-t nat -A OUTPUT -d 169.254.1.1/32 -p udp -m udp --dport 8472 -j DNAT
--to-destination 127.0.0.1:8473
The @cfbIV rule
-t nat -A OUTPUT -d 169.254.1.1/32 -p udp -m udp --dport 8472 -j DNAT --to-destination 127.0.0.1:8473
The @cfbIV rule
-t nat -A OUTPUT -d 169.254.1.1/32 -p udp -m udp --dport 8472 -j DNAT
--to-destination 127.0.0.1:8473
The @cfbIV rule
-t nat -A OUTPUT -d 169.254.1.1/32 -p udp -m udp --dport 8472 -j DNAT
--to-destination 127.0.0.1:8473
The @cfbIV rule
Demo
Demo Setup
Controller 1 Controller 2 Controller 3
L3 Network
Hypervisor 1
VTEP (172.16.3.4)
192.168.225.231
Hypervisor 500
192.168.225.232
192.168.225.226
VTEP1 (172.16.1.4)
192.168.225.227 192.168.225.228
VTEP1 (172.16.1.4) VTEP (172.16.3.6)VTEP1 (172.16.1.5) VTEP1 (172.16.1.5)
VXLAN
Distributed
Service Node
VXLAN
Distributed
Service Node
mcrouter
memcache
mcrouter
memcache
mcrouter
memcache
• Open source VDSN source code
• Integration with Neutron (if community interest)
• Performance and scalability testing
Future work
References
• Presentation slides: http://bit.ly/vdsn-presentation
• VDSN Source Code and Ansible playbooks:
• Simple, accessible model, horizontal scaling
• http://bit.ly/vdsn-ansible
• VDSN code coming soon (@devlaps, #devlaps)
• Production Code:
• Multi-area VXLAN! Highly optimized, requires expertise to
configure/troubleshoot
• http://bit.ly/multi-area-vxlan
References
• C. Burgess, N. Leake, L3 + VXLAN Made Practical,
OpenStack Summit Spring 2014.
• M. Mahalingam, et. Al, Virtual eXtensible Local Area
Network (VXLAN): A Framework for Overlaying
Virtualized Layer 2 Networks over Layer 3 Networks,
https://tools.ietf.org/html/rfc7348
References
• Sanjay K. Hooda, Shyam Kapadia, Padmanabhan
Krishnan, Using TRILL, FabricPath, and VXLAN:
Designing Massively Scalable Data Centers (MSDC) with
Overlays, Cisco Press, 2014.
• Introducing McRouter, http://bit.ly/introducing-mcrouter
References
• McRouter on github,
https://github.com/facebook/mcrouter
• Pyroute2, https://pypi.python.org/pypi/pyroute2
• Maintaining a set in Memcached, http://bit.ly/memcache-
sets
• Ansible, http://docs.ansible.com
References
@devlaps, dlapsley@cisco.com
Thank You
VXLAN Distributed Service Node

VXLAN Distributed Service Node

  • 1.
    David Lapsley (@devlaps),Chet Burgess (@cfbIV), Kahou Lei (@kahou82) May 20, 2015 OpenStack Vancouver Summit VXLAN Distributed Service Node
  • 2.
    Virtualization in thedata center has changed network requirements
  • 3.
    Number of endhosts  Number of networks  Bandwidth requirements 
  • 4.
    This is aproblem for traditional data center networks
  • 5.
    • L2 Accesswith L3 Aggregation • Wasted capacity: STP blocks ports to prevent loops • VLAN Exhaustion: only 4K with 802.1Q label • ToR Scalability: hw tables need to scale with endpoints Traditional Data Centers
  • 6.
    L3 to theedge can help
  • 7.
    • L3 isScalable • Well known and supported • Equal Cost Multi-Path (ECMP) Routing • Each link active at all times L3
  • 8.
    How do wescope tenants/projects?
  • 9.
    • MAC overUDP/IP overlay • Re-uses existing IP core (L3 ECMP, No STP) • Reduces pressure on ToR L2 tables • Supports over 16M+ VLANs • Maintains L2 bridging semantics VXLAN
  • 10.
  • 11.
    • Virtual NetworkIdentifier • 24 bits  16+ million • VXLAN Tunnel End Point (VTEP) • Encapsulation, Decapsulation • Listen on UDP port 4789 (IANA), 8472 (Linux default) for incoming VXLAN packets • VNI to VTEP IP mapping VXLAN Components
  • 12.
    VXLAN Example Deployment Hypervisor1 VM1 VM2 VTEP (vxlan100) Tenant bridge (br101) VM1 VM2 VTEP (vxlan101) L3 Network eth0 Hypervisor 2 Tenant bridge (br100) VM3 VM4 VTEP (vxlan100) Tenant bridge (br101) VM3 VM4 VTEP (vxlan101) eth0 VXLAN 100 VXLAN 101 DMAC SMAC 802.1Q EType Payload CRC Outer MAC Outer IP Outer UDP VXLAN CRCPayload VXLAN Network Identifier (24 bits) VXLAN Flags (8 bits) Reserved (24 bits) Reserved (8 bits) Tenant bridge (br100)
  • 13.
    • Broadcast, Unknown,and Multicast packets (e.g. ARP, DHCP, multi-cast, etc.) are flooded to all VTEPs for the given VNI • Two mechanisms used: • Multicast • Multi-cast address and VNI configured for each VXLAN segment • VTEP sends IGMP join/leave as VMs spin up/down • Broadcast domain implemented using multicast • Service Node: • Use a “central” service node to maintain mapping of VNIs to VTEP IPs Broadcast, Unknown and Multicast Packets
  • 14.
    Service Node Hypervisor 1 VM1VM2 vxlan100 (1.1.1.1) Tenant bridge (br101) VM1 VM2 vxlan101 (3.3.3.3) L3 Network eth0 Hypervisor 2 Tenant bridge (br100) VM3 VM4 vxlan100 (2.2.2.2) Tenant bridge (br101) VM3 VM4 vxlan101 (4.4.4.4) eth0 VXLAN 100 VXLAN 101 Tenant bridge (br100) VNI VTEPs 100 1.1.1.1 2.2.2.2 101 3.3.3.3 4.4.4.4 Remote Service Node
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    Design Controller 1 Controller2 Controller 3 L3 Network Hypervisor 1 Tenant bridge (br100) VM1 VM2 Tenant bridge (br101) VM1 VM2 VTEP (vxlan101) eth0 Hypervisor 500 Tenant bridge (br100) VM1 VM2 VTEP (vxlan100) Tenant bridge (br101) VM1 VM2 VTEP (vxlan101) eth0 eth0 VTEP (vxlan100) eth0 eth0 Distributed VXLAN Service Node Distributed VXLAN Service Node mcrouter memcache mcrouter memcache mcrouter memcache
  • 26.
    Design Controller 1 Controller2 Controller 3 L3 Network Hypervisor 1 Tenant bridge (br100) VM1 VM2 Tenant bridge (br101) VM1 VM2 VTEP (vxlan101) eth0 Hypervisor 500 Tenant bridge (br100) VM1 VM2 VTEP (vxlan100) Tenant bridge (br101) VM1 VM2 VTEP (vxlan101) eth0 eth0 VTEP (vxlan100) eth0 eth0 Distributed VXLAN Service Node Distributed VXLAN Service Node mcrouter memcache mcrouter memcache mcrouter memcache
  • 27.
    Design Controller 1 Controller2 Controller 3 L3 Network Hypervisor 1 Tenant bridge (br100) VM1 VM2 Tenant bridge (br101) VM1 VM2 VTEP (vxlan101) eth0 Hypervisor 500 Tenant bridge (br100) VM1 VM2 VTEP (vxlan100) Tenant bridge (br101) VM1 VM2 VTEP (vxlan101) eth0 eth0 VTEP (vxlan100) eth0 eth0 Distributed VXLAN Service Node Distributed VXLAN Service Node mcrouter memcache mcrouter memcache mcrouter memcache
  • 28.
    • Multi-threaded pythonprogram (multiprocessing module) • Runs on every hypervisor • Shares state using Distributed Cache • FB Mcrouter – memcached protocol router (5B requests /second @ peak!) • Listens for new VTEP registrations • Forwards new mappings to Distributed Cache • Listens for Broadcast, Unknown, Multicast packets • Floods to all VTEPs in the Virtual Network VXLAN Distributed Service Node
  • 29.
  • 30.
  • 31.
  • 32.
    ip link addvxlan1 type vxlan id 1 remote 169.254.1.1 dev eth0 ip addr add 172.16.1.1 dev vxlan1 ip link set dev vxlan1 mtu 1450 ip link set dev vxlan1 up Creating VXLAN interfaces
  • 33.
    root@mhv2:~# ip addrshow vxlan1 4: vxlan1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default link/ether f2:af:3f:62:cf:65 brd ff:ff:ff:ff:ff:ff inet 172.16.1.5/24 scope global vxlan1 valid_lft forever preferred_lft forever inet6 fe80::f0af:3fff:fe62:cf65/64 scope link valid_lft forever preferred_lft forever Configured VXLAN Interface
  • 34.
    iptables –t nat-A OUTPUT -d 169.254.1.1/32 -p udp -m udp - -dport 8472 -j DNAT --to-destination 127.0.0.1:8473 The @cfbIV rule
  • 35.
    -t nat -AOUTPUT -d 169.254.1.1/32 -p udp -m udp --dport 8472 -j DNAT --to-destination 127.0.0.1:8473 The @cfbIV rule
  • 36.
    -t nat -AOUTPUT -d 169.254.1.1/32 -p udp -m udp --dport 8472 -j DNAT --to-destination 127.0.0.1:8473 The @cfbIV rule
  • 37.
    -t nat -AOUTPUT -d 169.254.1.1/32 -p udp -m udp --dport 8472 -j DNAT --to-destination 127.0.0.1:8473 The @cfbIV rule
  • 38.
    -t nat -AOUTPUT -d 169.254.1.1/32 -p udp -m udp --dport 8472 -j DNAT --to-destination 127.0.0.1:8473 The @cfbIV rule
  • 39.
  • 40.
    Demo Setup Controller 1Controller 2 Controller 3 L3 Network Hypervisor 1 VTEP (172.16.3.4) 192.168.225.231 Hypervisor 500 192.168.225.232 192.168.225.226 VTEP1 (172.16.1.4) 192.168.225.227 192.168.225.228 VTEP1 (172.16.1.4) VTEP (172.16.3.6)VTEP1 (172.16.1.5) VTEP1 (172.16.1.5) VXLAN Distributed Service Node VXLAN Distributed Service Node mcrouter memcache mcrouter memcache mcrouter memcache
  • 43.
    • Open sourceVDSN source code • Integration with Neutron (if community interest) • Performance and scalability testing Future work
  • 44.
  • 45.
    • Presentation slides:http://bit.ly/vdsn-presentation • VDSN Source Code and Ansible playbooks: • Simple, accessible model, horizontal scaling • http://bit.ly/vdsn-ansible • VDSN code coming soon (@devlaps, #devlaps) • Production Code: • Multi-area VXLAN! Highly optimized, requires expertise to configure/troubleshoot • http://bit.ly/multi-area-vxlan References
  • 46.
    • C. Burgess,N. Leake, L3 + VXLAN Made Practical, OpenStack Summit Spring 2014. • M. Mahalingam, et. Al, Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks, https://tools.ietf.org/html/rfc7348 References
  • 47.
    • Sanjay K.Hooda, Shyam Kapadia, Padmanabhan Krishnan, Using TRILL, FabricPath, and VXLAN: Designing Massively Scalable Data Centers (MSDC) with Overlays, Cisco Press, 2014. • Introducing McRouter, http://bit.ly/introducing-mcrouter References
  • 48.
    • McRouter ongithub, https://github.com/facebook/mcrouter • Pyroute2, https://pypi.python.org/pypi/pyroute2 • Maintaining a set in Memcached, http://bit.ly/memcache- sets • Ansible, http://docs.ansible.com References
  • 49.

Editor's Notes