Marian MarinovMarian Marinov
mm@siteground.commm@siteground.com
Chief System ArchitectChief System Architect
Head of the DevOps departmentHead of the DevOps department
Challenges with Challenges with 
High­density High­density 
networksnetworks
❖❖ Who am I?Who am I?
- Chief System Architect of SiteGround.com- Chief System Architect of SiteGround.com
- Sysadmin since 1996- Sysadmin since 1996
- Organizer of OpenFest, BG Perl- Organizer of OpenFest, BG Perl
Workshops, LUG-BG and othersWorkshops, LUG-BG and others
- Teaching Network Security and- Teaching Network Security and
Linux System AdministrationLinux System Administration
courses in Sofia Universitycourses in Sofia University
and SoftUniand SoftUni
- Physical Machines
- with a few hundreds IPs per
machine
A bit of historyA bit of history
- Physical Machines
- with a few hundreds IPs per
machine
- Virtual Machines
- with a tens of IPs per VM
- with hundreds of VMs per machine
A bit of historyA bit of history
- Physical Machines
- with a few hundreds IPs per
machine
- Virtual Machines
- with a tens of IPs per VM
- with hundreds of VMs per machine
- Containers
A bit of historyA bit of history
- Containers
- with several IPs per container
- with thousand containers per
machine
A bit of historyA bit of history
- ONE 42U Rack
- with physical machines
42 * 100 = 4200 IPs
- with VMs
42 * (100 * 10) = 42 000 IPs
- with containers
42 * (1000 * 2) = 84 000 IPs
A bit of historyA bit of history
❖❖ ProblemsProblems
- broadcast domains- broadcast domains
- mac/arp address tables- mac/arp address tables
- bandwidth- bandwidth
- ARP/ICMPv6
- DHCP/mdns
- HA and Gossip like protocols
- 42 machines...
“Not Great, Not Terrible”
- 420 or 4200... that is a problem
Broadcast domainsBroadcast domains
MAC/ARP address tablesMAC/ARP address tables
- A typical DataCenter switch has
around:
~ 1Tbps throughput
MAC/ARP address tablesMAC/ARP address tables
- A typical DataCenter switch has
around:
~ 1Tbps throughput
~ 600-700 Mpps switching
MAC/ARP address tablesMAC/ARP address tables
- A typical DataCenter switch has
around:
~ 1Tbps throughput
~ 600-700 Mpps switching
~ 48-64k MAC table
MAC/ARP address tablesMAC/ARP address tables
- A typical DataCenter switch has around:
~ 1Tbps
~ 600-700 Mpps
~ 48-64k MAC table
Remember:
84k IPs with 1000 containers per
machine
MAC/ARP address tablesMAC/ARP address tables
❖❖ What if...What if...
we connect 10 racks?we connect 10 racks?
MAC/ARP address tablesMAC/ARP address tables
❖❖ What if...What if...
we connect 10 racks?we connect 10 racks?
84k IPs per rack
840k IPs for this network
MAC/ARP address tablesMAC/ARP address tables
❖❖ What about the servers?What about the servers?
Linux ARP table defaults to 1024
entries
MAC/ARP address tablesMAC/ARP address tables
- If you want to change the
defaults on Linux:
net.ipv4.neigh.default.gc_thresh1
net.ipv4.neigh.default.gc_thresh2
net.ipv4.neigh.default.gc_thresh3
IP sysctl options
BandwidthBandwidth
Assumptions:
- a single physical machine avg.
200Mbps
- a single VM avg. 100Mbps
(req. 10Gbps physical uplink)
- a single container avg. 100Mbps
(req. 100Gbps physical uplink)
BandwidthBandwidth
42 * 200Mbps = 8.4 Gbps
42 * 100VM * 100Mbps = 420 Gbps
42 * 1000LXC * 100Mbps = 4.2 Tbps
- a single physical machine avg.
200Mbps
- a single VM avg. 100Mbps
- a single container avg. 100Mbps
BandwidthBandwidth
- linking these switches
- typical DataCenter switches link
with multiple 10 or 40Gbps uplinks
- its rear to see 100Gbps uplinks
BandwidthBandwidth
- getting 420 Gbps out of a single
switch, would require at least 4
100Gbps uplinks
❖❖ SolutionsSolutions
- VLANs/QinQ/MPLS- VLANs/QinQ/MPLS
- Layered network designs- Layered network designs
- VXLAN/NVGRE- VXLAN/NVGRE
VLANs/QinQ/MPLSVLANs/QinQ/MPLS
- reduces the broadcast domains
- if switches support mac address
tables per-vlan, it also reduces the
mac address table issues
- it does not solve the capacity/BW
issues
- it introduces complexity in the
setup
Layered DesignsLayered Designs
Cisco's Spine and Leaf topology
Layered DesignsLayered Designs
Facebook's Data Center Fabric
Layered DesignsLayered Designs
Facebook's Data Center Fabric
Paper
Layered DesignsLayered Designs
Google's Jupiter fabrics
VXLAN/NVGREVXLAN/NVGRE
VXLAN/NVGREVXLAN/NVGRE
192.168.0.15 192.168.0.28
10.0.1.12 10.0.1.17
10.1.5.100 10.1.5.204
VXLAN/NVGREVXLAN/NVGRE
192.168.0.15 192.168.0.28
10.0.1.12 10.0.1.17
10.1.5.100 10.1.5.204
5a:7a:de:23:0b:27
192.168.0.15
00:00:5e:00:01:2a
VXLANVXLAN
- Point-to-Point or Multicast
- 50 bytes overhead
- Jumbo frames are preferred
- statically with iproute2
- Linux Documentation vxlan.txt
- dynamically with OpenVswitch
- Supported in switches from Arista
and Brocade
NVGRENVGRE
- Developed by Microsoft
supported only by Microsoft
- working over GRE tunnels
- Point-to-Point or Multicast
- 42 bytes
GeneveGeneve
- Point-to-Point or Multicast
- Unify NVGRE, VXLAN and
STT(Stateless Transport Tunneling)
- supported in OpenVswitch
Overlay technologiesOverlay technologies
Cisco made a good comparison of
the overlay technologies.
You can read the paper here.
ConclusionConclusion
- Layer 2 should end in ToR
switches
- Layer 3 should be used for
anything more complex
- Overlays can be used to
accommodate specific client
requirements
Marian MarinovMarian Marinov
mm@siteground.commm@siteground.com

Challenges with high density networks