Hardware accelerated switching with Linux @ SWLUG Talks May 2014

Uploaded on

Nat Morris will take us through the use of Linux on a new generation of hardware accelerated network switches

Nat Morris will take us through the use of Linux on a new generation of hardware accelerated network switches

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • The logos on the right by the team represent the companies where the team served.
  • The logos on the right by the team represent the companies where the team served.
  • Cumulus Network’s HCL focused on fixed boxes (Leaf/Spine)Same Broadcom silicon as Arista switches, same hardware performance at lower price point.Arista has additional hardware platforms for special purposes Choice – Cumulus focuses on breadth of platforms/vendors for best of breed.Arista supports black boxesArista and supports many different configurations – Cumulus doesn’t need differentiated price points for low end configurations, they are already cheaperCumulus Linux is a Linux OS, and network services apps run on top of it are very rich.Arista in contrast is a Linux-based OS, EoS integrates all apps in one image and control is limited to some Linux containerCloud Networking designs – includes L2/Host Multi-homing*, L3/ECMP, L2 over L3 VXLAN.Customers are moving to L3 CLOS fabrics so L2/Host multi-homing is all that’s needed, not MLAGOrchestration – Comprehensive set of tools today on par with Arista and rapid innovationOur model offers the same Orchestration tool and more due to rapid pace of innovation (ex. Midokura)OpenFlow is supported with other OS such as Big SwitchAutomation.Cumulus Linux has Zero Touch Provisioning, automated install, better DevOps integration (due to unmodified Linux/scripting languages)Application visibility – Leverage server style tools & hardware counters/functionalityArista may have stronger networking tracers, advanced mirroring (DANZ), advanced congestion management (LANZ) tools today. Congestion management/counters will be enabled with switchd file system, more can be done for simplification, but similar capability can be enabled through scriptingProgrammable foundation – drivers abstractions, eAPI, Unmodified Linux Cumulus Linux drivers abstractions are unchanged (in contrast Arista uses sysDB to provide visibility to their own driver), Cumulus Linux networking data structures are unchanged (Arista uses its own so user is limited to management plane/control plane box changes)
  • Bare metal switches have been around for a while, but each had a proprietary OS and were not robust. Now we have the best OS to ride along the best switches.
  • Just like BIOS and PXE allows you to install an OS on a server using a remote image, the combination of U-Boot and ONIE allows that for bare metal switches.We require ONIE preloaded on HCL because U-Boot is different across vendor devices, and U-Boot itself is not very user friendly.We created ONIE and gave it to the Open Compute Project (OCP); it facilitates easy network OS installation of not just Cumulus Linux (Pica8 is a competitive example). Now you have your choice of installing whatever OS you want, not just what comes with the switch (e.g. Cisco IOS– OEM example, or FASTPATH– Broadcom’s OS).Think of ONIE as PXE on steroids. ONIE is a small BusyBox Linux distribution, with a bunch of fetch and execution Bash scripts. It leverages modern ways of discovering networks using what was built into Linux—e.g., IPv6 neighbor discovery, DHCPv6, DHCPv4.U-Boot is very good at probing the bus. U-Boot takes about 1MB. It has boot flash that’s dedicated to booting the hardware, separate from the Operating System flash. ONIE is a way to build on top of this. Takes about 3.5 MB.ONIE is extremely well documented and flexible, and embraced by the open source community. (Source is on GitHubsince summer 2013).
  • 2.1 releases are subject to hardware availability. Make sure to explicitly order the ONIE part, as some models ship without ONIE.
  • swp numbering starts at 1 instead of 0 to match the numbering typically found on the front panel silk screen.
  • Within Linux is a construct called netlink,the communication channel between user space and Linux kernel. Everything we see in the User Space box talks to the Kernel through netlink (not shown on diagram). switchd snoops the netlink traffic and can react (e.g. whenever you add or remove a route)Color decode:Green with orange border pushes things down to the kernel
  • Statistics show that as packets flow across the merchant silicon, switchd updates the counters in the kernel real-time. The advantage of this approach is that you can deploy any interface monitoring tool that you normally would use in Linux.How do you access counters? Use netstat. It’s the same as on a server—even on a switch port.
  • Technically the up and down verses are needed only if method is manual. But we add them here for consistency. In the forthcoming CL 2.1, the commands for ifup and ifdown will be simplified.
  • The basic use of bridging is to connect all of the physical interfaces in the system into a single layer 2 domain. This results in a switch that behaves like a Cisco Catalyst device.VLANs are called bridging because under Linux the way you’d create a VLAN is by creating a bridge. We try to stick to Linux terms because CL is Linux.VLAN tags are implemented as VLAN sub-interfaces.The traffic from multiple bridges or VLAN segments can be multiplexed on the same data link. Cumulus Linux supports the 802.1Q VLAN trunk interface, which carries traffic from multiple VLANs, with each packet encapsulated with an 802.1Q VLAN tag. The VLAN ID carried in the VLAN tag associates the packet with the corresponding VLAN segment. Each VLAN sub-interface of the VLAN trunk can be added as a member interface of the corresponding bridge.
  • Add bridge_waitport 0 if you don’t want Cumulus Linux to wait while trying to connect to switch port– it waits 30 seconds by default. Setting to 0 is handy if ports are not connected or CL is not licensed.
  • Cumulus Linux-specific packages are organized into 5 repositories (compiled for proper architecture—PowerPC, MIPS, ARM, x86) we manage and support, compared to Debian.org (which we don’t control).main –all packages in CL image to support base functionalityupdates – updates to any packages in main,not security related.security-updates – security-related (address known exploit)updates to any packages in main, thus you should prioritizeaddons – additional packages not in image, e.g. Puppet from Puppet Labs.testing – packages undergoing development. Experimental and not QA’ed.Repository is publicly accessible from internet. We use same apt-get infrastructure as Debian. See KB article if you need to set up a local apt-get repository.switchd is the main item you can’t see source; most of our commands are written in Python and you can see the source.We’ve vetted and touched in some respect all the daemons we ship with. (If some customers are concerned about apparent reliability/support problems with Quagga in the community, we have a large customer running Quagga for over a year to manage over 6,000 switches for OSPF and IPv4 without problems.)Knowing what customers use and/or need will help us decide and prioritize what to test and include (e.g. vi, Emacs, Ruby, Perl, Python; but some customers want Puppet, Chef, Ansible)* on the slide means we do not have control of source, but we’ll do as much as we can accordingly.
  • The traditional hierarchical network was designed for server traffic that mostly went out and in, not servers to servers within the datacenterComplexity increases with more protocols, particularly proprietary ones (vs. standard, open protocols)With virtualization, today’s datacenters have a lot more nodes– each server is running dozens of VMs that need to talk to each other, or VMs need to be able to move between server hosts through vMotion.Just because this is the way Cisco has taught us doesn’t mean it’s the most efficient way.
  • We’ve taken a page out of the play book of the largest datacentersSimpler: single protocol (BGP or OSPF) provides ECMPPredictable latency– everything is a single hop awayHorizontally scalable– scale beyond two aggregation switches, higher bandwidth through ECMP and no blocked portsBetter failure– if one spine fails, less impact than in traditional hierarchical network topology where an aggregation switch failsNot 50% of bandwidth, as in traditional networks. Or MLAG-based designs.The two leaves connecting to the core are known as datacenter leaves. Why don’t we connect core switches to spine switches? We want to isolate external traffic. Core Ports are expensive, so as you try to scale and hook to all the spines, that becomes cost prohibitive.


  • 1. v Hardware accelerated switching with Linux Nat Morris 26th April 2014 @ South Wales Linux User Group
  • 2. About me Nat Morris • Based in Haverfordwest (beyond the M4) • Team lead, Cumulus Networks • Director & Board Member, UK Network Operators Forum (UKNOF) • Feeder of dogs • Attended first SWLUG meeting in 2001 Twitter • @natmorris cumulusnetworks.com 2
  • 3. About Cumulus Networks Team  JR Rivers, co-founder and CEO  Nolan Leake, co-founder and CTO  Shrijeet Mukherjee, VP Engineering  Reza Malekzadeh, VP Business  Jason Martin, VP Customer Experience Investors  Andreessen Horowitz  Battery Ventures  Sequoia Capital  Wing. VC (Peter Wagner)  Ed Bugnion, Diane Greene and Mendel Rosenblum (VMware founders) cumulusnetworks.com 3
  • 4. cumulusnetworks.com 4
  • 5. IP Fabric Networking Landscape cumulusnetworks.com 5 Network Hardware NetworkOS Open Closed
  • 6. The Expanding Landscape hardware operating system appapp hardware operating system app app Single Vendor Blob Multi-Vendor Ecosystem app app cumulusnetworks.com 6
  • 7. Expanding Ecosystem The missing piece: Cumulus® Linux® , bringing the Linux revolution to networking cumulusnetworks.com 7
  • 8. Understanding Characteristics of a Leaf Switch 8cumulusnetworks.com 10/40 Gigabit spine uplink ports Serial console port Ethernet Out-of- Band Management Port * SFP+ ports can be grouped together into a single QSFP 40G port via reverse connecting breakout cable options * QSFP ports can be broken out into four SFP+ ports via copper or optical transceiver options
  • 9. Understanding Characteristics of a Spine Switch 9cumulusnetworks.com Serial console port Ethernet Out-of- Band Management Port * QSFP ports can be broken out into four SFP+ ports via copper or optical breakout cable options
  • 10. Add leaf switches incrementally Connecting 40G Uplinks to Spine Layer 10cumulusnetworks.com Spine Switch 1 Leaf Switch 1 uplink 1 uplink 2 uplink 3 uplink 4 Spine Switch 2 Spine Switch 3 Spine Switch 4
  • 11. Anatomy of a Network Switch cumulusnetworks.com 11 ( Management Interfaces ) ( Data Plane ) CPU SoC DRAM Boot Flash Mass Storage Switchin g ASIC Serial Console Ethernet Mgmt Port 10Gb Port 40Gb Port… 10Gb Port 40Gb Port … PCIe
  • 12. Bare Metal Switch Provisioning Similar approach to installing OS on server  BIOS + PXE = U-Boot + ONIE (Open Network Install Environment)  Supported hardware (HCL) preloaded with ONIE  ONIE available on GitHub • http://onie.github.io/onie/ cumulusnetworks.com 12 bare metal server operating system app app app BIOS and PXE bare metal switch operating system app app app U-Boot and ONIE
  • 13. Hardware Vendors cumulusnetworks.com 13
  • 14. Operating System Vendors cumulusnetworks.com 14
  • 15. Hardware Compatibility List (HCL) cumulusnetworks.com 15 Switch Model Number Description Merchant Silicon Cumulus Linux Release Dell S6000-ON 32 x 40G-QSFP+ Trident II 2.1 or later Edge-Core AS6700-32X with ONIE 32 x 40G-QSFP+ Trident II 2.0.1 or later Penguin Computing Arctica 3200XL 32 x 40G-QSFP+ Trident II 2.0 or later Quanta QCT QuantaMesh T5032-LY6 32 x 40G-QSFP+ Trident II 2.0.1 or later Agema AG-7448CU 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident 1.5.0 or later Dell S4810-ON 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident 2.0.2 or later Edge-Core AS5600-52X with ONIE 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident+ 1.5.0 or later Edge-Core AS5610-52X with ONIE 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident+ 2.0.1 or later Edge-Core AS5710-54X with ONIE 48 x 10G-SFP+ and 6 x 40G-QSFP+ Trident II 2.1.x or later Penguin Computing Arctica 4804X 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident+ 1.5.1 or later Quanta QCT QuantaMesh T-3048-LY2 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident+ 1.5.0 or later Quanta QCT QuantaMesh T-3048- LY2R 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident+ 1.5.0 or later Quanta QCT QuantaMesh T5048-LY8 48 x 10G-SFP+ and 6 x 40G-QSFP+ Trident II 2.1.x or later* Edge-Core AS4600-54T with ONIE 48 x 1G-T and 4 x 10G-SFP+ Apollo2 2.0 or later Penguin Computing Arctica 4804i 48 x 1G-T and 4 x 10G-SFP+ Triumph2 1.5.1 or later Quanta QCT QuantaMesh T1048-LB9 48 x 1G-T and 4 x 10G-SFP+ FireBolt3 1.5.0 or later 40G10G1G
  • 16. Choice cumulusnetworks.com 16
  • 17. Choice cumulusnetworks.com 17
  • 18. ONIE: Bare Metal Install – First Time Boot Up cumulusnetworks.com 18 Boot Loader (HW Vendor Supplied) ONIE (HW Vendor Supplied) Installer (OS Vendor) Boot Loader • Low Level boot loader, configures CPU complex • Loads and boots ONIE ONIE • Linux Kernel with Busybox • Configures management Ethernet interface • Locates and executes an OS installer • Provides tools and environment for installer OS Installer • Available from network or USB • Linux executable • Installs vendor OS into mass storage Network OS (OS Vendor Supplied) Fetches Installs
  • 19. ONIE: Network OS Installer Discovery and Install Behavior cumulusnetworks.com 19 Configure Network Interface Locate Installer Run Installer • Uses DHCPv4, DHCPv6 • Configures Ethernet interface for IPv4 / IPv6 • Configures DNS and hostname • Determines the location of an installer executable • Examines local file systems, e.g. USB flash drives • Uses DHCP options, DNS Service Discovery, Multicast DNS and IPv6 Neighbors • Downloads installer via URL • Passes various environment variables to installer • Launches installer
  • 20. Networking Interfaces in Linux cumulusnetworks.com 20 Interface Description eth0 Physical interface for out-of-band management lo Loopback (logical interface redirecting to switch) in /etc/hosts Debian lists secondary swpN Physical interface for data plane traffic N corresponds to port number bridge Logical interface creating a single Layer 2 broadcast domain Traffic on sub-interfaces can be untagged or tagged Commonly called “VLAN” bond Logical interface aggregating two or more interfaces Commonly called “LAG” or “port channel”
  • 21. Pushing Changes Down cumulusnetworks.com 21 CPU, RAM, Flash, etc. Switch Silicon Front Panel Ports lldpd Routing Tables ARP Table Devices Bridge FDB Filter Tables Bonds VLANs LinuxKernel Virtual Kernel Ports Bridging mstpd ACLRouting Suite Quagga snmpd vconfig iptable ebtable ip6tableiproute2 VXLAN Bridges Switch HAL brctl Switch Driver UserSpace Quagga daemon, Quagga.conf, and vtysh CLI and /etc/network/interfaces switchd
  • 22. Show Interface Statistics cumulusnetworks.com 22 High level statistics for an interface cumulus@switch:~$ ip -s link show dev swp1 3: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 500 link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 21780 242 0 0 0 242 TX: bytes packets errors dropped carrier collsns 1145554 11325 0 0 0 0 Low level statistics for an interface cumulus@switch:~$ sudo ethtool -S swp1
  • 23. Deconstructing /etc/network/interfaces  auto swp1  iface swp1 inet static  address  gateway  up ip link set $IFACE up  down ip link set $IFACE down cumulusnetworks.com 23 Bring up interface during boot up or service network reload Interface name Method: manual, static, dhcp ifup verse to bring up interface ifdown verse to bring down interface IP address settings for interface, only if using static Metho d Action manual No IP address configured by default static IP address configured using address and gateway options dhcp Obtain IP address using DHCP server
  • 24. Bridging Bridge = single isolated Layer 2 broadcast domain  Allows hosts connected to bridge ports (members) to discover each other without having to define routes  Traffic on ports is tagged (802.1q VLAN ID) or untagged (native) • Tagging involves using sub-interfaces, e.g. swpN.ID  Commonly called “VLAN” in traditional networking cumulusnetworks.com 24
  • 25. Defining a Bridge  auto br-vlan100  iface br-vlan100 inet manual  bridge_ports swp4.100 swp5.100  up ip link set $IFACE up  down ip link set $IFACE down cumulusnetworks.com 25 Bring up interface during boot up or service network reload Interface name Method: manual, static, dhcp ifup verse to bring up interface ifdown verse to bring down interface Bridge members. swp4, swp4.100, swp5, and swp5.100 must be defined first .100 creates sub-interface (turning swp into trunk port)
  • 26. Show Bridge cumulusnetworks.com 26 Show bridges Show bridge MAC addresses cumulus@switch:~$ brctl showmacs br-red port name mac addr is local? ageing timer swp4 06:90:70:22:a6:2e no 19.47 swp1 12:12:36:43:6f:9d no 40.50 swp1 44:38:39:00:12:9b yes 0.00 swp2 44:38:39:00:12:9c yes 0.00 cumulus@switch:~$ brctl show bridge name bridge id STP enabled interfaces br-vlan100 8000.089e01f89511 no swp5 swp6
  • 27. Cumulus Linux Packaging and Support cumulusnetworks.com 27 main updates security-updates addons testing  250 packages  ~ 20 Cumulus Linux packages  Examples:  Ruby, Perl, Python, Bash, IPtables, LLDP  Updates: packages revised  Security: known concerns, CVEs  User-identified utilities + libraries  Puppet, Factor, Chef, collectd  Early access utilities and libraries  Bird (CL 1.5)  40K+ packagesDebian.org Fully Supported Fully Supported* Best Effort Best Effort* *packages not controlled by Cumulus
  • 28. Traditional Hierarchical Network Topology L3 L2 Access Aggregation Core Legacy and limitations  Not designed for today’s data center running modern workloads • Server density • Increased server-to-server traffic  Numerous proprietary protocols • STP/RSTP/PVSTP, VTP, HSRP, MLAG, LACP  “This is what we’ve been taught” 28
  • 29. L3 Is the Future L3 L2 ECMP Clos network (“spine/leaf”) 1. Simpler network 1. Fewer protocols 2. Standards-based 1. Fewer proprietary features 3. Predictable latency 1. Every leaf is 1 hop away 4. Horizontally scalable Leaf Spine Core 29
  • 30. Basic Clos Architecture (2-Tier Spine/Leaf) 30cumulusnetworks.com Optimized for high bandwidth East to West traffic patterns compute and storage network services Core or WAN Spine Layer Leaf Layer
  • 31. Basic Clos Architecture (3-Tier or 5-Stage) 31cumulusnetworks.com Leaf Spin e InterPod Spine Network Services Leaf
  • 32. Ansible demo 32 spine 1 swp1 - 4 swp1 - 4 swp1 - 4 swp1 - 4 leaf 2 swp17 - 20 swp17 - 20 swp17 - 20 swp17 - 20 wbench leaf 1 spine 2eth0 eth0 eth0 eth0 eth1 eth0 swp30-33 swp34-37 swp30-33 swp34-37
  • 33. Questions 33
  • 34. © 2014 Cumulus Networks. Cumulus Networks, the Cumulus Networks Logo, and Cumulus Linux are trademarks or registered trademarks of Cumulus Networks, Inc. or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. The registered trademark Linux® is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide basis. Thank You! Bringing the Linux Revolution to Networking 34