Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

LinuxCon 2015 Linux Kernel Networking Walkthrough


Published on

This presentation features a walk through the Linux kernel networking stack for users and developers. It will cover insights into both, existing essential networking features and recent developments and will show how to use them properly. Our starting point is the network card driver as it feeds a packet into the stack. We will follow the packet as it traverses through various subsystems such as packet filtering, routing, protocol stacks, and the socket layer. We will pause here and there to look into concepts such as networking namespaces, segmentation offloading, TCP small queues, and low latency polling and will discuss how to configure them.

Published in: Technology
  • Login to see the comments

LinuxCon 2015 Linux Kernel Networking Walkthrough

  1. 1. Kernel Networking Walkthrough LinuxCon 2015, Seattle Thomas Graf Kernel & Open vSwitch Team Noiro Networks (Cisco)
  2. 2. Agenda ● Getting packets from/to the NIC ● NAPI, Busy Polling, RSS, RPS, XPS, GRO, TSO ● Packet processing ● RX Handler, IP Processing, TCP Processing, TCP Fast Open ● Queuing from/to userspace ● Socket Buffers, Flow Control, TCP Small Queues ● Q&A
  3. 3. Touring the Network Stack Expectation Reality
  4. 4. How does a packet get in and out of the Network Stack?
  5. 5. Receive & Transmit Process Ring Buffer DMA Parse L2 & IP Parse TCP/UDP Socket Buffer Task / Container read() Ring Buffer Construct IP Construct TCP/UDP Local? Socket Buffer Forward Route? write() NIC Network Stack (Kernel Space) Process (User Space)
  6. 6. The 3 ways into the Network Stack Ring Buffer Network Stack Interrupt Driven A Ring Buffer Network Stack NAPI based Polling poll() B Ring Buffer Network Stack Busy Polling busy_poll() Task C
  7. 7. RSS – Receive Side Scaling ● NIC distributes packets across multiple RX queues allowing for parallel processing. ● Separate IRQ per RX queue, thus selects CPU to run hardware interrupt handler on. RX-queue-1 RX-queue-2 RX-queue-3 RX-queue-4 CPU 1 CPU 2 CPU 1 CPU 2 filter
  8. 8. RPS – Receive Packet Steering ● Software filter to select CPU # for processing ● Use it to ... RX-queue-1 RX-queue-2 RX-queue-3 RX-queue-4 CPU 1 CPU 2 CPU 3 CPU 1 CPU 2 CPU 3 ... redo queue - CPU mapping ... distribute single queue to multiple CPUs
  9. 9. Hardware Offload ● RX/TX Checksumming ● Perform CPU intensive checksumming in hardware. ● Virtual LAN filtering and tag stripping ● Strip 802.1Q header and store VLAN ID in network packet meta data. ● Filter out unsubscribed VLANs. ● Segmentation Offload
  10. 10. Generic Receive Offload (ethtool -K eth0 gro on) Ring Buffer Network Stack poll() NAPI based GRO MTU GRO Up to 64K It's more effective to process 1x64K bytes packet instead of 40x1500 bytes packets.
  11. 11. Segmentation Offload (ethtool -K eth0 tso on) (ethtool -K eth0 gso on) Ring Buffer Network Stack Generic Segmentation Offload (GSO) ethtool -K eth0 gso on MTU TCP Segmentation Offload (TSO) ethtool -K eth0 tso on MTU Up to 64K
  12. 12. How does a packet get through the Network Stack? (c) Karen Sagovac
  13. 13. Packet Processing Link Layer Ingress QoS Proto Handler IPv4 IPv6 ARP IPX ... Drop The Feast! RX Handler Open vSwitch Team Bonding Bridge macvlan macvtap Packet Socket ETH_P_ALL tcpdump
  14. 14. IP Processing IP Handler Route Lookup PREROUTING IPv4 Construction Route Lookup Local Output OUTPUT POSTROUTINGLink Layer FORWARD Forwarding L4 (TCP, ...) Local Delivery INPUT User Space
  15. 15. TCP Processing IP Socket Filter Receive TCP Parse TCP Lookup Socket Backlog socket locked Receive Socket Buffer Prequeue task exists process context ← softirq Task poll()read()
  16. 16. TCP Fast Open (net.ipv4.tcp_fastopen) 2nd Req SYN SYN+ACK ACK+HTTP GET Data 2x RTT SYN+Cookie+HTTP GET SYN+ACK+Data 2nd Req 1x RTT Client Server SYN SYN+ACK ACK+HTTP GET 1st Req Data 2x RTT2x RTT Regular Client Server SYN SYN+ACK+Cookie ACK+HTTP GET 1st Req Data 2x RTT Fast Open
  17. 17. Memory Accounting & Flow Control
  18. 18. Socket Buffers & Flow Control (net.ipv4.tcp_{r|w}mem) ssh TX Ring Buffer TCP/IP Socket Buffer wmem overlimit? Block or EWOULDBLOCK wmem += packet-size ssh RX Ring Buffer TCP/IP Socket Buffer rmem -= packet-size rmem overlimit? Reduce TCP Window rmem += packet-size wmem -= packet-size write()
  19. 19. TCP Small Queues (net.ipv4.tcp_limit_output_bytes) ssh TX Ring Buffer Driver TCP/IP Socket Buffer write() Queuing Discipline torrent Socket Buffer write() TSQ: max 128Kb in flight per socket
  20. 20. Q&A Contact: ● E-Mail: ● Twitter: @tgraf__