Networking in Userspace
   Living on the edge




  Stephen Hemminger
  stephen@networkplumber.org
Problem Statement
                                     20,000,000
Packets per second (bidirectional)




                                     15,000,000



                                     10,000,000



                                      5,000,000



                                              0
                                              64 208 352 496 640 784 928 1072 121613601504

                                                        Packet Size (bytes)



                                                                              Intel: DPDK Overview
Server vs Infrastructure
   Server Packets              Network Infrastructure
  Packet Size    1024 bytes     Packet Size       64 bytes

Packets/second   1.2 Million   Packets/second   14.88 Million
                                Arrival rate       67.2 ns
  Arrival rate     835 ns       2 GHz Clock      135 cycles
                                   cycles
    2 GHz        1670 cycles
                                3 Ghz Clock      201 cycles
    3 Ghz        2505 cycles       cycles


L3 hit on Intel® Xeon® ~40 cycles
L3 miss, memory read is (201 cycles at 3 GHz)
Traditional Linux networking
TCP Offload Engine
Good old sockets




Flexible, portable but slow
Memory mapped buffers




Efficient, but still constrained by architecture
Run in kernel
The OpenOnload architecture

      Network hardware provides a user-safe interface which
      can route Ethernet packets to an application context
      based on flow information contained within headers
                   Kernel          Application     Application
                   Context          Context         Context

                  Application      Application     Application




                             Protocol               Protocol


                                                     Driver
                       Network Driver




                                                      DMA        No new protocols
                                DMA

                                 Network Adaptor




Slide 7
The OpenOnload architecture

      Protocol processing can take place both in the
      application and kernel context for a given flow

                    Kernel          Application     Application
                    Context          Context         Context

                   Application      Application     Application




                              Protocol               Protocol     Enables persistent / asynchronous
                                                                             processing

                                                      Driver
                        Network Driver
                                                                         Maintains existing
                                                                       network control-plane

                                                       DMA
                                 DMA

                                  Network Adaptor




Slide 8
The OpenOnload architecture

      Protocol state is shared between the kernel and
      application contexts through a protected shared
      memory communications channel
                   Kernel          Application     Application
                   Context          Context         Context

                   Application     Application     Application




                             Protocol               Protocol         Enables correct handling of
                                                                 protocol state with high-performance

                                                     Driver
                        Network Driver




                                                      DMA
                                 DMA

                                 Network Adaptor




Slide 9
Performance metrics

      Overhead
           – Networking overheads take CPU time away from your application

      Latency
           – Holds your application up when it has nothing else to do
           – H/W + flight time + overhead

      Bandwidth
           – Dominates latency when messages are large
           – Limited by: algorithms, buffering and overhead

      Scalability
           – Determines how overhead grows as you add cores, memory, threads, sockets
             etc.


Slide 11
Anatomy of kernel-based networking




Slide 12
A user-level architecture?




Slide 13
Direct & safe hardware access




Slide 14
Some performance results


      Test platform: typical commodity server
           – Intel clovertown 2.3 GHz quad-core xeon (x1)
             1.3 GHz FSB, 2 Gb RAM
           – Intel 5000X chipset
           – Solarflare Solarstorm SFC4000 (B) controller, CX4
           – Back-to-back
           – RedHat Enterprise 5 (2.6.18-8.el5)




Slide 88
Performance: Latency and overhead

      TCP ping-pong with 4 byte payload
      70 byte frame: 14+20+20+12+4

                       ½ round-trip latency    CPU overhead
                         (microseconds)       (microseconds)
 Hardware                      4.2                  --

 Kernel                       11.2                 7.0

 Onload                        5.3                 1.1


Slide 89
Performance: Streaming bandwidth




Slide 92
Performance: UDP transmit

      Nessage rate:
           – 4 byte UDP payload (46 byte
             frame)



                               Kernel      Onload


 1 sender                      473,000     2,030,000




Slide 93
Performance: UDP transmit

      Nessage rate:
           – 4 byte UDP payload (46 byte
             frame)



                               Kernel      Onload


 1 sender                      473,000     2,030,000


 2 senders                     532,000     3,880,000



Slide 94
Performance: UDP receive




Slide 95
OpenOnload Open Source

      OpenOnload available as Open Source (GPLv2)
            – Please contact us if you’re interested

      Compatible with x86 (ia32, amd64/emt64)

      Currently supports SMC10GPCIe-XFP and SMC10GPCIe-10BT
      NICs
            – Could support other user-accessible network interfaces

      Very interested in user feedback
            – On the technology and project directions


Slide 100
Netmap
        http://info.iet.unipi.it/~luigi/netmap/
●
    BSD (and Linux port)
●
    Good scalability
●
    Libpcap emulation
Netmap
Netmap API
●
    Access
    –   open("/dev/netmap")
    –   ioctl(fd, NIOCREG, arg)
    –   mmap(..., fd, 0) maps buffers and rings
●
    Transmit
    –   fill up to avail buffers, starting from slot cur.
    –   ioctl(fd,NIOCTXSYNC) queues the packets
●
    Receive
    –   ioctl(fd,NIOCRXSYNC) reports newly received packets
    –   process up to avail buffers, starting from slot cur.


                       These ioctl()s are non-blocking.
Netmap API: synchronization
●   poll() and select(), what else!
    –   POLLIN and POLLOUT decide which sets of rings to
        work on
    –   work as expected, returning when avail>0
    –   interrupt mitigation delays are propagated up to
        the userspace process
Netmap: multiqueue
●
    Of course.
    –   one netmap ring per physical ring
    –   by default, the fd is bound to all rings
    –   ioctl(fd, NIOCREG, arg) can restrict the binding
        to a single ring pair
    –   multiple fd's can be bound to different rings on the same
        card
    –   the fd's can be managed by different threads
    –   threads mapped to cores with pthread_setaffinity()
Netmap and the host stack
●
    While in netmap mode, the control path remains unchanged:
    –   ifconfig, ioctl's, etc still work as usual
    –   the OS still believes the interface is there
●
    The data path is detached from the host stack:
    –   packets from NIC end up in RX netmap rings
    –   packets from TX netmap rings are sent to the NIC
●
    The host stack is attached to an extra netmap rings:
    –   packets from the host go to a SW RX netmap ring
    –   packets from a SW TX netmap ring are sent to the host
    –   these rings are managed using the netmap API
Netmap: Tx performance
Netmap: Rx Performance
Netmap Summary
Packet Forwarding     Mpps

Freebsd bridging      0.690

Netmap + libpcap      7.500

Netmap                14.88

Open vSwitch          Mpps

userspace             0.065

linux                 0.600

FreeBSD               0.790

FreeBSD+netmap/pcap   3.050
Intel DPDK Architecture
The Intel® DPDK Philosophy


                                                                   Intel® DPDK Fundamentals
                                                                   •   Implements a run to completion model or
                                                                       pipeline model
                                                                   •   No scheduler - all devices accessed by
                                                                       polling
                                                                   •   Supports 32-bit and 64-bit with/without
                                                                       NUMA
                                                                   •   Scales from Intel® Atom™ to Intel®
                                                                       Xeon® processors
                                                                   •   Number of Cores and Processors not
                                                                       limited
                                                                   •   Optimal packet allocation across DRAM
                                                                       channels
      Control
      Plane                       Data Plane




 • Must run on any IA CPU                                Provide software examples that
     ‒ From Intel® Atom™ processor to the                address common network
       latest Intel® Xeon® processor family              performance deficits
     ‒ Essential to the IA value proposition              ‒   Best practices for software architecture
     ‒                                                    ‒   Tips for data structure design and storage
 • Focus on the fast-path                                 ‒   Help the compiler generate optimum code
     ‒ Sending large number of packets to the             ‒   Address the challenges of achieving 80
       Linux Kernel /GPOS will bog the system down            Mpps per CPU Socket




20     Intel Restricted Secret
                                     TRANSFORMING COMMUNICATIONS
                                     TRANSFORMING COMMUNICATIONS
Intel® Data Plane Development Kit (Intel® DPDK)
Intel® DPDK embeds optimizations for                    Intel® DPDK
                                                        Libraries
the IA platform:
- Data Plane Libraries and Optimized NIC                                                  Customer
Drivers in Linux User Space                               Buffer Management               Application

                                                          Queue/Ring Functions            Customer
-   Run-time Environment
                                                                                          Application
                                                          Packet Flow
                                                          Classification
-   Environment Abstraction Layer and Boot Code                                           Customer
                                                          NIC Poll Mode Library           Application
- BSD-licensed & source downloadable from
Intel and leading ecopartners                           Environment Abstraction Layer

                                                                                                       User Space
                                                                                                   Kernel Space

                                                        Environment Abstraction Layer
                                                                                        Linux Kernel




                                                        Platform Hardware




21      Intel Restricted Secret
                                  TRANSFORMING COMMUNICATIONS
                                  TRANSFORMING COMMUNICATIONS
Intel® DPDK Libraries and Drivers

     • Memory Manager: Responsible for allocating pools of objects in memory. A pool is
       created in huge page memory space and uses a ring to store free objects. It also
       provides an alignment helper to ensure that objects are padded to spread them
       equally on all DRAM channels.
     • Buffer Manager: Reduces by a significant amount the time the operating system
       spends allocating and de-allocating buffers. The Intel® DPDK pre-allocates fixed size
       buffers which are stored in memory pools.
     • Queue Manager:: Implements safe lockless queues, instead of using spinlocks, that
       allow different software components to process packets, while avoiding unnecessary
       wait times.
     • Flow Classification: Provides an efficient mechanism which incorporates Intel®
       Streaming SIMD Extensions (Intel® SSE) to produce a hash based on tuple
       information so that packets may be placed into flows quickly for processing, thus
       greatly improving throughput.
     • Poll Mode Drivers: The Intel® DPDK includes Poll Mode Drivers for 1 GbE and 10 GbE
       Ethernet* controllers which are designed to work without asynchronous, interrupt-
       based signaling mechanisms, which greatly speeds up the packet pipeline.




22      Intel Restricted Secret
                                  TRANSFORMING COMMUNICATIONS
                                  TRANSFORMING COMMUNICATIONS
Intel® DPDK Native and Virtualized
     Forwarding Performance




23    Intel Restricted Secret
                                TRANSFORMING COMMUNICATIONS
                                TRANSFORMING COMMUNICATIONS
Comparison
             Netmap           DPDK           OpenOnload


License      BSD              BSD            GPL


API          Packet + pcap    Packet + lib   Sockets


Kernel       Yes              Yes            Yes


HW support   Intel, realtek   Intel          Solarflare


OS           FreeBSD, Linux   Linux          Linux
Issues
●
    Out of tree kernel code
    –   Non standard drivers
●
    Resource sharing
    –   CPU
    –   NIC
●
    Security
    –   No firewall
    –   DMA isolation
What's needed?
●
    Netmap
    –   Linux version (not port)
    –   Higher level protocols?
●
    DPDK
    –   Wider device support
    –   Ask Intel
●
    Openonload
    –   Ask Solarflare
●
    OpenOnload
    –   A user-level network stack (Google tech talk)
        ●
            Steve Pope
        ●
            David Riddoch
●
    Netmap - Luigi Rizzo
    –   http://info.iet.unipi.it/~luigi/netmap/talk-atc12.html
●
    DPDK
    –   Intel DPDK Overview
    –   Disruptive network IP networking
        ●
            Naoto MASMOTO
Thank you

Userspace networking

  • 1.
    Networking in Userspace Living on the edge Stephen Hemminger stephen@networkplumber.org
  • 2.
    Problem Statement 20,000,000 Packets per second (bidirectional) 15,000,000 10,000,000 5,000,000 0 64 208 352 496 640 784 928 1072 121613601504 Packet Size (bytes) Intel: DPDK Overview
  • 3.
    Server vs Infrastructure Server Packets Network Infrastructure Packet Size 1024 bytes Packet Size 64 bytes Packets/second 1.2 Million Packets/second 14.88 Million Arrival rate 67.2 ns Arrival rate 835 ns 2 GHz Clock 135 cycles cycles 2 GHz 1670 cycles 3 Ghz Clock 201 cycles 3 Ghz 2505 cycles cycles L3 hit on Intel® Xeon® ~40 cycles L3 miss, memory read is (201 cycles at 3 GHz)
  • 4.
  • 6.
  • 7.
    Good old sockets Flexible,portable but slow
  • 8.
    Memory mapped buffers Efficient,but still constrained by architecture
  • 9.
  • 10.
    The OpenOnload architecture Network hardware provides a user-safe interface which can route Ethernet packets to an application context based on flow information contained within headers Kernel Application Application Context Context Context Application Application Application Protocol Protocol Driver Network Driver DMA No new protocols DMA Network Adaptor Slide 7
  • 11.
    The OpenOnload architecture Protocol processing can take place both in the application and kernel context for a given flow Kernel Application Application Context Context Context Application Application Application Protocol Protocol Enables persistent / asynchronous processing Driver Network Driver Maintains existing network control-plane DMA DMA Network Adaptor Slide 8
  • 12.
    The OpenOnload architecture Protocol state is shared between the kernel and application contexts through a protected shared memory communications channel Kernel Application Application Context Context Context Application Application Application Protocol Protocol Enables correct handling of protocol state with high-performance Driver Network Driver DMA DMA Network Adaptor Slide 9
  • 13.
    Performance metrics Overhead – Networking overheads take CPU time away from your application Latency – Holds your application up when it has nothing else to do – H/W + flight time + overhead Bandwidth – Dominates latency when messages are large – Limited by: algorithms, buffering and overhead Scalability – Determines how overhead grows as you add cores, memory, threads, sockets etc. Slide 11
  • 14.
    Anatomy of kernel-basednetworking Slide 12
  • 15.
  • 16.
    Direct & safehardware access Slide 14
  • 17.
    Some performance results Test platform: typical commodity server – Intel clovertown 2.3 GHz quad-core xeon (x1) 1.3 GHz FSB, 2 Gb RAM – Intel 5000X chipset – Solarflare Solarstorm SFC4000 (B) controller, CX4 – Back-to-back – RedHat Enterprise 5 (2.6.18-8.el5) Slide 88
  • 18.
    Performance: Latency andoverhead TCP ping-pong with 4 byte payload 70 byte frame: 14+20+20+12+4 ½ round-trip latency CPU overhead (microseconds) (microseconds) Hardware 4.2 -- Kernel 11.2 7.0 Onload 5.3 1.1 Slide 89
  • 19.
  • 20.
    Performance: UDP transmit Nessage rate: – 4 byte UDP payload (46 byte frame) Kernel Onload 1 sender 473,000 2,030,000 Slide 93
  • 21.
    Performance: UDP transmit Nessage rate: – 4 byte UDP payload (46 byte frame) Kernel Onload 1 sender 473,000 2,030,000 2 senders 532,000 3,880,000 Slide 94
  • 22.
  • 23.
    OpenOnload Open Source OpenOnload available as Open Source (GPLv2) – Please contact us if you’re interested Compatible with x86 (ia32, amd64/emt64) Currently supports SMC10GPCIe-XFP and SMC10GPCIe-10BT NICs – Could support other user-accessible network interfaces Very interested in user feedback – On the technology and project directions Slide 100
  • 24.
    Netmap http://info.iet.unipi.it/~luigi/netmap/ ● BSD (and Linux port) ● Good scalability ● Libpcap emulation
  • 25.
  • 26.
    Netmap API ● Access – open("/dev/netmap") – ioctl(fd, NIOCREG, arg) – mmap(..., fd, 0) maps buffers and rings ● Transmit – fill up to avail buffers, starting from slot cur. – ioctl(fd,NIOCTXSYNC) queues the packets ● Receive – ioctl(fd,NIOCRXSYNC) reports newly received packets – process up to avail buffers, starting from slot cur. These ioctl()s are non-blocking.
  • 27.
    Netmap API: synchronization ● poll() and select(), what else! – POLLIN and POLLOUT decide which sets of rings to work on – work as expected, returning when avail>0 – interrupt mitigation delays are propagated up to the userspace process
  • 28.
    Netmap: multiqueue ● Of course. – one netmap ring per physical ring – by default, the fd is bound to all rings – ioctl(fd, NIOCREG, arg) can restrict the binding to a single ring pair – multiple fd's can be bound to different rings on the same card – the fd's can be managed by different threads – threads mapped to cores with pthread_setaffinity()
  • 29.
    Netmap and thehost stack ● While in netmap mode, the control path remains unchanged: – ifconfig, ioctl's, etc still work as usual – the OS still believes the interface is there ● The data path is detached from the host stack: – packets from NIC end up in RX netmap rings – packets from TX netmap rings are sent to the NIC ● The host stack is attached to an extra netmap rings: – packets from the host go to a SW RX netmap ring – packets from a SW TX netmap ring are sent to the host – these rings are managed using the netmap API
  • 30.
  • 31.
  • 32.
    Netmap Summary Packet Forwarding Mpps Freebsd bridging 0.690 Netmap + libpcap 7.500 Netmap 14.88 Open vSwitch Mpps userspace 0.065 linux 0.600 FreeBSD 0.790 FreeBSD+netmap/pcap 3.050
  • 33.
  • 34.
    The Intel® DPDKPhilosophy Intel® DPDK Fundamentals • Implements a run to completion model or pipeline model • No scheduler - all devices accessed by polling • Supports 32-bit and 64-bit with/without NUMA • Scales from Intel® Atom™ to Intel® Xeon® processors • Number of Cores and Processors not limited • Optimal packet allocation across DRAM channels Control Plane Data Plane • Must run on any IA CPU Provide software examples that ‒ From Intel® Atom™ processor to the address common network latest Intel® Xeon® processor family performance deficits ‒ Essential to the IA value proposition ‒ Best practices for software architecture ‒ ‒ Tips for data structure design and storage • Focus on the fast-path ‒ Help the compiler generate optimum code ‒ Sending large number of packets to the ‒ Address the challenges of achieving 80 Linux Kernel /GPOS will bog the system down Mpps per CPU Socket 20 Intel Restricted Secret TRANSFORMING COMMUNICATIONS TRANSFORMING COMMUNICATIONS
  • 35.
    Intel® Data PlaneDevelopment Kit (Intel® DPDK) Intel® DPDK embeds optimizations for Intel® DPDK Libraries the IA platform: - Data Plane Libraries and Optimized NIC Customer Drivers in Linux User Space Buffer Management Application Queue/Ring Functions Customer - Run-time Environment Application Packet Flow Classification - Environment Abstraction Layer and Boot Code Customer NIC Poll Mode Library Application - BSD-licensed & source downloadable from Intel and leading ecopartners Environment Abstraction Layer User Space Kernel Space Environment Abstraction Layer Linux Kernel Platform Hardware 21 Intel Restricted Secret TRANSFORMING COMMUNICATIONS TRANSFORMING COMMUNICATIONS
  • 36.
    Intel® DPDK Librariesand Drivers • Memory Manager: Responsible for allocating pools of objects in memory. A pool is created in huge page memory space and uses a ring to store free objects. It also provides an alignment helper to ensure that objects are padded to spread them equally on all DRAM channels. • Buffer Manager: Reduces by a significant amount the time the operating system spends allocating and de-allocating buffers. The Intel® DPDK pre-allocates fixed size buffers which are stored in memory pools. • Queue Manager:: Implements safe lockless queues, instead of using spinlocks, that allow different software components to process packets, while avoiding unnecessary wait times. • Flow Classification: Provides an efficient mechanism which incorporates Intel® Streaming SIMD Extensions (Intel® SSE) to produce a hash based on tuple information so that packets may be placed into flows quickly for processing, thus greatly improving throughput. • Poll Mode Drivers: The Intel® DPDK includes Poll Mode Drivers for 1 GbE and 10 GbE Ethernet* controllers which are designed to work without asynchronous, interrupt- based signaling mechanisms, which greatly speeds up the packet pipeline. 22 Intel Restricted Secret TRANSFORMING COMMUNICATIONS TRANSFORMING COMMUNICATIONS
  • 37.
    Intel® DPDK Nativeand Virtualized Forwarding Performance 23 Intel Restricted Secret TRANSFORMING COMMUNICATIONS TRANSFORMING COMMUNICATIONS
  • 38.
    Comparison Netmap DPDK OpenOnload License BSD BSD GPL API Packet + pcap Packet + lib Sockets Kernel Yes Yes Yes HW support Intel, realtek Intel Solarflare OS FreeBSD, Linux Linux Linux
  • 39.
    Issues ● Out of tree kernel code – Non standard drivers ● Resource sharing – CPU – NIC ● Security – No firewall – DMA isolation
  • 40.
    What's needed? ● Netmap – Linux version (not port) – Higher level protocols? ● DPDK – Wider device support – Ask Intel ● Openonload – Ask Solarflare
  • 41.
    OpenOnload – A user-level network stack (Google tech talk) ● Steve Pope ● David Riddoch ● Netmap - Luigi Rizzo – http://info.iet.unipi.it/~luigi/netmap/talk-atc12.html ● DPDK – Intel DPDK Overview – Disruptive network IP networking ● Naoto MASMOTO
  • 42.