Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Understanding DPDK

  1. Understanding DPDK Description of techniques used to achieve high throughput on a commodity hardware
  2. How fast SW has to work? 14.88 millions of 64 byte packets per second on 10G interface 1.8 GHz -> 1 cycle = 0,55 ns 1 packet -> 67.2 ns = 120 clock cycles IFG Pream ble DST MAC SRC MAC SRC MAC Type Payload CRC 84 Bytes 412 8 60
  3. Comparative speed values CPU to memory speed = 6-8 GBytes/s PCI-Express x16 speed = 5 GBytes/s Access to RAM = 200 ns Access to L3 cache = 4 ns Context switch ~= 1000 ns (3.2 GHz)
  4. Packet processing in Linux User space Kernel space NIC App Driver RX/TX queues Socket Ring buffers
  5. Linux kernel overhead System calls Context switching on blocking I/O Data copying from kernel to user space Interrupt handling in kernel
  6. Expense of sendto Function Activity Time (ns) sendto system call 96 sosend_dgram lock sock_buff, alloc mbuf, copy in 137 udp_output UDP header setup 57 ip_output route lookup, ip header setup 198 ether_otput MAC lookup, MAC header setup 162 ixgbe_xmit device programming 220 Total 950
  7. Packet processing with DPDK User space Kernel space NIC App DPDK Ring buffers UIO driver RX/TX queues
  8. Kernel space Updating a register in Linux User space HW ioctl() Register syscall VFS copy_from_user() iowrite()
  9. Updating a register with DPDK User space HW assign Register
  10. What is used inside DPDK? Processor affinity (separate cores) Huge pages (no swap, TLB) UIO (no copying from kernel) Polling (no interrupts overhead) Lockless synchronization (avoid waiting) Batch packets handling SSE, NUMA awareness
  11. Linux default scheduling Core 0 Core 1 Core 2 Core 3 t1 t4t3t2
  12. How to isolate a core for a process To diagnose use top “top” , press “f” , press “j” Before boot use isolcpus “isolcpus=2,4,6” After boot - use cpuset “cset shield -c 1-3”, “cset shield -k on”
  13. Core 2Core 1 Run-to-completion model RX/TX thread RX/TX thread Port 1 Port 2
  14. Core 2Core 1 Pipeline model RX thread TX thread Port 1 Port 2 Ring
  15. Page tables tree Linux paging model cr3 Page Page Global Directory Page Table Page Middle Directory
  16. TLB TLB Page Table RAM OffsetVirtual page Physical Page Offset
  17. TLB characteristics $ cpuid | grep -i tlb size: 12–4,096 entries hit time: 0.5–1 clock cycle miss penalty: 10–100 clock cycles miss rate: 0.01–1% It is very expensive resource!
  18. Solution - Hugepages Benefit: optimized TLB usage, no swap Hugepage size = 2M Usage: mount hugetlbfs /mnt/huge mmap Library - libhugetlbfs
  19. Lockless ring design Writer can preempt writer and reader Reader can not preempt writer Reader and writer can work simultaneously on different cores Barrier CAS operation Bulk queue/dequeue
  20. Lockless ring (Single Producer) 1 cons_head cons_tail prod_head prod_tail prod_next 2 cons_head cons_tail prod_head prod_next prod_tail 3 cons_head cons_tail prod_head prod_tail
  21. Lockless ring (Single Consumer) 1 cons_head cons_tail prod_head prod_tail cons_next 2 cons_tail prod_head prod_tail cons_next cons_head 3 cons_head cons_tail prod_head prod_tail
  22. Lockless ring (Multiple Producers) 1 cons_head cons_tail prod_head prod_tail prod_next1 prod_next2 3 cons_head cons_tail prod_head 2 cons_head cons_tail prod_head prod_next2 prod_tail prod_next1 4 cons_head cons_tail 5 cons_head cons_tail prod_head prod_tail prod_tail prod_head prod_tail prod_next1 prod_next2 prod_next1 prod_next2
  23. Kernel space network driver App IP stack Driver NIC Data Desc Config Data User space Kernel space Interrupts
  24. UIO “The most important devices can’t be handled in user space, including, but not limited to, network interfaces and block devices.” - LDD3
  25. UIO User space Kernel space Interfacesysfs /dev/uioX App US driver epoll() mmap() UIO framework driver
  26. NIC User space Access to device from user space BAR0 (Mem) BAR1 BAR2 (IO) BAR5 BAR4 BAR3 Vendor Id Device Id Command Revision Id Status ... Configuration registers I/O and memory regions /sys/class/uio/uioX/maps/mapX /sys/class/uio/uioX/portio/portX /dev/uioX -> mmap (offset) /sys/bus/pci/devices
  27. Host memory NIC memory DMA RX Update RDT DMA descriptor(s) RX queue RX FIFO DMA packet Descriptor ringMemory DMA descriptors
  28. Host memory NIC memory DMA TX Update TDT DMA descriptor(s) TX queue TX FIFO DMA packet Descriptor ringMemory DMA descriptors
  29. Receive from SW side DD DD DDDD RDT DD mbuf1 addr DD mbuf2 addr RDT RDH = 1 RDT = 5 RDBA = 0 RDLEN = 6 mbuf1 RDH RDH mbuf2
  30. Transmit from SW side DD DD DDDD TDT DD mbuf1 addr DD mbuf2 addr TDT TDH = 1 TDT = 5 TDBA = 0 TDLEN = 6 mbuf1 TDH TDH mbuf2
  31. NUMA CPU 0 Cores Memory controller I/O controller Memory PCI-E PCI-E CPU 1 Cores Memory controller I/O controller Memory PCI-E PCI-E QPI Socket 0 Socket 1
  32. RSS (Receive Side Scaling) Hash function Queue 0 CPU N ... Queue N Incoming traffic Indirection table
  33. Flow director Queue 0 CPU N ... Queue N Incoming traffic Filter table Hash function Outgoing traffic Drop Route
  34. Virtualization - SR-IOV NIC VMM VM1 VF driver VM2 VF driver PF driver VF Virtual bridge VF PF
  35. NIC Slow path using bifurcated driver Kernel DPDK VF Virtual bridge PF Filter table
  36. Slow path using TAP User space Kernel space NIC App DPDK Ring buffers TAP device RX/TX queues TCP/IP stack
  37. Slow path using KNI User space Kernel space NIC App DPDK Ring buffers KNI device RX/TX queues TCP/IP stack
  38. x86 HW Application 1 - Traffic generator User space Streams generator DUT Traffic analyzer
  39. x86 HW Application 2 - Router Kernel User space Routing table Routing table cacheDUT1 DUT2
  40. x86 HW Application 3 - Middlebox User space DPIDUT1 DUT2
  41. References Device Drivers in User Space Userspace I/O drivers in a realtime context The Userspace I/O HOWTO The anatomy of a PCI/PCI Express kernel driver From Intel® Data Plane Development Kit to Wind River Network Acceleration Platform DPDK Design Tips (Part 1 - RSS) Getting the Best of Both Worlds with Queue Splitting (Bifurcated Driver) Design considerations for efficient network applications with Intel® multi-core processor-based systems on Linux Introduction to Intel Ethernet Flow Director
  42. My blog Learning Network Programming
Advertisement