Successfully reported this slideshow.
Your SlideShare is downloading. ×

Dpdk performance

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 29 Ad

Dpdk performance

Download to read offline

FOSDEM15 SDN developer room talk

DPDK performance
How to not just do a demo with DPDK

The Intel DPDK provides a platform for building high performance Network Function Virtualization applications. But it is hard to get high performance unless certain design tradeoffs are made. This talk focuses on the lessons learned in creating the Brocade vRouter using DPDK. It covers some of the architecture, locking and low level issues that all have to be dealt with to achieve 80 Million packets per second forwarding.

FOSDEM15 SDN developer room talk

DPDK performance
How to not just do a demo with DPDK

The Intel DPDK provides a platform for building high performance Network Function Virtualization applications. But it is hard to get high performance unless certain design tradeoffs are made. This talk focuses on the lessons learned in creating the Brocade vRouter using DPDK. It covers some of the architecture, locking and low level issues that all have to be dealt with to achieve 80 Million packets per second forwarding.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Dpdk performance (20)

Advertisement

Recently uploaded (20)

Advertisement

Dpdk performance

  1. 1. DPDK performance How to not just do a demo with the DPDK Stephen Hemminger stephen@networkplumber.org @networkplumber
  2. 2. Agenda ● DPDK – Background – Examples ● Performance – Lessons learned ● Open Issues
  3. 3. 60 160 260 360 460 560 660 760 860 960 1060 1160 1260 1360 1460 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Packet size (bytes) Time(ns) Server Packet Network Infrastructure Packet time vs size
  4. 4. Demo Applications ● Examples – L2fwd → Bridge/Switch – L3fwd → Router – L3fwd-acl → Firewall – Load_balancer – qos_sched → Quality Of Service
  5. 5. L3fwd Route Lookup Transmit Transmit Transmit TransmitReceive Receive Receive Receive Receive
  6. 6. Forwarding thread Read Burst Process Burst Statistics: Received Packets Transmit Packets Iterations Packets processed
  7. 7. Time Budget ● Packet – 67.2ns = 201 cycles @ 3Ghz ● Cache – L3 = 8 ns – L2 = 4.3 ● Atomic operations – Lock = 8.25 ns – Lock/Unlock = 16.1 Network stack challenges at increasing speeds – LCA 2015 Jesper Dangaard Brouer
  8. 8. Architecture choices ● Legacy – Existing proprietary code ● BSD clone ● DIY – Build forwarding engine from scratch
  9. 9. Test fixture Freebsd (netmap) Router Linux desktop 10G Fibre 192.18.1.0/27 10G Cat6 192.18.0.0/27 Management network
  10. 10. Dataplane CPU activity Core Interface RX Rate TX Rate Idle -------------------------------------------------------- 1 p1p1 14.9M 0 2 p1p1 0 250 3 p33p1 0 250 4 p33p1 1 250 5 p1p1 0 250 6 p33p1 11.9M 1 Internal Instrumentation
  11. 11. Linux perf ● Perf tool part of kernel source ● Can look at kernel and application – Drill down if has symbols
  12. 12. Perf – active thread Samples: 16K of event 'cycles', Event count (approx.): 11763536471 14.93% dataplane [.] ip_input 10.04% dataplane [.] ixgbe_xmit_pkts 7.69% dataplane [.] ixgbe_recv_pkts 7.05% dataplane [.] T.240 6.82% dataplane [.] fw_action_in 6.61% dataplane [.] fifo_enqueue 6.44% dataplane [.] flow_action_fw 6.35% dataplane [.] fw_action_out 3.92% dataplane [.] ip_hash 3.69% dataplane [.] cds_lfht_lookup 2.45% dataplane [.] send_packet 2.45% dataplane [.] bit_reverse_ulong
  13. 13. Original model Rx0.0 Rx0.1 Core 1 Core 2 Tx0.0 Rx1.0 Rx1.1 Core 3 Core 4 Tx0.1 Tx0.2 Tx0.3 Tx1.0 Tx1.1 Tx1.2 Tx1.3
  14. 14. Split thread model Rx0.0 Rx0.1 Core 1 Core 5 Tx1 Core 2 Core 6 Tx1 Rx1.0 Rx1.1 Core 3 Core 4
  15. 15. Speed killer's ● I/O ● VM exit's ● System call's ● PCI access ● HPET ● TSC ● Floating Point ● Cache miss ● CPU pipeline stall
  16. 16. TSC counter while(1) cur_tsc = rte_rdtsc(); diff_tsc = cur_tsc – prev_tsc; if (unlikely(diff_tsc > drain_tsc)) { for (portid = 0; portid < RTE_MAX_ETHPORTS;  portid++) { send_burst(qconf, qconf­>tx_mbufs[portid].len, portid); CPU stall Heisenburg: observing performance slows it down
  17. 17. fw_action_in │ struct ip_fw_args fw_args = { .m = m, │ .client = client, │ .oif = NULL }; 1.54 │1d: movzbl %sil,%esi 0.34 │ mov %rsp,%rdi 0.04 │ mov $0x13,%ecx 0.16 │ xor %eax,%eax 57.66 │ rep stos %rax,%es:(%rdi) 4.68 │ mov %esi,0x90(%rsp) 20.45 │ mov %r9,(%rsp)
  18. 18. Why is QoS slow? static inline void rte_sched_port_time_resync(struct rte_sched_port *port) { uint64_t cycles = rte_get_tsc_cycles(); uint64_t cycles_diff = cycles ­ port­>time_cpu_cycles; double bytes_diff = ((double) cycles_diff) /  port­>cycles_per_byte; /* Advance port time */ port­>time_cpu_cycles = cycles; port­>time_cpu_bytes += (uint64_t) bytes_diff;
  19. 19. The value of idle ● All CPU's are not equal – Hyperthreading – Cache sharing – Thermal If one core is idle, others can go faster
  20. 20. Idle sleep ● 100% Poll → 100% CPU – CPU power limits – No Turbo boost – PCI bus overhead ● Small sleep's – 0 - 250us – Based on activity
  21. 21. Memory Layout ● Cache killers – Linked lists – Poor memory layout – Global statistics – Atomic ● Doesn't help – Prefetching – Inlining
  22. 22. Mutual Exclusion ● Locking – Reader/Writer lock is expensive – Read lock more overhead than spin lock ● Userspace RCU – Don't modify, create and destroy – Impacts thread model
  23. 23. Longest Prefix Match Nexthop 1.1.1.1 /24 table 1.1.1.X If = dp0p9p1 gw = 2.2.33.5 1.1.3.6
  24. 24. LPM issues ● Prefix → 8 bit next hop ● Missing barriers ● Rule update ● Fixed size /8 table
  25. 25. PCI passthrough ● I/O TLB size – Hypervisor uses IOMMU to map guest – IOMMU has small TLB cache – Guest I/O exceeds TLB ● Solution – 1G hugepage on host KVM – Put Guest in huge pages – Only on KVM – requires manual configuration
  26. 26. DPDK Issues ● Static configuration – Features – CPU architecture – Table sizes ● Machine specific initialization – # of Cores, Memory Channels ● Poor device model – Works for Intel E1000 like devices
  27. 27. Conclusion ● DPDK can be used to build fast router – 12M pps per core ● Lots of ways to go slow – Fewer ways to go fast
  28. 28. Q & A
  29. 29. Thank you Stephen Hemminger stephen@networkplumber.org @networkplumber

×