Successfully reported this slideshow.
Your SlideShare is downloading. ×

Dpdk applications

Upcoming SlideShare
DPDK KNI interface
DPDK KNI interface
Loading in …3
×

Check these out next

1 of 42 Ad
1 of 42 Ad
Advertisement

More Related Content

Advertisement
Advertisement

Dpdk applications

  1. 1. share, discuss, & ask 1 DPDK What is and is not? How to port an application? Where can we run? ISSUES What are? Why did it occurred? TOOLS General Debug guide Custom
  2. 2. 2 a) Platform: Difference General Processor, Network Processor NPU has smaller caches, lower clock speed (around 1GHz) Specialized ISA (Instruction Set Architecture) for single clock parsing (eg: IP, MPLS, Vlan etc..) Specialized schedulers keeping OS and worker threads separate b) Processing Architecture:  HW: dedicate peripheral interrupt processing, reduce TLB misses,  SW: schedulers, locks  Timers: scheduler tick, remove SW watchdog on DP cores  Inter Processor Communication: LRU drain, rcu_barrier, paging, per cpu drain. c) Locking overheads by CPU or Memory:  IPI between large number of multicore is expensive  Locking of memory or regions is expensive  Vmstat_update every sec and updates for virtual memory d) Pipeline Latency: Hiding for both HW & SW  Get bulk packets for processing  Pre-fetch to cache  Allow bulk lookup for multiple packets in burst e) Bus Interface: Improve PCIe (NIC) to CPU Cache  Design Cache to accommodate more frames per core. Larger L2 or L3.  Use HW assisted caching to manage incoming packets. f) Meta Buffer:  Impossible to hold and access Millions packets per sec CPU cache or memory. Overhead of latency and difficulty in size.  Pre parse stage; prepare Meta data to hold essential headers which can be stored in cache.  Make use of bytes prefetch in interleaved fashion to hide latency by pipeline.
  3. 3. Measuring cross socket bandwidth CPU Socket CPU Cores UPICONTROLLER 1 LLC PacketRXd NIC CPU Socket CPU Cores MEMORYCONTROLLER RAM 1 LLC PacketRXd NIC UPICONTROLLER
  4. 4. 4
  5. 5. Part 1: DPDK What we need to know!
  6. 6. Problem? 6
  7. 7. Problem? User Buffer User App Network Stack SKB Driver - generic _rcv_ISR() _hard_start_xmit() RX DMA BUFFER TX DMA BUFFER SKB frame end len tail data head Head room User data Tail room SKB shared info 7
  8. 8. Problem? BPF ACT XDP_DROP XDP_TX XDP_PASS User App Network Stack SKB Driver - XDP _rcv_ISR() _hard_start_xmit() RX DMA BUFFER TX DMA BUFFER MAP for XDP sock XDP_REDIRECT XDP User Buffer 8
  9. 9. 9 Application – Packet Life  Read from NIC  Check content  Ensure integrity  Do lookup / hash  Identify processing  Map to queue  Action per queue/schedule  Update stats counters  Send burst to NIC CPU NIC Programmable NIC slow fast
  10. 10. What is not DPDK? HW support: Huge Page Size, Data Direct I/O, SIMD Converse for Power or Cycles as required Allow multi process data sharing without SYSCALL IPC (SHM, sockets, FIFO) Either burst or low latency polls Adapt to small, big or hybrid cases Prototype and Deploy quickly HW offload with SW fallback Runs in User Space (Bypasses Kernel Path) Library of Functions What is DPDK? 10
  11. 11. 11 Where all we can run DPDK? Host User Space Application DPDK + Ext NIC Docker + Application NIC Application vNIC DPDK + Ext VM Guest NIC Host User Space Docker + Application vNIC DPDK + Ext VM Guest NIC NIC User Space Application DPDK + Ext NIC Host User Space Application DPDK + Ext Docker + Application Application vNIC DPDK + Ext VM Guest Docker + Application vNIC DPDK + Ext VM Guest
  12. 12. Part 2: Porting Apps to DPDK
  13. 13. Other Applications 13 Network I/O (Multiple 10Gbit/s Interfaces) Control, Configuration and Stats User Space Clear Text Encrypted Encrypted RX NIC Capture Decode Stream Detect Output Capture Decode Stream Detect Output RSS HASH Parse for metadata Match for rule set Buffer & Zero Copy DPDK
  14. 14. PMD 14 MEM-COPY ZERO-COPY https://www.youtube.com/watch?v=rsr_eIDCm8M
  15. 15. 15
  16. 16. 16 1000 499 1000 826 382 251 1000 416 1000 475 1000 825 382 213 1000 472 0 200 400 600 800 1000 1200 DPDK AF-Workers DPDK AF-Workers Byte64Byte1500 Byte 64 Byte 1500 DPDK AF-Workers DPDK AF-Workers P2 TX 382 213 1000 472 P2 RX 1000 475 1000 825 P1 TX 382 251 1000 416 P1 RX 1000 499 1000 826 P2 TX P2 RX P1 TX P1 RX 14.9 8.5 10.2 10.8 14.9 7.9 9.8 10.5 14.8 6.9 8.9 9.7 0 2 4 6 8 10 12 14 16 igb_uio xdp_memcpy xdp_zc xdp_zc_ (no offload) rx drop tx drop rx-tx (l2fwd) 1600 950 625 8000 2400 2970 30000 12000 7500 0 5000 10000 15000 20000 25000 30000 35000 1024 2048 4096 CONNECTION/SEC KEY SIZE Linked List Array Hash Array
  17. 17. Issues Feedback: Works partial with worse throughput
  18. 18. Overview 18 70% 20% 10% Interaction with teams for debug, live terminal, reproducing steps Let’s think and Identify where issue is in Application, DPDK, OVS, Platform, Kernel
  19. 19. Bottleneck Analysis mismatch in packet rates (received < desired)? does RX lcore threads gets enough cycles? packet drops at receive or transmit? packet or object processing rate in the pipeline? user functions performance is not as expected? execution cycles for dynamic service functions are not frequent? Is the packet not in the unexpected format? 19
  20. 20. 20 Why are there various drops? Stress & Regress pkt-gen trex Generic tools lstopo dmidecode libunwind dpdk apps proc-info pdump Isolate Debug guide numa huge page pinning Characterize & Quantize perf top Perf stats vtune Custom tools malloc scanner Memzone monitor Thread Stack Tracer
  21. 21. Part 3: Tips for quick debug
  22. 22. 22 lstopo --pid 2 --fontsize 15 --gridsize 12 --no-collapse Somewhat Helpful!
  23. 23. 23 Hardware related items • NIC details, configurations, firmware version via Linux • PCIe capability and current configurations • PCIe advertised speed and configurations. • SFP and SFP+ details fetch  lshw -c network –businfo  lshw -c network | egrep 'firmware|pci@‘  Ethool –m | -k | -P | -S • CPU flags and feature get • Lscpu • Cat /proc/cpuinfo HW performance counters (user perf and vtune on IA) Helpful!
  24. 24. Linux Signals 24 SIGHUP 1 Term Hangup detected on controlling terminal or death of controlling process SIGINT 2 Term Interrupt from keyboard SIGQUIT 3 Core Quit from keyboard SIGILL 4 Core Illegal Instruction SIGABRT 6 Core Abort signal from abort(3) SIGKILL 9 Term Kill signal SIGSEGV 11 Core Invalid memory reference SIGTERM 15 Term Termination signal SIGSTOP 17,19,23 Stop Stop process SIGTSTP 18,20,24 Stop Stop typed at terminal SIGBUS 10,7,10 Core Bus error (bad memory access) SIGFPE 8 Core Floating point exception SIGPIPE 13 Term Broken pipe: write to pipe with no readers SIGALRM 14 Term Timer signal from alarm(2) SIGUSR1 30,10,16 Term User-defined signal 1 SIGUSR2 31,12,17 Term User-defined signal 2 SIGCHLD 20,17,18 Ign Child stopped or terminated SIGCONT 19,18,25 Cont Continue if stopped SIGTTIN 21,21,26 Stop Terminal input for background process SIGTTOU 22,22,27 Stop Terminal output for background process The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored. Next the signals not in the POSIX.1-1990 standard but described in SUSv2 and POSIX.1-2001. Signal Value Action Comment SIGPOLL Term Pollable event (Sys V). Synonym for SIGIO SIGPROF 27,27,29 Term Profiling timer expired SIGSYS 12,31,12 Core Bad argument to routine (SVr4) SIGTRAP 5 Core Trace/breakpoint trap SIGURG 16,23,21 Ign Urgent condition on socket (4.2BSD) SIGVTALRM 26,26,28 Term Virtual alarm clock (4.2BSD) SIGXCPU 24,24,30 Core CPU time limit exceeded (4.2BSD) SIGXFSZ 25,25,31 Core File size limit exceeded (4.2BSD) Not Sure!
  25. 25. STRACE 25 strace -e trace=open,read <executable> strace -t -e open <Executable> strace -r -e open <exdcutable> strace -c <executbale> strace -i <executable> strace -T -e read <executable> strace -e trace=network|signal|memory <executable> strace userspace utility for Linux helps to diagnose, debug and instructional by monitoring system calls and signal. The operation of strace is made possible by the kernel feature known as ptrace. Specifying a list of paths to be traced (-P /etc/ld.so.cache, for example). Modifying return and error code of the specified syscalls, and inject signals upon their execution (since strace 4.15, -e inject= option). Extracting information about file descriptors (including sockets, -y option). Not Helpful!
  26. 26. objdump 26 File header: -f File format: -p Section header: -h All headers: -x Executable sections: -d Assembler sections: -D Full contents: -s Debug: -g Symbol table: -t Dynamic Symbol table: -T Dynamic Relocation: -R Function content via name: -s -j.rodata, -D --prefix-addresses readelf --relocs Somewhat Helpful!
  27. 27. nm <executable> 27 t|T – The symbol is present in the .text code section b|B – The symbol is in UN-initialized .data section D|d – The symbol is in Initialized .data section. nm -A ./*.o nm -u undefined symbols nm -n symbol nm -S symbol wth size nm -D dynamic symbol A : Global absolute symbol. a : Local absolute symbol. B : Global bss symbol. b : Local bss symbol. D : Global data symbol. d : Local data symbol. f : Source file name symbol. L : Global thread-local symbol (TLS). l : Static thread-local symbol (TLS). T : Global text symbol. t : Local text symbol. U : Undefined symbol. Somewhat Helpful!
  28. 28. CPU utilization 28 { char *stat_param[5] = {"utime", "stime", "cutime", "cstime", "starttime"}; char *stat_result[5] = {0}; struct sysinfo info = {0}; fprintf(stdout, "Process to fetch stat: %sn", argv[1]); if (sysinfo(&info) == 0) { fprintf(stdout, "sysinfo n"); sprintf(buf, "cat /proc/%s/stat | awk '{print $14 "," $15 "," $16 "," $17 "," $22}'", argv[1]); fp = popen(buf, "r"); if (fp) { char *parse = fgets (buf, 999, fp); char *p = strtok (parse, ","); res = 0; while (p) { stat_result[res++] = p; p = strtok (NULL, ","); } fprintf(stdout, " --- Calculation --- n"); unsigned long int hertz = sysconf(_SC_CLK_TCK); unsigned long int total_time = atol(stat_result[0]) + atol(stat_result[1]) + atol(stat_result[2]) + atol(stat_result[3]); unsigned long int sec = info.uptime - (atol(stat_result[4])/ hertz); unsigned long int cpu_usage = (100 * total_time) / (sec *hertz); fprintf(stdout, "cpu_usgae (%lu) for process (%s)n", cpu_usage, argv[1]); } return cpu_usage; } Not Helpful!
  29. 29. GDB 29 call actual library functions or even functions from within the debugged program using the command call start GDB with gdbtui or gdb -tui. Switch using 'layout src|asm|regs' shell allows you to execute commands in the shell print, examine and display info file - Entry point set disassembly-flavor intel set print pretty set print addr off set print array set print array on set print array off display next 5 instructions - x/5i $pc disassemble <function name> .gdbinit file exe break *0x400710 set disassembly-flavor intel layout asm layout regs run argument1 argument2 we can use set so do the magic for us. Let's first inspect the instruction bytes: (gdb) x/10b $pc (gdb) set write (gdb) set {unsigned int}$pc = 0x90909090 (gdb) set {unsigned char}($pc+4) = 0x90 (gdb) set write off (gdb) x/10i $pc x/6i $pc => 0x40911f: nop 0x409120: nop 0x409121: nop 0x409122: nop 0x409123: nop 0x409124: push rbp set {unsigned int}0x40911f = 0x90909090 {unsigned char}0x409123 = 0x9 set $pc+=5 jump *$pc+5 Somewhat Helpful!
  30. 30. 30 DPDK packet processing using Direct Data I/O 1. Core writes RXd preparing for receiving packet 2. NIC reads RXd to get buffer address 3. NIC writes packet 4. NIC writes RXd 5. Core reads RXd (polling) 6. Core reads packet and performs some action CPU Socket CPU Cores MEMORYCONTROLLER RAM 1 LLC Packet 1 5 2 RXd NIC 4 3 6 Easy!
  31. 31. Resource Director 31 Cache Monitoring Tech (CMT)  Per-thread L3 Occupancy Monitoring LPHP Memory Bandwidth Allocation  Per-thread Bandwidth Control  New on Purley IMCCORE CREDITS Memory BW Monitoring (MBM)  Per-thread Memory Bandwidth Monitoring IMC? Cache Allocation Tech (CAT)  Per-thread L3 Occupancy Control New Code/Data Prioritization (CDP) extension Cache LPHP Monitoring Allocation MemoryCache Somewhat Easy!
  32. 32. Packet generator Multi thread PDUMP 32 DPDK-0 DPDK-1 DPDK-1 librte_pdump Primary Application DPDK-PDUMP pkts-1.pcap pkts-1.pcap Easy!
  33. 33. PROCINFO 33 Easy!
  34. 34. Stack, register, variable trace for all threads 34 When to use: an unexpected signal or crash occurs What to do: dump all threads stack and register information in an environment where GDB is not present or not run. Where it works:  Binary are stripped.  Binary and Application have no debug symbols.  Rare cases & combinations when faults occurs.  Errors or faults difficult to reproduce.  There are no access to GDB or remote GDB, ptrace or pstack-dump.  Inspect stack for each thread.  Inspect & dump global and debug variables.  DPDK when secondary causes primary to segfault. Running GDB for primary causes Secondary to segfault. Q & A:  Does this work for all shared library? Yes  Does this work mixed libraries static and shared? Yes  Does this work for all stripped libraries? Yes  Can we register SIGUSER1 to dump intermediate? Yes How to make it work: Build:  LIB: libunwind-dev  CFLAGS: -DDUMPSTACK_EXTRAREG -DDUMPSTACK_EXTRASTACK -DDUMPSTACK - L/usr/lib/x86_64-linux-gnu/ -lunwind  LDFLAGS: -L/usr/lib/x86_64-linux-gnu/ -lunwind Application Code Modify: add signal handler to call custom signal handler Somewhat Easy!
  35. 35. trace stack 35 ----------------- THREAD NAME BEGIN ----------------- /proc/41253/task/41248/comm /proc/41253/task/41249/comm /proc/41253/task/41250/comm /proc/41253/task/41251/comm /proc/41253/task/41252/comm /proc/41253/task/41253/comm l2fwd eal-intr-thread lcore-slave-3 lcore-slave-4 lcore-slave-5 pdump-thread ----------------- THREAD NAME DONE ----------------- DPDK Version 0x11080010 Config: msater 2 lcore count 4 process 0 rte_sys_gettid 41253
  36. 36. Memzone Monitor 36 Lookup Table Direct Table Counters PRIMARY PROCESS Lookup Table Direct Table Counters SECONDARY PROCESS MMAP Huge Pages
  37. 37. When to use: Memory layout is shared across multiple process, this can lead to Unintended changes within the same process unintended changes from multi process Application logic or function pointers modifying unintended areas What to do: dump all threads stack and register information in an environment where GDB is not present or not run. Where it works:  Control and Data Plane are in same or different process  Tables are close by.  Table entries are malloced dynamically.  Isolate the table or counter where the change is occurring  Can monitor multiple tables.  Program error  Key or values are read without const.  Values are modified using PTR athematic. Tables with Lookup, Lookup + Result, Lookup + Result + Counters, Counters or Index to Counters, Reference to Lookup, and Lookup + Result and Lookup + Result + Counter Q & A:  Does this work for all shared library? Yes  Does this work mixed libraries static and shared? Yes  Does this work for all stripped libraries? Yes  Can we register SIGUSER1 to dump intermediate? Yes How it works: Works as secondary application, which periodically monitor selected tables or memory region. Reports back the offset where the change has occurred. Build:  LIB: libunwind-dev  CFLAGS: -DDUMPSTACK_EXTRAREG -DDUMPSTACK_EXTRASTACK -DDUMPSTACK - L/usr/lib/x86_64-linux-gnu/ -lunwind  LDFLAGS: -L/usr/lib/x86_64-linux-gnu/ -lunwind Application Code Modify: add signal handler to call custom signal handler Somewhat Easy! Memzone Monitor 37
  38. 38. 38 1 2 3 4 5
  39. 39. 39
  40. 40. MALLOC-FREE Scanner 40 When to use: Quick and dirty valgrind like report tool What to do:  For every successful malloc, calloc, zalloc create a container to hold - name, pointer and size.  For every free of alloced entry, remove the container. How to works:  create ‘struct rte_fbarray´ with ‘rte_memzone_reserve'  In Primary process we ‘rte_fbarray_init’  In secondary we ‘rte_fbarray_attach’  In primary process for each alloc retrieve container ‘rte_fbarray_find_next_free’.  For each successful alloc we mark with ‘rte_fbarray_set_used’  For each free we ‘rte_fbarray_set_free’  In secondary fetch the details back by ‘rte_fbarray_find_next_used,|rte_fbarray_find_next_n_used’ Where it works:  rte_malloc, rte_calloc and rte_zalloc does not map alloc region name to address.  This makes it difficult to track the usage on dynamically allocates instance. Easy! Seg - 0 Seg - 1 Seg - 2 Seg - n Memzone- container Alloc-1 Alloc-2 Alloc-3 rte_fbarrary_attach
  41. 41. Dynamic DEBUG with eBPF (user-space) 41 Looku p Table Count ers API: I. Application Specific II. DPDK eBPF functions for Debug API When to use: for dynamic debug What to do: load eBPF to existing applications How it works: same as user space eBPF Where: 1. Applications in field 2. Recompile not possible 3. Compiler MACROs not possible
  42. 42. 42 # llvm-objdump -S t3.o t3.o: file format ELF64-BPF Disassembly of section .text: entry: 0: bf 12 00 00 00 00 00 00 r2 = r1 1: 69 21 10 00 00 00 00 00 r1 = *(u16 *)(r2 + 16) 2: 79 23 00 00 00 00 00 00 r3 = *(u64 *)(r2 + 0) 3: 0f 13 00 00 00 00 00 00 r3 += r1 4: 69 31 0c 00 00 00 00 00 r1 = *(u16 *)(r3 + 12) 5: 15 01 01 00 08 06 00 00 if r1 == 1544 goto +1 <LBB0_2> 6: 55 01 05 00 08 00 00 00 if r1 != 8 goto +5 <LBB0_3> LBB0_2: 7: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll 9: 79 11 00 00 00 00 00 00 r1 = *(u64 *)(r1 + 0) 10: b7 03 00 00 40 00 00 00 r3 = 64 11: 85 10 00 00 ff ff ff ff call -1 LBB0_3: 12: b7 00 00 00 01 00 00 00 r0 = 1 13: 95 00 00 00 00 00 00 00 exit

Editor's Notes

  • Packet Classifier – state-full or state-less Flow pinning, Load balancing
    Kernel interface for IP for PMD ports
    Node aware resource allocation
  • https://github.com/vipinpv85/DPDK_SURICATA-4_1_1
    https://github.com/vipinpv85/DPDK-Suricata_3.0
  • Mem-copy:
    Pros: XDP Buffer are released to pool immediately after copy.
    Cons: Limited vector instructions (large byte copy is multiple smaller copy, HW is limited to 2 load & 1 store on vector.). With SIMD-512 we can only achieve 64B (512b) copy.

    Zero-Copy:
    Pros: Buffer is in DPDK buffer format, No copy or external buffer.
    Cons: All buffers needs to be page aligned, Applications needs to be adapted, Buffer held in application till packet is dropped or tx complete.
  • single or multiple primary processes.
    single primary and single secondary.
    single primary and multiple secondaries.
  • https://p81atches.dpdk.org/cover/50379
    https://p81atches.dpdk.org/cover/50380
    https://p81atches.dpdk.org/cover/50381
  • https://en.wikibooks.org/wiki/X86_Assembly/GAS_Syntax
  • set LD_PRELOAD to the path of a shared object, that file will be loaded before any other library (including the C runtime, libc.so).

    To run with special library (example malloc) ‘LD_PRELOAD=/path/to/my/malloc.so /bin/ls’
  • Process to fetch stat: 6871
    sysinfo
    uptime: 40936
    loads: 1min (127424) 5min (77472) 15min (42688)
    RAM: free (49962647552) shared (15323136) buffer (440262656)
    swap: total (0) free (0)
    procs: 919
    uptime: 40935.02 3598954.44

    utime: 12301
    stime: 269
    cutime: 0
    cstime: 0
    starttime: 4089122

    --- Calculation ---
    Hertz: 100
    total time (12570)
    sec (45)
    cpu_usgae (279)
  • https://github.com/vipinpv85/DPDK-THREADTRACE-WITHOUTGDB
  • https://github.com/vipinpv85/DPDK-MEMZONEMONITOR
  • https://github.com/vipinpv85/DPDK-MALLOCFREE-SCANNER

×