Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging - Florian Schmidt, NEC

172 views

Published on

Unikernels are increasingly gaining traction as they provide lightweight, low-overhead, high-performance execution of applications, while keeping the high isolation guarantees of virtualization crucial to multi-tenant deployments. However, developers still lack tools to support their development, especially compared to the rich toolsets that full operating systems such as Linux or BSD provide.

In this talk, Florian will present uniprof, a unikernel profiler and performance debugger that gives developers insight into their unikernel behavior transparently, without having to instrument the unikernel itself. Uniprof works on both x86 and ARM, can profile even when frame pointers are unavailable, and can be used with visualization tools such as flame graphs. It incurs only minimal overhead (~0.1% at 100 samples/s) to the unikernel, making it ideal for profiling even on production systems.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging - Florian Schmidt, NEC

  1. 1. uniprof: Transparent Unikernel Performance Profiling & Debugging Florian Schmidt, Research Scientist, NEC Europe Ltd.
  2. 2. 2 Unikernels? ▌Faster, smaller, better!
  3. 3. 3 Unikernels? ▌Faster, smaller, better! ▌But ever heard this? Unikernels are hard to debug. Kernel debugging is horrible! cliparts:clipproject.info
  4. 4. 4 Unikernels? ▌Faster, smaller, better! ▌But ever heard this? ▌Then you might say Unikernels are hard to debug. Kernel debugging is horrible! But that’s not really true! Unikernels are a single linked binary. They have a shared address space. You can just use gdb! cliparts:clipproject.info
  5. 5. 5 Unikernels? ▌Faster, smaller, better! ▌But ever heard this? ▌Then you might say ▌And while that is true… ▌… we are admittedly lacking tools Unikernels are hard to debug. Kernel debugging is horrible! But that’s not really true! Unikernels are a single linked binary. They have a shared address space. You can just use gdb! cliparts:clipproject.info
  6. 6. 6 Unikernels? ▌Faster, smaller, better! ▌But ever heard this? ▌Then you might say ▌And while that is true… ▌… we are admittedly lacking tools ▌Such as effective profilers Unikernels are hard to debug. Kernel debugging is horrible! But that’s not really true! Unikernels are a single linked binary. They have a shared address space. You can just use gdb! cliparts:clipproject.info
  7. 7. 7 Enter uniprof ▌Goals: Performance profiler No changes to profiled code necessary Minimal overhead
  8. 8. 8 Enter uniprof ▌Goals: Performance profiler No changes to profiled code necessary Minimal overhead  Useful in production environments
  9. 9. 9 Enter uniprof ▌Goals: Performance profiler No changes to profiled code necessary Minimal overhead  Useful in production environments ▌So, a stack profiler Collect stack traces at regular intervals call_main+0x278 main+0x1c schedule+0x3a monotonic_clock+0x1a
  10. 10. 10 Enter uniprof ▌Goals: Performance profiler No changes to profiled code necessary Minimal overhead  Useful in production environments ▌So, a stack profiler Collect stack traces at regular intervals Many of them call_main+0x278 main+0x1c schedule+0x3a monotonic_clock+0x1a call_main+0x278 main+0x1c netfront_rx+0xa call_main+0x278 main+0x1c netfront_rx+0xa netfront_get_responses+0x1c netfrontif_rx_handler+0x20 netfrontif_transmit+0x1a0 netfront_xmit_pbuf+0xa4 call_main+0x278 main+0x1c blkfront_aio_poll+0x32
  11. 11. 11 Enter uniprof ▌Goals: Performance profiler No changes to profiled code necessary Minimal overhead  Useful in production environments ▌So, a stack profiler Collect stack traces at regular intervals Many of them Analyze which code paths show up often • Either because they take a long time • Or because they are hit often Point towards potential bottlenecks call_main+0x278 main+0x1c schedule+0x3a monotonic_clock+0x1a call_main+0x278 main+0x1c netfront_rx+0xa call_main+0x278 main+0x1c netfront_rx+0xa netfront_get_responses+0x1c netfrontif_rx_handler+0x20 netfrontif_transmit+0x1a0 netfront_xmit_pbuf+0xa4 call_main+0x278 main+0x1c blkfront_aio_poll+0x32
  12. 12. 12 xenctx ▌Turns out, a stack profiler for Xen already exists Well, kinda
  13. 13. 13 xenctx ▌Turns out, a stack profiler for Xen already exists Well, kinda ▌xenctx is bundled with Xen Introspection tool Option to print call stack $ xenctx -f -s <symbol table file> <DOMID> [...] Call Trace: [<0000000000004868>] three+0x58 <-- 00000000000ffea0: [<00000000000044f2>] two+0x52 00000000000ffef0: [<00000000000046a6>] one+0x12 00000000000fff40: [<000000000002ff66>] 00000000000fff80: [<0000000000012018>] call_main+0x278
  14. 14. 14 xenctx ▌Turns out, a stack profiler for Xen already exists Well, kinda ▌xenctx is bundled with Xen Introspection tool Option to print call stack ▌So if we run this over and over, we have a stack profiler Well, kinda $ xenctx -f -s <symbol table file> <DOMID> [...] Call Trace: [<0000000000004868>] three+0x58 <-- 00000000000ffea0: [<00000000000044f2>] two+0x52 00000000000ffef0: [<00000000000046a6>] one+0x12 00000000000fff40: [<000000000002ff66>] 00000000000fff80: [<0000000000012018>] call_main+0x278
  15. 15. 15 xenctx ▌Downside: xenctx is slow Very slow: 3ms+ per trace Doesn’t sound like much, but really adds up (e.g., 100 samples/s = 300ms/s) Can’t really blame it, not designed as a fast stack profiler
  16. 16. 16 xenctx ▌Downside: xenctx is slow Very slow: 3ms+ per trace Doesn’t sound like much, but really adds up (e.g., 100 samples/s = 300ms/s) Can’t really blame it, not designed as a fast stack profiler ▌Performance isn’t just a nice-to-have We interrupt the guest all the time Can’t walk stack while guest is running: race conditions High overhead can influence results! Low overhead is imperative for use on production unikernels
  17. 17. 17 xenctx ▌Downside: xenctx is slow Very slow: 3ms+ per trace Doesn’t sound like much, but really adds up (e.g., 100 samples/s = 300ms/s) Can’t really blame it, not designed as a fast stack profiler ▌Performance isn’t just a nice-to-have We interrupt the guest all the time Can’t walk stack while guest is running: race conditions High overhead can influence results! Low overhead is imperative for use on production unikernels ▌First question: extend xenctx or write something from scratch? Spoiler: look at the talk title More insight when I come to the evaluation
  18. 18. 18 What do we need?
  19. 19. 19 What do we need? ▌Registers (for FP, IP) This is pretty easy: getvcpucontext() hypercall
  20. 20. 20 What do we need? ▌Registers (for FP, IP) This is pretty easy: getvcpucontext() hypercall ▌Access to stack memory (to read return addresses and next FPs) This is the complicated step We need to do address resolution
  21. 21. 21 What do we need? ▌Registers (for FP, IP) This is pretty easy: getvcpucontext() hypercall ▌Access to stack memory (to read return addresses and next FPs) This is the complicated step We need to do address resolution • Memory introspection requires mapping memory over • We’re looking at (uni)kernel code • But there’s still a virtual  (guest) physical resolution
  22. 22. 22 What do we need? ▌Registers (for FP, IP) This is pretty easy: getvcpucontext() hypercall ▌Access to stack memory (to read return addresses and next FPs) This is the complicated step We need to do address resolution • Memory introspection requires mapping memory over • We’re looking at (uni)kernel code • But there’s still a virtual  (guest) physical resolution • We can’t use hardware support (PVH), because we’re looking in from outside • So we need to manually walk page tables
  23. 23. 23 What do we need? ▌Registers (for FP, IP) This is pretty easy: getvcpucontext() hypercall ▌Access to stack memory (to read return addresses and next FPs) This is the complicated step We need to do address resolution • Memory introspection requires mapping memory over • We’re looking at (uni)kernel code • But there’s still a virtual  (guest) physical resolution • We can’t use hardware support (PVH), because we’re looking in from outside • So we need to manually walk page tables ▌Symbol table (to resolve function names) Thankfully, this is easy again: extract symbols from ELF with nm
  24. 24. 24 Stack Local variables RegistersIP FP … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace:
  25. 25. 25 Stack Local variables RegistersIP FP function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace:
  26. 26. 26 Stack Local variables RegistersIP FP function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xcaIP
  27. 27. 27 Stack Local variables RegistersIP FP function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xcaIP
  28. 28. 28 Stack Local variables RegistersIP FP function two() { […] three(); } function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xcaIP
  29. 29. 29 Stack Local variables RegistersIP FP function two() { […] three(); } function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xca two +0xc1 IP FP+1word
  30. 30. 30 Stack Local variables RegistersIP FP function one() { […] two(); […] } function two() { […] three(); } function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xca two +0xc1 IP FP+1word
  31. 31. 31 Stack Local variables RegistersIP FP function one() { […] two(); […] } function two() { […] three(); } function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xca two +0xc1 one +0x0d IP FP+1word *FP+1word
  32. 32. 32 Stack Local variables RegistersIP FP function one() { […] two(); […] } function two() { […] three(); } function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xca two +0xc1 one +0x0d IP FP+1word *FP+1word
  33. 33. 33 Stack Local variables RegistersIP FP function one() { […] two(); […] } function two() { […] three(); } function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xca two +0xc1 one +0x0d [done] IP FP+1word *FP+1word **FP==NULL
  34. 34. 34 Walking the page tables (x86-64) CR3 virtual address
  35. 35. 35 Walking the page tables (x86-64) CR3 virtual address
  36. 36. 36 Walking the page tables (x86-64) CR3 virtual address
  37. 37. 37 Walking the page tables (x86-64) CR3 L1 virtual address
  38. 38. 38 Walking the page tables (x86-64) CR3 L1 virtual address
  39. 39. 39 Walking the page tables (x86-64) CR3 L1 L2 virtual address
  40. 40. 40 Walking the page tables (x86-64) CR3 L1 L2 L3 virtual address
  41. 41. 41 Walking the page tables (x86-64) CR3 L1 L2 L3 L4 virtual address
  42. 42. 42 Walking the page tables (x86-64) CR3 L1 L2 L3 L4 (guest) physical address virtual address
  43. 43. 43 Walking the page tables (x86-64) CR3 L1 L2 L3 L4 (guest) physical address ▌So many maps: 5 per entry * stack depth map! map! map! map! map! virtual address
  44. 44. 44 Walking the page tables (x86-64) CR3 L1 L2 L3 L4 (guest) physical address ▌So many maps: 5 per entry * stack depth ▌Then again, page table locations don’t change… Neither do stack locations (exception: lots of thread spawning) Effective caching map! map! map! map! map! virtual address
  45. 45. 45 Walking the page tables (x86-64) CR3 L1 L2 L3 L4 (guest) physical address ▌So many maps: 5 per entry * stack depth ▌Then again, page table locations don’t change… Neither do stack locations (exception: lots of thread spawning) Effective caching virtual address map! map! map! map! map!
  46. 46. 46 Create Symbol Table ▌Stack only contains addresses ▌Symbol resolution necessary
  47. 47. 47 Create Symbol Table ▌Stack only contains addresses ▌Symbol resolution necessary ▌Trivial Virtual addresses mapped 1:1 into unikernel address space nm is your friend $ nm -n <ELF> > symtab $ head symtab 0000000000000000 T _start 0000000000000000 T _text 0000000000000008 a RSP_OFFSET 0000000000000017 t stack_start 00000000000000fc a KERNEL_CS_MASK 0000000000001000 t shared_info 0000000000002000 t hypercall_page 0000000000003000 t error_entry 000000000000304f t error_call_handler 0000000000003069 t hypervisor_callback
  48. 48. 48 Create Symbol Table ▌Stack only contains addresses ▌Symbol resolution necessary ▌Trivial Virtual addresses mapped 1:1 into unikernel address space nm is your friend ▌Needs unstripped binary You’re welcome to strip it afterwards $ nm -n <ELF> > symtab $ head symtab 0000000000000000 T _start 0000000000000000 T _text 0000000000000008 a RSP_OFFSET 0000000000000017 t stack_start 00000000000000fc a KERNEL_CS_MASK 0000000000001000 t shared_info 0000000000002000 t hypercall_page 0000000000003000 t error_entry 000000000000304f t error_call_handler 0000000000003069 t hypervisor_callback
  49. 49. 49 What do we get?
  50. 50. 50 What do we get? Flamegraphs! ▌ Y Axis: call trace  Bottom: main function, each layer: one call depth ▌ X Axis: relative run time  Call paths are aggregated, no same call path twice in graph https://github.com/brendangregg/flamegraph
  51. 51. 51 What do we get? Flamegraphs! ▌ Y Axis: call trace  Bottom: main function, each layer: one call depth ▌ X Axis: relative run time  Call paths are aggregated, no same call path twice in graph ▌ In this example: netfront functions “heavy hitters”  netfront_xmit_pbuf  netfront_rx  but also blkfront_aio_poll 1 3 2 1 3 2 2 1 3 https://github.com/brendangregg/flamegraph
  52. 52. 52 What do we get? Flamegraphs! ▌ Y Axis: call trace  Bottom: main function, each layer: one call depth ▌ X Axis: relative run time  Call paths are aggregated, no same call path twice in graph ▌ In this example: netfront functions “heavy hitters”  netfront_xmit_pbuf  netfront_rx  but also blkfront_aio_poll 1 3 2 1 3 2 2 1 3 Yep, it’s a MiniOS* doing network communication *with lwip for TCP/IP https://github.com/brendangregg/flamegraph
  53. 53. 53 Try 1: improving xenctx performance ▌xenctx translates and maps memory addresses every stack walk  Huge overhead  Solution: cache mapped memory and virtualmachine translations
  54. 54. 54 Try 1: improving xenctx performance ▌xenctx translates and maps memory addresses every stack walk  Huge overhead  Solution: cache mapped memory and virtualmachine translations ▌xenctx resolves symbols via linear search  Solution: use binary search
  55. 55. 55 Try 1: improving xenctx performance ▌xenctx translates and maps memory addresses every stack walk  Huge overhead  Solution: cache mapped memory and virtualmachine translations ▌xenctx resolves symbols via linear search  Solution: use binary search  (Or, even better, do resolutions offline after tracing)
  56. 56. 56 Try 1: improving xenctx performance ▌xenctx translates and maps memory addresses every stack walk  Huge overhead  Solution: cache mapped memory and virtualmachine translations ▌xenctx resolves symbols via linear search  Solution: use binary search  (Or, even better, do resolutions offline after tracing)
  57. 57. 57 Try 1: improving xenctx performance ▌xenctx translates and maps memory addresses every stack walk  Huge overhead  Solution: cache mapped memory and virtualmachine translations ▌xenctx resolves symbols via linear search  Solution: use binary search  (Or, even better, do resolutions offline after tracing) At this point, I abandoned xenctx and (re)wrote uniprof from scratch.
  58. 58. 58 Try 2: uniprof ▌100-fold improvement is nice! But we can do better: Xen 4.7 introduced low-level libraries (libxencall, libxenforeigmemory) Another significant reduction by ~ factor of 3
  59. 59. 59 Try 2: uniprof ▌100-fold improvement is nice! But we can do better: Xen 4.7 introduced low-level libraries (libxencall, libxenforeigmemory) Another significant reduction by ~ factor of 3 ▌End result: overhead of ~0.1% @101 samples/s
  60. 60. 60 Performance on ARM ▌uniprof supports ARM (xenctx doesn’t) Main challenge: different page table design
  61. 61. 61 Performance on ARM ▌uniprof supports ARM (xenctx doesn’t) Main challenge: different page table design ▌ARM: much slower, overhead higher But the CPU is much slower, too (Intel Xeon @3.7GHz vs. Cortex A20 @1GHz) So fewer samples/s needed for same effective resolution
  62. 62. 62 No Frame Pointer? No Problem! ▌Stack walking relies on frame pointer Optimizations can reuse FP as general-purpose register (-fomit-frame-pointer)
  63. 63. 63 No Frame Pointer? No Problem! ▌Stack walking relies on frame pointer Optimizations can reuse FP as general-purpose register (-fomit-frame-pointer) ▌But we can do without FPs Use stack unwinding information • It’s already included if you use C++ (for exception handling) • It doesn’t change performance • Only binary size DWARF standard $ readelf –S <ELF> There are 13 section headers, starting at offset 0x40d58: Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [...] [ 4] .eh_frame PROGBITS 0000000000035860 00036860 00000000000066f8 0000000000000000 A 0 0 8 [ 5] .eh_frame_hdr PROGBITS 000000000003bf58 0003cf58 000000000000128c 0000000000000000 A 0 0 4 [...]
  64. 64. 64 Unwinding without Frame Pointers ▌How does it work?
  65. 65. 65 Unwinding without Frame Pointers ▌How does it work?
  66. 66. 66 Unwinding without Frame Pointers ▌How does it work? ▌Lookup table For every program address • The current frame size • Locations of registers to restore (GP and IP)
  67. 67. 67 Unwinding without Frame Pointers ▌How does it work? ▌Lookup table For every program address • The current frame size • Locations of registers to restore (GP and IP) Important for exception handling • Exit functions immediately until handler is found
  68. 68. 68 Unwinding without Frame Pointers ▌How does it work? ▌Lookup table For every program address • The current frame size • Locations of registers to restore (GP and IP) Important for exception handling • Exit functions immediately until handler is found ▌Index to quickly find table entry
  69. 69. 69 Unwinding without Frame Pointers ▌How does it work? ▌Lookup table For every program address • The current frame size • Locations of registers to restore (GP and IP) Important for exception handling • Exit functions immediately until handler is found ▌Index to quickly find table entry ▌Several library implementations uniprof uses libunwind Actually, a libunwind patched for Xen guest introspection support Might be useful for other tools?
  70. 70. 70 Performance: uniprof w/ libunwind ▌Performance lower than with frame pointer Reason: libunwind does more than we need (full register reconstruction etc.) ▌Different library or own implementation promising But “good enough” for many cases And a good area for future work
  71. 71. 71 Enter the title. Thank you! Questions? uniprof: https://github.com/cnplab/uniprof libunwind-xen: https://github.com/cnplab/libunwind FlameGraphs: https://github.com/brendangregg/flamegraph

×