uniprof: Transparent Unikernel
Performance Profiling & Debugging
Florian Schmidt, Research Scientist, NEC Europe Ltd.
2
Unikernels?
▌Faster, smaller, better!
3
Unikernels?
▌Faster, smaller, better!
▌But ever heard this? Unikernels are hard to debug.
Kernel debugging is horrible!
cliparts:clipproject.info
4
Unikernels?
▌Faster, smaller, better!
▌But ever heard this?
▌Then you might say
Unikernels are hard to debug.
Kernel debugging is horrible!
But that’s not really true!
Unikernels are a single linked binary.
They have a shared address space.
You can just use gdb!
cliparts:clipproject.info
5
Unikernels?
▌Faster, smaller, better!
▌But ever heard this?
▌Then you might say
▌And while that is true…
▌… we are admittedly lacking tools
Unikernels are hard to debug.
Kernel debugging is horrible!
But that’s not really true!
Unikernels are a single linked binary.
They have a shared address space.
You can just use gdb!
cliparts:clipproject.info
6
Unikernels?
▌Faster, smaller, better!
▌But ever heard this?
▌Then you might say
▌And while that is true…
▌… we are admittedly lacking tools
▌Such as effective profilers
Unikernels are hard to debug.
Kernel debugging is horrible!
But that’s not really true!
Unikernels are a single linked binary.
They have a shared address space.
You can just use gdb!
cliparts:clipproject.info
7
Enter uniprof
▌Goals:
Performance profiler
No changes to profiled code necessary
Minimal overhead
8
Enter uniprof
▌Goals:
Performance profiler
No changes to profiled code necessary
Minimal overhead
 Useful in production environments
9
Enter uniprof
▌Goals:
Performance profiler
No changes to profiled code necessary
Minimal overhead
 Useful in production environments
▌So, a stack profiler
Collect stack traces at regular intervals
call_main+0x278
main+0x1c
schedule+0x3a
monotonic_clock+0x1a
10
Enter uniprof
▌Goals:
Performance profiler
No changes to profiled code necessary
Minimal overhead
 Useful in production environments
▌So, a stack profiler
Collect stack traces at regular intervals
Many of them
call_main+0x278
main+0x1c
schedule+0x3a
monotonic_clock+0x1a
call_main+0x278
main+0x1c
netfront_rx+0xa
call_main+0x278
main+0x1c
netfront_rx+0xa
netfront_get_responses+0x1c
netfrontif_rx_handler+0x20
netfrontif_transmit+0x1a0
netfront_xmit_pbuf+0xa4
call_main+0x278
main+0x1c
blkfront_aio_poll+0x32
11
Enter uniprof
▌Goals:
Performance profiler
No changes to profiled code necessary
Minimal overhead
 Useful in production environments
▌So, a stack profiler
Collect stack traces at regular intervals
Many of them
Analyze which code paths show up often
• Either because they take a long time
• Or because they are hit often
Point towards potential bottlenecks
call_main+0x278
main+0x1c
schedule+0x3a
monotonic_clock+0x1a
call_main+0x278
main+0x1c
netfront_rx+0xa
call_main+0x278
main+0x1c
netfront_rx+0xa
netfront_get_responses+0x1c
netfrontif_rx_handler+0x20
netfrontif_transmit+0x1a0
netfront_xmit_pbuf+0xa4
call_main+0x278
main+0x1c
blkfront_aio_poll+0x32
12
xenctx
▌Turns out, a stack profiler for Xen already exists
Well, kinda
13
xenctx
▌Turns out, a stack profiler for Xen already exists
Well, kinda
▌xenctx is bundled with Xen
Introspection tool
Option to print call stack
$ xenctx -f -s <symbol table file> <DOMID>
[...]
Call Trace:
[<0000000000004868>] three+0x58 <--
00000000000ffea0: [<00000000000044f2>] two+0x52
00000000000ffef0: [<00000000000046a6>] one+0x12
00000000000fff40: [<000000000002ff66>]
00000000000fff80: [<0000000000012018>] call_main+0x278
14
xenctx
▌Turns out, a stack profiler for Xen already exists
Well, kinda
▌xenctx is bundled with Xen
Introspection tool
Option to print call stack
▌So if we run this over and over, we have a stack profiler
Well, kinda
$ xenctx -f -s <symbol table file> <DOMID>
[...]
Call Trace:
[<0000000000004868>] three+0x58 <--
00000000000ffea0: [<00000000000044f2>] two+0x52
00000000000ffef0: [<00000000000046a6>] one+0x12
00000000000fff40: [<000000000002ff66>]
00000000000fff80: [<0000000000012018>] call_main+0x278
15
xenctx
▌Downside: xenctx is slow
Very slow: 3ms+ per trace
Doesn’t sound like much, but really adds up (e.g., 100 samples/s = 300ms/s)
Can’t really blame it, not designed as a fast stack profiler
16
xenctx
▌Downside: xenctx is slow
Very slow: 3ms+ per trace
Doesn’t sound like much, but really adds up (e.g., 100 samples/s = 300ms/s)
Can’t really blame it, not designed as a fast stack profiler
▌Performance isn’t just a nice-to-have
We interrupt the guest all the time
Can’t walk stack while guest is running: race conditions
High overhead can influence results!
Low overhead is imperative for use on production unikernels
17
xenctx
▌Downside: xenctx is slow
Very slow: 3ms+ per trace
Doesn’t sound like much, but really adds up (e.g., 100 samples/s = 300ms/s)
Can’t really blame it, not designed as a fast stack profiler
▌Performance isn’t just a nice-to-have
We interrupt the guest all the time
Can’t walk stack while guest is running: race conditions
High overhead can influence results!
Low overhead is imperative for use on production unikernels
▌First question: extend xenctx or write something from scratch?
Spoiler: look at the talk title
More insight when I come to the evaluation
18
What do we need?
19
What do we need?
▌Registers (for FP, IP)
This is pretty easy: getvcpucontext() hypercall
20
What do we need?
▌Registers (for FP, IP)
This is pretty easy: getvcpucontext() hypercall
▌Access to stack memory (to read return addresses and next FPs)
This is the complicated step
We need to do address resolution
21
What do we need?
▌Registers (for FP, IP)
This is pretty easy: getvcpucontext() hypercall
▌Access to stack memory (to read return addresses and next FPs)
This is the complicated step
We need to do address resolution
• Memory introspection requires mapping memory over
• We’re looking at (uni)kernel code
• But there’s still a virtual  (guest) physical resolution
22
What do we need?
▌Registers (for FP, IP)
This is pretty easy: getvcpucontext() hypercall
▌Access to stack memory (to read return addresses and next FPs)
This is the complicated step
We need to do address resolution
• Memory introspection requires mapping memory over
• We’re looking at (uni)kernel code
• But there’s still a virtual  (guest) physical resolution
• We can’t use hardware support (PVH), because we’re looking in from outside
• So we need to manually walk page tables
23
What do we need?
▌Registers (for FP, IP)
This is pretty easy: getvcpucontext() hypercall
▌Access to stack memory (to read return addresses and next FPs)
This is the complicated step
We need to do address resolution
• Memory introspection requires mapping memory over
• We’re looking at (uni)kernel code
• But there’s still a virtual  (guest) physical resolution
• We can’t use hardware support (PVH), because we’re looking in from outside
• So we need to manually walk page tables
▌Symbol table (to resolve function names)
Thankfully, this is easy again: extract symbols from ELF with nm
24
Stack
Local variables
RegistersIP
FP
…
… …
Frame pointer
NULL
Return address
Other registers
Local variables
Frame pointer
Return address
Other registers
Local variables
Stack trace:
25
Stack
Local variables
RegistersIP
FP
function three() {
[…]
}
…
… …
Frame pointer
NULL
Return address
Other registers
Local variables
Frame pointer
Return address
Other registers
Local variables
Stack trace:
26
Stack
Local variables
RegistersIP
FP
function three() {
[…]
}
…
… …
Frame pointer
NULL
Return address
Other registers
Local variables
Frame pointer
Return address
Other registers
Local variables
Stack trace:
three +0xcaIP
27
Stack
Local variables
RegistersIP
FP
function three() {
[…]
}
…
… …
Frame pointer
NULL
Return address
Other registers
Local variables
Frame pointer
Return address
Other registers
Local variables
Stack trace:
three +0xcaIP
28
Stack
Local variables
RegistersIP
FP
function two() {
[…]
three();
}
function three() {
[…]
}
…
… …
Frame pointer
NULL
Return address
Other registers
Local variables
Frame pointer
Return address
Other registers
Local variables
Stack trace:
three +0xcaIP
29
Stack
Local variables
RegistersIP
FP
function two() {
[…]
three();
}
function three() {
[…]
}
…
… …
Frame pointer
NULL
Return address
Other registers
Local variables
Frame pointer
Return address
Other registers
Local variables
Stack trace:
three +0xca
two +0xc1
IP
FP+1word
30
Stack
Local variables
RegistersIP
FP
function one() {
[…]
two();
[…]
}
function two() {
[…]
three();
}
function three() {
[…]
}
…
… …
Frame pointer
NULL
Return address
Other registers
Local variables
Frame pointer
Return address
Other registers
Local variables
Stack trace:
three +0xca
two +0xc1
IP
FP+1word
31
Stack
Local variables
RegistersIP
FP
function one() {
[…]
two();
[…]
}
function two() {
[…]
three();
}
function three() {
[…]
}
…
… …
Frame pointer
NULL
Return address
Other registers
Local variables
Frame pointer
Return address
Other registers
Local variables
Stack trace:
three +0xca
two +0xc1
one +0x0d
IP
FP+1word
*FP+1word
32
Stack
Local variables
RegistersIP
FP
function one() {
[…]
two();
[…]
}
function two() {
[…]
three();
}
function three() {
[…]
}
…
… …
Frame pointer
NULL
Return address
Other registers
Local variables
Frame pointer
Return address
Other registers
Local variables
Stack trace:
three +0xca
two +0xc1
one +0x0d
IP
FP+1word
*FP+1word
33
Stack
Local variables
RegistersIP
FP
function one() {
[…]
two();
[…]
}
function two() {
[…]
three();
}
function three() {
[…]
}
…
… …
Frame pointer
NULL
Return address
Other registers
Local variables
Frame pointer
Return address
Other registers
Local variables
Stack trace:
three +0xca
two +0xc1
one +0x0d
[done]
IP
FP+1word
*FP+1word
**FP==NULL
34
Walking the page tables (x86-64)
CR3
virtual address
35
Walking the page tables (x86-64)
CR3
virtual address
36
Walking the page tables (x86-64)
CR3
virtual address
37
Walking the page tables (x86-64)
CR3
L1
virtual address
38
Walking the page tables (x86-64)
CR3
L1
virtual address
39
Walking the page tables (x86-64)
CR3
L1 L2
virtual address
40
Walking the page tables (x86-64)
CR3
L1 L2 L3
virtual address
41
Walking the page tables (x86-64)
CR3
L1 L2 L3 L4
virtual address
42
Walking the page tables (x86-64)
CR3
L1 L2 L3 L4
(guest) physical address
virtual address
43
Walking the page tables (x86-64)
CR3
L1 L2 L3 L4
(guest) physical address
▌So many maps:
5 per entry * stack depth
map! map! map! map!
map!
virtual address
44
Walking the page tables (x86-64)
CR3
L1 L2 L3 L4
(guest) physical address
▌So many maps:
5 per entry * stack depth
▌Then again, page table locations don’t change…
Neither do stack locations (exception: lots of thread spawning)
Effective caching
map! map! map! map!
map!
virtual address
45
Walking the page tables (x86-64)
CR3
L1 L2 L3 L4
(guest) physical address
▌So many maps:
5 per entry * stack depth
▌Then again, page table locations don’t change…
Neither do stack locations (exception: lots of thread spawning)
Effective caching
virtual address
map! map! map! map!
map!
46
Create Symbol Table
▌Stack only contains addresses
▌Symbol resolution necessary
47
Create Symbol Table
▌Stack only contains addresses
▌Symbol resolution necessary
▌Trivial
Virtual addresses mapped 1:1 into unikernel address space
nm is your friend $ nm -n <ELF> > symtab
$ head symtab
0000000000000000 T _start
0000000000000000 T _text
0000000000000008 a RSP_OFFSET
0000000000000017 t stack_start
00000000000000fc a KERNEL_CS_MASK
0000000000001000 t shared_info
0000000000002000 t hypercall_page
0000000000003000 t error_entry
000000000000304f t error_call_handler
0000000000003069 t hypervisor_callback
48
Create Symbol Table
▌Stack only contains addresses
▌Symbol resolution necessary
▌Trivial
Virtual addresses mapped 1:1 into unikernel address space
nm is your friend
▌Needs unstripped binary
You’re welcome to strip it afterwards
$ nm -n <ELF> > symtab
$ head symtab
0000000000000000 T _start
0000000000000000 T _text
0000000000000008 a RSP_OFFSET
0000000000000017 t stack_start
00000000000000fc a KERNEL_CS_MASK
0000000000001000 t shared_info
0000000000002000 t hypercall_page
0000000000003000 t error_entry
000000000000304f t error_call_handler
0000000000003069 t hypervisor_callback
49
What do we get?
50
What do we get? Flamegraphs!
▌ Y Axis: call trace
 Bottom: main function, each layer: one call depth
▌ X Axis: relative run time
 Call paths are aggregated, no same call path twice in graph
https://github.com/brendangregg/flamegraph
51
What do we get? Flamegraphs!
▌ Y Axis: call trace
 Bottom: main function, each layer: one call depth
▌ X Axis: relative run time
 Call paths are aggregated, no same call path twice in graph
▌ In this example: netfront functions “heavy hitters”
 netfront_xmit_pbuf
 netfront_rx
 but also blkfront_aio_poll
1
3 2
1
3 2
2
1
3
https://github.com/brendangregg/flamegraph
52
What do we get? Flamegraphs!
▌ Y Axis: call trace
 Bottom: main function, each layer: one call depth
▌ X Axis: relative run time
 Call paths are aggregated, no same call path twice in graph
▌ In this example: netfront functions “heavy hitters”
 netfront_xmit_pbuf
 netfront_rx
 but also blkfront_aio_poll
1
3 2
1
3 2
2
1
3
Yep, it’s a MiniOS*
doing network
communication
*with lwip for TCP/IP
https://github.com/brendangregg/flamegraph
53
Try 1: improving xenctx performance
▌xenctx translates and maps memory addresses every stack walk
 Huge overhead
 Solution: cache mapped memory and virtualmachine translations
54
Try 1: improving xenctx performance
▌xenctx translates and maps memory addresses every stack walk
 Huge overhead
 Solution: cache mapped memory and virtualmachine translations
▌xenctx resolves symbols via linear search
 Solution: use binary search
55
Try 1: improving xenctx performance
▌xenctx translates and maps memory addresses every stack walk
 Huge overhead
 Solution: cache mapped memory and virtualmachine translations
▌xenctx resolves symbols via linear search
 Solution: use binary search
 (Or, even better, do resolutions offline after tracing)
56
Try 1: improving xenctx performance
▌xenctx translates and maps memory addresses every stack walk
 Huge overhead
 Solution: cache mapped memory and virtualmachine translations
▌xenctx resolves symbols via linear search
 Solution: use binary search
 (Or, even better, do resolutions offline after tracing)
57
Try 1: improving xenctx performance
▌xenctx translates and maps memory addresses every stack walk
 Huge overhead
 Solution: cache mapped memory and virtualmachine translations
▌xenctx resolves symbols via linear search
 Solution: use binary search
 (Or, even better, do resolutions offline after tracing)
At this point, I abandoned xenctx and (re)wrote uniprof from scratch.
58
Try 2: uniprof
▌100-fold improvement is nice! But we can do better:
Xen 4.7 introduced low-level libraries (libxencall, libxenforeigmemory)
Another significant reduction by ~ factor of 3
59
Try 2: uniprof
▌100-fold improvement is nice! But we can do better:
Xen 4.7 introduced low-level libraries (libxencall, libxenforeigmemory)
Another significant reduction by ~ factor of 3
▌End result: overhead of ~0.1% @101 samples/s
60
Performance on ARM
▌uniprof supports ARM (xenctx doesn’t)
Main challenge: different page table design
61
Performance on ARM
▌uniprof supports ARM (xenctx doesn’t)
Main challenge: different page table design
▌ARM: much slower, overhead higher
But the CPU is much slower, too (Intel Xeon @3.7GHz vs. Cortex A20 @1GHz)
So fewer samples/s needed for same effective resolution
62
No Frame Pointer? No Problem!
▌Stack walking relies on frame pointer
Optimizations can reuse FP as general-purpose register (-fomit-frame-pointer)
63
No Frame Pointer? No Problem!
▌Stack walking relies on frame pointer
Optimizations can reuse FP as general-purpose register (-fomit-frame-pointer)
▌But we can do without FPs
Use stack unwinding information
• It’s already included if you use C++ (for exception handling)
• It doesn’t change performance
• Only binary size
DWARF standard
$ readelf –S <ELF>
There are 13 section headers, starting at offset 0x40d58:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[...]
[ 4] .eh_frame PROGBITS 0000000000035860 00036860
00000000000066f8 0000000000000000 A 0 0 8
[ 5] .eh_frame_hdr PROGBITS 000000000003bf58 0003cf58
000000000000128c 0000000000000000 A 0 0 4
[...]
64
Unwinding without Frame Pointers
▌How does it work?
65
Unwinding without Frame Pointers
▌How does it work?
66
Unwinding without Frame Pointers
▌How does it work?
▌Lookup table
For every program address
• The current frame size
• Locations of registers to restore (GP and IP)
67
Unwinding without Frame Pointers
▌How does it work?
▌Lookup table
For every program address
• The current frame size
• Locations of registers to restore (GP and IP)
Important for exception handling
• Exit functions immediately until handler is found
68
Unwinding without Frame Pointers
▌How does it work?
▌Lookup table
For every program address
• The current frame size
• Locations of registers to restore (GP and IP)
Important for exception handling
• Exit functions immediately until handler is found
▌Index to quickly find table entry
69
Unwinding without Frame Pointers
▌How does it work?
▌Lookup table
For every program address
• The current frame size
• Locations of registers to restore (GP and IP)
Important for exception handling
• Exit functions immediately until handler is found
▌Index to quickly find table entry
▌Several library implementations
uniprof uses libunwind
Actually, a libunwind patched for Xen guest introspection support
Might be useful for other tools?
70
Performance: uniprof w/ libunwind
▌Performance lower than with frame pointer
Reason: libunwind does more than we need (full register reconstruction etc.)
▌Different library or own implementation promising
But “good enough” for many cases
And a good area for future work
71
Enter the title.
Thank you!
Questions?
uniprof: https://github.com/cnplab/uniprof
libunwind-xen: https://github.com/cnplab/libunwind
FlameGraphs: https://github.com/brendangregg/flamegraph

XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging - Florian Schmidt, NEC

  • 1.
    uniprof: Transparent Unikernel PerformanceProfiling & Debugging Florian Schmidt, Research Scientist, NEC Europe Ltd.
  • 2.
  • 3.
    3 Unikernels? ▌Faster, smaller, better! ▌Butever heard this? Unikernels are hard to debug. Kernel debugging is horrible! cliparts:clipproject.info
  • 4.
    4 Unikernels? ▌Faster, smaller, better! ▌Butever heard this? ▌Then you might say Unikernels are hard to debug. Kernel debugging is horrible! But that’s not really true! Unikernels are a single linked binary. They have a shared address space. You can just use gdb! cliparts:clipproject.info
  • 5.
    5 Unikernels? ▌Faster, smaller, better! ▌Butever heard this? ▌Then you might say ▌And while that is true… ▌… we are admittedly lacking tools Unikernels are hard to debug. Kernel debugging is horrible! But that’s not really true! Unikernels are a single linked binary. They have a shared address space. You can just use gdb! cliparts:clipproject.info
  • 6.
    6 Unikernels? ▌Faster, smaller, better! ▌Butever heard this? ▌Then you might say ▌And while that is true… ▌… we are admittedly lacking tools ▌Such as effective profilers Unikernels are hard to debug. Kernel debugging is horrible! But that’s not really true! Unikernels are a single linked binary. They have a shared address space. You can just use gdb! cliparts:clipproject.info
  • 7.
    7 Enter uniprof ▌Goals: Performance profiler Nochanges to profiled code necessary Minimal overhead
  • 8.
    8 Enter uniprof ▌Goals: Performance profiler Nochanges to profiled code necessary Minimal overhead  Useful in production environments
  • 9.
    9 Enter uniprof ▌Goals: Performance profiler Nochanges to profiled code necessary Minimal overhead  Useful in production environments ▌So, a stack profiler Collect stack traces at regular intervals call_main+0x278 main+0x1c schedule+0x3a monotonic_clock+0x1a
  • 10.
    10 Enter uniprof ▌Goals: Performance profiler Nochanges to profiled code necessary Minimal overhead  Useful in production environments ▌So, a stack profiler Collect stack traces at regular intervals Many of them call_main+0x278 main+0x1c schedule+0x3a monotonic_clock+0x1a call_main+0x278 main+0x1c netfront_rx+0xa call_main+0x278 main+0x1c netfront_rx+0xa netfront_get_responses+0x1c netfrontif_rx_handler+0x20 netfrontif_transmit+0x1a0 netfront_xmit_pbuf+0xa4 call_main+0x278 main+0x1c blkfront_aio_poll+0x32
  • 11.
    11 Enter uniprof ▌Goals: Performance profiler Nochanges to profiled code necessary Minimal overhead  Useful in production environments ▌So, a stack profiler Collect stack traces at regular intervals Many of them Analyze which code paths show up often • Either because they take a long time • Or because they are hit often Point towards potential bottlenecks call_main+0x278 main+0x1c schedule+0x3a monotonic_clock+0x1a call_main+0x278 main+0x1c netfront_rx+0xa call_main+0x278 main+0x1c netfront_rx+0xa netfront_get_responses+0x1c netfrontif_rx_handler+0x20 netfrontif_transmit+0x1a0 netfront_xmit_pbuf+0xa4 call_main+0x278 main+0x1c blkfront_aio_poll+0x32
  • 12.
    12 xenctx ▌Turns out, astack profiler for Xen already exists Well, kinda
  • 13.
    13 xenctx ▌Turns out, astack profiler for Xen already exists Well, kinda ▌xenctx is bundled with Xen Introspection tool Option to print call stack $ xenctx -f -s <symbol table file> <DOMID> [...] Call Trace: [<0000000000004868>] three+0x58 <-- 00000000000ffea0: [<00000000000044f2>] two+0x52 00000000000ffef0: [<00000000000046a6>] one+0x12 00000000000fff40: [<000000000002ff66>] 00000000000fff80: [<0000000000012018>] call_main+0x278
  • 14.
    14 xenctx ▌Turns out, astack profiler for Xen already exists Well, kinda ▌xenctx is bundled with Xen Introspection tool Option to print call stack ▌So if we run this over and over, we have a stack profiler Well, kinda $ xenctx -f -s <symbol table file> <DOMID> [...] Call Trace: [<0000000000004868>] three+0x58 <-- 00000000000ffea0: [<00000000000044f2>] two+0x52 00000000000ffef0: [<00000000000046a6>] one+0x12 00000000000fff40: [<000000000002ff66>] 00000000000fff80: [<0000000000012018>] call_main+0x278
  • 15.
    15 xenctx ▌Downside: xenctx isslow Very slow: 3ms+ per trace Doesn’t sound like much, but really adds up (e.g., 100 samples/s = 300ms/s) Can’t really blame it, not designed as a fast stack profiler
  • 16.
    16 xenctx ▌Downside: xenctx isslow Very slow: 3ms+ per trace Doesn’t sound like much, but really adds up (e.g., 100 samples/s = 300ms/s) Can’t really blame it, not designed as a fast stack profiler ▌Performance isn’t just a nice-to-have We interrupt the guest all the time Can’t walk stack while guest is running: race conditions High overhead can influence results! Low overhead is imperative for use on production unikernels
  • 17.
    17 xenctx ▌Downside: xenctx isslow Very slow: 3ms+ per trace Doesn’t sound like much, but really adds up (e.g., 100 samples/s = 300ms/s) Can’t really blame it, not designed as a fast stack profiler ▌Performance isn’t just a nice-to-have We interrupt the guest all the time Can’t walk stack while guest is running: race conditions High overhead can influence results! Low overhead is imperative for use on production unikernels ▌First question: extend xenctx or write something from scratch? Spoiler: look at the talk title More insight when I come to the evaluation
  • 18.
  • 19.
    19 What do weneed? ▌Registers (for FP, IP) This is pretty easy: getvcpucontext() hypercall
  • 20.
    20 What do weneed? ▌Registers (for FP, IP) This is pretty easy: getvcpucontext() hypercall ▌Access to stack memory (to read return addresses and next FPs) This is the complicated step We need to do address resolution
  • 21.
    21 What do weneed? ▌Registers (for FP, IP) This is pretty easy: getvcpucontext() hypercall ▌Access to stack memory (to read return addresses and next FPs) This is the complicated step We need to do address resolution • Memory introspection requires mapping memory over • We’re looking at (uni)kernel code • But there’s still a virtual  (guest) physical resolution
  • 22.
    22 What do weneed? ▌Registers (for FP, IP) This is pretty easy: getvcpucontext() hypercall ▌Access to stack memory (to read return addresses and next FPs) This is the complicated step We need to do address resolution • Memory introspection requires mapping memory over • We’re looking at (uni)kernel code • But there’s still a virtual  (guest) physical resolution • We can’t use hardware support (PVH), because we’re looking in from outside • So we need to manually walk page tables
  • 23.
    23 What do weneed? ▌Registers (for FP, IP) This is pretty easy: getvcpucontext() hypercall ▌Access to stack memory (to read return addresses and next FPs) This is the complicated step We need to do address resolution • Memory introspection requires mapping memory over • We’re looking at (uni)kernel code • But there’s still a virtual  (guest) physical resolution • We can’t use hardware support (PVH), because we’re looking in from outside • So we need to manually walk page tables ▌Symbol table (to resolve function names) Thankfully, this is easy again: extract symbols from ELF with nm
  • 24.
    24 Stack Local variables RegistersIP FP … … … Framepointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace:
  • 25.
    25 Stack Local variables RegistersIP FP function three(){ […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace:
  • 26.
    26 Stack Local variables RegistersIP FP function three(){ […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xcaIP
  • 27.
    27 Stack Local variables RegistersIP FP function three(){ […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xcaIP
  • 28.
    28 Stack Local variables RegistersIP FP function two(){ […] three(); } function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xcaIP
  • 29.
    29 Stack Local variables RegistersIP FP function two(){ […] three(); } function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xca two +0xc1 IP FP+1word
  • 30.
    30 Stack Local variables RegistersIP FP function one(){ […] two(); […] } function two() { […] three(); } function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xca two +0xc1 IP FP+1word
  • 31.
    31 Stack Local variables RegistersIP FP function one(){ […] two(); […] } function two() { […] three(); } function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xca two +0xc1 one +0x0d IP FP+1word *FP+1word
  • 32.
    32 Stack Local variables RegistersIP FP function one(){ […] two(); […] } function two() { […] three(); } function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xca two +0xc1 one +0x0d IP FP+1word *FP+1word
  • 33.
    33 Stack Local variables RegistersIP FP function one(){ […] two(); […] } function two() { […] three(); } function three() { […] } … … … Frame pointer NULL Return address Other registers Local variables Frame pointer Return address Other registers Local variables Stack trace: three +0xca two +0xc1 one +0x0d [done] IP FP+1word *FP+1word **FP==NULL
  • 34.
    34 Walking the pagetables (x86-64) CR3 virtual address
  • 35.
    35 Walking the pagetables (x86-64) CR3 virtual address
  • 36.
    36 Walking the pagetables (x86-64) CR3 virtual address
  • 37.
    37 Walking the pagetables (x86-64) CR3 L1 virtual address
  • 38.
    38 Walking the pagetables (x86-64) CR3 L1 virtual address
  • 39.
    39 Walking the pagetables (x86-64) CR3 L1 L2 virtual address
  • 40.
    40 Walking the pagetables (x86-64) CR3 L1 L2 L3 virtual address
  • 41.
    41 Walking the pagetables (x86-64) CR3 L1 L2 L3 L4 virtual address
  • 42.
    42 Walking the pagetables (x86-64) CR3 L1 L2 L3 L4 (guest) physical address virtual address
  • 43.
    43 Walking the pagetables (x86-64) CR3 L1 L2 L3 L4 (guest) physical address ▌So many maps: 5 per entry * stack depth map! map! map! map! map! virtual address
  • 44.
    44 Walking the pagetables (x86-64) CR3 L1 L2 L3 L4 (guest) physical address ▌So many maps: 5 per entry * stack depth ▌Then again, page table locations don’t change… Neither do stack locations (exception: lots of thread spawning) Effective caching map! map! map! map! map! virtual address
  • 45.
    45 Walking the pagetables (x86-64) CR3 L1 L2 L3 L4 (guest) physical address ▌So many maps: 5 per entry * stack depth ▌Then again, page table locations don’t change… Neither do stack locations (exception: lots of thread spawning) Effective caching virtual address map! map! map! map! map!
  • 46.
    46 Create Symbol Table ▌Stackonly contains addresses ▌Symbol resolution necessary
  • 47.
    47 Create Symbol Table ▌Stackonly contains addresses ▌Symbol resolution necessary ▌Trivial Virtual addresses mapped 1:1 into unikernel address space nm is your friend $ nm -n <ELF> > symtab $ head symtab 0000000000000000 T _start 0000000000000000 T _text 0000000000000008 a RSP_OFFSET 0000000000000017 t stack_start 00000000000000fc a KERNEL_CS_MASK 0000000000001000 t shared_info 0000000000002000 t hypercall_page 0000000000003000 t error_entry 000000000000304f t error_call_handler 0000000000003069 t hypervisor_callback
  • 48.
    48 Create Symbol Table ▌Stackonly contains addresses ▌Symbol resolution necessary ▌Trivial Virtual addresses mapped 1:1 into unikernel address space nm is your friend ▌Needs unstripped binary You’re welcome to strip it afterwards $ nm -n <ELF> > symtab $ head symtab 0000000000000000 T _start 0000000000000000 T _text 0000000000000008 a RSP_OFFSET 0000000000000017 t stack_start 00000000000000fc a KERNEL_CS_MASK 0000000000001000 t shared_info 0000000000002000 t hypercall_page 0000000000003000 t error_entry 000000000000304f t error_call_handler 0000000000003069 t hypervisor_callback
  • 49.
  • 50.
    50 What do weget? Flamegraphs! ▌ Y Axis: call trace  Bottom: main function, each layer: one call depth ▌ X Axis: relative run time  Call paths are aggregated, no same call path twice in graph https://github.com/brendangregg/flamegraph
  • 51.
    51 What do weget? Flamegraphs! ▌ Y Axis: call trace  Bottom: main function, each layer: one call depth ▌ X Axis: relative run time  Call paths are aggregated, no same call path twice in graph ▌ In this example: netfront functions “heavy hitters”  netfront_xmit_pbuf  netfront_rx  but also blkfront_aio_poll 1 3 2 1 3 2 2 1 3 https://github.com/brendangregg/flamegraph
  • 52.
    52 What do weget? Flamegraphs! ▌ Y Axis: call trace  Bottom: main function, each layer: one call depth ▌ X Axis: relative run time  Call paths are aggregated, no same call path twice in graph ▌ In this example: netfront functions “heavy hitters”  netfront_xmit_pbuf  netfront_rx  but also blkfront_aio_poll 1 3 2 1 3 2 2 1 3 Yep, it’s a MiniOS* doing network communication *with lwip for TCP/IP https://github.com/brendangregg/flamegraph
  • 53.
    53 Try 1: improvingxenctx performance ▌xenctx translates and maps memory addresses every stack walk  Huge overhead  Solution: cache mapped memory and virtualmachine translations
  • 54.
    54 Try 1: improvingxenctx performance ▌xenctx translates and maps memory addresses every stack walk  Huge overhead  Solution: cache mapped memory and virtualmachine translations ▌xenctx resolves symbols via linear search  Solution: use binary search
  • 55.
    55 Try 1: improvingxenctx performance ▌xenctx translates and maps memory addresses every stack walk  Huge overhead  Solution: cache mapped memory and virtualmachine translations ▌xenctx resolves symbols via linear search  Solution: use binary search  (Or, even better, do resolutions offline after tracing)
  • 56.
    56 Try 1: improvingxenctx performance ▌xenctx translates and maps memory addresses every stack walk  Huge overhead  Solution: cache mapped memory and virtualmachine translations ▌xenctx resolves symbols via linear search  Solution: use binary search  (Or, even better, do resolutions offline after tracing)
  • 57.
    57 Try 1: improvingxenctx performance ▌xenctx translates and maps memory addresses every stack walk  Huge overhead  Solution: cache mapped memory and virtualmachine translations ▌xenctx resolves symbols via linear search  Solution: use binary search  (Or, even better, do resolutions offline after tracing) At this point, I abandoned xenctx and (re)wrote uniprof from scratch.
  • 58.
    58 Try 2: uniprof ▌100-foldimprovement is nice! But we can do better: Xen 4.7 introduced low-level libraries (libxencall, libxenforeigmemory) Another significant reduction by ~ factor of 3
  • 59.
    59 Try 2: uniprof ▌100-foldimprovement is nice! But we can do better: Xen 4.7 introduced low-level libraries (libxencall, libxenforeigmemory) Another significant reduction by ~ factor of 3 ▌End result: overhead of ~0.1% @101 samples/s
  • 60.
    60 Performance on ARM ▌uniprofsupports ARM (xenctx doesn’t) Main challenge: different page table design
  • 61.
    61 Performance on ARM ▌uniprofsupports ARM (xenctx doesn’t) Main challenge: different page table design ▌ARM: much slower, overhead higher But the CPU is much slower, too (Intel Xeon @3.7GHz vs. Cortex A20 @1GHz) So fewer samples/s needed for same effective resolution
  • 62.
    62 No Frame Pointer?No Problem! ▌Stack walking relies on frame pointer Optimizations can reuse FP as general-purpose register (-fomit-frame-pointer)
  • 63.
    63 No Frame Pointer?No Problem! ▌Stack walking relies on frame pointer Optimizations can reuse FP as general-purpose register (-fomit-frame-pointer) ▌But we can do without FPs Use stack unwinding information • It’s already included if you use C++ (for exception handling) • It doesn’t change performance • Only binary size DWARF standard $ readelf –S <ELF> There are 13 section headers, starting at offset 0x40d58: Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [...] [ 4] .eh_frame PROGBITS 0000000000035860 00036860 00000000000066f8 0000000000000000 A 0 0 8 [ 5] .eh_frame_hdr PROGBITS 000000000003bf58 0003cf58 000000000000128c 0000000000000000 A 0 0 4 [...]
  • 64.
    64 Unwinding without FramePointers ▌How does it work?
  • 65.
    65 Unwinding without FramePointers ▌How does it work?
  • 66.
    66 Unwinding without FramePointers ▌How does it work? ▌Lookup table For every program address • The current frame size • Locations of registers to restore (GP and IP)
  • 67.
    67 Unwinding without FramePointers ▌How does it work? ▌Lookup table For every program address • The current frame size • Locations of registers to restore (GP and IP) Important for exception handling • Exit functions immediately until handler is found
  • 68.
    68 Unwinding without FramePointers ▌How does it work? ▌Lookup table For every program address • The current frame size • Locations of registers to restore (GP and IP) Important for exception handling • Exit functions immediately until handler is found ▌Index to quickly find table entry
  • 69.
    69 Unwinding without FramePointers ▌How does it work? ▌Lookup table For every program address • The current frame size • Locations of registers to restore (GP and IP) Important for exception handling • Exit functions immediately until handler is found ▌Index to quickly find table entry ▌Several library implementations uniprof uses libunwind Actually, a libunwind patched for Xen guest introspection support Might be useful for other tools?
  • 70.
    70 Performance: uniprof w/libunwind ▌Performance lower than with frame pointer Reason: libunwind does more than we need (full register reconstruction etc.) ▌Different library or own implementation promising But “good enough” for many cases And a good area for future work
  • 71.
    71 Enter the title. Thankyou! Questions? uniprof: https://github.com/cnplab/uniprof libunwind-xen: https://github.com/cnplab/libunwind FlameGraphs: https://github.com/brendangregg/flamegraph