Performance Profiling
        of Virtual Machines
Jiaqing Du+, Nipun Sehrawat*, Willy Zwaenepoel+
+EPFL, Switzerland
*University of Illinois at Urbana-Champaign
Performance Profiling
•    Use CPU performance counters
•    Monitor software runtime behavior
•    Incur very low overhead
•    Used extensively: OProfile, VTune, …

                    %CYCLE       Function            Module
                    98.5529      vmx_vcpu_run        kvm-intel.ko
                    0.2226       (no symbols)        libc.so
                    0.1034       hpet_cpuhp_notify   vmlinux
                    0.1034       native_patch        vmlinux


Jiaqing Du, VEE, March 9, 2011                                      2
Terminology

      OS                          Guest                       Guest
                   profiler                profiler                       profiler

                                  VMM                         VMM
                                                                      profiler

     CPU             PMU          CPU        PMU             CPU         PMU



        (1) native profiling     (2) guest-wide profiling   (3) system-wide profiling




Jiaqing Du, VEE, March 9, 2011                                                          3
Profiling with Virtual Machines

                                 Para-            Hardware     Binary
                                 virtualization   assistance   translation
                  Guest-wide
                  profiling
                                       ?                ?            ?
                  System-wide
                  profiling        XenOprof             ?            ?


         Profilers do not work well with virtual machines.



Jiaqing Du, VEE, March 9, 2011                                               4
Contributions

                                 (1) Give solutions

                                    Para-            Hardware     Binary
                                    virtualization   assistance   translation
                  Guest-wide
                  profiling
                                          ?                ?            ?
                  System-wide
                  profiling
                                      XenOprof             ?            ?



                                              (2) Implement prototypes
Jiaqing Du, VEE, March 9, 2011                                                  5
Outline
•    Native profiling
•    Guest-wide profiling
•    System-wide profiling
•    Evaluation




Jiaqing Du, VEE, March 9, 2011             6
Native Profiling
• Performance monitoring unit (PMU)
       – consists of a set of event counters
       – generates an interrupt when a counter overflows
• PMU-based profiler
                                 User
                                           Control     Interpret   - previous PC value
                                 Kernel
                                                                   - process identifier
                                          Configure    Collect




                                 CPU                 PMU

Jiaqing Du, VEE, March 9, 2011                                                        7
Guest-wide Profiling
• Profiler runs in the guest and only profiles the guest
                                   Guest
                                            Control     Interpret   Injected interrupts
                                                                    should be handled
                                                                    right after guest
                                           Configure    Collect     resumes execution.


                                 VMM


                                   CPU                PMU



              Challenge: synchronous interrupt delivery to the guest
Jiaqing Du, VEE, March 9, 2011                                                        8
System-wide Profiling (1/3)
• Reveal runtime behavior of both VMM and guest(s)

                                         Guest1              Guest2

                                                                         Do not know the
                                                                         internals of a guest.
                                                  Control    Interpret


                                 VMM          Configure      Collect


                                   CPU                 PMU



               Challenge: interpret samples belonging to the guest
Jiaqing Du, VEE, March 9, 2011                                                               9
System-wide Profiling (2/3)
• Interpret guest samples: full delegation

                                                     Control        Interpret
                                       Guest
                                                    Configure       Collect



                                          Control       Interpret


                                 VMM     Configure      Collect


                                   CPU                    PMU



Jiaqing Du, VEE, March 9, 2011                                                  10
System-wide Profiling (3/3)
• Interpret guest samples: interpretation delegation

                                                     Control        Interpret
                                       Guest
                                                    Configure       Collect



                                          Control       Interpret
                                                                              Shared
                                                                              Buffer
                                 VMM     Configure      Collect


                                   CPU                    PMU



Jiaqing Du, VEE, March 9, 2011                                                         11
PMU Multiplexing
• When to save & restore performance counters?
• CPU switch
       – only in-guest execution is accounted to the guest
                                 VMM                   VMM
                  guest1         I/Oguest1   guest2           I/Oguest2       guest2

            account to guest 1           account to guest 2               account to guest 2
• Domain switch
       – in-VMM execution is also accounted to the guest
                                 VMM                   VMM
                  guest1         I/Oguest1   guest2           I/Oguest2       guest2

                    account to guest1          account to guest2

Jiaqing Du, VEE, March 9, 2011                                                                 12
Implementation


                                  Para-            KVM   QEMU
                                  virtualization
                  Guest-wide
                  profiling
                                        ?          √      ?
                  System-wide
                  profiling
                                    XenOprof       √      √




Jiaqing Du, VEE, March 9, 2011                                  13
Evaluation question #1

How much does profiling slow down programs?




Jiaqing Du, VEE, March 9, 2011                            14
Profiling Overhead
• Measure execution time
       – a computation-intensive program
       – with and without profiling
       – about 400 counter overflows per second

                     Profiling environment   Increased execution time
                     Native Linux                0.04% ± 0.004%
                     KVM guest-wide              0.39% ± 0.045%
                     KVM system-wide             0.44% ± 0.043%
                     QEMU system-wide            0.94% ± 0.044%




Jiaqing Du, VEE, March 9, 2011                                          15
Evaluation question #2

                       Are profiling results accurate?




Jiaqing Du, VEE, March 9, 2011                            16
Profiling Accuracy (1/4)
• A computation-intensive benchmark
• compute_{a|b}() does floating point arithmetic
• Monitor CPU cycles

                           int main(int argc, char *argv[])
                           {
                               while (1) {
                                   compute_a();
                                   compute_b();
                               }
                           }




Jiaqing Du, VEE, March 9, 2011                                17
Profiling Accuracy (2/4)
• Comparison with native profiling
                 90

                 80

                 70

                 60

                 50                                                Native
   Cycle %       40
                                                                   KVM guest-wide
                                                                   KVM system-wide
                 30
                                                                   QEMU system-wide
                 20

                 10

                  0
                                 compute_a             compute_b

                                        Routine name
Jiaqing Du, VEE, March 9, 2011                                                        18
Profiling Accuracy (3/4)
• A memory-intensive benchmark
• Randomly access a fixed-size region of memory
• Monitor last level cache misses

                        struct item {
                            struct item *next;
                            long pad[NUM_PAD];
                        }

                        void chase_pointer()
                        {
                            struct item *p = NULL;
                            p = &randomly_connected_items;
                            while (p != null) p = p->next;
                        }


Jiaqing Du, VEE, March 9, 2011                               19
Profiling Accuracy (4/4)
 • Comparison with native profiling
                         1.6

                         1.4

                         1.2

                           1
                                                                                             Native
Cache misses per         0.8                                                                 KVM guest-wide
memory access            0.6
                                                                                             KVM system-wide
                                                                                             QEMU system-wide
                         0.4

                         0.2

                           0
                                  256 512 768 1024 1280 1536 1792 2048 2304 2560 2816 3072

                                               Working set size (KB)

 Jiaqing Du, VEE, March 9, 2011                                                                                 20
Evaluation question #3

                     What is the difference between
                     CPU switch and domain switch?




Jiaqing Du, VEE, March 9, 2011                            21
Recap
• CPU switch
                           VMM                    VMM
              guest1        I/Oguest1    guest2          I/Oguest2       guest2

        account to guest 1          account to guest 2               account to guest 2


• Domain switch
                            VMM                   VMM
              guest1         I/Oguest1   guest2          I/Oguest2       guest2

                account to guest1         account to guest2




Jiaqing Du, VEE, March 9, 2011                                                            22
Profiling Packet Receive (1/2)
• Experiment
       – push packets to a Linux guest in KVM
       – run OProfile in the guest
       – monitor instruction retirements

                             Linux


                         KVM         virtual NIC         Linux

                         Hardware                  Hardware
                                           NIC                   NIC




Jiaqing Du, VEE, March 9, 2011                                         23
Profiling Packet Receive (2/2)
                                CPU Switch                       Domain Switch
                 INSTR       Function                    INSTR    Function
                 167         csum_partial                2261     cp_interrupt
                 106         csum_partial_copy_generic   1336     cp_rx_poll
Packet           74          copy_to_user                1034     cp_start_xmit               I/O
Processing                                                                                    Related
                 47          ipt_do_table                421      native_apic_mem_write
                 38          tcp_v4_rcv                  374      native_apic_mem_read
                 …             …                         191
                                                         …        csum_partial
                                                                    …
                 …             …                         105
                                                         …        csum_partial_copy_generic
                                                                    …
                 …             …                         94
                                                         …        copy_to_user
                                                                    …
                 …             …                         79
                                                         …        ipt_do_table
                                                                    …
                 …             …                         51
                                                         …        tcp_v4_rcv
                                                                    …


                      Domain switch gives more insight for I/O operations.
     Jiaqing Du, VEE, March 9, 2011                                                           24
Related Work
• XenOprof
       – first profiler targeting virtual machines
       – system-wide profiling for Xen
• Linux perf
       – a profiling infrastructure for Linux
       – limited support of profiling KVM Linux guest
• VMware vmkperf
       – only read and write CPU performance counters



Jiaqing Du, VEE, March 9, 2011                          25
Conclusions


                                 Para-            Hardware     Binary
                                 virtualization   assistance   translation
                  Guest-wide           √                             √
                  profiling                            √
                  System-wide
                  profiling
                                   XenOprof            √             √




Jiaqing Du, VEE, March 9, 2011                                               26

Performance Profiling of Virtual Machines

  • 1.
    Performance Profiling of Virtual Machines Jiaqing Du+, Nipun Sehrawat*, Willy Zwaenepoel+ +EPFL, Switzerland *University of Illinois at Urbana-Champaign
  • 2.
    Performance Profiling • Use CPU performance counters • Monitor software runtime behavior • Incur very low overhead • Used extensively: OProfile, VTune, … %CYCLE Function Module 98.5529 vmx_vcpu_run kvm-intel.ko 0.2226 (no symbols) libc.so 0.1034 hpet_cpuhp_notify vmlinux 0.1034 native_patch vmlinux Jiaqing Du, VEE, March 9, 2011 2
  • 3.
    Terminology OS Guest Guest profiler profiler profiler VMM VMM profiler CPU PMU CPU PMU CPU PMU (1) native profiling (2) guest-wide profiling (3) system-wide profiling Jiaqing Du, VEE, March 9, 2011 3
  • 4.
    Profiling with VirtualMachines Para- Hardware Binary virtualization assistance translation Guest-wide profiling ? ? ? System-wide profiling XenOprof ? ? Profilers do not work well with virtual machines. Jiaqing Du, VEE, March 9, 2011 4
  • 5.
    Contributions (1) Give solutions Para- Hardware Binary virtualization assistance translation Guest-wide profiling ? ? ? System-wide profiling XenOprof ? ? (2) Implement prototypes Jiaqing Du, VEE, March 9, 2011 5
  • 6.
    Outline • Native profiling • Guest-wide profiling • System-wide profiling • Evaluation Jiaqing Du, VEE, March 9, 2011 6
  • 7.
    Native Profiling • Performancemonitoring unit (PMU) – consists of a set of event counters – generates an interrupt when a counter overflows • PMU-based profiler User Control Interpret - previous PC value Kernel - process identifier Configure Collect CPU PMU Jiaqing Du, VEE, March 9, 2011 7
  • 8.
    Guest-wide Profiling • Profilerruns in the guest and only profiles the guest Guest Control Interpret Injected interrupts should be handled right after guest Configure Collect resumes execution. VMM CPU PMU Challenge: synchronous interrupt delivery to the guest Jiaqing Du, VEE, March 9, 2011 8
  • 9.
    System-wide Profiling (1/3) •Reveal runtime behavior of both VMM and guest(s) Guest1 Guest2 Do not know the internals of a guest. Control Interpret VMM Configure Collect CPU PMU Challenge: interpret samples belonging to the guest Jiaqing Du, VEE, March 9, 2011 9
  • 10.
    System-wide Profiling (2/3) •Interpret guest samples: full delegation Control Interpret Guest Configure Collect Control Interpret VMM Configure Collect CPU PMU Jiaqing Du, VEE, March 9, 2011 10
  • 11.
    System-wide Profiling (3/3) •Interpret guest samples: interpretation delegation Control Interpret Guest Configure Collect Control Interpret Shared Buffer VMM Configure Collect CPU PMU Jiaqing Du, VEE, March 9, 2011 11
  • 12.
    PMU Multiplexing • Whento save & restore performance counters? • CPU switch – only in-guest execution is accounted to the guest VMM VMM guest1 I/Oguest1 guest2 I/Oguest2 guest2 account to guest 1 account to guest 2 account to guest 2 • Domain switch – in-VMM execution is also accounted to the guest VMM VMM guest1 I/Oguest1 guest2 I/Oguest2 guest2 account to guest1 account to guest2 Jiaqing Du, VEE, March 9, 2011 12
  • 13.
    Implementation Para- KVM QEMU virtualization Guest-wide profiling ? √ ? System-wide profiling XenOprof √ √ Jiaqing Du, VEE, March 9, 2011 13
  • 14.
    Evaluation question #1 Howmuch does profiling slow down programs? Jiaqing Du, VEE, March 9, 2011 14
  • 15.
    Profiling Overhead • Measureexecution time – a computation-intensive program – with and without profiling – about 400 counter overflows per second Profiling environment Increased execution time Native Linux 0.04% ± 0.004% KVM guest-wide 0.39% ± 0.045% KVM system-wide 0.44% ± 0.043% QEMU system-wide 0.94% ± 0.044% Jiaqing Du, VEE, March 9, 2011 15
  • 16.
    Evaluation question #2 Are profiling results accurate? Jiaqing Du, VEE, March 9, 2011 16
  • 17.
    Profiling Accuracy (1/4) •A computation-intensive benchmark • compute_{a|b}() does floating point arithmetic • Monitor CPU cycles int main(int argc, char *argv[]) { while (1) { compute_a(); compute_b(); } } Jiaqing Du, VEE, March 9, 2011 17
  • 18.
    Profiling Accuracy (2/4) •Comparison with native profiling 90 80 70 60 50 Native Cycle % 40 KVM guest-wide KVM system-wide 30 QEMU system-wide 20 10 0 compute_a compute_b Routine name Jiaqing Du, VEE, March 9, 2011 18
  • 19.
    Profiling Accuracy (3/4) •A memory-intensive benchmark • Randomly access a fixed-size region of memory • Monitor last level cache misses struct item { struct item *next; long pad[NUM_PAD]; } void chase_pointer() { struct item *p = NULL; p = &randomly_connected_items; while (p != null) p = p->next; } Jiaqing Du, VEE, March 9, 2011 19
  • 20.
    Profiling Accuracy (4/4) • Comparison with native profiling 1.6 1.4 1.2 1 Native Cache misses per 0.8 KVM guest-wide memory access 0.6 KVM system-wide QEMU system-wide 0.4 0.2 0 256 512 768 1024 1280 1536 1792 2048 2304 2560 2816 3072 Working set size (KB) Jiaqing Du, VEE, March 9, 2011 20
  • 21.
    Evaluation question #3 What is the difference between CPU switch and domain switch? Jiaqing Du, VEE, March 9, 2011 21
  • 22.
    Recap • CPU switch VMM VMM guest1 I/Oguest1 guest2 I/Oguest2 guest2 account to guest 1 account to guest 2 account to guest 2 • Domain switch VMM VMM guest1 I/Oguest1 guest2 I/Oguest2 guest2 account to guest1 account to guest2 Jiaqing Du, VEE, March 9, 2011 22
  • 23.
    Profiling Packet Receive(1/2) • Experiment – push packets to a Linux guest in KVM – run OProfile in the guest – monitor instruction retirements Linux KVM virtual NIC Linux Hardware Hardware NIC NIC Jiaqing Du, VEE, March 9, 2011 23
  • 24.
    Profiling Packet Receive(2/2) CPU Switch Domain Switch INSTR Function INSTR Function 167 csum_partial 2261 cp_interrupt 106 csum_partial_copy_generic 1336 cp_rx_poll Packet 74 copy_to_user 1034 cp_start_xmit I/O Processing Related 47 ipt_do_table 421 native_apic_mem_write 38 tcp_v4_rcv 374 native_apic_mem_read … … 191 … csum_partial … … … 105 … csum_partial_copy_generic … … … 94 … copy_to_user … … … 79 … ipt_do_table … … … 51 … tcp_v4_rcv … Domain switch gives more insight for I/O operations. Jiaqing Du, VEE, March 9, 2011 24
  • 25.
    Related Work • XenOprof – first profiler targeting virtual machines – system-wide profiling for Xen • Linux perf – a profiling infrastructure for Linux – limited support of profiling KVM Linux guest • VMware vmkperf – only read and write CPU performance counters Jiaqing Du, VEE, March 9, 2011 25
  • 26.
    Conclusions Para- Hardware Binary virtualization assistance translation Guest-wide √ √ profiling √ System-wide profiling XenOprof √ √ Jiaqing Du, VEE, March 9, 2011 26