Minimizing I/O Latency in Xen-ARM




       Seehwan Yoo, Chuck Yoo
            and OSVirtual
           Korea University
I/O Latency Issues in Xen-ARM
•    Guaranteed I/O latency is essential to Xen-ARM
     –    For mobile communication,
           •    Broken/missing phone calls, long network detection
           •    AP, CP are used for guaranteed communication
•  Virtualization premises transparent execution
      –  Not only instruction execution; access isolation
      –  But also performance; performance isolation

•    In practice, I/O latency is still an obstacle
      –  Lack of scheduler support
            •  Hierarchical scheduling nature
               i.e. hypervisor cannot impose task scheduling inside a guest OS
            •  The credit scheduler doesn’t match to time-sensitive applications
      –  Latency due to the split driver model
            •  Enhances reliability, but degrades performance (w.r.t. I/O latency)
            •  Inter-domain communication between IDD and user domain


•    We investigate these issues, and present possible remedies



Operating Systems Lab.                                                        http://os.korea.ac.kr   2
Related Work
•    Credit schedulers in Xen
      –  Task-aware scheduler (by Hwanjoo et. al. vee’08)
           •  Inspection of task execution at guest OS
           •  Adaptively use boost mechanism by VM workload characteristics
      –  Laxity-based soft RT scheduler (by Lee min et. al. vee’10)
           •  Laxity calculation by execution profile
           •  Assign priority based on the remaining time to deadline (laxity)
      –  Dynamic core-allocation (by Y. Hu et. al. hpdc’10)
           •  Core-allocation by workload characteristics (driver core, fast/slow tick core)
•    Mobile/embedded hypervisors
      –  OKL4 microvisor: microkernel-based hypervisor
           •  Has verified kernel – from sel4
           •  Presents good performance – commercially successful
           •  Real-time optimizations – slice donation, direct switching, reflective scheduling,
              threaded interrupt handling, etc.
      –  VirtualLogix – VLX: mobile hypervisor for real-time support
           •  Shared driver model – device sharing model among RT-guest OS and
              non-RT guest OSs
           •  Good PV performance



Operating Systems Lab.                                                           http://os.korea.ac.kr   3
Background – the Credit in Xen

•  Weighted round-robin based fair scheduler
•  Priority – BOOST, UNDER, OVER
     –  Basic principle: preserve fairness among all
        VCPUs
          •  Each vcpu gets credit periodically
          •  Credit is debited as vcpu consumes execution time
     –  Priority assignment
          •  Remaining credit <= 0 à OVER (lowest)
          •  Remaining credit > 0 à UNDER
          •  Event-pending VCPU : BOOST (highest)
     –  BOOST: for providing low response time
          •  Allows immediate preemption of the current vcpu

Operating Systems Lab.                             http://os.korea.ac.kr   4
Fallacies for the BOOST
                                      in the Credit
•    Fallacy 1) VCPU is always boosted by I/O event
      –  In fact, BOOST is sometimes ignored because
         VCPU is boosted only when it doesn’t break the fairness
           •  ‘Not-boosted vcpu’s are observed when the vcpu is in-execution
      –  Example 1)
           •  If a user domain has CPU job and waiting for execution, then
           •  It is not boosted since it will be executed soon, and tentative BOOST is easy to break
              the fairness


•    Fallacy 2) BOOST always prioritizes the VCPU
      –  In fact, BOOST is easy to be negated because
         multiple vcpus can be boosted simultaneously
           •  ‘Multi-boost’ happens quite often in split driver model
      –  Example 1)
           •  Driver domain has to be boosted, and then
           •  User domain also needs to be boosted
      –  Example 2)
           •  Driver domain has multiple pkts that are destined to multiple user domains, then
           •  All the designated user domains are boosted


Operating Systems Lab.                                                       http://os.korea.ac.kr     5
Xen-ARM’s
                                  Latency Characteristic
•  I/O latency measured throughout interrupt
   path
   –  Preemption latency :
           until code is preemptible
   –  VCPU dispatch latency :
           until the designated VCPU is scheduled
   –  Intra-VM latency :
           until IO completion
                                        Dom0                     DomU
                                        dispatch                 dispatch
      Phy intr.
                                             Dom0                    DomU
                  Xen-ARM                      Dom0                   DomU
                  Intr. handler                I/O task               I/O task

          Preemption                Vcpu                    Vcpu
                                  dispatch     Intra-VM   dispatch    Intra-VM
            latency
                                  latency       latency   latency      latency



 Operating Systems Lab.                                                     http://os.korea.ac.kr   6
Observed Latency thru intr. path
•  I/O latency measured throughout interrupt path
    –  Send ping request from external server to dom1
    –  VM settings
          •  Dom0 : the driver domain
          •  Dom1 : 20% cpu load + ping recv.
          •  Dom2 : 100% cpu load (CPU-burning workload)

 –  Xen-netback latency
 •  Large worst-case latency
 •  Dom0 vcpu dispatch lat.
    + intra-dom0 lat.

 –  Netback-domU
    latency
 •  Large average latency
 •  Dom1 vcpu dispatch lat.


 Operating Systems Lab.                              http://os.korea.ac.kr   7
Observed VCPU dispatch latency :
                          not-boosted vcpu
•    Experimental setting                                            (%)
                                                                                 100
       –    Dom1: varying CPU workload
       –    Dom2: burning CPU workload
       –    Another external host sends ping




                                               Cumulative latency distribution
                                                                                  80
            to Dom1
•    Not-boosted vcpus affect I/O latency
     distribution                                                                 60
       –    Dom1 CPU workload 20% : almost
            90% ping requests are handled
            within 1ms                                                            40
       –    Dom1 CPU workload 40% : 75%
            ping requests are handled                                                                                        20% CPU load @ dom1
                                                                                                                             40% CPU load @ dom1
            within 1ms                                                            20
                                                                                                                             60% CPU load @ dom1
       –    Dom1 CPU workload 60% : 65%                                                                                      80% CPU load @ dom1
            ping requests are handled
            within 1ms                                                             0
       –    Dom1 CPU workload 80% : only 60%                                           0   2    4        6        8        10       12        14
            ping requests are handled                                                          Latency distribution by CPU load
                                                                                                                                           Latency
                                                                                                                                              (ms)
            within 1ms


•    When Dom1 has more CPU load
     è larger I/O latency (self-disturbing by not-boosted vcpu)

     Operating Systems Lab.                                                                                     http://os.korea.ac.kr              8
Observed Multi-boost
•  At the hypervisor, we         vcpu state   priority       sched. out
                                                             count
   counted the number              Blocked    BOOST          275
   of “schedule out” of the                   UNDER          13
   driver domain                 Unblocked    BOOST          664,190
                                              UNDER          49


•  Multi-boost is specifically when
     –  the current VCPU is BOOST state, and
        it is unblocked
     –  Imply that there should be another BOOST vcpu, and
        the current vcpu is scheduled out

•  Large number of Multi-boosts

Operating Systems Lab.                             http://os.korea.ac.kr   9
Intra-VM latency
                                                         1


•  Latency




                             Cumulative distribution
                                                       0.95

   from usb_irq
   to netback
                                                        0.9
                                                                                Xen-usb_irq @ dom0
                                                                                Xen-netback @ dom0
                                                                                Xen-icmp @ dom1

 –  Schedule outs                                      0.85              Dom0 : IDD
                                                                         Dom1 : CPU 20%+ ping recv

    during dom0 execution
                                                                         Dom2 : CPU 100%
                                                        0.8
                                                              0   20         40          60          80

                                                                       Latency (ms)
                                                         1

•  Reasons



                             Cumulative distribution
                                                       0.95
 –  Dom0 : not always the
    higest prio.                                        0.9
                                                                                Xen-usb_irq @ dom0

 –  Asynch I/O handling :
                                                                                Xen-netback @ dom0
                                                                                Xen-icmp @ dom1
                                                       0.85              Dom0 : IDD

    bottom half, softirq,                                                Dom1 : CPU 80%+ ping recv
                                                                         Dom2 : CPU 100%

    tasklets, etc.                                      0.8
                                                              0   20         40          60          80

                                                                       Latency (ms)

Operating Systems Lab.                                                  http://os.korea.ac.kr   10
Virtual Interrupt Preemption
                                @ Guest OS
                                                                                      <At driver domain>

•    Interrupt enable/disable is not physically
                                                                         void default_idle(void) {
     operated within a guest OS
                                                    Hardware interrupt
                                                                             local_irq_disable();
       –  local_irq_disable(): disables
                                                     Set a pending bit
          only virtual intr.                                                                               Virtual
                                                      (because virtual       if (!need_resched())
                                                                                                           interrupt
                                                    intr. is disabled)       // blocks this domain         disabled
•    Physical intr. can be occurred                                               arch_idle();

       –  might trigger inter-VM                                             local_irq_enable();
          scheduling                                                     }



                                                                   Driver domain is scheduled out
•    Virtual interrupt preemption                                  having pending IRQ
       –  Similar to lock-holder preemption
       –  Perform driver function with
          interrupt disabled
       –  Physical timer intr. triggers
          inter-VM scheduling
       –  Virt. Intr. can be received only after
          the domain is scheduled again
          (as large as tens of ms)



     Note that the driver domain performs extensive I/O operation that disables interrupt
Operating Systems Lab.                                                        http://os.korea.ac.kr          11
Resolving I/O Latency
                          Problems for Xen-ARM
1.    Fixed priority assignment
      –  Let the driver domain always run with
         DRIVER_BOOST, which is the highest priority
          •  regardless of the CPU workload
           •  Resolves non-boosted VCPU and multi-boost
      –  RT_BOOST, BOOST for real-time I/O domain
2.    Virtual FIQ support
      –  ARM-specific interrupt optimization
      –  Higher priority than normal IRQ interrupt
      –  vPSR (Program status register) usage
3.    Do not schedule out the driver domain when it disables virtual
      interrupt
      –  It will be finished soon, and the hypervisor should give a chance to
         run the driver domain
          •  Resolves virtual interrupt preemption



Operating Systems Lab.                                       http://os.korea.ac.kr   12
Enhanced Latency :
   no dom1 vcpu dispatch latency
                                   1                                                                              1

                                  0.9                                                                            0.9




                                                                                       Cumulative distribution
        Cumulative distribution
                                  0.8                                                                            0.8

                                  0.7                                                                            0.7
                                                             Xen-netback @dom0                                                             Xen-netback @ dom0
                                                             Xen-icmp @ dom1                                                               Xen-icmp @ dom1
                                  0.6                                                                            0.6
                                                   Dom0 : IDD                                                                     Dom0 : IDD
                                  0.5              Dom1 : CPU 20%+ ping recv                                                      Dom1 : CPU 40%+ ping recv
                                                                                                                 0.5
                                                   Dom2 : CPU 100%                                                                Dom2 : CPU 100%
                                  0.4                                                                            0.4
                                        0   20         40             60          80                                   0   20         40            60           80
                                                 Latency (ms)                                                                   Latency (ms)

                                   1                                                                              1

                                  0.9                                                                            0.9
       Cumulative distribution




                                                                                       Cumulative distribution
                                  0.8                                                                            0.8

                                  0.7                                                                            0.7
                                                             Xen-netback @ dom0                                                            Xen-netback @ dom0
                                                             Xen-icmp @ dom1                                                               Xen-icmp @ dom1
                                  0.6                                                                            0.6
                                                   Dom0 : IDD                                                                     Dom0 : IDD
                                                   Dom1 : CPU 60%+ ping recv                                                      Dom1 : CPU 80%+ ping recv
                                  0.5                                                                            0.5
                                                   Dom2 : CPU 100%                                                                Dom2 : CPU 100%

                                  0.4                                                                            0.4
                                        0   20          40            60          80                                   0   20         40            60           80

                                                 Latency (ms)                                                                   Latency (ms)

Operating Systems Lab.                                                                                                           http://os.korea.ac.kr          13
Enhanced Interrupt Latency :
                    @ Driver Domain
•  Vcpu dispatch latency                                        1
    –  From intr. handler @ Xen
       to hard irq handler @ IDD                              0.99




                                    Cumulative distribution
    –  Fixed priority dom0 vcpu                               0.98
                                                                                   Dom0 : IDD
                                                                                   Dom1 : CPU 80%+ ping recv
    –  Virtual FIQ                                                                 Dom2 : CPU 100%

    –  No latency is observed!                                0.97
                                                                                        Xen-usb_irq - orig.
                                                                                        Xen-netback - orig.
                                                              0.96                      Xen-usb_irq - new
                                                                                        Xen-netback - new
•  Intra-VM latency
                                                              0.95
    –  From ISR to netback @ IDD                                     0   20        40              60         80

                                                                              Latency (ms)
    –  No virt. intr. preemption
        (dom0 highest prio.)

•  Among 13M intr.,
    –  56K are caught where
    virt intr preemption happened
    –  8.5M preemption occurred
        with FIQ optimization


Operating Systems Lab.                                                           http://os.korea.ac.kr        14
Enhanced end-user Latency :
                       overall result
•  Over 1 million ping tests,
     –  Fixed priority make the driver domain to run
        without additional latency
        (from inter-VM scheduling)
          •  Largely reduces overall latency
     –  99% interrupts                     (%)
                                                                 100



        are handled
                             Latency distribution (cumulative)
        within 1ms                                                95       reduced
                                                                           latency




                                                                  90



                                                                                                                         Xen-domU    (enhanced)
                                                                                                                         Xen-domU    (original)
                                                                  85




                                                                  80
                                                                       0             10     20          30          40          50            60
                                                                                          Latencies in all intervals (ms)

Operating Systems Lab.                                                                                       http://os.korea.ac.kr           15
Conclusion and
                                  Possible Future work
•    We analyzed I/O latency in Xen-ARM virtual machine
      –  Throughout the interrupt handling path in split driver model
•    Two main reasons for long latency
      –  Limitation of ‘BOOST’ in the Xen-ARM’s Credit scheduler
           •  Not-boosted vcpu
           •  Multi-boost
      –  Driver domain’s virtualized interrupt handling
           •  Virtual interrupt preemption (aka. lock-holder preemption)
•    Achieve under millisecond latency for 99% network packet interrupts
      –  DRIVER_BOOST; the highest priority for driver domain
      –  Modify scheduler in order for the driver domain not to be scheduled out while
         virtual interrupts are disabled
      –  Further optimizations (incl. virtual FIQ mode, softirq-awareness)
•    Possible future work
      –  Multi-core consideration/extension (core allocation, etc.)
           •  Other scheduler integration
      –  Tight latency guarantee for real-time guest OS
           •  Rest 1% holds the key


Operating Systems Lab.                                                     http://os.korea.ac.kr   16
Thanks for your attention




Credits to OSvirtual @ oslab, KU



                                   17
Appendix.
                                                                          Native comparison
  •  Comparison with native system – no cpuworkload
                      –  Slightly reduced handling latency, largely reduced max.
                                          Orig.	
                                     Orig.	
  	
             New	
  sched.	
         New	
  sched.	
  
                            Na#ve	
  
                                          dom0	
                                      domU	
                  Dom0	
                  DomU	
  
Min	
  	
  	
  	
           375	
         459	
                                       575	
                   444	
                   584	
  
Avg	
  	
  	
  	
           532.52	
      821.48	
                                    912.42	
                576.54	
                736.06	
  
Max	
                       107456	
      100782	
                                    100964	
                1883	
                  2208	
  
Stdev	
                     1792.34	
     4656.26	
                                   4009.78	
               41.84	
                 45.95	
  
                                                                               1




                                                    Cumulative distribution
                                                                              0.8

                                                                              0.6

                                                                              0.4
                                                                                                                                   native ping (eth0)
                                                                                                                                   orig. dom0
                                                                              0.2                                                  orig. domU
                                                                                                                                   new sched. dom0
                                                                                                                                   new sched. domU
                                                                               0
                                                                                300       400         500     600       700       800       900         1000

                                                                                                            Latency (us)
    Operating Systems Lab.                                                                                                    http://os.korea.ac.kr            18
Appendix.
                                                                       Fairness?
   •  is still good, and achieves high utilization
                                         CPU burning jobs’ utilization
  dom2	





  dom1	


                    NO I/O (ideal case)	

           Orig. credit	

       New scheduler	

* Setting
Dom1: 20, 40, 60, 80, 100% CPU load
       + ping receiver
Dom2: 100% CPU load
Note that the credit is work-conserving


                          Normalized
                                                  Orig. credit	

        New scheduler	

                          throughput
            =   ( Measured thruput/
                    ideal thruput            )
    Operating Systems Lab.                                               http://os.korea.ac.kr   19
Appendix.
                                                                                                      Latency in multi-OS env.
   •  3 domain cases with differentiated service
                     –  added RT_BOOST priority, between
                        DRIVER_BOOST and BOOST
                     –  Dom1 assuming RT
                     –  Dom2 and Dom3 for GP
              Credit’s latency dist. 3 domains for                                                                                                                                                    Enhanced latency dist. 3 doms. for
              10K ping tests (interval 10~40ms)                                                                                                                                                       10K ping tests (interval 10~40ms)
100.00%                                                                                                                                                                                 100.00%
90.00%                                                                                                                                                                                  90.00%
80.00%                                                                                                                                                                                  80.00%
70.00%                                                                                                                                                                                  70.00%
60.00%                                                                                                                                                                                  60.00%
50.00%                                                                                                                                                                           Dom1   50.00%                                                                                                                                                                           Dom1

40.00%                                                                                                                                                                           Dom2   40.00%                                                                                                                                                                           Dom2
30.00%                                                                                                                                                                           Dom3   30.00%                                                                                                                                                                           Dom3
20.00%                                                                                                                                                                                  20.00%
10.00%                                                                                                                                                                                  10.00%
 0.00%                                                                                                                                                                                   0.00%
                                                                                                                                                                                                  0
                                                                                                                                                                                                      6000
                                                                                                                                                                                                             12000
                                                                                                                                                                                                                     18000
                                                                                                                                                                                                                             24000
                                                                                                                                                                                                                                     30000
                                                                                                                                                                                                                                             36000
                                                                                                                                                                                                                                                     42000
                                                                                                                                                                                                                                                             48000
                                                                                                                                                                                                                                                                     54000
                                                                                                                                                                                                                                                                             60000
                                                                                                                                                                                                                                                                                     66000
                                                                                                                                                                                                                                                                                             72000
                                                                                                                                                                                                                                                                                                     78000
                                                                                                                                                                                                                                                                                                             84000
                                                                                                                                                                                                                                                                                                                     90000
                                                                                                                                                                                                                                                                                                                             96000
                                                                                                                                                                                                                                                                                                                                     102000
                                                                                                                                                                                                                                                                                                                                              108000
                                                                                                                                                                                                                                                                                                                                                       114000
                                                                                                                                                                                                                                                                                                                                                                120000
          0
              6000
                     12000
                             18000
                                     24000
                                             30000
                                                     36000
                                                             42000
                                                                     48000
                                                                             54000
                                                                                     60000
                                                                                             66000
                                                                                                     72000
                                                                                                             78000
                                                                                                                     84000
                                                                                                                             90000
                                                                                                                                     96000
                                                                                                                                             102000
                                                                                                                                                      108000
                                                                                                                                                               114000
                                                                                                                                                                        120000




    Operating Systems Lab.                                                                                                                                                                                                                                                      http://os.korea.ac.kr                                                                    20
Appendix.
                         Latency in multi-OS env.
•  3 domain cases with differentiated service
     –  added RT_BOOST priority, between DRIVER_BOOST and BOOST
     –  Dom1 assuming RT
     –  Dom2 and Dom3 for GP
          100.00%

           90.00%

           80.00%

           70.00%

           60.00%                                  Enh. Dom1
                                                   Enh. Dom2
           50.00%
                                                   Enh. Dom3
           40.00%
                                                   Credit Dom1
           30.00%                                  Credit Dom2

           20.00%                                  Credit Dom3

           10.00%

            0.00%




Operating Systems Lab.                           http://os.korea.ac.kr   21

Minimizing I/O Latency in Xen-ARM

  • 1.
    Minimizing I/O Latencyin Xen-ARM Seehwan Yoo, Chuck Yoo and OSVirtual Korea University
  • 2.
    I/O Latency Issuesin Xen-ARM •  Guaranteed I/O latency is essential to Xen-ARM –  For mobile communication, •  Broken/missing phone calls, long network detection •  AP, CP are used for guaranteed communication •  Virtualization premises transparent execution –  Not only instruction execution; access isolation –  But also performance; performance isolation •  In practice, I/O latency is still an obstacle –  Lack of scheduler support •  Hierarchical scheduling nature i.e. hypervisor cannot impose task scheduling inside a guest OS •  The credit scheduler doesn’t match to time-sensitive applications –  Latency due to the split driver model •  Enhances reliability, but degrades performance (w.r.t. I/O latency) •  Inter-domain communication between IDD and user domain •  We investigate these issues, and present possible remedies Operating Systems Lab. http://os.korea.ac.kr 2
  • 3.
    Related Work •  Credit schedulers in Xen –  Task-aware scheduler (by Hwanjoo et. al. vee’08) •  Inspection of task execution at guest OS •  Adaptively use boost mechanism by VM workload characteristics –  Laxity-based soft RT scheduler (by Lee min et. al. vee’10) •  Laxity calculation by execution profile •  Assign priority based on the remaining time to deadline (laxity) –  Dynamic core-allocation (by Y. Hu et. al. hpdc’10) •  Core-allocation by workload characteristics (driver core, fast/slow tick core) •  Mobile/embedded hypervisors –  OKL4 microvisor: microkernel-based hypervisor •  Has verified kernel – from sel4 •  Presents good performance – commercially successful •  Real-time optimizations – slice donation, direct switching, reflective scheduling, threaded interrupt handling, etc. –  VirtualLogix – VLX: mobile hypervisor for real-time support •  Shared driver model – device sharing model among RT-guest OS and non-RT guest OSs •  Good PV performance Operating Systems Lab. http://os.korea.ac.kr 3
  • 4.
    Background – theCredit in Xen •  Weighted round-robin based fair scheduler •  Priority – BOOST, UNDER, OVER –  Basic principle: preserve fairness among all VCPUs •  Each vcpu gets credit periodically •  Credit is debited as vcpu consumes execution time –  Priority assignment •  Remaining credit <= 0 à OVER (lowest) •  Remaining credit > 0 à UNDER •  Event-pending VCPU : BOOST (highest) –  BOOST: for providing low response time •  Allows immediate preemption of the current vcpu Operating Systems Lab. http://os.korea.ac.kr 4
  • 5.
    Fallacies for theBOOST in the Credit •  Fallacy 1) VCPU is always boosted by I/O event –  In fact, BOOST is sometimes ignored because VCPU is boosted only when it doesn’t break the fairness •  ‘Not-boosted vcpu’s are observed when the vcpu is in-execution –  Example 1) •  If a user domain has CPU job and waiting for execution, then •  It is not boosted since it will be executed soon, and tentative BOOST is easy to break the fairness •  Fallacy 2) BOOST always prioritizes the VCPU –  In fact, BOOST is easy to be negated because multiple vcpus can be boosted simultaneously •  ‘Multi-boost’ happens quite often in split driver model –  Example 1) •  Driver domain has to be boosted, and then •  User domain also needs to be boosted –  Example 2) •  Driver domain has multiple pkts that are destined to multiple user domains, then •  All the designated user domains are boosted Operating Systems Lab. http://os.korea.ac.kr 5
  • 6.
    Xen-ARM’s Latency Characteristic •  I/O latency measured throughout interrupt path –  Preemption latency : until code is preemptible –  VCPU dispatch latency : until the designated VCPU is scheduled –  Intra-VM latency : until IO completion Dom0 DomU dispatch dispatch Phy intr. Dom0 DomU Xen-ARM Dom0 DomU Intr. handler I/O task I/O task Preemption Vcpu Vcpu dispatch Intra-VM dispatch Intra-VM latency latency latency latency latency Operating Systems Lab. http://os.korea.ac.kr 6
  • 7.
    Observed Latency thruintr. path •  I/O latency measured throughout interrupt path –  Send ping request from external server to dom1 –  VM settings •  Dom0 : the driver domain •  Dom1 : 20% cpu load + ping recv. •  Dom2 : 100% cpu load (CPU-burning workload) –  Xen-netback latency •  Large worst-case latency •  Dom0 vcpu dispatch lat. + intra-dom0 lat. –  Netback-domU latency •  Large average latency •  Dom1 vcpu dispatch lat. Operating Systems Lab. http://os.korea.ac.kr 7
  • 8.
    Observed VCPU dispatchlatency : not-boosted vcpu •  Experimental setting (%) 100 –  Dom1: varying CPU workload –  Dom2: burning CPU workload –  Another external host sends ping Cumulative latency distribution 80 to Dom1 •  Not-boosted vcpus affect I/O latency distribution 60 –  Dom1 CPU workload 20% : almost 90% ping requests are handled within 1ms 40 –  Dom1 CPU workload 40% : 75% ping requests are handled 20% CPU load @ dom1 40% CPU load @ dom1 within 1ms 20 60% CPU load @ dom1 –  Dom1 CPU workload 60% : 65% 80% CPU load @ dom1 ping requests are handled within 1ms 0 –  Dom1 CPU workload 80% : only 60% 0 2 4 6 8 10 12 14 ping requests are handled Latency distribution by CPU load Latency (ms) within 1ms •  When Dom1 has more CPU load è larger I/O latency (self-disturbing by not-boosted vcpu) Operating Systems Lab. http://os.korea.ac.kr 8
  • 9.
    Observed Multi-boost •  Atthe hypervisor, we vcpu state priority sched. out count counted the number Blocked BOOST 275 of “schedule out” of the UNDER 13 driver domain Unblocked BOOST 664,190 UNDER 49 •  Multi-boost is specifically when –  the current VCPU is BOOST state, and it is unblocked –  Imply that there should be another BOOST vcpu, and the current vcpu is scheduled out •  Large number of Multi-boosts Operating Systems Lab. http://os.korea.ac.kr 9
  • 10.
    Intra-VM latency 1 •  Latency Cumulative distribution 0.95 from usb_irq to netback 0.9 Xen-usb_irq @ dom0 Xen-netback @ dom0 Xen-icmp @ dom1 –  Schedule outs 0.85 Dom0 : IDD Dom1 : CPU 20%+ ping recv during dom0 execution Dom2 : CPU 100% 0.8 0 20 40 60 80 Latency (ms) 1 •  Reasons Cumulative distribution 0.95 –  Dom0 : not always the higest prio. 0.9 Xen-usb_irq @ dom0 –  Asynch I/O handling : Xen-netback @ dom0 Xen-icmp @ dom1 0.85 Dom0 : IDD bottom half, softirq, Dom1 : CPU 80%+ ping recv Dom2 : CPU 100% tasklets, etc. 0.8 0 20 40 60 80 Latency (ms) Operating Systems Lab. http://os.korea.ac.kr 10
  • 11.
    Virtual Interrupt Preemption @ Guest OS <At driver domain> •  Interrupt enable/disable is not physically void default_idle(void) { operated within a guest OS Hardware interrupt local_irq_disable(); –  local_irq_disable(): disables Set a pending bit only virtual intr. Virtual (because virtual if (!need_resched()) interrupt intr. is disabled) // blocks this domain disabled •  Physical intr. can be occurred arch_idle(); –  might trigger inter-VM local_irq_enable(); scheduling } Driver domain is scheduled out •  Virtual interrupt preemption having pending IRQ –  Similar to lock-holder preemption –  Perform driver function with interrupt disabled –  Physical timer intr. triggers inter-VM scheduling –  Virt. Intr. can be received only after the domain is scheduled again (as large as tens of ms) Note that the driver domain performs extensive I/O operation that disables interrupt Operating Systems Lab. http://os.korea.ac.kr 11
  • 12.
    Resolving I/O Latency Problems for Xen-ARM 1.  Fixed priority assignment –  Let the driver domain always run with DRIVER_BOOST, which is the highest priority •  regardless of the CPU workload •  Resolves non-boosted VCPU and multi-boost –  RT_BOOST, BOOST for real-time I/O domain 2.  Virtual FIQ support –  ARM-specific interrupt optimization –  Higher priority than normal IRQ interrupt –  vPSR (Program status register) usage 3.  Do not schedule out the driver domain when it disables virtual interrupt –  It will be finished soon, and the hypervisor should give a chance to run the driver domain •  Resolves virtual interrupt preemption Operating Systems Lab. http://os.korea.ac.kr 12
  • 13.
    Enhanced Latency : no dom1 vcpu dispatch latency 1 1 0.9 0.9 Cumulative distribution Cumulative distribution 0.8 0.8 0.7 0.7 Xen-netback @dom0 Xen-netback @ dom0 Xen-icmp @ dom1 Xen-icmp @ dom1 0.6 0.6 Dom0 : IDD Dom0 : IDD 0.5 Dom1 : CPU 20%+ ping recv Dom1 : CPU 40%+ ping recv 0.5 Dom2 : CPU 100% Dom2 : CPU 100% 0.4 0.4 0 20 40 60 80 0 20 40 60 80 Latency (ms) Latency (ms) 1 1 0.9 0.9 Cumulative distribution Cumulative distribution 0.8 0.8 0.7 0.7 Xen-netback @ dom0 Xen-netback @ dom0 Xen-icmp @ dom1 Xen-icmp @ dom1 0.6 0.6 Dom0 : IDD Dom0 : IDD Dom1 : CPU 60%+ ping recv Dom1 : CPU 80%+ ping recv 0.5 0.5 Dom2 : CPU 100% Dom2 : CPU 100% 0.4 0.4 0 20 40 60 80 0 20 40 60 80 Latency (ms) Latency (ms) Operating Systems Lab. http://os.korea.ac.kr 13
  • 14.
    Enhanced Interrupt Latency: @ Driver Domain •  Vcpu dispatch latency 1 –  From intr. handler @ Xen to hard irq handler @ IDD 0.99 Cumulative distribution –  Fixed priority dom0 vcpu 0.98 Dom0 : IDD Dom1 : CPU 80%+ ping recv –  Virtual FIQ Dom2 : CPU 100% –  No latency is observed! 0.97 Xen-usb_irq - orig. Xen-netback - orig. 0.96 Xen-usb_irq - new Xen-netback - new •  Intra-VM latency 0.95 –  From ISR to netback @ IDD 0 20 40 60 80 Latency (ms) –  No virt. intr. preemption (dom0 highest prio.) •  Among 13M intr., –  56K are caught where virt intr preemption happened –  8.5M preemption occurred with FIQ optimization Operating Systems Lab. http://os.korea.ac.kr 14
  • 15.
    Enhanced end-user Latency: overall result •  Over 1 million ping tests, –  Fixed priority make the driver domain to run without additional latency (from inter-VM scheduling) •  Largely reduces overall latency –  99% interrupts (%) 100 are handled Latency distribution (cumulative) within 1ms 95 reduced latency 90 Xen-domU (enhanced) Xen-domU (original) 85 80 0 10 20 30 40 50 60 Latencies in all intervals (ms) Operating Systems Lab. http://os.korea.ac.kr 15
  • 16.
    Conclusion and Possible Future work •  We analyzed I/O latency in Xen-ARM virtual machine –  Throughout the interrupt handling path in split driver model •  Two main reasons for long latency –  Limitation of ‘BOOST’ in the Xen-ARM’s Credit scheduler •  Not-boosted vcpu •  Multi-boost –  Driver domain’s virtualized interrupt handling •  Virtual interrupt preemption (aka. lock-holder preemption) •  Achieve under millisecond latency for 99% network packet interrupts –  DRIVER_BOOST; the highest priority for driver domain –  Modify scheduler in order for the driver domain not to be scheduled out while virtual interrupts are disabled –  Further optimizations (incl. virtual FIQ mode, softirq-awareness) •  Possible future work –  Multi-core consideration/extension (core allocation, etc.) •  Other scheduler integration –  Tight latency guarantee for real-time guest OS •  Rest 1% holds the key Operating Systems Lab. http://os.korea.ac.kr 16
  • 17.
    Thanks for yourattention Credits to OSvirtual @ oslab, KU 17
  • 18.
    Appendix. Native comparison •  Comparison with native system – no cpuworkload –  Slightly reduced handling latency, largely reduced max. Orig.   Orig.     New  sched.   New  sched.   Na#ve   dom0   domU   Dom0   DomU   Min         375   459   575   444   584   Avg         532.52   821.48   912.42   576.54   736.06   Max   107456   100782   100964   1883   2208   Stdev   1792.34   4656.26   4009.78   41.84   45.95   1 Cumulative distribution 0.8 0.6 0.4 native ping (eth0) orig. dom0 0.2 orig. domU new sched. dom0 new sched. domU 0 300 400 500 600 700 800 900 1000 Latency (us) Operating Systems Lab. http://os.korea.ac.kr 18
  • 19.
    Appendix. Fairness? •  is still good, and achieves high utilization CPU burning jobs’ utilization dom2 dom1 NO I/O (ideal case) Orig. credit New scheduler * Setting Dom1: 20, 40, 60, 80, 100% CPU load + ping receiver Dom2: 100% CPU load Note that the credit is work-conserving Normalized Orig. credit New scheduler throughput = ( Measured thruput/ ideal thruput ) Operating Systems Lab. http://os.korea.ac.kr 19
  • 20.
    Appendix. Latency in multi-OS env. •  3 domain cases with differentiated service –  added RT_BOOST priority, between DRIVER_BOOST and BOOST –  Dom1 assuming RT –  Dom2 and Dom3 for GP Credit’s latency dist. 3 domains for Enhanced latency dist. 3 doms. for 10K ping tests (interval 10~40ms) 10K ping tests (interval 10~40ms) 100.00% 100.00% 90.00% 90.00% 80.00% 80.00% 70.00% 70.00% 60.00% 60.00% 50.00% Dom1 50.00% Dom1 40.00% Dom2 40.00% Dom2 30.00% Dom3 30.00% Dom3 20.00% 20.00% 10.00% 10.00% 0.00% 0.00% 0 6000 12000 18000 24000 30000 36000 42000 48000 54000 60000 66000 72000 78000 84000 90000 96000 102000 108000 114000 120000 0 6000 12000 18000 24000 30000 36000 42000 48000 54000 60000 66000 72000 78000 84000 90000 96000 102000 108000 114000 120000 Operating Systems Lab. http://os.korea.ac.kr 20
  • 21.
    Appendix. Latency in multi-OS env. •  3 domain cases with differentiated service –  added RT_BOOST priority, between DRIVER_BOOST and BOOST –  Dom1 assuming RT –  Dom2 and Dom3 for GP 100.00% 90.00% 80.00% 70.00% 60.00% Enh. Dom1 Enh. Dom2 50.00% Enh. Dom3 40.00% Credit Dom1 30.00% Credit Dom2 20.00% Credit Dom3 10.00% 0.00% Operating Systems Lab. http://os.korea.ac.kr 21