Minimizing I/O Latency in Xen-ARM
Upcoming SlideShare
Loading in...5
×
 

Minimizing I/O Latency in Xen-ARM

on

  • 6,583 views

 

Statistics

Views

Total Views
6,583
Views on SlideShare
4,011
Embed Views
2,572

Actions

Likes
0
Downloads
36
Comments
0

5 Embeds 2,572

http://xen.org 1831
http://www.xen.org 677
http://lars.1.xen.org 39
http://www-archive.xenproject.org 20
http://xen.xensource.com 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Minimizing I/O Latency in Xen-ARM Minimizing I/O Latency in Xen-ARM Presentation Transcript

  • Minimizing I/O Latency in Xen-ARM Seehwan Yoo, Chuck Yoo and OSVirtual Korea University
  • I/O Latency Issues in Xen-ARM•  Guaranteed I/O latency is essential to Xen-ARM –  For mobile communication, •  Broken/missing phone calls, long network detection •  AP, CP are used for guaranteed communication•  Virtualization premises transparent execution –  Not only instruction execution; access isolation –  But also performance; performance isolation•  In practice, I/O latency is still an obstacle –  Lack of scheduler support •  Hierarchical scheduling nature i.e. hypervisor cannot impose task scheduling inside a guest OS •  The credit scheduler doesn’t match to time-sensitive applications –  Latency due to the split driver model •  Enhances reliability, but degrades performance (w.r.t. I/O latency) •  Inter-domain communication between IDD and user domain•  We investigate these issues, and present possible remediesOperating Systems Lab. http://os.korea.ac.kr 2
  • Related Work•  Credit schedulers in Xen –  Task-aware scheduler (by Hwanjoo et. al. vee’08) •  Inspection of task execution at guest OS •  Adaptively use boost mechanism by VM workload characteristics –  Laxity-based soft RT scheduler (by Lee min et. al. vee’10) •  Laxity calculation by execution profile •  Assign priority based on the remaining time to deadline (laxity) –  Dynamic core-allocation (by Y. Hu et. al. hpdc’10) •  Core-allocation by workload characteristics (driver core, fast/slow tick core)•  Mobile/embedded hypervisors –  OKL4 microvisor: microkernel-based hypervisor •  Has verified kernel – from sel4 •  Presents good performance – commercially successful •  Real-time optimizations – slice donation, direct switching, reflective scheduling, threaded interrupt handling, etc. –  VirtualLogix – VLX: mobile hypervisor for real-time support •  Shared driver model – device sharing model among RT-guest OS and non-RT guest OSs •  Good PV performanceOperating Systems Lab. http://os.korea.ac.kr 3
  • Background – the Credit in Xen•  Weighted round-robin based fair scheduler•  Priority – BOOST, UNDER, OVER –  Basic principle: preserve fairness among all VCPUs •  Each vcpu gets credit periodically •  Credit is debited as vcpu consumes execution time –  Priority assignment •  Remaining credit <= 0 à OVER (lowest) •  Remaining credit > 0 à UNDER •  Event-pending VCPU : BOOST (highest) –  BOOST: for providing low response time •  Allows immediate preemption of the current vcpuOperating Systems Lab. http://os.korea.ac.kr 4
  • Fallacies for the BOOST in the Credit•  Fallacy 1) VCPU is always boosted by I/O event –  In fact, BOOST is sometimes ignored because VCPU is boosted only when it doesn’t break the fairness •  ‘Not-boosted vcpu’s are observed when the vcpu is in-execution –  Example 1) •  If a user domain has CPU job and waiting for execution, then •  It is not boosted since it will be executed soon, and tentative BOOST is easy to break the fairness•  Fallacy 2) BOOST always prioritizes the VCPU –  In fact, BOOST is easy to be negated because multiple vcpus can be boosted simultaneously •  ‘Multi-boost’ happens quite often in split driver model –  Example 1) •  Driver domain has to be boosted, and then •  User domain also needs to be boosted –  Example 2) •  Driver domain has multiple pkts that are destined to multiple user domains, then •  All the designated user domains are boostedOperating Systems Lab. http://os.korea.ac.kr 5
  • Xen-ARM’s Latency Characteristic•  I/O latency measured throughout interrupt path –  Preemption latency : until code is preemptible –  VCPU dispatch latency : until the designated VCPU is scheduled –  Intra-VM latency : until IO completion Dom0 DomU dispatch dispatch Phy intr. Dom0 DomU Xen-ARM Dom0 DomU Intr. handler I/O task I/O task Preemption Vcpu Vcpu dispatch Intra-VM dispatch Intra-VM latency latency latency latency latency Operating Systems Lab. http://os.korea.ac.kr 6
  • Observed Latency thru intr. path•  I/O latency measured throughout interrupt path –  Send ping request from external server to dom1 –  VM settings •  Dom0 : the driver domain •  Dom1 : 20% cpu load + ping recv. •  Dom2 : 100% cpu load (CPU-burning workload) –  Xen-netback latency •  Large worst-case latency •  Dom0 vcpu dispatch lat. + intra-dom0 lat. –  Netback-domU latency •  Large average latency •  Dom1 vcpu dispatch lat. Operating Systems Lab. http://os.korea.ac.kr 7
  • Observed VCPU dispatch latency : not-boosted vcpu•  Experimental setting (%) 100 –  Dom1: varying CPU workload –  Dom2: burning CPU workload –  Another external host sends ping Cumulative latency distribution 80 to Dom1•  Not-boosted vcpus affect I/O latency distribution 60 –  Dom1 CPU workload 20% : almost 90% ping requests are handled within 1ms 40 –  Dom1 CPU workload 40% : 75% ping requests are handled 20% CPU load @ dom1 40% CPU load @ dom1 within 1ms 20 60% CPU load @ dom1 –  Dom1 CPU workload 60% : 65% 80% CPU load @ dom1 ping requests are handled within 1ms 0 –  Dom1 CPU workload 80% : only 60% 0 2 4 6 8 10 12 14 ping requests are handled Latency distribution by CPU load Latency (ms) within 1ms•  When Dom1 has more CPU load è larger I/O latency (self-disturbing by not-boosted vcpu) Operating Systems Lab. http://os.korea.ac.kr 8
  • Observed Multi-boost•  At the hypervisor, we vcpu state priority sched. out count counted the number Blocked BOOST 275 of “schedule out” of the UNDER 13 driver domain Unblocked BOOST 664,190 UNDER 49•  Multi-boost is specifically when –  the current VCPU is BOOST state, and it is unblocked –  Imply that there should be another BOOST vcpu, and the current vcpu is scheduled out•  Large number of Multi-boostsOperating Systems Lab. http://os.korea.ac.kr 9
  • Intra-VM latency 1•  Latency Cumulative distribution 0.95 from usb_irq to netback 0.9 Xen-usb_irq @ dom0 Xen-netback @ dom0 Xen-icmp @ dom1 –  Schedule outs 0.85 Dom0 : IDD Dom1 : CPU 20%+ ping recv during dom0 execution Dom2 : CPU 100% 0.8 0 20 40 60 80 Latency (ms) 1•  Reasons Cumulative distribution 0.95 –  Dom0 : not always the higest prio. 0.9 Xen-usb_irq @ dom0 –  Asynch I/O handling : Xen-netback @ dom0 Xen-icmp @ dom1 0.85 Dom0 : IDD bottom half, softirq, Dom1 : CPU 80%+ ping recv Dom2 : CPU 100% tasklets, etc. 0.8 0 20 40 60 80 Latency (ms)Operating Systems Lab. http://os.korea.ac.kr 10
  • Virtual Interrupt Preemption @ Guest OS <At driver domain>•  Interrupt enable/disable is not physically void default_idle(void) { operated within a guest OS Hardware interrupt local_irq_disable(); –  local_irq_disable(): disables Set a pending bit only virtual intr. Virtual (because virtual if (!need_resched()) interrupt intr. is disabled) // blocks this domain disabled•  Physical intr. can be occurred arch_idle(); –  might trigger inter-VM local_irq_enable(); scheduling } Driver domain is scheduled out•  Virtual interrupt preemption having pending IRQ –  Similar to lock-holder preemption –  Perform driver function with interrupt disabled –  Physical timer intr. triggers inter-VM scheduling –  Virt. Intr. can be received only after the domain is scheduled again (as large as tens of ms) Note that the driver domain performs extensive I/O operation that disables interruptOperating Systems Lab. http://os.korea.ac.kr 11
  • Resolving I/O Latency Problems for Xen-ARM1.  Fixed priority assignment –  Let the driver domain always run with DRIVER_BOOST, which is the highest priority •  regardless of the CPU workload •  Resolves non-boosted VCPU and multi-boost –  RT_BOOST, BOOST for real-time I/O domain2.  Virtual FIQ support –  ARM-specific interrupt optimization –  Higher priority than normal IRQ interrupt –  vPSR (Program status register) usage3.  Do not schedule out the driver domain when it disables virtual interrupt –  It will be finished soon, and the hypervisor should give a chance to run the driver domain •  Resolves virtual interrupt preemptionOperating Systems Lab. http://os.korea.ac.kr 12
  • Enhanced Latency : no dom1 vcpu dispatch latency 1 1 0.9 0.9 Cumulative distribution Cumulative distribution 0.8 0.8 0.7 0.7 Xen-netback @dom0 Xen-netback @ dom0 Xen-icmp @ dom1 Xen-icmp @ dom1 0.6 0.6 Dom0 : IDD Dom0 : IDD 0.5 Dom1 : CPU 20%+ ping recv Dom1 : CPU 40%+ ping recv 0.5 Dom2 : CPU 100% Dom2 : CPU 100% 0.4 0.4 0 20 40 60 80 0 20 40 60 80 Latency (ms) Latency (ms) 1 1 0.9 0.9 Cumulative distribution Cumulative distribution 0.8 0.8 0.7 0.7 Xen-netback @ dom0 Xen-netback @ dom0 Xen-icmp @ dom1 Xen-icmp @ dom1 0.6 0.6 Dom0 : IDD Dom0 : IDD Dom1 : CPU 60%+ ping recv Dom1 : CPU 80%+ ping recv 0.5 0.5 Dom2 : CPU 100% Dom2 : CPU 100% 0.4 0.4 0 20 40 60 80 0 20 40 60 80 Latency (ms) Latency (ms)Operating Systems Lab. http://os.korea.ac.kr 13
  • Enhanced Interrupt Latency : @ Driver Domain•  Vcpu dispatch latency 1 –  From intr. handler @ Xen to hard irq handler @ IDD 0.99 Cumulative distribution –  Fixed priority dom0 vcpu 0.98 Dom0 : IDD Dom1 : CPU 80%+ ping recv –  Virtual FIQ Dom2 : CPU 100% –  No latency is observed! 0.97 Xen-usb_irq - orig. Xen-netback - orig. 0.96 Xen-usb_irq - new Xen-netback - new•  Intra-VM latency 0.95 –  From ISR to netback @ IDD 0 20 40 60 80 Latency (ms) –  No virt. intr. preemption (dom0 highest prio.)•  Among 13M intr., –  56K are caught where virt intr preemption happened –  8.5M preemption occurred with FIQ optimizationOperating Systems Lab. http://os.korea.ac.kr 14
  • Enhanced end-user Latency : overall result•  Over 1 million ping tests, –  Fixed priority make the driver domain to run without additional latency (from inter-VM scheduling) •  Largely reduces overall latency –  99% interrupts (%) 100 are handled Latency distribution (cumulative) within 1ms 95 reduced latency 90 Xen-domU (enhanced) Xen-domU (original) 85 80 0 10 20 30 40 50 60 Latencies in all intervals (ms)Operating Systems Lab. http://os.korea.ac.kr 15
  • Conclusion and Possible Future work•  We analyzed I/O latency in Xen-ARM virtual machine –  Throughout the interrupt handling path in split driver model•  Two main reasons for long latency –  Limitation of ‘BOOST’ in the Xen-ARM’s Credit scheduler •  Not-boosted vcpu •  Multi-boost –  Driver domain’s virtualized interrupt handling •  Virtual interrupt preemption (aka. lock-holder preemption)•  Achieve under millisecond latency for 99% network packet interrupts –  DRIVER_BOOST; the highest priority for driver domain –  Modify scheduler in order for the driver domain not to be scheduled out while virtual interrupts are disabled –  Further optimizations (incl. virtual FIQ mode, softirq-awareness)•  Possible future work –  Multi-core consideration/extension (core allocation, etc.) •  Other scheduler integration –  Tight latency guarantee for real-time guest OS •  Rest 1% holds the keyOperating Systems Lab. http://os.korea.ac.kr 16
  • Thanks for your attentionCredits to OSvirtual @ oslab, KU 17
  • Appendix. Native comparison •  Comparison with native system – no cpuworkload –  Slightly reduced handling latency, largely reduced max. Orig.   Orig.     New  sched.   New  sched.   Na#ve   dom0   domU   Dom0   DomU  Min         375   459   575   444   584  Avg         532.52   821.48   912.42   576.54   736.06  Max   107456   100782   100964   1883   2208  Stdev   1792.34   4656.26   4009.78   41.84   45.95   1 Cumulative distribution 0.8 0.6 0.4 native ping (eth0) orig. dom0 0.2 orig. domU new sched. dom0 new sched. domU 0 300 400 500 600 700 800 900 1000 Latency (us) Operating Systems Lab. http://os.korea.ac.kr 18
  • Appendix. Fairness? •  is still good, and achieves high utilization CPU burning jobs’ utilization dom2 dom1 NO I/O (ideal case) Orig. credit New scheduler * SettingDom1: 20, 40, 60, 80, 100% CPU load + ping receiverDom2: 100% CPU loadNote that the credit is work-conserving Normalized Orig. credit New scheduler throughput = ( Measured thruput/ ideal thruput ) Operating Systems Lab. http://os.korea.ac.kr 19
  • Appendix. Latency in multi-OS env. •  3 domain cases with differentiated service –  added RT_BOOST priority, between DRIVER_BOOST and BOOST –  Dom1 assuming RT –  Dom2 and Dom3 for GP Credit’s latency dist. 3 domains for Enhanced latency dist. 3 doms. for 10K ping tests (interval 10~40ms) 10K ping tests (interval 10~40ms)100.00% 100.00%90.00% 90.00%80.00% 80.00%70.00% 70.00%60.00% 60.00%50.00% Dom1 50.00% Dom140.00% Dom2 40.00% Dom230.00% Dom3 30.00% Dom320.00% 20.00%10.00% 10.00% 0.00% 0.00% 0 6000 12000 18000 24000 30000 36000 42000 48000 54000 60000 66000 72000 78000 84000 90000 96000 102000 108000 114000 120000 0 6000 12000 18000 24000 30000 36000 42000 48000 54000 60000 66000 72000 78000 84000 90000 96000 102000 108000 114000 120000 Operating Systems Lab. http://os.korea.ac.kr 20
  • Appendix. Latency in multi-OS env.•  3 domain cases with differentiated service –  added RT_BOOST priority, between DRIVER_BOOST and BOOST –  Dom1 assuming RT –  Dom2 and Dom3 for GP 100.00% 90.00% 80.00% 70.00% 60.00% Enh. Dom1 Enh. Dom2 50.00% Enh. Dom3 40.00% Credit Dom1 30.00% Credit Dom2 20.00% Credit Dom3 10.00% 0.00%Operating Systems Lab. http://os.korea.ac.kr 21