Extending Io Scalability


Published on

Xen.org community presentation from Xen Summit Asia 2009

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Extending Io Scalability

  1. 1. Extending I/O scalability in Xen Dong Eddie, Zhang Xiantao, Xu Dongxiao, Yang Xiaowei Xen Summit Asia 2009
  2. 2. Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.  EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel may make changes to specifications and product descriptions at any time, without notice. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests.  Any difference in system hardware or software design or configuration may affect actual performance.  Intel is a trademark of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2009, Intel Corporation. All rights are protected. 2
  3. 3. Agenda •Scalability challenges of I/O virtualization •VNIF optimizations •VT-d overhead reduction •SR-IOV •Per CPU vector 3
  4. 4. Scalability challenges of I/O virtualization • Para-virtualized device driver •Software bottleneck; high CPU utilization • Direct I/O • Interrupt overhead; device# limit • SR-IOV • Driver not optimal • Limitation inside VMM 4
  5. 5. VNIF optimization 5
  6. 6. Improving scalability in 10G network U P/ RX D U P/ RX ( w hack) D / 1200 10000 1200 10000 bps) bps) 1000 8000 1000 8000 CPU ut i l % CPU ut i l % Bandw dt h ( M Bandw dt h ( M 800 800 6000 6000 600 600 4000 4000 400 i i 400 200 2000 2000 200 0 0 0 0 10 20 40 60 10 20 40 60 VM# VM# dom0% vm% BW dom0% vm% BW •Issue: the total throughput is limited by the bottleneck of netback driver in dom0 - only one TX/RX tasklet serves all netfront interfaces. •Hack: duplicate 10 netback drivers, each serves VNIFs of domX (X=domid %10). •Benefit: reach near 10G throughput, with the cost of high CPU util% •Ongoing: multiple TX/RX tasklets 6
  7. 7. Notification frequency reduction UDP/RX TCP/RX 250.00% 250.00% CPU Utilization CPU utilization 200.00% 200.00% 150.00% 150.00% 100.00% 100.00% 50.00% 50.00% 0.00% 0.00% 1v rig 3v rig 9v rig 1v FE 3v FE 9v FE E E E 1v FE 3v FE 9v FE 1v rig 3v rig 9v rig E E E -B -B -B -B -B -B -o -o -o -o -o -o - - - - - - m m m m m m m m m m m m m m m m m m 1v 3v 9v 1v 3v 9v dom0% vm% dom0% vm% • Issue: Evtchn freq ~= physical NIC’s intr freq * n (n: VNIF# sharing the NIC). But per our test, 1K HZ evtchn freq can sustain 1G throughput (TCP/UDP) already. • Solution: add evtchn freq controlling policy inside 1) netback or 2) netfront, and interface to userspace. • Benefit: guests’ CPU util% reduction. 7
  8. 8. VT-d overhead reduction 8
  9. 9. Network virtualization overhead with Direct I/O - 10G NIC * ‘%’ here means system wide CPU utilization normalized to 100%; 4 CPUs are used in the test. •Interrupt frequency is very high in 10G network: 4K/8K HZ for each TX/RX queue; 8 TX/RX queues in total; •APIC access VMExit causes the most overhead, within which EOI access occupies 90%/50% for TX/RX case. •Interrupt (including ext. intr and IPI) VMExit causes the 2nd most overhead, within which IPI occupies >50%. 9
  10. 10. APIC access VMExit optimization - vEOI VMExit Instruction fetch Instruction vAPIC emulation VMEntry emulation APIC access VMExit handling stages • EOI’s usage • S/W writes “0” into EOI for signaling interrupt servicing completion; • Upon receiving a EOI, LAPIC clears the highest priority bit in the ISR and dispatches the next priority interrupt. • Majority OSes (Windows and Linux) only use ‘mov’ for this purpose, which has no side effect to other CPU states • Benefit: by bypassing instruction fetch/emulation stage, each vEOI’s cost decreases from 7.6k to 3.1k cycles, CPU util% can decrease by 15+% (12k*8*4.5k/2.8G) in 10G NIC case. If guests use complex instructions (e.g. stos) for EOI access, there will be side effects to guests, but immune to host. Better / complete solution is PV or virtualizing x2APIC (using MSR). 10
  11. 11. vIntr delivering - before optimization 0. VT-d device’s pIntr affinity is set to where vCPU0 is on (pCPU0) at boot time; guest OS sets vCPU2 to receive the vIntr. 1. When the device generates the pIntr, it’s delivered to pCPU0 2. Xen sends IPI to where vCPU0 is on (pCPU2) for vIntr delivery 3. Before vCPU0 VMentry, Xen delivers vIntr by sending IPI to where vCPU2 is on (pCPU3). 4. On vCPU2 VMentry, vIntr is injected to the guest. 11
  12. 12. vIntr delivering - after optimization 0. Most works are done before the pIntr happens * When guest OS sets vCPU2 to receive the vIntr, Xen sets the corresponding pCPU (pCPU3) to receive the pIntr. * When vCPU2 migrates, the pIntr migrates with it. 1. When the device generates the pIntr, it’s delivered to pCPU3 directly 2. On vCPU2 VMentry, vIntr is injected to the guest. Benefit: all IPIs for vIntr delivery (~4%-9% CPU overhead) are removed. Limit: for simplicity, our 1st patch only handle MSI (no ioapic) and single vIntr destination case, which covers typical usage. 12
  13. 13. SR-IOV 13
  14. 14. SRIOV v.s. VNIF scalability dom+ U VF dom + I F ( w hack) U VN / 1200 10000 1200 10000 bps) bps) 1000 8000 1000 8000 CPU ut i l % CPU ut i l % Bandw dt h ( M Bandw dt h ( M 800 800 6000 6000 600 600 4000 4000 400 i i 400 200 2000 2000 200 0 0 0 0 10 20 40 60 10 20 40 60 VM# VM# dom0% vm% BW dom0% vm% BW •SRIOV has significant advantage over VNIF solution even the tasklet bottleneck is fixed, in the aspects of • Less CPU utilization • Stable latency 14
  15. 15. Interrupt coalescing optimization • Interrupt coalescing is more critical to VM than to native • Interrupt handling inside VM has more overhead. • Interrupt coalescing policy in native is not efficient in SRIOV, as the BW of one VF can be • << line speed, e.g. several VFs within one physical port competing at the same time. • >> line speed, e.g inter-VF communication with one port. • Proposal: adaptive interrupt coalescing (AIC), based on the packet generating rate in the last seconds and buffer size in the system. 15
  16. 16. Per CPU vector 16
  17. 17. Issues •Interrupt vector in Xen was global • One vector number was shared by all pCPUS •Only <200 vectors were available for devices in total • Vector range is 0-255 • Lowest 32 vectors reserved by x86 SDM • 16 vectors reserved for legacy PIC high priority cases • Highest 16 vectors reserved for high priority cases • Special case: 0x82 •In SR-IOV case, vectors are easy to be used up • E.g. one Niantic NIC can use 384 vectors = 2(ports) * 64(VFs+PF) * 3(TX/RX/mailbox). 17
  18. 18. Solution •Back-port per-CPU vector solution in Linux kernel •After the change, max vector# = 256*pCPU# •Code changes • Precondition: all interrupts to be indexed by ‘irq’ • Vector allocation / management functions, related structures… • Interrupt migration need special care: after migration, vector# may be different. • Evaluation • With per-CPU vector, vectors can be assigned to all Niantic VFs successful. 18