ppt
Upcoming SlideShare
Loading in...5
×
 

ppt

on

  • 714 views

 

Statistics

Views

Total Views
714
Views on SlideShare
714
Embed Views
0

Actions

Likes
0
Downloads
17
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Good afternoon everyone, my name is 김신규 . My presentation, today, is going to be about various ways in optimizing network virtualization in Xen. These are the authors of the paper and this paper was awarded the “Best Paper” at the USENIX ATC 2006.
  • In recent years, there has been a trend towards running network intensive applications, such as Internet servers, in virtual machine environments. Despite the advances in virtualization technology, the overhead of network I/O virtualization can still significantly affect the performance of network-intensive applications. For example, the CPU utilization required to saturate a 100 Mbps network under Linux running on VMware Workstation was about 5 times higher compared to the utilization under native Linux. Even in the paravirtualized Xen, it shows similar results.
  • This graph shows the performance comparison between XenoLinux and native Linux. The upper line is for native linux, and the other is for XenoLinux. You can see that there is a considerable performance degradation in XenoLinux.
  • So, this paper propose a number of optimizations for improving the networking performance under the Xen VMM. The optimizations fall into the following three categories Firstly, they add three capabilities to the virtualized network interfaces: scatter/gather I/O, TCP/IP checksum offload, and TCP segmentation offload. Secondly, they introduced a faster I/O channel for transferring network packets between the guest and driver domains. The optimizations include transmit mechanisms that avoid a data remap or copy in the common case and a receive mechanism optimized for small data receives. Thirdly, they present VMM support mechanisms for allowing the use of superpage and global page mapping on the Intel x86 architecture. These mechanisms reduce the number of TLB misses by guest domains.
  • This figure shows the network architecture used in Xen. Xen provides each guest domain with a virtual network interfaces(, which is used by guest domains for its network communications). Each virtual interface in a guest domain has a backend interface in the driver domain. The virtual and backend interfaces are connected over an I/O channel. The I/O channel implements a zero-copy data transfer mechanism by remapping the physical pages. This table shows the network performance under Xen. The driver domain configuration shows performance comparable to native Linux for the transmit case and a degradation of 30% for the receive case. But, in the guest domain, which uses virtualized network interfaces, the degradations are much more. This performance degradation comes from a significantly higher TLB miss rate and a higher L2 cache miss rate.
  • Here, I’m going to explain about a virtual interface optimization. The network I/O operations supported on the virtual interface consist of simple transmit and receive operations. It allows the virtualized interface to be easily supported across a large number of physical interfaces. But, this also prevents the virtual interface from taking advantage of different network offload capabilities of the physical network interface, such as checksum offload, scatter/gather I/O, and TCP segmentation offload. The authors propose a new virtual interface architecture which supports high level network offload features , although these features are not supported in the physical network interfaces. Of course, it makes use of offload features of the physical interface if they match the offload requirements of the virtual network interface.
  • This figure shows the design of new I/O architecture The architecture introduces a new component, a software offload driver. This software offload driver supports scatter/gather I/O, TCP/IP checksum offload, and TCP segmentation offload. It intercepts all packets to be transmitted on the network interface. When the guest domain transmit a packet, the offload driver determines whether the offload requirements of the packet are compatible with the capabilities of the NIC. In the absence of support from the physical NIC, the offload driver performs the necessary offload actions in software.
  • The scatter/gather I/O operation is especially useful for doing zero-copy network transmits. Gather I/O allows the OS to construct network packets consisting of multiple fragments directly from the file system buffers without copying them to a contiguous location. This is the benefit for the guest domain.
  • The most significant advantage of a high level interface is in the driver domain and Xen VMM Roughly 60% of the execution time for a transmit operation is spent in the VMM and driver domain for multiplexing the packet from the virtual to the physical network interface. Most of this overhead is a per-packet overhead. Each packet requires one page remap operation and one forwarding operation.
  • Using TSO, it can reduce the number of packets transfers for transmitting the same amount of data. Therefore, it reduces the per-byte virtualization overhead incurred by the driver domain and the VMM. Supporting larger sized packets in the virtual network interface (using TSO) can thus significantly reduce the overheads incurred in network virtualization along the transmit path.
  • Here, I’ll explain about the I/O channel optimization. The I/O channel implements a zero-copy page remapping mechanism for transferring packets between the guest and driver domain. The physical page containing the packet is remapped into the address space of the target domain. The implementation of the I/O channel in Xen some address remaps and memory allocations. In receive operation, three address remaps and two memory allocation/deallocation operations are required. And in transmit operation, two address remaps are required.
  • But, in network transmit operation, the network packet needs to be mapped into the driver domain only when the destination of the packet is the driver domain itself, or when it is a broadcast packet. Therefore, the page remapping operation over the I/O channel can be avoided in the common transmit case. So, the authors augmented the I/O channel with an out-of-band ‘header’ channel. The guest domain uses it to supply the header of the packet to be transmitted to the backend driver. And, the backend driver reads the header of the packet from this channel to determine if it needs to map in the entire packet.
  • This shows the optimization process.
  • With this mechanism, the two page remap operations are replaced by the cost of copying a small header, in common case. The ‘header’ channel for transferring the packet header is implemented using a separate set of shared pages between the guest and driver domain.
  • The Xen I/O channel uses page remapping on the receive path to avoid the cost of an extra data copy. However, remapping costs a lot when the packet size is small, for example a few hundred bytes. And, data transfer by page transfer from the driver domain to the guest domain incurs some additional overheads. Firstly, each network packet has to be allocated on a separate page, so that it can be remapped. Additionally, the driver domain has to ensure that there is no potential leakage of information by the remapping. The driver domain ensures this by zeroing out at initialization. The zeroing is done whenever new pages are added to the memory allocator in the driver domain.
  • So, This paper employs data copy instead of page remapping. This method results in a small improvement in the receiver performance. Additionally, it allows us to use regular MTU sized buffers for network packets, which avoids the overheads of zeroing out the pages. The data copy in the I/O channel is also implemented using a set of shared pages between the guest and driver domains.
  • From now on, I’ll explain about a virtual memory optimization. Guest operating systems running on Xen incurred a significantly higher number of TLB misses for network workloads relative to native Linux. This is due to the increase in working set size when running on Xen. To solve this problem, superpage and global page mapping is applied to Xen VMM.
  • A superpage mapping maps a contiguous virtual address range to a contiguous physical address range. On the x86 platform, one superpage entry covers 1024 pages of physical memory. It greatly increases the virtual memory coverage in the TLB. Thus, this greatly reduces the capacity misses incurred in the TLB. But it has two issues. Firstly, in order to use a superpage mapping, the guest OS must be able to determine physical contiguity of its page frames within a superpage block. This is not possible in a fully virtualized system. Secondly, All page frames within a superpage block must have identical memory protection permissions.
  • In the Xen VMM, supporting superpages is simplfied because of the use of the paravirtualization approach. The Xen VMM provides the guest OS with a pseudo-physical to physical page translation table. By using this table, guest OS can determine the physical contiguity of pages.
  • The support for global page table mappings allows certain page table entries to be marked ‘global’, which are then kept persistent in the TLB across TLB flushes. Xen does not allow guest OSes to use global mappings since it needs to fully flush the TLB when switching between domains. But, Xen itself uses global page mappings.
  • The authors modified Xen to allow guest OS to use global page mapping. On each domain switch, the VMM is modified to flush all TLB entries. This has the additional side effect that the VMM’s global page mappings are also flushed on a domain switch. The use of global page mappings potentially improves the TLB performance in the absence of domain switches.
  • From now, I’ll show the evaluation results of optimizations. They uses tow micro-benchmarks, a transmit and a receive benchmark. The server system is Dell PowerEdge, 2.4 GHz Intel Xeon machine. It has for gigabit NICs. The client system is Intel Xeon machine with a similar configuration. And it has one gigabit NIC, which support TSO, SG, CO.
  • This graph compares the transmit throuput under 5 configurations. The performance of the Linux, Xen-driver, Xen-driver-opt is limited by the network interface bandwidth, And does not fully saturate the CPU. The optimized guest domain configuration improves on the performance of the unoptimized configuration by a factor of 4. The optimizations marked great improvement.
  • For receive benchmark, The optimized Xendriver version improves upon this performance by 35%. But in guest domain, the optimization improves the performance slightly.
  • The use of the high level virtual interface reduces the execution cost of the guest domain by almost a factor of 4. Driver domain by a factor of 2.1 and Xen VMM by factor of 1.9. These reductions from the reduced per-byte overhead.
  • This shows the improvement by the I/O channel optimization The I/O channel optimization reduces the execution overhead incurred in the Xen VMM by 38%.
  • This shows the improvement by the virtual memory optimizations The use of superpages alone is sufficient to bring down the data TLB misses by a factor of 3.8. The use of global mappings does not have a significant impact on data TLB misses Because frequent switches between the guest and driver domain cancel out any benefits of using global pages.

ppt ppt Presentation Transcript

  • Optimizing Network Virtualization in Xen Aravind Menon 1 , Alan Cox 2 , Willy Zwaenepoel 1 EPFL 1 and Rice University 2 USENIX ATC’06 Presented by Shin gyu, Kim 2006. 12. 6
  • Motivation
    • Despite the advances in virtualization technology, the overhead of network I/O virtualization can still significantly affect the performance of network-intensive applications.
      • CPU utilization required to saturate a 100 Mbps network under Linux 2.2.17 running on VMware Workstation 2.0 was 5 to 6 times higher compared to the utilization under native Linux 2.2.17.
      • Even in the paravirtualized Xen 2.0 VMM, it shows similar results.
  • Motivation
  • Optimizations
    • Adding three capabilities to the virtualized network interface:
      • scatter/gather I/O
      • TCP/IP checksum offload
      • TCP segmentation offload (TSO)
    • A faster I/O channel for transferring network packets between the guest and driver domains.
      • avoid a data remap or copy in the common case, and a receive mechanism optimized for small data receives.
    • VMM support
      • Allow guest OSes to make use of superpage and global page mappings on the Intel x86 architecture.
  • Xen I/O Architecture The I/O channel implements a zero-copy data transfer mechanism by remapping the physical page. Both the guest domains and the driver domain suffered from a significantly higher TLB miss rate compared to execution in native Linux. Additionally, guest domains suffered from much higher L2 cache misses compared to native Linux.
  • Virtual Interface Optimization
    • The authors propose a new virtual interface architecture in which the virtual network interface always supports a fixed set of high level network offload features , irrespective of whether these features are supported in the physical network interfaces.
      • The architecture makes use of offload features of the physical NIC itself if they match the offload requirements of the virtual network interface.
  • Virtual Interface Optimization -Software Offload Driver
    • The architecture introduces a ‘software offload’ driver , which intercepts all packets to be transmitted on the network interface.
    • When the guest domain’s packet arrives at the physical interface, the offload driver determines whether the offload requirements of the packet are compatible with the capabilities of the NIC.
    • In the absence of support from the physical NIC, the offload driver performs the necessary offload actions in software
  • Advantages of a High-level Interface -Scatter-Gather I/O
    • Support for scatter/gather I/O is especially useful for doing zero-copy network transmits, such as sendfile in Linux.
    • Gather I/O allows the OS to construct network packets consisting of multiple fragments directly from the file system buffers.
  • Advantages of a High-level Interface - TCP Segmentation Offload
    • Roughly 60% of the execution time for a transmit operation is spent in the VMM and driver domain for multiplexing the packet from the virtual to the physical network interface.
    • In the absence of support for TSO in the virtual interface, each 1500 (MTU) byte packet transmitted by the guest domain requires one page remap operation over the I/O channel and one forwarding operation over the network bridge.
  • Advantages of a High-level Interface - TCP Segmentation Offload
    • Supporting larger sized packets in the virtual network interface (using TSO) can thus significantly reduce the overheads incurred in network virtualization along the transmit path.
  • I/O Channel Optimization
    • The I/O channel implements a zero-copy page remapping mechanism for transferring packets between the guest and driver domain.
      • The physical page containing the packet is remapped into the address space of the target domain.
    • The implementation of the I/O channel in Xen
      • three address remaps and two memory allocation/deallocation operations are required for each packet receive operation.
      • two address remaps are required for each packet transmit operation .
  • I/O Channel Optimization -Transmit Path Optimization
    • The network packet needs to be mapped into the driver domain only when the destination of the packet is the driver domain itself, or when it is a broadcast packet.
    • Augmenting the I/O channel with an out-of-band ‘header’ channel
      • The guest domain uses it to supply the header of the packet to be transmitted to the backend driver.
      • The backend driver reads the header of the packet from this channel to determine if it needs to map in the entire packet.
  • I/O Channel Optimization -Transmit Path Optimization
    • The ‘header’ channel for transferring the packet header is implemented using a separate set of shared pages between the guest and driver domain.
    Guest OS Driver Domain VMM payload header Physical page frames Pseudo-physical pages shared page frames
  • I/O Channel Optimization -Transmit Path Optimization
    • The ‘header’ channel for transferring the packet header is implemented using a separate set of shared pages between the guest and driver domain.
    Guest OS Driver Domain VMM payload header Physical page frames Pseudo-physical pages shared page frames
  • I/O Channel Optimization -Receive Path Optimization
    • The Xen I/O channel uses page remapping on the receive path to avoid the cost of an extra data copy.
    • However, remapping costs a lot when the packet size is small.
    • Data transfer by page transfer from the driver domain to the guest domain incurs some additional overheads.
      • Each network packet has to be allocated on a separate page.
      • No potential leakage of information by the remapping  zeroing at initialization
  • I/O Channel Optimization -Receive Path Optimization
    • This paper employs data copy instead of page remapping.  small improvement
    • The data copy in the I/O channel is also implemented using a set of shared pages between the guest and driver domains.
  • Virtual Memory Optimization
    • Problem
      • Guest operating systems running on Xen incurred a significantly higher number of TLB misses for network workloads relative to the TLB misses in native Linux execution.
      • Increased working set size lead to higher TLB misses in the guest OS.
  • Virtual Memory Optimization -Superpage Mappings
    • Definition. A superpage mapping maps a contiguous virtual address range to a contiguous physical address range.
      • In order to use a superpage mapping, the guest OS must be able to determine physical contiguity of its page frames within a superpage block.  can be problematic with the “pseudo-physical” address mapping.
      • All page frames within a superpage block must have identical memory protection permissions.
        • The guest OS cannot modify them without notifying the VMM. Similarly, the GDT and LDT pages on the x86 architecture must be set read-only.
  • Virtual Memory Optimization -Superpage Mappings
    • In the Xen VMM, supporting superpages for guest Oses is simplified because of the use of the paravirtualization approach.
    • The Xen VMM provides the guest OS with a pseudo-physical to physical page translation table, which can be used by the guest OS to determine the physical contiguity of pages.
      • The VMM tries to allocate memory to the guest OS in chunks of superpage size, i.e., 4 MB.
      • The guest OS is modified so that it uses superpage mappings for a virtual address range only if it determines that the underlying set of physical pages is also contiguous.
  • Virtual Memory Optimization -Global Mappings
    • The support for global page table mappings in the processor allows certain page table entries to be marked ‘global’, which are then kept persistent in the TLB across TLB flushes (for example, on context switch).
      • Xen does not allow guest OSes to use global mappings since it needs to fully flush the TLB when switching between domains.
  • Virtual Memory Optimization -Global Mappings
    • On each domain switch, the VMM is modified to flush all TLB entries.  side-effects that the VMM’s global page mappings are also invalidated on a domain switch.
    • The use of global page mappings potentially improves the TLB performance in the absence of domain switches.
  • Evaluation
    • Two micro-benchmarks
      • A transmit and a receive benchmark, to evaluate the networking performance of guest and driver domains.
      • It is similar to the netperf TCP streaming benchmark.
      • The benchmark is modified to use the zero-copy sendfile system call for transmit operations.
    • Server
      • Dell PowerEdge 1600 SC, 2.4 GHz Intel Xeon machine.
      • This machine has four Intel Pro-1000 Gigabit NICs.
    • Clients
      • Intel Xeon with Intel Pro-1000 Gigabit NIC per machine.
      • All the NICs have support for TSO, scatter/gather I/O and checksum offload.
    • OS
      • Linux 2.6.11
      • Xen-2
  • Evaluation -Transmit Throughput CPU Utilization : 40%, 46% and 43%.
  • Evaluation -Receive Throughput
  • Evaluation -High-level Interface 4 times 2.1 times 1.9 times
  • Evaluation -I/O Channel Optimization The I/O channel optimization reduces the execution overhead incurred in the Xen VMM by 38%. 38%
  • Evaluation -Super Page & Global Mapping
    • The use of superpages alone is sufficient to bring down the data TLB misses by a factor of 3.8.
    • The use of global mappings does not have a significant impact on data TLB misses since frequent switches between the guest and driver domain cancel out any benefits of using global pages.
  • Conclusion
    • This paper presented a number of optimizations to the Xen network virtualization architecture.
    • The optimizations improve the transmit throughput of guest domains by a factor of 4.4, and the receive throughput in the driver domain by 35%.
    • The receive performance of guest domains remains a significant bottleneck which remains to be solved.
    • Any questions?