Transcript of "Optimizing Network Virtualization in Xen"
Optimizing Network Virtualization in Xen
Aravind Menon Alan L. Cox Willy Zwaenepoel
EPFL, Switzerland Rice university, Houston EPFL, Switzerland
Abstract network resources. In such an environment, the virtual
machine monitor (VMM) virtualizes the machine’s net-
In this paper, we propose and evaluate three techniques
work I/O devices to allow multiple operating systems
for optimizing network performance in the Xen virtual-
running in different VMs to access the network concur-
ized environment. Our techniques retain the basic Xen
architecture of locating device drivers in a privileged
‘driver’ domain with access to I/O devices, and providing Despite the advances in virtualization technology [15,
network access to unprivileged ‘guest’ domains through 4], the overhead of network I/O virtualization can still
virtualized network interfaces. signiﬁcantly affect the performance of network-intensive
First, we redeﬁne the virtual network interfaces of applications. For instance, Sugerman et al.  report
that the CPU utilization required to saturate a 100 Mbps
guest domains to incorporate high-level network offﬂoad
features available in most modern network cards. We network under Linux 2.2.17 running on VMware Work-
station 2.0 was 5 to 6 times higher compared to the uti-
demonstrate the performance beneﬁts of high-level of-
ﬂoad functionality in the virtual interface, even when lization under native Linux 2.2.17. Even in the paravirtu-
such functionality is not supported in the underlying alized Xen 2.0 VMM , Menon et al.  report signif-
icantly lower network performance under a Linux 2.6.10
physical interface. Second, we optimize the implemen-
tation of the data transfer path between guest and driver guest domain, compared to native Linux performance.
domains. The optimization avoids expensive data remap- They report performance degradation by a factor of 2 to
3x for receive workloads, and a factor of 5x degradation
ping operations on the transmit path, and replaces page
remapping by data copying on the receive path. Finally, for transmit workloads. The latter study, unlike previous
we provide support for guest operating systems to effec- reported results on Xen [4, 5], reports network perfor-
tively utilize advanced virtual memory features such as mance under conﬁgurations leading to full CPU satura-
superpages and global page mappings. tion.
The overall impact of these optimizations is an im- In this paper, we propose and evaluate a number of op-
provement in transmit performance of guest domains by timizations for improving the networking performance
a factor of 4.4. The receive performance of the driver under the Xen 2.0 VMM. These optimizations address
domain is improved by 35% and reaches within 7% of many of the performance limitations identiﬁed by Menon
native Linux performance. The receive performance in et al. . Starting with version 2.0, the Xen VMM
guest domains improves by 18%, but still trails the native adopted a network I/O architecture that is similar to the
Linux performance by 61%. We analyse the performance hosted virtual machine model . In this architecture, a
improvements in detail, and quantify the contribution of physical network interface is owned by a special, privi-
each optimization to the overall performance. leged VM called a driver domain that executes the native
Linux network interface driver. In contrast, an ordinary
VM called a guest domain is given access to a virtualized
1 Introduction network interface. The virtualized network interface has
a front-end device in the guest domain and a back-end
In recent years, there has been a trend towards running device in the corresponding driver domain. The front-
network intensive applications, such as Internet servers, end and back-end devices transfer network packets be-
in virtual machine (VM) environments, where multiple tween their domains over an I/O channel that is provided
VMs running on the same machine share the machine’s by the Xen VMM. Within the driver domain, either Eth-
USENIX Association Annual Tech ’06: 2006 USENIX Annual Technical Conference 15
ernet bridging or IP routing is used to demultiplex incom- an evaluation of the different optimizations described. In
ing network packets and to multiplex outgoing network section 7, we discuss related work, and we conclude in
packets between the physical network interface and the section 8.
guest domain through its corresponding back-end device.
Our optimizations fall into the following three cate-
1. We add three capabilities to the virtualized net- The network I/O virtualization architecture in Xen can
work interface: scatter/gather I/O, TCP/IP check- be a signiﬁcant source of overhead for networking per-
sum ofﬂoad, and TCP segmentation ofﬂoad (TSO). formance in guest domains . In this section, we de-
Scatter/gather I/O and checksum ofﬂoad improve scribe the overheads associated with different aspects of
performance in the guest domain. Scatter/gather I/O the Xen network I/O architecture, and their overall im-
eliminates the need for data copying by the Linux pact on guest network performance. To understand these
implementation of sendfile(). TSO improves overheads better, we ﬁrst describe the network virtual-
performance throughout the system. In addition to ization architecture used in Xen.
its well-known effects on TCP performance [11, 9] The Xen VMM uses an I/O architecture which is sim-
beneﬁting the guest domain, it improves perfor- ilar to the hosted VMM architecture . Privileged
mance in the Xen VMM and driver domain by re- domains, called ‘driver’ domains, use their native de-
ducing the number of network packets that they vice drivers to access I/O devices directly, and perform
must handle. I/O operations on behalf of other unprivileged domains,
called guest domains. Guest domains use virtual I/O de-
2. We introduce a faster I/O channel for transferring vices controlled by paravirtualized drivers to request the
network packets between the guest and driver do- driver domain for device access.
mains. The optimizations include transmit mecha- The network architecture used in Xen is shown in ﬁg-
nisms that avoid a data remap or copy in the com- ure 1. Xen provides each guest domain with a number
mon case, and a receive mechanism optimized for of virtual network interfaces, which is used by the guest
small data receives. domain for all its network communications. Correspond-
ing to each virtual interface in a guest domain, a ‘back-
3. We present VMM support mechanisms for allow-
end’ interface is created in the driver domain, which acts
ing the use of efﬁcient virtual memory primitives
as the proxy for that virtual interface in the driver do-
in guest operating systems. These mechanisms al-
main. The virtual and backend interfaces are ‘connected’
low guest OSes to make use of superpage and global
to each other over an ‘I/O channel’. The I/O channel
page mappings on the Intel x86 architecture, which
implements a zero-copy data transfer mechanism for ex-
signiﬁcantly reduce the number of TLB misses by
changing packets between the virtual interface and back-
end interfaces by remapping the physical page contain-
Overall, our optimizations improve the transmit per- ing the packet into the target domain.
formance in guest domains by a factor 4.4. Receive side
network performance is improved by 35% for driver do- Driver Domain Guest Domain
mains, and by 18% for guest domains. We also present
a detailed breakdown of the performance beneﬁts result- Bridge
ing from each individual optimization. Our evaluation
demonstrates the performance beneﬁts of TSO support
in the virtual network interface even in the absence of Backend Interface Virtual Interface
TSO support in the physical network interface. In other NIC Driver
words, emulating TSO in software in the driver domain
results in higher network performance than performing I/O Channel
TCP segmentation in the guest domain.
The outline for the rest of the paper is as follows. In Physical NIC
section 2, we describe the Xen network I/O architec-
ture and a summary of its performance overheads as de-
scribed in previous research. In section 3, we describe Figure 1: Xen Network I/O Architecture
the design of the new virtual interface architecture. In
section 4, we describe our optimizations to the I/O chan- All the backend interfaces in the driver domain (cor-
nel. Section 5 describes the new virtual memory opti- responding to the virtual interfaces) are connected to the
mization mechanisms added to Xen. Section 6 presents physical NIC and to each other through a virtual network
16 Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association
bridge. 1 The combination of the I/O channel and net- miss rate compared to execution in native Linux. Addi-
work bridge thus establishes a communication path be- tionally, guest domains were seen to suffer from much
tween the guest’s virtual interfaces and the physical in- higher L2 cache misses compared to native Linux.
Table 1 compares the transmit and receive bandwidth
3 Virtual Interface Optimizations
achieved in a Linux guest domain using this virtual in-
terface architecture, with the performance achieved with Network I/O is supported in guest domains by providing
the benchmark running in the driver domain and in native each guest domain with a set of virtual network inter-
Linux. These results are obtained using a netperf-like  faces, which are multiplexed onto the physical interfaces
benchmark which used the zero-copy sendfile() for using the mechanisms described in section 2. The virtual
transmit. network interface provides the abstraction of a simple,
low-level network interface to the guest domain, which
Conﬁguration Receive (Mb/s) Transmit (Mb/s)
uses paravirtualized drivers to perform I/O on this in-
Linux 2508 (100%) 3760 (100%) terface. The network I/O operations supported on the
Xen driver 1738 (69.3%) 3760 (100%) virtual interface consist of simple network transmit and
Xen guest 820 (32.7%) 750 (19.9%) receive operations, which are easily mappable onto cor-
Table 1: Network performance under Xen responding operations on the physical NIC.
Choosing a simple, low-level interface for virtual net-
work interfaces allows the virtualized interface to be eas-
The driver domain conﬁguration shows performance ily supported across a large number of physical interfaces
comparable to native Linux for the transmit case and a available, each with different network processing capa-
degradation of 30% for the receive case. However, this bilities. However, this also prevents the virtual interface
conﬁguration uses native device drivers to directly access from taking advantage of different network ofﬂoad capa-
the network device, and thus the virtualization overhead bilities of the physical NIC, such as checksum ofﬂoad,
is limited to some low-level functions such as interrupts. scatter/gather DMA support, TCP segmentation ofﬂoad
In contrast, in the guest domain conﬁguration, which (TSO).
uses virtualized network interfaces, the impact of net-
work I/O virtualization is much more pronounced. The
receive performance in guest domains suffers from a per- 3.1 Virtual Interface Architecture
formance degradation of 67% relative to native Linux, We propose a new virtual interface architecture in which
and the transmit performance achieves only 20% of the the virtual network interface always supports a ﬁxed set
throughput achievable under native Linux. of high level network ofﬂoad features, irrespective of
Menon et al.  report similar results. Moreover, whether these features are supported in the physical net-
they introduce Xenoprof, a variant of Oproﬁle  for work interfaces. The architecture makes use of ofﬂoad
Xen, and apply it to break down the performance of a features of the physical NIC itself if they match the of-
similar benchmark. Their study made the following ob- ﬂoad requirements of the virtual network interface. If the
servations about the overheads associated with different required features are not supported by the physical inter-
aspects of the Xen network virtualization architecture. face, the new interface architecture provides support for
Since each packet transmitted or received on a guest these features in software.
domain’s virtual interface had to pass through the I/O Figure 2 shows the top-level design of the virtual in-
channel and the network bridge, a signiﬁcant fraction of terface architecture. The virtual interface in the guest do-
the network processing time was spent in the Xen VMM main supports a set of high-level ofﬂoad features, which
and the driver domain respectively. For instance, 70% are reported to the guest OS by the front-end driver con-
of the execution time for receiving a packet in the guest trolling the interface. In our implementation, the features
domain was spent in transferring the packet through the supported by the virtual interface are checksum ofﬂoad-
driver domain and the Xen VMM from the physical inter- ing, scatter/gather I/O and TCP segmentation ofﬂoading.
face. Similarly, 60% of the processing time for a trans- Supporting high-level features like scatter/gather I/O
mit operation was spent in transferring the packet from and TSO allows the guest domain to transmit network
the guest’s virtual interface to the physical interface. The packets in sizes much bigger than the network MTU, and
breakdown of this processing overhead was roughly 40% which can consist of multiple fragments. These large,
Xen, 30% driver domain for receive trafﬁc, and 30% fragmented packets are transferred from the virtual to
Xen, 30% driver domain for transmit trafﬁc. the physical interface over a modiﬁed I/O channel and
In general, both the guest domains and the driver do- network bridge (The I/O channel is modiﬁed to allow
main were seen to suffer from a signiﬁcantly higher TLB transfer of packets consisting of multiple fragments, and
USENIX Association Annual Tech ’06: 2006 USENIX Annual Technical Conference 17
Driver Domain Guest Domain icant advantage of high level interfaces comes from its
impact on improving the efﬁciency of the network virtu-
Bridge alization architecture in Xen.
As described in section 2, roughly 60% of the execu-
tion time for a transmit operation from a guest domain
Offload Driver High-level
is spent in the VMM and driver domain for multiplexing
NIC Driver the packet from the virtual to the physical network in-
terface. Most of this overhead is a per-packet overhead
I/O Channel incurred in transferring the packet over the I/O channel
and the network bridge, and in packet processing over-
Physical NIC heads at the backend and physical network interfaces.
Using a high level virtual interface improves the efﬁ-
ciency of the virtualization architecture by reducing the
Figure 2: New I/O Architecture number of packets transfers required over the I/O channel
and network bridge for transmitting the same amount of
data (by using TSO), and thus reducing the per-byte vir-
the network bridge is modiﬁed to support forwarding of tualization overhead incurred by the driver domain and
packets larger than the MTU). the VMM.
The architecture introduces a new component, a ‘soft- For instance, in the absence of support for TSO in the
ware ofﬂoad’ driver, which sits above the network device virtual interface, each 1500 (MTU) byte packet transmit-
driver in the driver domain, and intercepts all packets to ted by the guest domain requires one page remap opera-
be transmitted on the network interface. When the guest tion over the I/O channel and one forwarding operation
domain’s packet arrives at the physical interface, the of- over the network bridge. In contrast, if the virtual in-
ﬂoad driver determines whether the ofﬂoad requirements terface supports TSO, the OS uses much bigger sized
of the packet are compatible with the capabilities of the packets comprising of multiple pages, and in this case,
NIC, and takes different actions accordingly. each remap operation over the I/O channel can poten-
If the NIC supports the ofﬂoad features required by the tially transfer 4096 bytes of data. Similarly, with larger
packet, the ofﬂoad driver simply forwards the packet to packets, much fewer packet forwarding operations are re-
the network device driver for transmission. quired over the network bridge.
In the absence of support from the physical NIC, the Supporting larger sized packets in the virtual network
ofﬂoad driver performs the necessary ofﬂoad actions in interface (using TSO) can thus signiﬁcantly reduce the
software. Thus, if the NIC does not support TSO, the overheads incurred in network virtualization along the
ofﬂoad driver segments the large packet into appropri- transmit path. It is for this reason that the software of-
ate MTU sized packets before sending them for trans- ﬂoad driver is situated in the driver domain at a point
mission. Similarly, it takes care of the other ofﬂoad re- where the transmit packet would have already covered
quirements, scatter/gather I/O and checksum ofﬂoading much of the multiplexing path, and is ready for transmis-
if these are not supported by the NIC. sion on the physical interface.
Thus, even in the case when the ofﬂoad driver may
have to perform network ofﬂoading for the packet in soft-
3.2 Advantage of a high level interface ware, we expect the beneﬁts of reduced virtualization
A high level virtual network interface can reduce the net- overhead along the transmit path to show up in overall
work processing overhead in the guest domain by allow- performance beneﬁts. Our evaluation in section 6 shows
ing it to ofﬂoad some network processing to the physical that this is indeed the case.
NIC. Support for scatter/gather I/O is especially useful
for doing zero-copy network transmits, such as sendﬁle 4 I/O Channel Optimizations
in Linux. Gather I/O allows the OS to construct net-
work packets consisting of multiple fragments directly Previous research  noted that 30-40% of execution
from the ﬁle system buffers, without having to ﬁrst copy time for a network transmit or receive operation was
them out to a contiguous location in memory. Support for spent in the Xen VMM, which included the time for page
TSO allows the OS to do transmit processing on network remapping and ownership transfers over the I/O channel
packets of size much larger than the MTU, thus reduc- and switching overheads between the guest and driver
ing the per-byte processing overhead by requiring fewer domain.
packets to transmit the same amount of data. The I/O channel implements a zero-copy page remap-
Apart from its beneﬁts for the guest domain, a signif- ping mechanism for transferring packets between the
18 Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association
guest and driver domain. The physical page containing this over the network bridge.
the packet is remapped into the address space of the tar- To ensure correct execution of the network driver (or
get domain. In the case of a receive operation, the owner- the backend driver) for this packet, the backend ensures
ship of the physical page itself which contains the packet that the pseudo-physical to physical mappings of the
is transferred from the driver to the guest domain (Both packet fragments are set correctly (When a foreign page
these operations require each network packet to be allo- frame is mapped into the driver domain, the pseudo-
cated on a separate physical page). physical to physical mappings for the page have to be up-
A study of the implementation of the I/O channel in dated). Since the network driver uses only the physical
Xen reveals that three address remaps and two memory addresses of the packet fragments, and not their virtual
allocation/deallocation operations are required for each address, it is safe to pass an unmapped page fragment to
packet receive operation, and two address remaps are re- the driver.
quired for each packet transmit operation. On the receive The ‘header’ channel for transferring the packet
path, for each network packet (physical page) transferred header is implemented using a separate set of shared
from the driver to the guest domain, the guest domain pages between the guest and driver domain, for each vir-
releases ownership of a page to the hypervisor, and the tual interface of the guest domain. With this mechanism,
driver domain acquires a replacement page from the hy- the cost of two page remaps per transmit operation is
pervisor, to keep the overall memory allocation constant. replaced in the common case, by the cost of copying a
Further, each page acquire or release operation requires a small header.
change in the virtual address mapping. Thus overall, for We note that this mechanism requires the physical net-
each receive operation on a virtual interface, two page al- work interface to support gather DMA, since the trans-
location/freeing operations and three address remapping mitted packet consists of a packet header and unmapped
operations are required. For the transmit path, ownership fragments. If the NIC does not support gather DMA,
transfer operation is avoided as the packet can be directly the entire packet needs to be mapped in before it can be
mapped into the privileged driver domain. Each transmit copied out to a contiguous memory location. This check
operation thus incurs two address remapping operations. is performed by the software ofﬂoad driver (described in
We now describe alternate mechanisms for packet the previous section), which performs the remap opera-
transfer on the transmit path and the receive path. tion if necessary.
4.1 Transmit Path Optimization 4.2 Receive Path Optimization
We observe that for network transmit operations, the The Xen I/O channel uses page remapping on the receive
driver domain does not need to map in the entire packet path to avoid the cost of an extra data copy. Previous
from the guest domain, if the destination of the packet research  has also shown that avoiding data copy in
is any host other than the driver domain itself. In or- network operations signiﬁcantly improves system perfor-
der to forward the packet over the network bridge, (from mance.
the guest domain’s backend interface to its target inter- However, we note that the I/O channel mechanism can
face), the driver domain only needs to examine the MAC incur signiﬁcant overhead if the size of the packet trans-
header of the packet. Thus, if the packet header can be ferred is very small, for example a few hundred bytes.
supplied to the driver domain separately, the rest of the In this case, it may not be worthwhile to avoid the data
network packet does not need to be mapped in. copy.
The network packet needs to be mapped into the driver Further, data transfer by page transfer from the driver
domain only when the destination of the packet is the domain to the guest domain incurs some additional over-
driver domain itself, or when it is a broadcast packet. heads. Firstly, each network packet has to be allocated on
We use this observation to avoid the page remapping a separate page, so that it can be remapped. Additionally,
operation over the I/O channel in the common transmit the driver domain has to ensure that there is no potential
case. We augment the I/O channel with an out-of-band leakage of information by the remapping of a page from
‘header’ channel, which the guest domain uses to supply the driver to a guest domain. The driver domain ensures
the header of the packet to be transmitted to the backend this by zeroing out at initialization, all pages which can
driver. The backend driver reads the header of the packet be potentially transferred out to guest domains. (The ze-
from this channel to determine if it needs to map in the roing is done whenever new pages are added to the mem-
entire packet (i.e., if the destination of the packet is the ory allocator in the driver domain used for allocating net-
driver domain itself or the broadcast address). It then work socket buffers).
constructs a network packet from the packet header and We investigate the alternate mechanism, in which
the (possibly unmapped) packet fragments, and forwards packets are transferred to the guest domain by data copy.
USENIX Association Annual Tech ’06: 2006 USENIX Annual Technical Conference 19
As our evaluation shows, packet transfer by data copy On the x86 platform, one superpage entry covers 1024
instead of page remapping in fact results in a small im- pages of physical memory, which greatly increases the
provement in the receiver performance in the guest do- virtual memory coverage in the TLB. Thus, this greatly
main. In addition, using copying instead of remapping reduces the capacity misses incurred in the TLB. Many
allows us to use regular MTU sized buffers for network operating systems use superpages to improve their over-
packets, which avoids the overheads incurred from the all TLB performance: Linux uses superpages to map
need to zero out the pages in the driver domain. the ‘linear’ (lowmem) part of the kernel address space;
The data copy in the I/O channel is implemented us- FreeBSD supports superpage mappings for both user-
ing a set of shared pages between the guest and driver space and kernel space translations .
domains, which is established at setup time. A single set The support for global page table mappings in the pro-
of pages is shared for all the virtual network interfaces in cessor allows certain page table entries to be marked
the guest domain (in order to restrict the copying over- ‘global’, which are then kept persistent in the TLB across
head for a large working set size). Only one extra data TLB ﬂushes (for example, on context switch). Linux
copy is involved from the driver domain’s network pack- uses global page mappings to map the kernel part of the
ets to the shared memory pool. address space of each process, since this part is common
Sharing memory pages between the guest and the between all processes. By doing so, the kernel address
driver domain (for both the receive path, and for the mappings are not ﬂushed from the TLB when the pro-
header channel in the transmit path) does not introduce cessor switches from one process to another.
any new vulnerability for the driver domain. The guest
domain cannot crash the driver domain by doing invalid
writes to the shared memory pool since, on the receive 5.2 Issues in supporting VM features
path, the driver domain does not read from the shared
pool, and on the transmit path, it would need to read the 5.2.1 Superpage Mappings
packet headers from the guest domain even in the origi-
A superpage mapping maps a contiguous virtual address
nal I/O channel implementation.
range to a contiguous physical address range. Thus, in
order to use a superpage mapping, the guest OS must be
5 Virtual Memory Optimizations able to determine physical contiguity of its page frames
within a superpage block. This is not be possible in a
It was noted in a previous study , that guest operating fully virtualized system like VMware ESX server ,
systems running on Xen (both driver and guest domains) where the guest OS uses contiguous ‘pseudo-physical’
incurred a signiﬁcantly higher number of TLB misses addresses, which are transparently mapped to discontigu-
for network workloads (more than an order of magnitude ous physical addresses.
higher) relative to the TLB misses in native Linux execu- A second issue with the use of superpages is the fact
tion. It was conjectured that this was due to the increase that all page frames within a superpage block must have
in working set size when running on Xen. identical memory protection permissions. This can be
We show that it is the absence of support for certain problematic in a VM environment because the VMM
virtual memory primitives, such as superpage mappings may want to set special protection bits for certain page
and global page table mappings, that leads to a marked frames. As an example, page frames containing page
increase in TLB miss rate for guest OSes running on tables of the guest OS must be set read-only so that
Xen. the guest OS cannot modify them without notifying the
VMM. Similarly, the GDT and LDT pages on the x86
architecture must be set read-only.
5.1 Virtual Memory features These special permission pages interfere with the use
of superpages. A superpage block containing such spe-
Superpage and global page mappings are features intro-
cial pages must use regular granularity paging to support
duced in the Intel x86 processor series starting with the
different permissions for different constituent pages.
Pentium and Pentium Pro processors respectively.
A superpage mapping allows the operating system to
specify the virtual address translation for a large set of 5.2.2 Global Mappings
pages, instead of at the granularity of individual pages,
as with regular paging. A superpage page table entry Xen does not allow guest OSes to use global mappings
provides address translation from a set of contiguous vir- since it needs to fully ﬂush the TLB when switching be-
tual address pages to a set of contiguous physical address tween domains. Xen itself uses global page mappings to
pages. map its address range.
20 Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association
5.3 Support for Virtual Memory primitives the guest OS are returned to this reserved pool. Thus,
in Xen this mechanism collects all pages with the same permis-
sion into a different superpage, and avoids the breakup
5.3.1 Superpage Mappings of superpages into regular pages.
In the Xen VMM, supporting superpages for guest OSes Certain read-only pages in the Linux guest OS, such as
is simpliﬁed because of the use of the paravirtualization the boot time GDT, LDT, initial page table (init mm), are
approach. In the Xen approach, the guest OS is already currently allocated at static locations in the OS binary.
aware of the physical layout of its memory pages. The In order to allow the use of superpages over the entire
Xen VMM provides the guest OS with a pseudo-physical kernel address range, the guest OS is modiﬁed to relocate
to physical page translation table, which can be used these read-only pages to within a read-only superpage
by the guest OS to determine the physical contiguity of allocated by the special allocator.
To allow the use of superpage mappings in the guest
OS, we modify the Xen VMM and the guest OS to co- 5.3.2 Global Page Mappings
operate with each other over the allocation of physical
memory and the use of superpage mappings. Supporting global page mappings for guest domains run-
The VMM is modiﬁed so that for each guest OS, it ning in a VM environment is quite simple on the x86
tries to give the OS an initial memory allocation such architecture. The x86 architecture allows some mecha-
that page frames within a superpage block are also phys- nisms by which global page mappings can be invalidated
ically contiguous (Basically, the VMM tries to allocate from the TLB (This can be done, for example, by dis-
memory to the guest OS in chunks of superpage size, abling and re-enabling the global paging ﬂag in the pro-
i.e., 4 MB). Since this is not always possible, the VMM cessor control registers, speciﬁcally the PGE ﬂag in the
does not guarantee that page allocation for each super- CR4 register).
page range is physically contiguous. The guest OS is We modify the Xen VMM to allow guest OSes to use
modiﬁed so that it uses superpage mappings for a virtual global page mappings in their address space. On each do-
address range only if it determines that the underlying set main switch, the VMM is modiﬁed to ﬂush all TLB en-
of physical pages is also contiguous. tries, using the mechanism described above. This has the
As noted above in section 5.2.1, the use of pages with additional side effect that the VMM’s global page map-
restricted permissions, such as pagetable (PT) pages, pre- pings are also invalidated on a domain switch.
vents the guest OS from using a superpage to cover the
The use of global page mappings potentially improves
physical pages in that address range.
the TLB performance in the absence of domain switches.
As new processes are created in the system, the OS fre-
However, in the case when multiple domains have to be
quently needs to allocate (read-only) pages for the pro-
switched frequently, the beneﬁts of global mappings may
cess’s page tables. Each such allocation of a read-only
be much reduced. In this case, use of global mappings
page potentially forces the OS to convert the superpage
in the guest OS forces both the guest’s and the VMM’s
mapping covering that page to a two-level regular page
TLB mappings to be invalidated on each switch. Our
mapping. With the proliferation of such read-only pages
evaluation in section 6 shows that global mappings are
over time, the OS would end up using regular paging to
beneﬁcial only when there is a single driver domain, and
address much of its address space.
not in the presence of guest domains.
The basic problem here is that the PT page frames
for new processes are allocated randomly from all over We make one additional optimization to the domain
memory, without any locality, thus breaking multiple su- switching code in the VMM. Currently, the VMM ﬂushes
perpage mappings. We solve this problem by using a the TLB whenever it switches to a non-idle domain. We
special memory allocator in the guest OS for allocating notice that this incurs an unnecessary TLB ﬂush when
page frames with restricted permissions. This allocator the VMM switches from a domain d to the idle domain
tries to group together all memory pages with restricted and then back to the domain d. We make a modiﬁcation
permissions into a contiguous range within a superpage. to avoid the unnecessary ﬂush in this case.
When the guest OS allocates a PT frame from this al-
locator, the allocator reserves the entire superpage con-
taining this PT frame for future use. It then marks the en- 5.4 Outstanding issues
tire superpage as read-only, and reserves the pages of this
superpage for read-only use. On subsequent requests for There remain a few aspects of virtualization which are
PT frames from the guest OS, the allocator returns pages difﬁcult to reconcile with the use of superpages in the
from the reserved set of pages. PT page frames freed by guest OS. We brieﬂy mention them here.
USENIX Association Annual Tech ’06: 2006 USENIX Annual Technical Conference 21
5.4.1 Transparent page sharing machines with a similar CPU conﬁguration, and hav-
ing one Intel Pro-1000 Gigabit NIC per machine. All
Transparent page sharing between virtual machines is the NICs have support for TSO, scatter/gather I/O and
an effective mechanism to reduce the memory usage checksum ofﬂoad. The clients and server machines are
in the system when there are a large number of VMs connected over a Gigabit switch.
. Page sharing uses address translation from pseudo-
The experiments measure the maximum throughput
physical to physical addresses to transparently share
achievable with the benchmark (either transmitter or re-
pseudo-physical pages which have the same content.
ceiver) running on the server machine. The server is
Page sharing potentially breaks the contiguity of phys- connected to each client machine over a different net-
ical page frames. With the use of superpages, either the work interface, and uses one TCP connection per client.
entire superpage must be shared between the VMs, or no We use as many clients as required to saturate the server
page within the superpage can be shared. This can sig- CPU, and measure the throughput under different Xen
niﬁcantly reduce the scope for memory sharing between and Linux conﬁgurations. All the proﬁling results pre-
VMs. sented in this section are obtained using the Xenoprof
Although Xen does not make use of page sharing cur- system wide proﬁler in Xen .
rently, this is a potentially important issue for super-
6.2 Overall Results
5.4.2 Ballooning driver
We evaluate the following conﬁgurations: ‘Linux’ refers
The ballooning driver  is a mechanism by which to the baseline unmodiﬁed Linux version 2.6.11 run-
the VMM can efﬁciently vary the memory allocation of ning native mode. ‘Xen-driver’ refers to the unoptimized
guest OSes. Since the use of a ballooning driver poten- XenoLinux driver domain. ‘Xen-driver-opt’ refers to the
tially breaks the contiguity of physical pages allocated Xen driver domain with our optimizations. ‘Xen-guest’
to a guest, this invalidates the use of superpages for that refers to the unoptimized, existing Xen guest domain.
address range. ‘Xen-guest-opt’ refers to the optimized version of the
A possible solution is to force memory alloca- guest domain.
tion/deallocation to be in units of superpage size for Figure 3 compares the transmit throughput achieved
coarse grained ballooning operations, and to invalidate under the above 5 conﬁgurations. Figure 4 shows receive
superpage mappings only for ﬁne grained ballooning op- performance in the different conﬁgurations.
erations. This functionality is not implemented in the
current prototype. 5000
6 Evaluation 3000
The optimizations described in the previous sections
have been implemented in Xen version 2.0.6, running
Linux guest operating systems version 2.6.11. 0
6.1 Experimental Setup
We use two micro-benchmarks, a transmit and a receive
benchmark, to evaluate the networking performance of
guest and driver domains. These benchmarks are simi- Figure 3: Transmit throughput in different conﬁgurations
lar to the netperf  TCP streaming benchmark, which
measures the maximum TCP streaming throughput over For the transmit benchmark, the performance of the
a single TCP connection. Our benchmark is modiﬁed to Linux, Xen-driver and Xen-driver-opt conﬁgurations is
use the zero-copy sendﬁle system call for transmit oper- limited by the network interface bandwidth, and does not
ations. fully saturate the CPU. All three conﬁgurations achieve
The ‘server’ system for running the benchmark is a an aggregate link speed throughput of 3760 Mb/s. The
Dell PowerEdge 1600 SC, 2.4 GHz Intel Xeon machine. CPU utilization values for saturating the network in the
This machine has four Intel Pro-1000 Gigabit NICs. The three conﬁgurations are, respectively, 40%, 46% and
‘clients’ for running an experiment consist of Intel Xeon 43%.
22 Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association
Receive Throughput 3500 Transmit Throughput
Figure 4: Receive throughput in different conﬁgurations Figure 5: Contribution of individual transmit optimiza-
The optimized guest domain conﬁguration, Xen-
guest-opt improves on the performance of the unopti- conﬁguration to yield 3230 Mb/s (conﬁguration Guest-
mized Xen guest by a factor of 4.4, increasing trans- high-ioc). The superpage optimization improves this fur-
mit throughput from 750 Mb/s to 3310 Mb/s. How- ther to achieve 3310 Mb/s (conﬁguration Guest-high-ioc-
ever, Xen-guest-opt gets CPU saturated at this through- sp). The impact of these two optimizations by them-
put, whereas the Linux conﬁguration reaches a CPU uti- selves in the absence of the high level interface optimiza-
lization of roughly 40% to saturate 4 NICs. tion is insigniﬁcant.
For the receive benchmark, the unoptimized Xen
driver domain conﬁguration, Xen-driver, achieves a 6.3.1 High level Interface
throughput of 1738 Mb/s, which is only 69% of the na-
We explain the improved performance resulting from the
tive Linux throughput, 2508 Mb/s. The optimized Xen-
use of a high level interface in ﬁgure 6. The ﬁgure com-
driver version, Xen-driver-opt, improves upon this per-
pares the execution overhead (in millions of CPU cycles
formance by 35%, and achieves a throughput of 2343
per second) of the guest domain, driver domain and the
Mb/s. The optimized guest conﬁguration Xen-guest-
Xen VMM, incurred for running the transmit workload
opt improves the guest domain receive performance only
on a single NIC, compared between the high-level Guest-
slightly, from 820 Mb/s to 970 Mb/s.
high conﬁguration and the Guest-none conﬁguration.
6.3 Transmit Workload 3500
CPU cycles of execution (millions)
3000 Guest domain
We now examine the contribution of individual optimiza-
tions for the transmit workload. Figure 5 shows the trans- 2500
mit performance under different combinations of opti- 2000
mizations. ‘Guest-none’ is the guest domain conﬁgura-
tion with no optimizations. ‘Guest-sp’ is the guest do-
main conﬁguration using only the superpage optimiza- 1000
tion. ‘Guest-ioc’ uses only the I/O channel optimization. 500
‘Guest-high’ uses only the high level virtual interface op-
timization. ‘Guest-high-ioc’ uses both high level inter- 0
faces and the optimized I/O channel. ‘Guest-high-ioc- Transmit Optimization
sp’ uses all the optimizations: high level interface, I/O
channel optimizations and superpages. Figure 6: Breakdown of execution cost in high-level and
The single biggest contribution to guest transmit per- unoptimized virtual interface
formance comes from the use of a high-level virtual in-
terface (Guest-high conﬁguration). This optimization The use of the high level virtual interface reduces the
improves guest performance by 272%, from 750 Mb/s execution cost of the guest domain by almost a factor of
to 2794 Mb/s. 4 compared to the execution cost with a low-level inter-
The I/O channel optimization yields an incremental face. Further, the high-level interface also reduces the
improvement of 439 Mb/s (15.7%) over the Guest-high Xen VMM execution overhead by a factor of 1.9, and
USENIX Association Annual Tech ’06: 2006 USENIX Annual Technical Conference 23
the driver domain overhead by a factor of 2.1. These 6.3.2 I/O Channel Optimizations
reductions can be explained by the reasoning given in
Figure 5 shows that using the I/O channel optimiza-
section 3, namely the absence of data copy because of
tions in conjunction with the high level interface (con-
the support for scatter/gather, and the reduced per-byte
ﬁguration Guest-high-ioc) improves the transmit perfor-
processing overhead because of the use of larger packets
mance from 2794 Mb/s to 3239 Mb/s, an improvement
The use of a high level virtual interface gives perfor- This can be explained by comparing the execution pro-
mance improvements for the guest domain even when the ﬁle of the two guest conﬁgurations, as shown in ﬁgure 8.
ofﬂoad features are not supported in the physical NIC. The I/O channel optimization reduces the execution over-
Figure 7 shows the performance of the guest domain us- head incurred in the Xen VMM by 38% (Guest-high-ioc
ing a high level interface with the physical network in- conﬁguration), and this accounts for the corresponding
terface supporting varying capabilities. The capabilities improvement in transmit throughput.
of the physical NIC form a spectrum, at one end the NIC
supports TSO, SG I/O and checksum ofﬂoad, at the other
CPU cycles of execution (millions)
end it supports no ofﬂoad feature, with intermediate con- Driver domain
ﬁgurations supporting partial ofﬂoad capabilities. 1000
1500 Guest-high Guest-high-ioc
1000 Guest Configuration
0 Figure 8: I/O channel optimization beneﬁt
6.3.3 Superpage mappings
Figure 5 shows that superpage mappings improve guest
Figure 7: Advantage of higher-level interface
domain performance slightly, by 2.5%. Figure 9 shows
data and instruction TLB misses incurred for the transmit
benchmark for three sets of virtual memory optimiza-
Even in the case when the physical NIC supports tions. ‘Base’ refers to a conﬁguration which uses the
no ofﬂoad feature (bar labeled ‘None’), the guest do- high-level interface and I/O channel optimizations, but
main with a high level interface performs nearly twice as with regular 4K sized pages. ‘SP-GP’ is the Base conﬁg-
well as the guest using the default interface (bar labeled uration with the superpage and global page optimizations
Guest-none), viz. 1461 Mb/s vs. 750 Mb/s. in addition. ‘SP’ is the Base conﬁguration with super-
Thus, even in the absence of ofﬂoad features in the page mappings.
NIC, by performing the ofﬂoad computation just before The TLB misses are grouped into three categories:
transmitting it over the NIC (in the ofﬂoad driver) instead data TLB misses (D-TLB), instruction TLB misses in-
of performing it in the guest domain, we signiﬁcantly curred in the guest and driver domains (Guest OS I-
reduce the overheads incurred in the I/O channel and the TLB), and instruction TLB misses incurred in the Xen
network bridge on the packet’s transmit path. VMM (Xen I-TLB).
The use of superpages alone is sufﬁcient to bring down
The performance of the guest domain with just check- the data TLB misses by a factor of 3.8. The use of global
sum ofﬂoad support in the NIC is comparable to the mappings does not have a signiﬁcant impact on data TLB
performance without any ofﬂoad support. This is be- misses (conﬁgurations SP-GP and SP), since frequent
cause, in the absence of support for scatter/gather I/O, switches between the guest and driver domain cancel out
data copying both with or without checksum computa- any beneﬁts of using global pages.
tion incur effectively the same cost. With scatter/gather The use of global mappings, however, does have a
I/O support, transmit throughput increases to 2007 Mb/s. negative impact on the instruction TLB misses in the
24 Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association
D-TLB Transmit Throughput
5 Guest OS I-TLB 2000
TLB misses (millions/s)
Figure 9: TLB misses for transmit benchmark Figure 10: Transmit performance with writes
Xen VMM. As mentioned in section 5.3.2, the use of ﬁguration shows a much smaller improvement in perfor-
global page mappings forces the Xen VMM to ﬂush out mance, from 820 Mb/s to 970 Mb/s.
all TLB entries, including its own TLB entries, on a do-
main switch. This shows up as a signiﬁcant increase in 6.4.1 Driver domain
the number of Xen instruction TLB misses. (SP-GP vs.
We examine the individual contribution of the different
optimizations for the receive workload. We evaluate the
The overall impact of using global mappings on the
following conﬁgurations: ‘Driver-none’ is the driver do-
transmit performance, however, is not very signiﬁcant.
main conﬁguration with no optimizations, ‘Driver-MTU’
(Throughput drops from 3310 Mb/s to 3302 Mb/s). The
is the driver conﬁguration with the I/O channel opti-
optimal guest domain performance shown in section 6.3
mization, which allows the use of MTU sized socket
uses only the global page optimization.
buffers. ‘Driver-MTU-GP’ uses both MTU sized buffers
and global page mappings. ‘Driver-MTU-SP’ uses MTU
6.3.4 Non Zero-copy Transmits sized buffers with superpages, and ‘Driver-MTU-SP-GP’
uses all three optimizations. ‘Linux’ is the baseline na-
So far we have shown the performance of the guest do- tive Linux conﬁguration.
main when it uses a zero-copy transmit workload. This Figure 11 shows the performance of the receive bench-
workload beneﬁts from the scatter/gather I/O capability mark under the different conﬁgurations.
in the network interface, which accounts for a signiﬁcant
part of the improvement in performance when using a 3000
high level interface. Receive Throughput
We now show the performance beneﬁts of using a high 2000
level interface, and the other optimizations, when using 1500
a benchmark which uses copying writes instead of the 1000
zero-copy sendﬁle. Figure 10 shows the transmit perfor- 500
mance of the guest domain for this benchmark under the 0
different combinations of optimizations.
The breakup of the contributions of individual opti-
mizations in this benchmark is similar to that for the
sendﬁle benchmark. The best case transmit performance
in this case is 1848 Mb/s, which is much less than the
best sendﬁle throughput (3310 Mb/s), but still signiﬁ-
cantly better than the unoptimized guest throughput. Figure 11: Receive performance in different driver con-
6.4 Receive Benchmark
It is interesting to note that the use of MTU sized
Figure 4 shows that the optimized driver domain conﬁgu- socket buffers improves network throughput from 1738
ration, Xen-driver-opt, improves the receive performance Mb/s to 2033 Mb/s, an increase of 17%. Detailed pro-
from 1738 Mb/s to 2343 Mb/s. The Xen-guest-opt con- ﬁling reveals that the Driver-none conﬁguration, which
USENIX Association Annual Tech ’06: 2006 USENIX Annual Technical Conference 25