Transcript of "Solaris Operating System Hardware Virtualization Product ..."
SOLARIS™ OPERATING SYSTEM
Chien-Hua Yen, ISV Engineering
Sun BluePrints™ On-Line — November 2007
Part No 820-3703-10
Revision 1.0, 11/27/07
Edition: November 2007
1 Introduction Sun Microsystems, Inc.
In the IT industry, virtualization is a mechanism of presenting a set of logical computing
resources over a fixed hardware configuration so that these logical resources can be
accessed in the same manner as the original hardware configuration. The concept of
virtualization is not new. First introduced in the late 1960s on mainframe computers,
virtualization has recently become popular as a means to consolidate servers and
reduce the costs of hardware acquisition, energy consumption, and space utilization.
The hardware resources that can be virtualized include computer systems, storage, and
Server virtualization can be implemented at different levels on the computing stack,
including the application level, operating system level, and hardware level:
• An example of application level virtualization is the Virtual Machine for the Java™
platform (Java Virtual Machine or JVM™ machine)1. The JVM implementation
provides an application execution environment as a layer between the application
and the OS, removing application dependency on OS-specific APIs and hardware-
• OS level virtualization abstracts OS services such as file systems, devices,
networking, and security, and provides a virtualized operating environment to
applications. Typically, OS level virtualization is implemented by the OS kernel.
Only one instance of the kernel runs on the system, and it provides multiple
virtualized operating environments to applications. Examples of OS level
virtualization include Solaris™ Containers technology, Linux VServers, and FreeBSD
Jails. OS level virtualization has less performance overhead and better system
resource utilization than hardware level virtualization. Since one OS kernel is
shared among all virtual operating environments, isolation among all virtualized
operating environments is as good as the OS provides.
• Hardware level virtualization, discussed in detail in this paper, has become popular
recently because of increasing CPU power and low utilization of CPU resources in the
IT data center. Hardware level virtualization allows a system to run multiple OS
instances. With less sharing of system resources than OS level virtualization,
hardware virtualization provides stronger isolation of operating environments.
The Solaris OS includes bundled support for application and OS level virtualization with
its JVM software and Solaris Containers offerings. Sun first added support for hardware
virtualization in the Solaris 10 11/06 release with Sun Logical Domains (LDoms)
technology, supported on Sun servers which utilize UltraSPARC T1 or UltraSPARC T2
1. The terms "Java Virtual Machine" and "JVM" mean a Virtual Machine for the Java(TM) platform.
2 Introduction Sun Microsystems, Inc.
processors. VMware also supports the Solaris OS as a guest OS in its VMware Server and
Virtual Infrastructure products starting with the Solaris 10 1/06 release. In October
2007, Sun announced the Sun xVM family of products that includes the Sun xVM Server
and the Sun xVM Ops Center management system:
• Sun xVM Server — includes support for the Xen open source community work  on
the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform
• Sun xVM Ops Center — a management suite for the Sun xVM Server
Note – In this paper, in order to distinguish the discussion of x86 and UltraSPARC
T1/T2 processors, Sun xVM Server is specifically used to refer to the Sun hardware
virtualization product for the x86 platform, and LDoms is used to refer to the Sun
hardware virtualization product for the UltraSPARC T1 and T2 platforms.
The hardware virtualization technology and new products built around this technology
have expanded options and opportunities for deploying servers with better utilization,
more flexibility, and enhanced functionality. In reaping the benefits of the hardware
virtualization, IT professionals also face the challenges of operating within the
limitation of a virtualized environment while delivering the same level of service
agreement as the physical operating environment. Meeting this requirement requires a
good understanding of virtualization technologies, CPU architecture, and software
implementations, and awareness of their strengths and limitations.
Hardware Level Virtualization
Hardware level virtualization is a mechanism of virtualizing the system hardware
resources such as CPU, memory, and I/O, and creating multiple execution
environments on a single system. Each of these execution environments runs an
instance of the operating system.
A hardware level virtualization implementation typically consists of several virtual
machines (VMs), as shown in Figure 1. A layer of software, the virtual machine monitor
(VMM), manages system hardware resources and presents an abstraction of these
resources to each VM. The VMM runs in privileged mode and has full control of system
hardware. A guest operating system (GOS) runs in each VM. The GOS to VM is
analogous to program to process in which OS plays the function of the VMM.
3 Introduction Sun Microsystems, Inc.
VM VM VM
GOS GOS GOS
Virtual Machine Monitor (VMM)
Figure 1. In hardware level virtualization, the VMM software manages hardware resources and
presents an abstraction of these resources to one or more virtual machines.
Hardware resource virtualization can take the form of sharing, partitioning, or
• Sharing — Resources are shared among VMs. The VMM coordinates the use of
resources by VMs. For example, the VMM may include a CPU scheduler to run threads
of VMs based on a pre-determined scheduling policy and VM priority.
• Partitioning — Resources are partitioned so that each VM gets the portion of
resources allocated to it. Partitioning can be dynamically adjusted by the VMM based
on the utilization of each VM. Examples of resource partitioning include the
ballooning memory technique employed in Sun xVM Server and VMware, and the
allocation of CPU resources in Logical Domains technology.
• Delegating — With delegating, resources are not directly accessible by a VM.
Instead, all resource accesses are made through a control VM that has direct access
to the resource. I/O device virtualization is normally accessed via delegation.
The distinction and boundaries between the virtualization methods are often not clear.
For example, sharing may be used for one component and partitioning used in others,
and together they make up an integral functional module.
Benefits of Hardware Level Virtualization
Hardware level virtualization allows multiple operating systems to run on a single
server system. This ability offers many benefits that are not available in a single OS
server. These benefits can be summarized in three functional categories:
• Workload Consolidation
According to Gartner  “Intel servers running at 10 percent to 15 percent
utilization are common.” Many IT organizations run out and buy a new server every
time they deploy a new application. With virtualization, computers no longer have to
be dedicated to a particular task. Applications and users can share computing
resources, remaining blissfully unaware that they are doing so. Companies can shift
computing resources around to meet demand at a given time, and get by with less
infrastructure overall. When used for consolidation, virtualization can also save
4 Introduction Sun Microsystems, Inc.
hardware and maintenance expenses, floor space, cooling costs, and power
• Workload Migration
Hardware level virtualization decouples the OS from the underlying physical platform
resources. A guest OS state, along with the user applications running on top of it, can
be encapsulated into an entity and moved to another system. This capability is useful
for migrating a legacy OS system from an old under-powered server to a more
powerful server while preserving the investment in software. When a server needs to
be maintained, a VM can be dynamically migrated to a new sever with no down time,
further enhancing availability. Changes in workload intensity levels can be addressed
by dynamically shifting underlying resources to the starving VMs. Legacy applications
that ran natively on a server continue to run on the same OS running inside a VM,
leveraging the existing investment in applications and tools.
• Workload Isolation
Workload isolation includes fault and security isolations. Multiple guest OSes run
independently, and thus a software failure in one VM does not affect other VMs.
However, the VMM layer introduces a single point of failure that can bring down all
VMs on the system. A VMM failure, although potentially catastrophic, is less probable
than a failure in the OS because the complexity of VMM is much less than that of an
Multiple VMs also provide strong security isolation among themselves with each VM
running an independent OS. Security intrusions are confined to the VM in which they
occur. The boundary around each VM is enforced by the VMM and the inter-domain
communication, if provided by the VMM, is restricted to specific kernel modules only.
One distinct feature of hardware level virtualization is the ability to run multiple
instances of heterogeneous operating systems on a single hardware platform. This
feature is important for the following reasons:
• Better security and fault containment among application services can be achieved
through OS isolation.
• Applications written for one OS can run on a system that supports a different OS.
• Better management of system resource utilization is possible among the virtualized
This paper explores the underlying hardware architecture and software implementation
for enabling hardware virtualization. Great emphasis has been placed on the CPU
hardware architecture limitations for virtualizing CPU services and their software
workarounds. In addition, this paper discusses in detail the software architecture for
implementing the following types of virtualization:
5 Introduction Sun Microsystems, Inc.
• CPU virtualization — uses processor privileged mode to control resource usage by
the VM, and relays hardware traps and interrupts to VMs
• Memory virtualization — partitions physical memory among multiple VMs and
handles page translations for each VM
• I/O virtualization — uses a dedicated VM with direct access to I/O devices to provide
The paper is organized into three sections. Section I, Background Information, contains
information on VMMs and provides details on the x86 and SPARC processors:
• “Virtual Machine Monitor Basics” on page 9 discusses the core of hardware
virtualization, the VMM, as well as requirements for the VMM and several types of
• “The x86 Processor Architecture” on page 21 describes features of the x86 processor
architecture that are pertinent to virtualization.
• “SPARC Processor Architecture” on page 29 describes features of the SPARC processor
that affect virtualization implementations.
Section II, Hardware Virtualization Implementations, provides details on the Sun xVM
Server, Logical Domains, and VMware implementations:
• “Sun xVM Server” on page 39 discusses a paravirtualized Solaris OS that is based on
an open source VMM implementation for x86 processors and is planned for
inclusion in a future Solaris release.
• “Sun xVM Server with Hardware VM (HVM)” on page 63 continues the discussion of
Sun xVM Server for the x86 processors that support hardware virtual machines: Intel-
VT and AMD-V.
• “Logical Domains” on page 79 discusses Logical Domains (LDoms), supported on Sun
servers that utilize UltraSPARC T1 or T2 processors, and describes Solaris OS support
for this feature.
• “VMware” on page 97 discusses the VMware implementation for the VMM.
Section III, Additional Information, contains a concluding comparison, references, and
• “VMM Comparison” on page 109 presents a summary of the VMM implementations
discussed in this paper.
• “References” on page 111 provides a comprehensive listing of related references.
• “Terms and Definitions” on page 113 contains a glossary of terms.
• “Author Biography” on page 117 provides information on the author.
9 Virtual Machine Monitor Basics Sun Microsystems, Inc.
Virtual Machine Monitor Basics
At the heart of hardware level virtualization is the VMM. The VMM is a software layer
that abstracts computer hardware resources so that multiple OS instances can run on a
physical system. Hardware resources are normally controlled and managed by the OS.
In a virtualized environment the VMM takes this role, managing and coordinating
hardware resources. There is no clear boundary between an OS and the VMM from the
definition point of view. The division of functions between OS and the VMM can be
influenced by factors such as processor architecture, performance, OS, and non-
technical requirements such as ease of installation and migration.
Certain VMM requirements exist for running multiple OS instances on a system. These
requirements, discussed in detail in the next section, stem primarily from processor
architecture design that is inherently an impediment to hardware virtualization. Based
on these requirements, two types of VMMs have emerged, each with distinct
characteristics in defining the relationship between the VMM and an OS. This
relationship determines the privilege level of the VMM and an OS, and the control and
sharing of hardware resources.
A software program communicates with the computer hardware through instructions.
Instructions, in turn, operate on registers and memory. If any of the instructions,
registers, or memory involved in an action is privileged, that instruction results in a
privileged action. Sometimes an action, which is not necessarily privileged, attempts to
change the configuration of resources in the system. Subsequently, this action would
impact other actions whose behavior or result depends on the configuration of
resources. The instructions that result in such operations are called sensitive
In the context of the virtualization discussion, a processor's instructions can be
classified into three groups:
• Privileged instructions are those that trap if the processor is in non-privileged mode
and do not trap if it is in privileged mode.
• Sensitive instructions are those that change or reference the configuration of
resources (memory), affect the processor mode without going through the memory
trap sequence (page fault), or reference the sensitive registers whose contents
change when the processor switches to run another VM.
• Non-privileged and non-sensitive instructions are those that do not fall into either
the privileged or sensitive categories described above.
10 Virtual Machine Monitor Basics Sun Microsystems, Inc.
Sensitive instructions have “a major bearing on the virtualizability of a machine” 
because of their system-wide impact. In a virtualized environment, a GOS should only
contain non-privileged and non-sensitive instructions.
If sensitive instructions are a subset of privileged instructions, it is relatively easy to
build a VM because all sensitive instructions will result in a trap. In this case a VMM can
be constructed to catch all traps that result from execution of sensitive instructions by a
GOS. All privileged and sensitive actions from VMs would be caught by the VMM, and
resources could be allocated and managed accordingly (a technique called trap-and-
emulate). A GOS's trap handler could then be called by the VMM trap handler to
perform the GOS-specific actions for the trap.
If a sensitive instruction is a non-privileged instruction, the instruction executed by one
VM will be unnoticed. Robin and Irvine  identified several x86 instructions in this
category. These instructions cannot be safely executed by a GOS as they can impact the
operations of other VMs or adversely affect the operation of its own GOS. Instead, these
instructions must be substituted by the VMM service. The substitution can be in the
form of an API for the GOS to call, or a dynamic conversion of these instructions to
explicit processor traps.
Types of VMM
In a virtualized environment, the VMM controls the hardware resources. VMMs can be
categorized into two types, based on this control of resources:
• Type I — maintains exclusive control of hardware resources
• Type II —leverages the host OS by running inside the OS kernel
The Type I VMM  has several distinct characteristics: it is the first software to run
(besides BIOS and the boot loader), it has full and exclusive control of system hardware,
and it runs in privileged mode directly on the physical processor. The GOS on a Type I
VMM implementation runs in a less privileged mode than the VMM to avoid conflicts
managing the hardware resources.
An example of a Type I VMM is Sun xVM Server. Sun xVM Server includes a bundled
VMM, the Sun vVM Hypervisor for x86. The Sun xVM Hypervisor for x86 is the first
software, beside BIOS and boot loader, to run during boot as shown in the GRUB
title Sun xVM Server
module$ /platform/i86xpv/kernel/$ISADIR/unix /platform/i86xpv/kernel/$ISADIR/unix
11 Virtual Machine Monitor Basics Sun Microsystems, Inc.
The GRUB bootloader first loads the Sun xVM Hypervisor for x86, xen.gz. After the
VMM gains control of the hardware, it loads the Solaris kernel,
/platform/i86xpv/kernel/$ISADIR/unix, to run as a GOS.
Sun's Logical Domains and VMware's Virtual Infrastructure 3  (formerly knows as
VMware ESX Server), described in detail in Chapter 7 “Logical Domains” on page 79 and
Chapter 8 “VMware” on page 97, are also Type I VMMs.
A Type II VMM typically runs inside a host OS kernel as an add-on module, and the host
OS maintains control of the hardware resources. The GOS in a Type II VMM is a process
of the host OS. A Type II VMM leverages the kernel services of the host OS to access
hardware, and intercepts a GOS's privileged operations and performs these operations
in the context of the host OS. Type II VMMs have the advantage of preserving the
existing installation by allowing a new GOS to be added to an running OS.
An example of type II VMM is VMware's VMware Server (formerly known as VMware
Figure 2 illustrates the relationships among hardware, VMM, GOS, host OS, and user
application in virtualized environments.
Type I VMM Type II VMM Physical
Server Server Unprivileged Mode Server
Apps Apps Apps Apps Apps Apps User Space
GOS GOS GOS GOS GOS
VMM Host OS
Platform Hardware Platform Hardware Platform Hardware
Figure 2. Virtual machine monitors vary in how they support guest OS, host OS, and user applications
in virtualized environments.
As discussed in “VMM Requirements” on page 9, the VMM performs some of the
functions that an OS normally does: namely, it controls and arbitrates CPU and memory
resources, and provides services to upper layer software for sensitive and privileged
operations. These functions require the VMM to run in privileged mode and the OS to
relinquish the privileged and sensitive operations to the VMM. In addition to processor
and memory operation, I/O device support also has a large impact on VMM
12 Virtual Machine Monitor Basics Sun Microsystems, Inc.
VMM in Privileged Mode
A processor typically has two or more privileged modes. The operating system kernel
runs in the privileged mode. The user applications run in a non-privileged mode and
trap to the kernel when they need to access system resources or services from the
The GOS normally assumes it runs in the most privileged mode of the processor.
Running a VMM in a privileged mode can be accomplished with one of the following
• Deprivileging the GOS — This method usually requires a modification to the OS to
run at a lower privilege level. For x86 systems, the OS normally runs at protected ring
0, the most privileged level. In Sun xVM Server, ring 0 is reserved to run the VMM.
This requires the GOS to be modified, or paravirtualized, to run outside of ring 0 at a
lower privilege level.
• Hyperprivileging the VMM — Instead of changing the GOS to run at lower privilege,
another approach taken by the chip vendors is to create a hyperprivileged processor
mode for the VMM. The Sun UltraSPARC T1 and T2 processor’s hyperprivileged mode
, Intel-VT's VMX-root operation (see  Volume 3B, Chapter 19), and AMD-V’s
VMRUN-Exit state (see  Chapter 15) are examples of a hyperprivileged processor for
• Both VMM and GOS run in same privileged mode — It is possible to have both the
VMM and GOS run in the same privileged mode. In this case, the VMM intercepts all
privileged and sensitive operations of a GOS before passing them to the processor. For
example, VMware allows both the GOS and the VMM to run in privileged mode.
VMware dynamically examines each instruction to decide whether the processor
state and the segment reversibility (see “Segmented Architecture” on page 23) allow
the instruction to be executed directly without the involvement of the VMM. If the
GOS is in privileged mode or the code segment is non-reversible, the VMM performs
necessary conversions of the core execution path.
Removing Sensitive Instructions in the GOS
Privileged and sensitive operations are normally executed by the OS kernel. In a
virtualized environment, the GOS has to relinquish the privileged and sensitive
operations to the VMM. This is accomplished by one of the following approaches:
• Modifying the GOS source code to use the VMM services for handling sensitive
This method is used by Sun xVM Server and Sun's Logical Domains (LDoms). Sun xVM
Server and LDoms provide a set of hypercalls for an OS to request VMM services. The
VMM-aware Solaris OS uses these hypercalls to replace its sensitive instructions.
13 Virtual Machine Monitor Basics Sun Microsystems, Inc.
• Dynamically translating the GOS sensitive instructions by software
As described in a previous section, VMware uses binary translation to replace the GOS
sensitive instructions with VMM instructions.
• Dynamically translating the GOS sensitive instructions by hardware
This method requires the processor to provides a special mode of operation that is
entered when an sensitive instruction is executed in reduced privileged mode.
The first approach, which involves modifying the GOS source code, is called
paravirtualization, because the VMM provides only partial virtualization of the
processor. The GOS must replace its sensitive and privileged operations with the VMM
service. The remaining two approaches provide full virtualization to the VM, enabling
the GOS to run without modification
In addition to OS modification, performance requirements, processor architecture
design, tolerance of a single point of failure, and support for legacy OS installations
have an impact on the design of VMM architecture.
Physical Memory Virtualization
Memory management by the VMM involves two tasks: partitioning physical memory
for VMs, and supporting page translations in a VM.
Each OS assumes physical memory starts from page frame number (PFN) 0 and is
contiguous to the size configured for that VM. An OS uses physical addresses in
operations like page table updates and Direct Memory Access (DMA). In reality, the
starting PFN of the memory exported to a VM may not start from PFN 0 and may not be
The virtualization of physical address is provided in the VMM by creating another layer
of addressing scheme, namely machine address (MA). Within a GOS, a virtual address
(VA) is used by applications, and a physical address (PA) is used by the OS in DMA and
page tables. The VMM maps a PA from a VM to a MA, which is used on hardware. The
VMM maintains translation tables, one for each VM, for mapping PAs to MAs.
Figure 3 depicts the scheme to partition machine memory to physical memory for each
14 Virtual Machine Monitor Basics Sun Microsystems, Inc.
PFN 0 MPFN 0
Physical Memory Machine Memory
Figure 3. Example physical-to-machine memory mapping.
A ballooning technique  has been used in some virtualization products to achieve
better utilization of physical memory among VMs. The idea behind the ballooning
technique is simple. The VMM controls a balloon module in a GOS. When the VMM
wants to reclaim memory, it inflates the balloon to increase pressure on memory,
forcing the GOS to page out memory to disk. If the demand for physical memory
decreases, the VMM deflates the balloon in a VM, enabling the GOS to claim more
Page Translations Virtualization
Access to processor's page translation hardware is a privileged operation, and this
operation is performed by the privileged VMM. Exactly what the VMM needs to perform
depends on the processor architecture. For example, x86 hardware automatically loads
translations from the page table to the Translation Lookaside Buffer (TLB). The software
has no control of loading page translations to the TLB. Therefore, the VMM is
responsible for updating the page table that is seen by the hardware. The SPARC
processor uses software through traps to load page translations to the TLB. A GOS
maintains its page tables in its own memory, and the VMM gets page translations from
the VM and loads them to the TLB.
VMMs typically support the following two methods to support page translations:
• Hypervisor calls — The GOS makes a call to the VMM for page translation
operations. This method is commonly used by paravirtualized OSes, as it provides
• Shadow page table — The VMM maintains an independent copy of page tables,
called shadow page tables, from the guest page tables. When a page fault occurs,
the VMM propagates changes made by the GOS's page table to the shadow page
table. This method is commonly used by VMMs that support full virtualization, as the
GOS continues to update its own page table and the synchronization of the guest
15 Virtual Machine Monitor Basics Sun Microsystems, Inc.
page table and the shadow page table is handled by the VMM when page faults
Figure 4 shows three different page translation implementations in the Solaris OS on
x86 and SPARC platforms.
1. The paravirtualized Sun xVM Server uses the following approach on x86 platforms:
 The GOS uses the hypervisor call method to update the page tables
maintained by the VMM.
2. The Sun xVM Server with HVM and VMware use the following approach:
[2a] The GOS maintains its own guest page table. The synchronization between
the guest page table and the hardware page table (shadow page table) is
handled by the VMM when page faults occur.
[2b] The x86 CPU loads the page translation from the hardware page table to
3. On SPARC systems, the Solaris OS uses the following approach for Logical Domains:
[3a] The GOS maintains its own page table. The GOS takes an entry from the
page table as an argument to the hypervisor call that loads the translations
to the TLB.
[3b] The VMM gets the page translation from the GOS and loads the translation
to the TLB.
Guest Page Table Guest Page Table
HV Calls HV Calls
HW Page Table TLB Operations
X86 Page Translations SPARC Page Translations
Figure 4. Page translation schemes used on x86 and SPARC architectures.
The memory management implementation for Sun xVM Server, Sun xVM Server with
HVM, VMware, and Logical Domains using these mechanisms is discussed in detail in
later sections of this paper.
16 Virtual Machine Monitor Basics Sun Microsystems, Inc.
I/O devices are typically managed by a special software module called the device driver
running in the kernel context. Due to vastly different types and varieties of device types
and device drivers, the VMM either includes few device drivers or leaves device
management entirely to the GOS. In the latter case, because of existing device
architecture limitations (discussed later in the section), devices can only be exclusively
managed by one VM.
This constraint creates some challenges for I/O access by a VM, and limits the
• What device are exported to a VM
• How devices are exported to a VM
• How each I/O transaction is handled by a VM and the VMM
Consequently, I/O has the most challenges in the areas of compatibility and
performance for virtual machines. In order to explain what devices are exported and
how they are exported, it is first necessary to understand the options available to
handle I/O transactions in a VM.
There are, in general, three approaches for I/O virtualization, as illustrated in Figure 5:
• Direct I/O (VM1 and VM3)
• Virtual I/O using I/O transaction emulation (VM2)
• Virtual I/O using device emulation (VM4)
VM1 VM2 VM3 VM4
Direct I/O Virtual I/O Direct I/O Virtual I/O
I/O VM I/O VM VMM
I/O Transaction Native Driver
Emulation and or
Native Driver Virtual Driver Native Driver Virtual Driver
and Device Driver
Network Chip SCSI Controller
Sun X64 Server
Figure 5. Different I/O virtualization techniques used by virtual machine monitors.
For direct I/O, the VMM exports all or a portion of the physical devices attached to the
system to a VM, and relies on VMs to manage devices. The VM that has direct I/O
access uses the existing driver in the GOS to communicate directly with the device.
VM 1 and VM3 in Figure 5 have direct I/O access to devices. VM1 is also a special I/O VM
that provides virtual I/O for other VMs, such as VM2, to access devices.
17 Virtual Machine Monitor Basics Sun Microsystems, Inc.
Virtual I/O is made possible by controlling the device types exported to a VM. There are
two different methods of implementing virtual I/O: I/O transaction emulation (shown
in VM2 in Figure 5) and device emulation (shown in VM4).
• I/O transaction emulation requires virtual drivers on both ends for each type of I/O
transaction (data and control functions). As shown in Figure 5, the virtual driver on
the client side (VM2) receives I/O requests from applications and forwards requests
through the VMM to the virtual driver on the server side (VM1); the virtual driver on
the server side then sends out the request to the device.
I/O transaction emulation is typically used in paravirtualization because the OS on
the client side needs to include the special drivers to communicate with its
corresponding driver in the OS on the server side, and needs to add kernel interfaces
for inter-domain communication using the VMM services. However, it is possible to
have PV drivers in an un-paravirtualized OS (full virtualization) for better I/O
performance. For example, Solaris 10, which is not paravirtualized, can include PV
drivers on a HVM-capable system to get better performance than that achieved using
device emulation drivers such as QEMU. (See “Sun xVM Server with HVM I/O
Virtualization (QEMU)” on page 71.)
I /O transaction emulation may cause application compatibility issues if the virtual
driver does not provide all data and control functions (for example, ioctl(2)) that
the existing driver does.
• Device emulation provides an emulation of a device type, enabling the existing
driver for the emulated device in a GOS to be used. The VMM exports emulated
device nodes to a VM so that the existing drivers for the emulated devices in a GOS
are used. By doing this, the VMM controls the driver used by a GOS for a particular
device type; for example, using the e1000g driver for all network devices. Thus, the
VMM can focus on the emulation of underlying hardware using one driver interface.
Driver accesses to the I/O register and port in a GOS, which will result in a trap due to
invalid address, are caught and converted to access the real device hardware. VM4 in
Figure 5 uses native OS drivers to access emulated devices exported by the VMM.
Device emulation is in general less efficient and more limited on platforms supported
than I/O transaction emulation. Device emulation does not require changes in the
GOS and, therefore, is typically used to provide full virtualization to a VM.
Virtual I/O, unlike direct I/O, requires additional drivers in either the I/O VM or the
VMM to provide I/O virtualization. This constraint:
• Limits the type of devices that are made available to a VM
• Limits device functionality
• Causes significant I/O performance overhead
While virtualization provides full application binary compatibility, I/O becomes a
trouble area in terms of application compatibility and performance in a VM. One
18 Virtual Machine Monitor Basics Sun Microsystems, Inc.
solution to the I/O virtualization issues is to allow VMs to directly access I/O, as shown
by VM3 in Figure 5.
Direct I/O access by VMs requires additional hardware support to ensure device
accesses by a VM are isolated and restricted to resources owned by the assigned VM. In
order to understand the industry effort to allow an I/O device to be shared among VMs,
it is necessary to examine device operations from an OS point of view.
The interactions between an OS and a device consist, in general, of three operations:
1. Programmed I/O (PIO) — host-initiated data transfer. In PIO, a host OS maps a
virtual address to a piece of device memory and accesses the device memory using
CPU load/store instructions.
2. Direct Memory Access (DMA) —device-initiated data transfer without the CPU
involvement. In DMA, a host OS writes an address of its memory and the transfer
size to a device's DMA descriptor. After receiving an enable DMA instruction from
the host driver, the device performs data transfer at a time it chooses and uses
interrupts to notify the host OS of DMA completion.
3. Interrupt —a device-generated asynchronous event notification.
Interrupts are already virtualized by all VMM implementations as is shown in the later
discussions for Sun xVM Server, Logical Domains, and VMware. The challenge of I/O
sharing among VMs therefore lies in the device handling for PIO and DMA. To meet the
challenges, PCI SIG has released a suite of IOV specifications for PCI Express (PCIe)
devices, in particular the “Single Root I/O Virtualization and Sharing Specification”
(SRIOV) specification  for device sharing and PIO operation, and the “Address
Translation Services (ATS)” specification  for DMA operation.
Device Configuration and PIO
A PCI device exports its memory to the host through Base Address Registers (BARs) in its
configuration space. A device's configuration space is identified in the PCI configuration
address space as shown in Figure 6.
31 24 23 16 15 11 10 8 7 2 1 0
Device Function Register
Reserved Bus Number 00
Number Number Number
Figure 6. PCI configuration address space.
A PCI device can have up to 8 physical functions (PF). Each PF has its own 256 byte
configuration header. The BARs of a PCI function, which are 32-bit wide, are located at
offset 0x10-0x24 in the configuration header. The host gets the size of the memory
region mapped by a BAR by writing a value of all 1's to the BAR and then reading the
value back. The address written to a BAR is the assigned starting address of the memory
region mapped to the BAR.
19 Virtual Machine Monitor Basics Sun Microsystems, Inc.
To allow multiple VMs to share a PF, the SRIOV specification introduces the notion of a
Virtual Function (VF). Each VF shares some common configuration header fields with
the PF and other VFs. The VF BARs are defined in the PCIe's SRIOV extended capabilities
structure. A VF contains a set of non-shared physical resources, such as work queue and
data buffer, which are required to deliver function specific services. These resources are
exported through the VF BARs and are directly accessible by a VM.
The starting address of a VF's memory space is derived from the first VF's memory
space address and the size of VF's BAR. For any given VFx, the starting address of its
memory space mapped to BARa is calculated according to the following formula:
addr (VF x,BAR a) = addr (VF 1,BAR a) + ( x – 1 ) × ( VF BARa aperature size )
where addr (VF1, BARa) is the starting address of BARa for the first VF and (VF BARa
aperture size) is the size of the VF BARa as determined by writing a value of 1's to BARa
and reading the value back. Using this mechanism, a GOS in a VM is able to share the
device with other VMs while performing device operations that pertain only to the VM.
In many current implementations (especially in most x86 platforms), physical addresses
are used in DMA. Since a VM shares the same physical address space on the system
with other VMs, a VM might read/write to another VM's memory through DMA. For
example, a device driver in a VM might write the memory contents that belong to other
VMs to a disk and read the data back into the VM's memory. This causes a potential
breach in security and fault isolation among VMs.
To provide isolation during DMA operation, the ATS specification defines a scheme for a
VM to use the address mapped to its own physical memory for DMA operation. (This
approach is used in similar designs such as IOMMU Specification  and DMA
Remapping .) This DMA ATS enables DMA memory to be partitioned into multiple
domains, and keeps DMA transactions on one domain isolated from other domains.
Figure 7 shows device DMA with and without ATS. With DMA ATS, the DMA address is
like a virtual address that is associated with a context (VM). DMA transactions initiated
by a VM can only be associated with the memory owned by the VM. DMA ATS is a
chipset function that resides outside of the processor.
20 Virtual Machine Monitor Basics Sun Microsystems, Inc.
System Memory System Memory
CPU DMA Buffer CPU Buffer DMA Buffer DMA
PA PA HPA HPA
North Bridge North Bridge
PCI Device South Bridge PCI Device
South Bridge PA DVA/GPA
PCI Device w/ IOMMU PCI Device
DMA without ATS DMA with ATS
PA - Physical Address
HPA - Host Physical Address
DVA - Device Virtual Address
GPA - Guest Physical Address
Figure 7. DMA with and without address translation service (ATS).
As shown in Figure 7, the physical address (PA) is used on the hardware platform
without hardware support for ATS. For platforms with hardware support for ATS, a GOS
in a VM writes either a device virtual address (DVA) or a guest physical address (GPA) to
the device’s DMA engine. The device driver in the GOS loads the mappings of either the
DVA or GPA to the host physical address (HPA) in the hardware IOMMU. The HPA is the
address understood by the memory controller.
Note – The distinction between the HPA and GPA is described in detail in later
sections for Sun xVM Server (see “Physical Memory Management” on page 52), for
UltraSPARC LDoms (see “Physical Memory Allocation” on page 88), and for VMware
(see “Physical Memory Management” on page 103).
When the device performs a DMA operation, a DVA/GPA address appears on the PCI bus
and is intercepted by the hardware IOMMU. The hardware IOMMU looks up the
mapping for the DVA/GPA, finds the corresponding HPA, and moves the PCI data to
system memory pointed to by the HPA. Since either DVA or GPA of a VM has its own
address space, ATS allows system memory for DMA to be partitioned and, thus,
prevents a VM from accessing another VM’s DMA buffer.
21 The x86 Processor Architecture Sun Microsystems, Inc.
The x86 Processor Architecture
This chapter provides background information on the x86 processor architecture that is
relevant to later discussions on Sun xVM Server (Chapter 5 on page 39), Sun xVM Server
with HVM (Chapter 6 on page 63), and VMware (Chapter 8 on page 97).
The x86 processor was not designed to run in a virtualized environment, and the x86
architecture presents some challenges for CPU and memory virtualization. This chapter
discusses the following x86 architecture features that are pertinent to virtualization:
• Protected Mode
The protected mode in the x86 processor utilizes two mechanisms, segmentation and
paging, to prevent a program from accessing a segment or a page with a higher
privilege level. Privilege level controls how the VMM and a GOS work together to
provide CPU virtualization.
• Segmented Architecture
The x86 segmented architecture converts a program's virtual addresses into linear
addresses that are used by the paging mechanism to map into physical memory.
During the conversion, the processor's privilege level is checked against the privilege
level of the segment for the address. Because of the segment cache technique
employed by the x86 processor, the VMM must ensure segment cache consistency
with the VM descriptor table updates. This x86 feature results in a significant amount
of work for the VMM of full virtualization products such as VMware.
• Paging Architecture
The x86 paging architecture provides page translations to the TLB and page tables.
Because the loading of page translations from page table to TLB is done
automatically by hardware on the x86 platform, page table updates have to be
performed by the privileged VMM. Several mechanisms are available for updating
this “hardware” page table by a VM.
• I/O and Interrupts
A device interacts with a host processor through PIO, DMA, and interrupts. PIO in the
x86 processor can be performed through either I/O ports using special I/O
instructions or through memory-mapped addresses with general purpose MOVE and
String instructions. DMA in most x86 platforms is performed with physical
addresses. This can cause a security and isolation breach in a virtualized environment
because a VM may read/write other VMs memory contents. Interrupts and
exceptions are handled through the Interrupt Descriptor Table (IDT). There is only one
IDT on the system and access to the IDT is privileged. Therefore, interrupts have to be
handled by the VM and virtualized to be delivered to a VM.
22 The x86 Processor Architecture Sun Microsystems, Inc.
• Timer Devices
The x86 platform includes several timer devices for time keeping purposes.
Knowledge of the characteristics of these devices is important to fully understand
time keeping in a VM: Some timer devices are interrupt driven (which is virtualized
and delayed) and some require privileged access to update the device counter.
The x86 architecture protected mode provides a protection mechanism to limit access
to certain segments or pages and prevent unprivileged access. The processor's
segment-protection mechanism recognizes 4 privilege levels, numbered from 0 to 3
(Figure 8). The greater the level number, the lesser the privileges provided.
The page-level protection mechanism restricts access to pages based on two privilege
levels: supervisor mode and user mode. If the processor is operating at a current
privilege level (CPL) 0, 1, or 2, it is in a supervisor mode and the processor can access all
pages. If the processor is operating at a CPL 3, it is in a user mode and the processor can
access only user level pages.
Level 0 - OS Kernel
Level 3 - Applications
Figure 8. Privilege levels in the x86 architecture.
When the processor detects a privilege level violation, it generates a general-protection
exception (#GP). The x86 has more than 20 privileged instructions. These instructions
can be executed only when the current privilege level (CPL) is 0 (most privileged).
In addition to the CPL, the x86 has an I/O privilege level (IOPL) field in the EFLAGS
register that indicates the I/O privilege level of the currently running program. Some
instructions, while allowed to execute when the CPL is not 0, might generate a #GP
exception if the CPL value is higher than IOPL. These instructions include CLI (clear
interrupt), STI (set interrupt flag), IN/INS (input from port), and OUT/OUTS (output
In addition to the above instructions, there are many instructions  that, while not
privileged, reference registers or memory locations that would allow a VM to access a
memory region not assigned to that VM. These sensitive instructions will not cause a
#GP exception. The trap-and-emulate method for virtualization of a GOS, as stated in
“VMM Requirements” on page 9, does not apply to these instructions. However, these
instructions may impact other VMs.
23 The x86 Processor Architecture Sun Microsystems, Inc.
In protected mode, all memory accesses must go through a logical address } Linear
address (LA) } Physical Address (PA) translation scheme. The logical address to LA
translation is managed by the x86 segmentation architecture which divides a process's
address space into multiple protected segments.
A logical address, which is used as the address of an operand or of an instruction,
consists of a 16-bit segment selector and a 32-bit offset. A segment selector points to a
segment descriptor that defines the segment (see Figure 11 on page 24). The segment
base address is contained in the segment descriptor. The sum of the offset in a logical
address and the segment base address gives the LA. The Solaris OS directly maps an LA
to a process's Virtual Address (VA) by setting the segment base address to NULL.
Segmentation: VA + Segment Base Address (always 0 in Solaris) } Linear address
Paging: Linear address } Physical Address
For each memory reference, a VA and a segment selector are provided to the processor
(Figure 9). The segment selector, which is loaded to the segment register, is used to
identify a segment descriptor for the address.
15 3 2 1 0
Index TI RPL
Index: up to 8K descriptors (bits 3-15)
TI: Table Indicator; 0=GDT, 1=LDT
RPL: Request Privilege Level
Figure 9. Segment Selector
Every segment descriptor has a visible part and a hidden part, as illustrated in Figure 10
(see also , Volume 3A Section 3.4.3). The visible part is the segment selector, an
index that points into either the global descriptor table (GDT) or the local descriptor
table (LDT) to identify from which descriptor the hidden part of the segment register is
to be loaded. The hidden part includes portions containing segment descriptor
information loaded from the descriptor table.
Selector Type Base Address Limit CPL
Figure 10. Each segment descriptor has a visible and a hidden part.
24 The x86 Processor Architecture Sun Microsystems, Inc.
The hidden fields of a segment register are loaded to the processor from a descriptor
table and are stored in the descriptor cache registers. The descriptor cache registers,
like the TLB, allow the hardware processor to refer to the contents of the segment
register's hidden part without further reference to the descriptor table. Each time a
segment register is loaded, the descriptor cache register gets fully loaded from the
descriptor table. Since each VM has its own descriptor table (for example, the GDT), the
VMM has to maintain a shadow copy of each VM’s descriptor table. A context switch to
a VM will cause the VM's shadow descriptor table to be loaded to the hardware
descriptor table. If the content of the descriptor table is changed by the VMM because
of a context switch to another VM, the segment is non-reversible, which means the
segment cannot be restored if an event such as a trap causes the segment to be saved
The Current Privilege Level (CPL) is stored in the hidden portion of the segment register.
The CPL is initially equal to the privilege level of the code segment from which it is
being loaded. The processor changes the CPL when program control is transferred to a
code segment with a different privilege level.
The segment descriptor contains the size, location, access control, and status
information of the segment that is stored in either the LDT or GDT. The OS sets segment
descriptors in the descriptor table and controls which descriptor entry to use for a
segment (Figure 11). See “CPU Privilege Mode” on page 45 for a discussion of setting
the segment descriptor in the Solaris OS.
31 24 23 22 21 20 19 16 15 14 13 12 11 87 0
Base 31:24 D D/B L AVL SL P DPL S Type Base 23:16
31 16 0
Base 15:00 Segment Limit 15:00
L: 64-bit code segment
AVL: Available for use by system software
Base: Segment base address
D/B Default operation size (0=64-bit segment, 1=32 bit segment)
DBL: Descriptor Privilege Level
SL: Segment Limit 19:16
P: Segment present
S: Descriptor type (0=system, 1=code or data)
Type: segment type
Figure 11. Segment descriptor.
The privilege check performed by the processor recognizes three types of privilege
levels: requested privilege level (RPL), current privilege level (CPL), and descriptor
privilege level (DPL). A segment can be loaded if the DPL of the segment is numerically
greater than or equal to both the CPL and the RPL. In other words, a segment can be
25 The x86 Processor Architecture Sun Microsystems, Inc.
accessed only by code that has equal or higher privilege level. Otherwise, a general-
protection fault exception, #GP, is generated and the segment register is not loaded.
On 64-bit systems, linear address space (flat memory model) is used to create a
continuous, unsegmented address space for both kernel and application programs.
Segmentation is disabled in the sense that privilege checking can not apply to VA to LA
translations as it doesn't exist. The only protection left to prevent a user application
from accessing kernel memory is through the page protection mechanism. This is why
the kernel of a GOS has to run in ring 3 (user mode in page level protection) on a 64-bit
When operating in the protected mode, the LA } PA translation is performed by the
paging hardware of the x86 processor. To access data in memory, the processor requires
the presence of a VA } PA translation in the TLB (in Solaris, LA is equal to VA), the page
table backing up the TLB entry, and a page of physical memory. For the x86 processor,
loading the VA } PA page translation from the page table to TLB is performed
automatically by the processor. The OS is responsible for allocating physical memory
and loading the VA } PA translation to the page table.
When the processor cannot load a translation from the page table, it generates a page
fault exception, #PF. A #PF exception on x86 processors usually means a physical page
has not been allocated, because the loading of the translation from the page table to
the TLB is handled by the processor (Figure 12).
TLB Entry Page Table Physical Memory
Performed by the processor Performed by the OS
Figure 12. Translations through the TLB are accomplished in the processor itself, while translations
through page tables are performed by the OS.
The x86 processor uses a control register, %cr3, to manage the loading of address
translations from the page table to the TLB. The base address of a process's page table
is kept by the OS and loaded to %cr3 when the process is contexted in to run. On the
Solaris OS, %cr3 is kept in the kernel hat structure. Each address space, as, has one
hat structure. The mdb(1) command can be used to find the value of the %cr3
register of a process:
26 The x86 Processor Architecture Sun Microsystems, Inc.
% mdb -k
S PID PPID PGID SID UID FLAGS ADDR NAME
R 9352 9351 9352 9352 28155 0x4a014000 fffffffec2ae78c0 bash
> fffffffec2ae78c0::print -t 'struct proc' ! grep p_as
struct as *p_as = 0xfffffffed15ba7e0
> 0xfffffffed15ba7e0::print -t 'struct as' ! grep a_hat
struct hat *a_hat = 0xfffffffed1718e98
> 0xfffffffed1718e98::print -t 'struct hat' ! grep hat_htable
htable_t *hat_htable = 0xfffffffed0f67678
> 0xfffffffed0f67678::print -t 'struct htable' ! grep ht_pfn
pfn_t ht_pfn = 0x16d37 // %cr3
When multiple VMs are running, the automatic loading of page translations from the
page table to the TLB actually makes the virtualization more difficult because all page
tables have to be accessible by the processor. As a result, pages table updates can only
be performed by the VMM to enforce a consistent memory usage on the system. “Page
Translations Virtualization” on page 14 discusses two mechanism for managing page
tables by the VMM.
Another issue of the x86 paging architecture is related to the flushing of TLB entries.
Unlike many RISC processors which support a tagged TLB, the x86 TLB is not tagged. A
TLB miss results in a walk of the page table by the processor to find and load the
translation to the TLB. Since the TLB is not tagged, a change in the %cr3 register due to
a virtual memory context switch will result in invalidating all TLB entries. This adversely
affects performance if the VMM and VM are not in the same address space.
A typical solution to address the performance impact of TLB flushing is to reserve a
region of the VM address space for the VMM. With this solution, the VMM and VM can
run from the same address space and thus avoid a TLB flush when a VM memory
operation traps to the VMM. The latest CPUs from Intel and AMD with hardware
virtualization support include tagged TLBs, and consequently the translation of
different address spaces can co-exist in the TLB.
I/O and Interrupts
In general, x86 support for exceptions and I/O interrupts does not impose any
particular challenge to the implementation of a VMM. The x86 processor uses the
interrupt descriptor table (IDT) to provide a handler for a particular interrupt or
exception. Access to the IDT functions is privileged and, therefore, can only be
performed by the VMM. The Sun xVM Hypervisor for x86 provides a mechanism to relay
hardware interrupts to a VM through its event channel hypervisor calls (see “Event
Channels” on page 43).
27 The x86 Processor Architecture Sun Microsystems, Inc.
The x86 processor allows device memory and registers to be accessed through either an
I/O address space or memory-mapped I/O. An I/O address space access is performed
using special I/O instructions such as IN and OUT. These instructions, while allowed to
execute when the CPL is not 0, will result in a #GP exception if the processor's CPL
value is higher than the I/O privilege level (IOPL). The Sun xVM Hypervisor for x86
provides a hypervisor call to set the IOPL, enabling a GOS to directly access I/O ports by
setting the IOPL to its privilege level.
When using memory-mapped I/O, any of the processor’s instructions that reference
memory can be used to access an I/O location with protection provided through
segmentation and paging. PIO, whether it is using I/O address space or memory-
mapped I/O, is normally uncacheable as device registers are usually accessed with
precise programming order. PIO uses addresses in a VM's address space and doesn't
cause any security and isolation issues.
The x86 processor uses physical addresses for DMA. DMA in a virtualized x86 system has
• A 32-bit, non-dual-address-cycle (DAC) PCI device can not address beyond 4 GB of
• It is possible for one domain’s DMA to intrude into another domain's physical
memory, thus causing the risk of security violation.
The solution to the above issues is to have an I/O memory management unit (IOMMU)
as a part of an I/O bridge or north bridge that performs a translation of I/O addresses
(for example, an address that appears on the PCI bus) to machine memory addresses.
The I/O address can be any address that is recognized by the IOMMU. An IOMMU can
also improve the performance of large chunk data transfers by mapping a contiguous
I/O address to multiple physical pages in one DMA transaction. However, the IOMMU
may hurt the I/O performance for small data transfers because the DMA setup cost is
higher than that of DMA without an IOMMU.
For more details on the IOMMU, also known as hardware address translation service
(hardware ATS), see “I/O Virtualization” on page 16.
An OS typically uses several timer devices for different purposes. Timer devices are
characterized by their frequency granularity, frequency reliability, and ability to
generate interrupts and receive counter input. Understanding the characteristics of
timer devices is important for the discussion of timekeeping in a virtualized
environment, as the VMM provides virtualized timekeeping of some timers to its
overlaying VMs. Virtualized timekeeping has significant impact on the accuracy of time
related functions in the GOS and, thus, on the performance and results of time sensitive
28 The x86 Processor Architecture Sun Microsystems, Inc.
An x86 system typically includes the following timer devices:
• Programmable Interrupt Timer (PIT)
PITs use a 1.193182 Mhz crystal oscillator and have a 16-bit counter and counter input
register. The PIT contains three timers. Timer 0 can generate interrupts and is used by
the Solaris OS as the system timer. Timer 1 was historically used for RAM refreshes
and timer 2 for the PC speaker.
• Time Stamp Counter (TSC)
The TSC is a feature of the x86 architecture that is accessed via the RDTSC
instruction. The TSC, a 64-bit counter, changes with the processor speed. The TSC
cannot generate interrupts and has no counter input register. The TSC is the finest
grained of all timers and is used in the Solaris OS as the high resolution timer. For
example, the gethrtime(3C) function uses the TSC to return the current high-
resolution real time.
• Real Time Clock (RTC)
The RTC is used as the time-of-day (TOD) clock in the Solaris OS. The RTC uses a battery
as an alternate power source, enabling it to continue to keep time while the primary
source of power is not available. The RTC can generate interrupts and has a counter
input register. It is the lowest grained timer on the system.
• Local Advanced Programmable Interrupt Controller (APIC) Timer
The local APIC timer, which is a part of the local APIC, has a 32-bit counter and
counter input register. It can generate interrupts and has the same frequency as the
front side bus. The Solaris OS supports the use of the local APIC timer as one of the
• High Precision Event Timer (HPET)
The HPET is a relatively new timer available in some new x86 systems. The HPET is
intended to replace the PIT and the RTC for generating periodic interrupts. The HPET
can generate interrupts, is 64-bits wide, and has a counter input register. The Solaris
OS currently does not use the HPET.
• Advanced Configuration and Power Interface (ACPI) Timer
The ACPI timer has a 24-bit counter, can generate interrupts, and has no input
counter register. The Solaris OS does not use the ACPI timer.
29 SPARC Processor Architecture Sun Microsystems, Inc.
SPARC Processor Architecture
This chapter provides background information on the SPARC processor architecture that
is relevant to later discussions on Logical Domains (Chapter 7 on page 79).
The SPARC (Scalable Processor Architecture) processor, first introduced in 1987, is a big-
endian RISC processor ISA. SPARC International (SI), an industry organization, was
established in 1989 to promote the open SPARC architecture. In 1994, SI introduced a
64-bit version of the SPARC processor as SPARC v9. The UltraSPARC processor, which is a
Sun-specific implementation of SPARC v9, was introduced in 1996 and has been
incorporated into all Sun SPARC platforms shipping today.
In 2005, Sun's UltraSPARC architecture was open sourced as the UltraSPARC
Architecture 2005 Specification . Included in this enhanced UltraSPARC 2005
specification is support for Chip-level Multithreading (CMT) for a highly threaded
processor architecture and a hyperprivileged mode that allows the hypervisor to
virtualize the processor to run multiple domains. The design of the UltraSPARC T1
processor, which is the first implementation of the UltraSPARC Architecture 2005
Specification, is also open sourced. The UltraSPARC T1 processor includes 8 cores with 4
strands in each core, providing a total of 32 strands per processor.
In August 2007 Sun announced the UltraSPARC T2 processor, the follow-up CMT
processor to the UltraSPARC T1 processor, and the OpenSPARC T2 architecture 
which is the open source version of the UltraSPARC T2 processor. Sun also released the
UltraSPARC Architecture 2007 specification  which adds a section for error handling
and expands the discussion for memory management. The UltraSPARC T2 processor has
several enhancements over the UltraSPARC T1 processor. These enhancements include
64 strands, per-core floating-point and graphic units, and integrated PCIe and 10 GB
Ethernet (for more details see “Processor Components” on page 31).
The remainder of this chapter discusses the following features of the UltraSPARC T1/T2
processor architecture, and describes their effect on virtualization implementations:
• Processor privilege mode — The UltraSPARC 2005 specification defines a
hyperprivileged mode for the hypervisor operations.
• Sun4v Chip Multithreaded architecture — This feature enables the creation of up to
32 domains, each with its own dedicated strands, on an UltraSPARC T1 processor, and
up to 64 domains on an UltraSPARC T2 processor.
• Address Space Identifier (ASI)— The ASI provides functionality to control access to a
range of address spaces, similar to the segmentation used by x86 processors.
• Memory Management Unit (MMU) — The software-controlled MMU allows an
efficient redirection of page faults to the intended domain for loading translations.
30 SPARC Processor Architecture Sun Microsystems, Inc.
• Trap and interrupt handling — Each strand (virtual processor) has its own trap and
interrupt priority registers. This functionality allows the hypervisor to re-direct traps
to the target CPU and enables the trap to be taken by the GOS's trap handler.
Note – The terms strand, hardware thread, logical processor, virtual CPU and virtual
processor are used by various documents to refer to the same concept. For
consistency, the term strand is used in this chapter.
Processor Mode of Operation
The UltraSPARC 2005 specification defines three privilege modes: non-privileged,
privileged, and hyperprivileged. In hyperprivileged mode, the processor can access all
registers and address spaces, and can execute all instructions. Instructions, registers,
and address spaces for privileged and non-privileged modes are restricted.
The processor operates in privileged mode when PSTATE.priv is set to 1 and
HPSTATE.hpriv is set to 0. The processor operates in hyperprivileged mode when
HPSTATE.hpriv is set to 1 (PSTATE.priv is ignored).
Table 1 lists the availability of instructions, registers, and address spaces for each of the
privilege modes, and includes information on where further details can be found in the
UltraSPARC Architecture 2005 Specification .
Table 1. Documentation describing the availability of components in the UltraSPARC processor.
Component Locationa Comments
Instruction Table 7-2 All instructions except SIR, RDHPR, and RHPR (which require
hyperprivilege to execute) can be executed from the privileged
Registers Chapter 5 There are seven hyperprivileged registers: HPSTATE, HTSTATE,
HINTP HTBA, HVER, HSTICK_CMPR, and STRAND_STS. These
registers are used by the hypervisor in the hyperprivileged
Address Tables 9-1 ASIs 0x30-0x7F are for hyperprivileged access only. These ASIs
Space and 10-1 are mainly for CMT control, MMU, TLB, and hyperprivileged
a. Location in the UltraSPARC Architecture 2005 Specification .
Based on the availability of instructions, registers, and the ASI in hyperprivileged mode,
the following functions of the hypervisor can be deduced:
• Reset the processor: SIR instruction
• Control hyperprivileged traps and interrupts: HTSTATE, HTBA, HINTP registers
• Control strand operation: ASI 0x41, and HSTICK_CMPR and STRAND_STS registers
• Manage MMU: ASI 0x50-0x5F
31 SPARC Processor Architecture Sun Microsystems, Inc.
The UltraSPARC T1 processor contains eight cores, and each core has hardware
support for four strands. One FPU and one L2 cache are shared among all cores in the
processor. Each core has its own Level 1 instruction and data cache (L1 Icache and
Dcache) and TLB that are shared among all strands in the core. In addition, each strand
contains the following:
• A full register file with eight register windows and four sets of global registers (a total
of 160 registers: 8 * 16 registers per window, + 4 * 8 global registers)
• Most of the ASIs
• Ancillary privileged registers
• Trap queue with up to 16 entries
This hardware support in each strand allows the hypervisor to partition the processor
into 32 domains, with one strand for each domain. Each strand can execute instructions
separately without requiring a software scheduler in the hypervisor to coordinate the
Table 2 summarizes the association of processor components to their location in the
processor, core and strand.
Table 2. Location of key processor components in the UltraSPARC T1 processor.
Processor Core Strand
• Floating Point Unit • 6 stage instruction • Register file with 160 registers
• L2 cache crossbar pipeline • Most of ASI
• L2 cache • L1 Icache and Dcache • Ancillary state register (ASR)
• TLB • Trap registers
• Privileged registers
The UltraSPARC T2 processor is built upon the UltraSPARC T1 architecture. It has the
following enhancements over the UltraSPARC T1 processor:
• EIght strands per core (for a total of 64 strands)
• Two integer pipelines per core, with each integer pipeline supporting 4 strands
• A floating-point and graphics unit (FGU) per core
• Integrated PCI-E and 10 Gb/Gb Ethernet (System-on-Chip)
• Eight banks of 4 MB L2 cache
The UltraSPARC T2 has a total of 64 strands in 8 cores, and each core has its own
floating-pointing and graphics unit (FGU). This allows up to 64 domains to be created on
the UltraSPARC T2 processor. This design also adds integrated support for industry
standard I/O interfaces such PCI-Express and 10 Gb Ethernet.
Table 3 summarizes the association of processor components to physical processor, core
32 SPARC Processor Architecture Sun Microsystems, Inc.
Table 3. Location of key processor components in the UltraSPARC T2 processor.
Processor Core Strand
• 8 banks 4 MB L2 cache • 2 instruction • Full register file with 8 windows
• L2 cache crossbar pipelines (8 stages) • Most of ASI
• Memory controller • L1 Icache and Dcache • Ancillary state register (ASR)
• PCI-E • TLB • Privileged registers
• 10 Gb/Gb Ethernet • FGU (12 stages)
Address Space Identifier
Unlike x86 processors in 32-bit mode, which use segmentation to divide a process's
address space into several segments of protected address spaces, the SPARC v9
processor has a flat 64-bit address space. An address in the SPARC V9 processor is a
tuple consisting of an 8-bit address space identifier (ASI) and a 64-bit byte-address offset
within the specified address space. The ASI provides attributes of an address space,
including the following:
• Privileged or non-privileged
• Register or memory
• Endianness (for example, little-endian or big-endian)
• Physical or virtual address
• Cacheable or non-cacheable
The SPARC processor's ASI allows different types of address spaces (user virtual address
space, kernel virtual address space, processor control and status registers, etc.) to co-
exist as separate and independent address spaces for a given context. Unlike x86
processors in which user processes and the kernel share the same address space, user
processes and the kernel have their own address space on SPARC processors.
Access to these address spaces are protected by the ASI associated with each address
space. ASIs in the range 0x00-0x2F may be accessed only by software running in
privileged or hyperprivileged mode; ASIs in the range 0x30-0x7F may be accessed
only by software running in hyperprivileged mode. An access to a restricted (privileged
or hyperprivileged) ASI (0x00-0x7F ) by non-privileged software will result in a
Table 9-1 and Table 10--1 of  provide a summary and description for each ASI.
Memory Management Unit
The traditional UltraSPARC architecture supports two types of memory addressing:
• Virtual Address (VA) — managed by the GOS and used by user programs
• Physical address (PA) — passed by the processor to the system bus when accessing
33 SPARC Processor Architecture Sun Microsystems, Inc.
The Memory Management Unit (MMU) of the UltraSPARC processor provides the
translation of VAs to PAs. This translation enables user programs to use a VA to locate
data in physical memory.
The SpitFire Memory Management Unit (sfmmu) is Sun's implementation of the
UltraSPARC MMU. The sfmmu hardware consists of Translation Lookaside Buffers (TLBs)
and a number of MMU registers:
• Translation Lookaside Buffer (TLB)
The TLB provides virtual to physical address translations. Each entry of the TLB is a
Translation Table Entry (TTE) that holds information for a single page mapping of
virtual to physical addresses. The format of the TTE is shown in Figure 13. The TTE
consists of two 64-bit words, representing the tag and data of the translation. The
privileged field, P, controls whether or not the page can be accessed by non-
• MMU registers
A number of MMU registers are used for accessing TLB entries, removing TLB entries
(demap), context management, handling TLB misses, and support for Translation
Storage Buffer (TSB) access. The TSB, an array of TTE entries, is a cache of translation
tables used to quickly reload the TLB. The TSB resides in the system memory and is
managed entirely by the OS. The UltraSPARC processors includes some MMU
hardware registers for speeding up TSB access. The TLB miss handler will first search
the TSB for the translation. If the translation is not found in the TSB, the TLB handler
calls to a more sophisticated (and slower) TSB miss handler to load the translation
table to the TSB.
TTE context_id 000000 va
63 48 47 42 41 0
TTE v f soft2 taddr i cc e o
Data o e e p v p pw f sz
636261 56 55 13121110 9 8 7 654 3 0
Figure 13. The translation lookaside buffer (TLB) is an array of translation table entries containing tag
and data portions.
A TLB hit occurs if both the context and virtual address match an entry in the TLB.
Address aliasing (multiple TLB entries with the same physical address) is permitted.
Unlike the x86 processor, the loading of page translations to the TLB is manually
managed by software through traps. In the event of a TLB miss, a trap is generated
trying first to get the translation from the Translation Storage Buffer (TSB) (Figure 14).
The TSB, an in-memory array of translations, acts like a direct-mapped cache for the
TLB. If the translation is not present in the TSB, a TSB miss trap is generated. The TSB
miss trap handler uses a software lookup mechanism based on the hash memory entry
34 SPARC Processor Architecture Sun Microsystems, Inc.
block structure, hme_blk, to obtain the TTE. If a translation is still not found in
hme_blk, the kernel generic trap handler is invoked to call the kernel function
pagefault() to allocate physical memory for the virtual address and load the
translation into the hme_blk hash structure.
Figure 14 depicts the mechanism for handling TLB misses in an unvirtualized domain.
TLB miss TSB miss Allocate memory
TLB TSB home_blk pagefault ()
TTE load to TLB TTE load to TSB hat_memload()
Processor TTE cache OS data OS
MMU in memory structure function
Figure 14. Handling a TLB miss in an unvirtualized domain, UltraSPARC T1/T2 processor architecture.
Similarly, Figure 15 depicts how TLB misses are handled in a virtualized domain. In a
virtualized environment, the UltraSPARC T1/T2 processor adds a Real Address type, in
addition to the VA and PA, into the types of memory addressing (Figure 15). Real
addresses (RA), which are equivalent to the physical memory in Sun xVM Server (see
“Physical Memory Management” on page 52) are provided to the GOS as the
underlying physical memory allocated to it. The GOS-maintained TSBs are used to
translate VAs into RAs. The hypervisor manages the translation from RA to PA.
TLB miss TLB miss TSB miss Allocate memory
TLB PA<-RA TSB hme_blk pagefault()
TTE load to TLB TTE load to TSB hat_memload()
Processor Managed by TTE cache OS data OS
MMU Hypervisor in memory structure function
Figure 15. Handling a TLB miss in a virtualized domain, UltraSPARC T1/T2 processor architecture.
Applications, which are non-privileged software, use only VAs. The OS kernel, which is
privileged software, uses both VAs and RAs. The hypervisor, which is hyperprivileged
software, normally uses PAs. “Physical Memory Allocation” on page 88 discusses in
detail the types of memory addressing used in LDoms.
The UltraSPARC T2 processor adds a hardware table walk for loading TLB entries. The
hardware table walk accesses the TSBs to find TTEs that match the virtual address and
context ID of the request. Since a GOS cannot access or control physical memory, the
TTEs in the TSBs controlled by a GOS contain real page numbers, not physical page
numbers (see “Physical Memory Allocation” on page 88). TTEs in the TSBs controlled by
the hypervisor can contain real page numbers or physical page numbers. The
hypervisor performs the RA-to-PA translation within the hardware table walk to permit
the hardware table walk to load a GOS TTEs into the TLB for VA-to-PA translation.
35 SPARC Processor Architecture Sun Microsystems, Inc.
In the SPARC processor, a trap transfers software execution from one privileged mode to
another privileged mode at the same or higher level. The only exception is that
unprivileged mode can not trap to another unprivileged mode. A trap can be generated
by the following methods:
• Internally by the processor (memory faults, privileged exceptions, etc.)
• Externally generated by I/O devices (interrupts)
• Externally generated by another processor (cross calls)
• Software generated (for example, the Tcc instruction)
A trap is associated with a Trap Type (TT), a 9-bit value. (TT values 0x180-0x1FF are
reserved for future use.) The transfer of software execution occurs through a trap table
that contains an array of TT handlers indexed by the TT value. Each trap table entry is
32-bytes in length and contains the first eight instructions of the TT handler. When a
trap occurs, the processor gets the TT from the TT register and the trap table base
address (TBA) from the TBA register. After saving the current executing states and
updating some registers, the processor starts to execute the instructions in the trap
The SPARC processors support nesting traps using a trap level (TL). The maximum TL
(MAXTL) value is typically in the range of 2-6, and depends on the processor; in
UltraSPARC T1/T2 processors, MAXTL is 6. Each trap level has one set of trap stack
control registers: trap type (TT), trap program counter (TPC), trap next program
counter (TNPC), and trap state (TSTATE). These registers provide trap software
execution state and control for the current TL. The ability to support nested traps in
SPARC processors makes the implementation of an OS trap handler easier and more
efficient, as the OS doesn't need to explicitly save the current trap stack information.
On UltraSPARC T1/T2 processors, each strand has a full set of trap control and stack
registers which include TT, TL, TPC, TNPC, TSTATE, HTSTATE (hyperprivileged trap
state), TBA, HTBA (hyperprivileged trap base address), and PIL (priority interrupt
level). This design feature allows each strand to receive traps independently of other
strands. This capability significantly helps trap handling and management by the
hypervisor, as traps are delivered to a strand without being queued up in the hypervisor.
On SPARC platforms, interrupt requests are delivered to the CPU as traps. Traps 0x041
through 0x04F are used for Priority Interrupt Level (PIL) interrupts, and trap 0x60 is
used for the vector interrupt. There are 15 interrupt levels for PIL interrupts. Interrupts
are serviced in accordance to their PIL, with higher PILs having higher priority. The
vector interrupt is used to support the data bearing vector interrupt which allows a
device to include its private data in the interrupt packet (also known as the mondo
36 SPARC Processor Architecture Sun Microsystems, Inc.
vector). With vector interrupt, device CSR access can be eliminated and the complexity
of device hardware can be reduced.
PIL interrupts are delivered to the processor through the ASR's SOFTINT_REG register.
The SOFTINT_REG register contains a 15 bit int_level field. When a bit in this field
is set, a trap is generated and the PIL of the trap corresponds to the location of the bit
in that field. There is one SOFTINT_REG for each strand.
In LDoms, the interrupt delivery from an I/O device to a GOS is a two-step process:
• An I/O device sends an interrupt request using the vector interrupt (trap 0x60) to the
hypervisor. The hypervisor inserts the interrupt request into the interrupt queue of
the target virtual processor.
• The target processor receives the interrupt request on its interrupt queue through
trap 0x7D (for device) or 0x7C (for cross calls), and schedules an interrupt to itself to
be processed at a later time by setting bits in the privileged SOFTINT register which
causes a PIL interrupt (trap 0x41-0x4F). For more details on interrupt delivery, see
“Trap and Interrupt Handling” on page 85.
SPARC Processor Architecture Sun Microsystems, Inc.
Hardware Virtualization Implementations
• Chapter 5: Sun xVM Server (page 39)
• Chapter 6: Sun xVM Server with Hardware VM (HVM) (page 63)
• Chapter 7: Logical Domains (page 79)
• Chapter 8: VMware (page 97)
38 SPARC Processor Architecture Sun Microsystems, Inc.
39 Sun xVM Server Sun Microsystems, Inc.
Sun xVM Server
Sun xVM Server is a a paravirtualized Solaris OS that incorporates the Xen open source
community work. The open source VMM, Xen, was originally developed by the Systems
Research Group of the University of Cambridge Computer Laboratory, as part of the UK-
EPSRC funded XenoServers project. The first versions of Xen, targeted at the Linux
community for the x86 processor, required the Linux kernel to be specifically modified
to run on the Xen VMM. This OS paravirtualization made it impossible to run Windows
on early versions of Xen, because Microsoft did not permit the Windows software to be
In December 2005 the Xen development team released Xen 3.0, the first version of its
VMM that supported hardware-assisted virtual machines (HVM). With this new version,
an unmodified OS could be hosted on the Intel-VTx and AMD-V (Pacifica) processors.
Xen 3.0 eliminated the need for paravirtualization and enabled Microsoft Windows to
run in a Xen environment side-by-side with Linux and the Solaris OS.
Xen 3.0 supports the x86 CPU both with HVM and without HVM. Xen 3.0 also extends
support for symmetric multiprocessing, 64-bit operating systems, and up to 64 GB RAM
allowed by the x86 physical address extension (PAE) in 32-bit mode.
HVM technology affects the Xen implementation in many ways. This chapter discusses
the architecture and design of Sun xVM Server, which does not leverage the processor
HVM feature. Chapter 5 discusses Sun xVM Server for x86 processors with HVM support
(Sun xVM Server with HVM).
Note – Sun xVM Server includes support for the Xen open source community work on
the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform. In this
paper, in order to distinguish the discussion of x86 and UltraSPARC T1/T2 processors,
Sun xVM Server is specifically used to refer to the Sun hardware virtualization
product for the x86 platform, and LDoms is used to refer to the Sun hardware
virtualization product for the UltraSPARC T1 and T2 platforms.
This chapter is organized as follows:
• “Sun xVM Server Architecture Overview” on page 40 provides an overview of the Sun
xVM Server architecture.
• “Sun xVM Server CPU Virtualization” on page 45 discusses the CPU virtualization
employed by Sun xVM Server.
• “Sun xVM Server Memory Virtualization” on page 52 describes memory management
40 Sun xVM Server Sun Microsystems, Inc.
• “Sun xVM Server I/O Virtualization” on page 56 discusses the I/O virtualization used
in Sun xVM Server.
Sun xVM Server Architecture Overview
A Sun xVM Server virtualized system consists of an x86 system, a VMM, a control VM
running Sun xVM Server (Dom0), and zero or more VMs (DomU), as shown in Figure 16.
The Sun xVM Hypervisor for x86, the VMM of the Sun xVM Server system, manages
hardware resources and provides services to the VMs. Each VM, including Dom0, runs
an instance of a guest operating system (GOS) and is capable of communicating with
the VMM through a set of hypervisor calls.
Dom 0 Dom U Dom U Dom U
Applications/ Guest Guest Guest
Domain Applications Applications Applications
Guest OS Guest OS Guest OS Guest OS
Scheduler Event Channel
Console IF XenStore Hypercalls Grant Tables
Sun xVM Hypervisor for x86
Sun X64 Server
Figure 16. A Sun xVM Server virtualized system consists of a VMM, a control VM (Dom0), and zero or
more VMs (DomU).
The Dom0 VM has some unique characteristics not available in other VMs:
• First VM started by the VMM
• Able to directly access I/O devices
• Runs domain manager to create, start, stop, and configure other VMs
• Provides I/O access service to other VMs (DomU)
Each DomU VM runs an instance of a paravirtualized GOS, and gets VMM services
through a set of hypercalls. Access to I/O devices from each DomU VM are provided by
drivers in Dom0.