Verdana Bold 30
Upcoming SlideShare
Loading in...5
×
 

Verdana Bold 30

on

  • 1,499 views

 

Statistics

Views

Total Views
1,499
Views on SlideShare
1,499
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This module describes internals for the two Xen flavours: paravirtualization and HVM. Architecture and some source code is also provided. Other VMM solutions are described in high level such as KVM, VMware and OpenVZ.
  • This slides describes most important characteristics of the Xen hypervisor.
  • This slide presents the major components from Xen.
  • This slides describes most important characteristics of the Xen hypervisor.
  • This slide describes the problem that a hypervisor has to resolve: how to manage privileged instructions. Xen uses 2 mechanisms: paravirtualization and HVM; these two different mechanisms are described in the following sections.
  • Domain 0 handles direct HW connection for either paravirtualized VMs and HVMs with different mechanisms. This patched linux is provided in the xen package inside the xen-sparse folder.
  • Xen provides several tools for managing virtual machines. These tools reside in Domain 0.
  • This slide provides a high level description of the xen specific components in a paravitualized environment. Details will be provided in the section of paravirtualized OS.
  • This slide describes all the possible statuses in which a VM can be. OFF: The VM has not been executed yet RUNNING: The VM is runnable (it could be the active running, or pending to be scheduled) PAUSED: Xen provides the capability to pause to RAM a VM, that is, it doesn’t run and is not considered for scheduling SUSPENDED: Xen also provides the capability to pause and store in disk a VM, that is, all the status needed for further execution is sent to disk, so it can be restarted later in the same hardware or even in a different HW. Different methods incur in different overhead, next slide describes an analysis performed with Linux and Windows Oss in an ordinary notebook with VT-x capabilities.
  • This slide describes all the possible statuses in which a VM can be. OFF: The VM has not been executed yet RUNNING: The VM is runnable (it could be the active running, or pending to be scheduled) PAUSED: Xen provides the capability to pause to RAM a VM, that is, it doesn’t run and is not considered for scheduling SUSPENDED: Xen also provides the capability to pause and store in disk a VM, that is, all the status needed for further execution is sent to disk, so it can be restarted later in the same hardware or even in a different HW.
  • Booting from scratch The measurements for this approach were taken over Windows XP and Linux virtual machines with 512Mb of RAM, and each installed in a 4GB disk partition. One important difference between the Linux and Windows virtual machines is that Windows virtual machine is running in a full virtualization mode whereas the Linux virtual machine is paravitualized . Also, the Linux machine was booted with the minimum number of services and without a graphical window manager whereas the Windows XP measurements were done booting the system by default. Restoring from a saved state Other approach is to use the Xen’s save operation to store a snapshot of a running VM into a file. This snapshot can be used to restart the VM at a later time by using Xen’s restore operation. Once saved, the virtual machine will no longer be running on the system, thus the memory allocated for the virtual machine will be freed for other virtual machines to use [6]. When the saved virtual machine is restored with the Xen’s restore operation, the virtual machine is loaded into memory with the same state it had when the snapshot was taken. Basically, using this approach the entire memory assigned to the virtual machine is saved to a file and then restored back form the file to memory. As expected, the time consumed to restore the virtual machine depends on the size of the virtual memory. Therefore, the main tests executed for this approach were based in the amount of memory variation for the virtual machine. There is another parameter that modifies the time to resume the virtual machine: the physical location of the saved file. If a RAM disk partition is used instead of a hard disk to store the snapshot file, the resume time decreases. However, this approach will reduce the total memory available for the entire system.
  • VM-Pool The idea is to have a pool of virtual machines already booted and ready for execution, but in a “paused” state. These virtual machines maintain their RAM but they don’t use processor time, interrupts and other resources. When a request arrives to fetch a clean virtual machine from the pool, one of these dormant virtual machines is awakened and is given to the requestor. In this way, the availability of a virtual machine occurs almost instantaneously for the requestor. The VM-Pool uses the thread pool design pattern; in this case applied to the virtualization field. The difference with a pool of threads is that threads are generally homogeneous units of computing power, while virtual machines have more personality, they are not necessarily homogeneous; each of them could have its own operating system, its own network configuration, its own disk, etc. Measurements We took measurements to compare the performance gained using the VM-Pool for the dynamic spawning of virtual machines compared to the traditional approach. We measured two operations: the get operation to retrieve a virtual machine from the pool and the release operation to give back a virtual machine to the pool. We also measured the initialization time of the VM-Pool. The VM-Pool module has the capability to boot each virtual machine either from scratch or resuming from a snapshot file of a previously running virtual machine. Booting “from scratch” , in Xen, corresponds to the ‘create’ operation; on the other hand, the “ resume” booting method, corresponds to the Xen’s ‘restore’ operation. The resume method avoids the booting process of the Operating System in the virtual machine, which is in general very time-consuming.
  • VM-Pool The idea is to have a pool of virtual machines already booted and ready for execution, but in a “paused” state. These virtual machines maintain their RAM but they don’t use processor time, interrupts and other resources. When a request arrives to fetch a clean virtual machine from the pool, one of these dormant virtual machines is awakened and is given to the requestor. In this way, the availability of a virtual machine occurs almost instantaneously for the requestor. The VM-Pool uses the thread pool design pattern; in this case applied to the virtualization field. The difference with a pool of threads is that threads are generally homogeneous units of computing power, while virtual machines have more personality, they are not necessarily homogeneous; each of them could have its own operating system, its own network configuration, its own disk, etc. Measurements We took measurements to compare the performance gained using the VM-Pool for the dynamic spawning of virtual machines compared to the traditional approach. We measured two operations: the get operation to retrieve a virtual machine from the pool and the release operation to give back a virtual machine to the pool. We also measured the initialization time of the VM-Pool. The VM-Pool module has the capability to boot each virtual machine either from scratch or resuming from a snapshot file of a previously running virtual machine. Booting “from scratch” , in Xen, corresponds to the ‘create’ operation; on the other hand, the “ resume” booting method, corresponds to the Xen’s ‘restore’ operation. The resume method avoids the booting process of the Operating System in the virtual machine, which is in general very time-consuming.
  • Paravirtualization is created to avoid the lack of virtualization capabilities from x86.
  • Describes the overhead created by the additional layer executed by the hypervisor.
  • It was possible to take such decision (HW-assisted virtualization platforms only) because at the start of the project availability of such (cheap) platforms was relatively large.
  • It was possible to take such decision (HW-assisted virtualization platforms only) because at the start of the project availability of such (cheap) platforms was relatively large.
  • It is seen as a normal Linux process from the point of view of the Linux kernel, not from the guest OS
  • Checkpointing and live migration From OpenVZ Wiki Jump to: navigation , search CPT is an extension to the OpenVZ kernel which can save the full state of a running VE and to restore it later on the same or on a different host in a way transparent to running applications and network connections. This technique has several applications, the most important being live (zero-downtime) migration of VEs and taking an instant snapshot of a running VE for later resume, i.e. CheckPoinTing. Before CPT, it was only possible to migrate a VE through a shutdown and subsequent reboot. The procedure not only introduces quite a long downtime for network services, it is not transparent for clients using the VE, making migration impossible when clients run some tasks which are not tolerant to shutdowns. Compared with this old scheme, CPT allows migration of a VE in a way which is essentially invisible both for users of this VE and for external clients using network services located inside the VE. It still introduces a short delay in service, required for actual checkpoint/restore of the processes, but this delay is indistinguishable from a short interruption of network connectivity. Online migration There is a special utility vzmigrate in the OpenVZ distribution intended to support VE migration. With its help one can perform live (a.k.a. online) migration, i.e. during migration the VE “freezes” for some time, and after migration it continues to work as though nothing had happened. Online migration can be performed with: vzmigrate --online VEID During online migration all VE private data saved to an image file, which is transferred to the target host. In order for vzmigrate to work without asking for a password, ssh public keys from the source host should be placed in the destination host's /root/.ssh/authorized_keys file. In other words, command ssh root@host should not ask you for a password. See ssh keys for more info. Manual Checkpoint and Restore Functions vzmigrate is not strictly required to perform online migration. The vzctl utility, accompanied with some file system backup tools, provides enough power to do all the tasks. A VE can be checkpointed with: vzctl chkpnt VEID --dumpfile This command saves all the state of a running VE to the dump file and stops the VE. If the option --dumpfile is not set, vzctl uses a default path /vz/dump/Dump.VEID. After this it is possible to restore the VE to the same state executing: vzctl restore VEID --dumpfile If the dump file and file system is transferred to another HW node, the same command can restore the VE there with the same success. It is a critical requirement that file system at the moment of restore must be identical to the file system at the moment of checkpointing. If this requirement is not met, depending on the severity of changes, the process of restoration can be aborted or the processes inside a VE can see this as an external corruption of open files. When a VE is restored on the same node where it was checkpointed, it is enough to not touch the file system accessible by the VE. When a VE is transferred to another node it is necessary to synchronize the VE file system before restore. vzctl does not provide this functionality and external tools (i.e. rsync) are required.

Verdana Bold 30 Verdana Bold 30 Presentation Transcript

  • VMMs / Hypervisors Intel Corporation 21 July 2008
  • Agenda
    • Xen internals
      • High level architecture
      • Paravirtualization
      • HVM
    • Others
      • KVM
      • VMware
      • OpenVZ
  • Xen Overview
  • Xen Project bio
    • Xen project was created in 2003 at the University of Cambridge Computer Laboratory in what's known as the Xen Hypervisor project
      • Led by Ian Pratt with team members Keir Fraser, Steven Hand, and Christian Limpach.
      • This team along with Silicon Valley technology entrepreneurs Nick Gault and Simon Crosby founded XenSource which was acquired by Citrix Systems in October 2007
    • The Xen® hypervisor is an open source technology, developed collaboratively by the Xen community and engineers (AMD, Cisco, Dell, HP, IBM, Intel, Mellanox, Network Appliance, Novell, Red Hat, SGI, Sun, Unisys, Veritas, Voltaire, and of course, Citrix)
    • Xen is licensed under the GNU General Public License
    • Xen supports Linux 2.4, 2.6, Windows and NetBSD 2.0
  • Xen Components
    • A Xen virtual environment consists of several modules that provide the virtualization environment:
    • Xen Hypervisor - VMM
    • Domain 0
    • Domain Management and Control
    • Domain User, can be one of:
      • Paravirtualized Guest: the kernel is aware of virtualization
      • Hardware Virtual Machine Guest: the kernel runs natively
    Hypervisor - VMM Domain 0 Domain Management and Control Domain U Paravirtual Guest Domain U Paravirtual Guest Domain U Paravirtual Guest Domain U HVM Guest Domain U HVM Guest Domain U HVM Guest
  • Xen Hypervisor - VMM
    • The hypervisor is Xen itself.
    • It goes between the hardware and the operating systems of the various domains.
    • The hypervisor is responsible for:
    • Checking page tables
    • Allocating resources for new domains
    • Scheduling domains.
    • Booting the machine enough that it can start dom0.
    • It presents the domains with a VirtualMachine that looks similar but not identical to the native architecture.
    • Just as applications can interact with an OS by giving it syscalls, domains interact with the hypervisor by giving it hypercalls. The hypervisor responds by sending the domain an event, which fulfills the same function as an IRQ on real hardware.
    • A hypercall is to a hypervisor what a syscall is to a kernel.
  • Restricting operations with Privilege Rings
    • The hypervisor executes privileged instructions, so it must be in the right place:
    • x86 architecture provides 4 privilege levels / rings
    • Most OSs were created before this implementation, so only 2 levels are used
    • Xen provides 2 modes:
      • In x86 the applications are run at ring 3, the kernel at ring 1 and Xen at ring 0
      • In x86 with VT-x, the applications run at ring 3, the guest at ring non-root-0 and Xen at ring root-0 (-1)
    3 0 3 1 0 3 0 The Guest is moved to ring 1 Native Paravirtual x86 HVM x86 The Hypervisor is moved to ring -1 Applications Guest kernel (dom0 and dom U) Hypervisor
  • Domain 0
    • Domain 0 is a Xen required Virtual Machine running a modified Linux kernel with special rights to:
    • Access physical I/O devices
      • Two drivers are included in Domain 0 to attend requests from Domain U PV or HVM guests
    • Interact with the other Virtual Machines (Domain U)
    • Provides the command line interface for Xen daemons
    • Due to its importance, the minimum functionality should be provided and properly secured
    • Some Domain 0 responsibilities can be delegated to Domain U (isolated driver domain)
    Domain 0 Network backend driver Block backend driver Communicates directly with the local networking hardware to process all virtual machines requests Communicates with the local storage disk to read and write data from the drive based upon Domain U requests PV HVM Qemu-DM Supports HVM Guests for networking and disk access requests
  • Domain Management and Control - Daemons
    • The Domain Management and Control is composed of Linux daemons and tools:
    • Xm
      • Command line tool and passes user input to Xend through XML RPC
    • Xend
      • Python application that is considered the system manager for the Xen environment
    • Libxenctrl
      • A C library that allows Xend to talk with the Xen hypervisor via Domain 0 (privcmd driver delivers the request to the hypervisor)
    • Xenstored
      • Maintains a registry of information including memory and event channel links between Domain 0 and all other Domains
    • Qemu-dm
      • Supports HVM Guests for networking and disk access requests
  • Domain U – Paravirtualized guests The Domain U PV Guest is a modified Linux, Solaris, FreeBSD or other UNIX system that is aware of virtualization (no direct access to hardware) No rights to directly access hardware resources, unless especially granted Access to hardware through front-end drivers using the split device driver model Usually contains XenStore, console, network and block device drivers There can be multiple Domain U in a Xen configuration Domain U - PV Network front-end driver Block front-end driver Communicates with the Network backend driver in Domain 0 Communicates with the Block backend driver in Domain 0 Console driver XenStore driver Similar to a registry
  • Domain U – HVM guests The Domain U HVM Guest is a native OS with no notion of virtualization (sharing CPU time and other VMs running) An unmodified OS doesn’t support the Xen split device driver, Xen emulates devices by borrowing code from QEMU HVMs begin in real mode and gets configuration information from an emulated BIOS For an HVM guest to use Xen features it must use CPUID and then access the hypercall page Domain U - HVM Xen virtual firmware Simulates the BIOS for the unmodified operating system to read it during startup
    • In an operating system with protected memory, each application has it own address space. A hypervisor has to do something similar for guest operating systems.
    • The triple indirection model is not necessarily required but it is more convenient from the performance point of view and modifications needed in the guest kernel.
    • If the guest kernel needs to know anything about the machine pages, it has to use the translation table provided by the shared info page (rare)
    Pseudo-Physical to Memory Model … … … … … … Application Kernel Hypervisor Virtual Pseudo-physical Machine
  • Pseudo-Physical to Memory Model
    • There are variables at various places in the code identified as MFN, PFN, GMFN and GPFN
    It means “some kind of page frame number”. The exact meaning depends on the context PFN (Page Frame Number) This refers to either a MFN or a GPFN, depending on the architecture GMFN (Guest machine frame number) These are page frames in the guest’s address space. These page addresses are relative to the local page tables GPFN (Guest page frame number) Number of a page in the (real) machine’s address space MFN (Machine frame number)
  • Virtual Ethernet interfaces Xen creates, by default, seven pair of "connected virtual ethernet interfaces" for use by dom0 For each new domU, it creates a new pair of "connected virtual ethernet interfaces", with one end in domU and the other in dom0 Virtualized network interfaces in domains are given Ethernet MAC addresses (by default xend will select a random address) The default Xen configuration uses bridging (xenbr0) within domain 0 to allow all domains to appear on the network as individual hosts
  • The Virtual Machine lifecycle OFF RUNNING SUSPENDED PAUSED Turn on Turn off Resume Pause Start (paused) Stop Turn off Wake Sleep Migrate
    • Xen provides 3 mechanisms to boot a VM:
    • Booting from scratch (Turn on)
    • Restoring the VM from a previously saved state (Wake)
    • Clone a running VM (only in XenServer)
  • A project: provide VMs for instantaneous/isolated execution
    • Goal: determine a mechanism for instantaneous execution of applications in sandboxed VMs
    • Approach:
    • Analyze current capabilities in Xen
    • Implement a prototype that addresses the specified goal: VM-Pool
    • Technical specification of HW and SW used:
    • Intel® Core™ Duo T2400 @ 1.83GHz 1828 MHz
    • Motherboard Properties
      • Motherboard ID: <DMI>
      • Motherboard Name: LENOVO 1952D89
    • 2048 MB RAM
    • Software:
      • Linux Fedora Core 8 Kernel 2.6.3.18
      • Xen 3.1
      • For the Windows images Windows XP SP2
  • Analyzing Xen spawning mechanisms
    • Booting from scratch
    HVM WinXP varying the #CPU PV Fedora 8 varying the #CPU
    • Restoring from a saved state
      • HVM WinXP 4GB disk / 1CPU
    PV Fedora 8 varying the #CPU
    • Cloning a running VM
    HVM WinXP 4GB disk / 1CPU 79 sec 2 93.5 sec 1 Time # of CPU 22 sec 2 19.5 sec 1 Time # of CPU 15 sec 21 sec 512 MB 13 sec 16 sec 256 MB Image in RAM Disk Image in Hard Disk VM RAM Size 29 sec 37 sec 1024 MB 16 sec 23 sec 512 MB 9 sec 15 sec 256 MB RAM disk HDD VM RAM Size 300 sec 8 GB 220 sec 4 GB 145 sec 2 GB Time Image size
  • Dynamic Spawning with a VM-Pool
    • To have a pool of virtual machines already booted and ready for execution, but in a “paused” state
    • These virtual machines keep their RAM but they don’t use processor time, interrupts and other resources
    • Simple interface defined:
    • get: retrieves and unpauses a virtual machine from the pool
    • release: gives back a virtual machine to the pool and restarts the VM
    • High level description:
  • Results with the VM-Pool
    • The VM is ready to run in less than half a second (~350 milliseconds)
    • Preferred spawning method is resuming, although it uses additional disk storage
    VMPool Initialization Time 0 50 100 150 200 250 300 VM Booting Mode S e c o n d s From scratch Resume seconds 30±2 Release operation - resume seconds 110±21 Release operation - from scratch milliseconds 341 Get operation seconds 52±1 Initialization time - resume seconds 265±21 Initialization time - from scratch
  • Virtual Machines Scheduling
    • The hypervisor is responsible for ensuring that every running guest receives some CPU time.
    • Most used scheduling mechanisms in Xen:
    • Simple Earliest Deadline First – SEDF (being deprecated):
      • Each domain runs for an n ms slice every m ms (n and m are configured per-domain)
    • Credit Scheduler:
      • Each domain has a couple of properties: a cap and a weight
      • Weight: determines the share of the physical CPU time that the domain gets, weights are relative to each other
      • Cap: represents the maximum, it’s an absolute value
      • Default work-conserving ; if no other VMs needs to use CPU, then the running one will be given more time to execute
      • Uses a fixed-size 30ms quantum, and ticks every 10 ms
    • Xen provides a simple abstract interface to schedulers:
          • struct scheduler {
          • char *name; /* full name for this scheduler */
          • char *opt_name; /* option name for this scheduler */
          • unsigned int sched_id; /* ID for this scheduler */
          • void (*init) (void);
          • int (*init_domain) (struct domain *);
          • void (*destroy_domain) (struct domain *);
          • int (*init_vcpu) (struct vcpu *);
          • void (*destroy_vcpu) (struct vcpu *);
          • void (*sleep) (struct vcpu *);
          • void (*wake) (struct vcpu *);
          • struct task_slice (*do_schedule) (s_time_t);
          • int (*pick_cpu) (struct vcpu *);
          • int (*adjust) (struct domain *, struct xen_domctl_scheduler_op *);
          • void (*dump_settings) (void);
          • void (*dump_cpu_state) (int);
          • };
  • Xen Para-Virtual functionality
  • Paravirtualized architecture
    • We’ll review the PV mechanisms that support this architecture:
    • Kernel Initialization
    • Hypercalls creation
    • Event channels
    • XenStore (some kind of registry)
    • Memory transfers between VMs
    • Split device drivers
    Hypervisor Domain 0 Backend device driver Paravirtual Guest Real device driver Frontend device driver Hardware Block devices Shared Ring Buffers
  • Initial information for booting a PV OS
    • First things the OS needs to know when boots:
      • Available RAM, connected peripherals, access to the machine clock.
    • An OS running on a PV Xen environment does not have access to real firmware
      • The information required is provided by the SHARED INFO PAGES.
    • The “domain builder” is in charge of mapping the shared info pages in the guest’s address space prior its boot.
      • Example: launching dom0 in a i386 architecture:
        • Refer to function construct_dom0 in xen/arch/x86/domain_build.c
    • The shared info pages does not completely replace a BIOS
      • The console device is available via the start info page for debugging purposes; debugging output from the kernel should be available as early as possible.
      • Other devices must be found using the XenStore
  • The start info page
    • The start info page is loaded in the guest’s address space at boot time. The way this page is transferred is architecture-dependent; x86 uses the ESI register.
    • The content of this page is defined by the C structure start_info which is declared in xen/include/public/xen.h
    • A portion of the fields in the start info page are always available for the guest domain and are updated every time the virtual machine is resumed because some of them contain machine addresses (subject to change
  • start_info structure overview
    • struct start_info {
    • /* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. */
    • char magic[32]; /* &quot;xen-<version>-<platform>&quot;. */
    • unsigned long nr_pages; /* Total pages allocated to this domain. */
    • unsigned long shared_info; /* MACHINE address of shared info struct. */
    • uint32_t flags; /* SIF_xxx flags. */
    • xen_pfn_t store_mfn; /* MACHINE page number of shared page. */
    • uint32_t store_evtchn; /* Event channel for store communication. */
    • union {
    • struct {
    • xen_pfn_t mfn; /* MACHINE page number of console page. */
    • uint32_t evtchn; /* Event channel for console page. */
    • } domU;
    • struct {
    • uint32_t info_off; /* Offset of console_info struct. */
    • uint32_t info_size; /* Size of console_info struct from start.*/
    • } dom0;
    • } console;
    • /* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME). */
    • unsigned long pt_base; /* VIRTUAL address of page directory. */
    • unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames. */
    • unsigned long mfn_list; /* VIRTUAL address of page-frame list. */
    • unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */
    • unsigned long mod_len; /* Size (bytes) of pre-loaded module. */
    • int8_t cmd_line[MAX_GUEST_CMDLINE];
    • }; typedef struct start_info start_info_t;
  • start_info fields
    • char magic[32]; /*&quot;xen-<version>-platform>&quot;*/
    • The magic number is the first thing the guest domain must check from its start info page.
      • If the magic string does not start with “xen-” something is seriously wrong and the best thing to do is abort.
      • Also, minor and major versions must be checked in order to determine if the guest kernel had been tested in this Xen version.
    • unsigned long nr_pages; /*Total pages allocated to this domain.*/
    • The amount of available RAM is determined by this field. It contains the number of memory pages available to the domain.
  • start_info fields (2)
    • unsigned long shared_info; /*MACHINE address of shared info struct.*/
    • Contains the address of the machine page where the shared info structure is. The guest kernel should map it to retrieve useful information for its initialization process.
    • uint32_t flags; /* SIF_xxx flags.*/
    • Contains any optional settings for this domain. (defined in xen.h)
      • SIF_PRIVILEGED, SIF_INITDOMAIN
    • xen_pfn_t store_mfn; /* MACHINE page number of shared page.*/
    • Machine address of the shared memory page used for communication with the XenStore.
    • uint32_t store_evtchn; /* Event channel for store communication.*/
    • Event channel used for notifications.
  • start_info fields (3)
    • union {
    • struct {
    • xen_pfn_t mfn; /* MACHINE page number of console page.*/
    • uint32_t evtchn; /* Event channel for console page.*/
    • }domU;
    • struct {
    • uint32_t info_off; /*Offset of console_info struct. */
    • uint32_t info_size; /*Size of console_info struct from start.*/
    • }dom0;
    • }console;
    • Domain 0 guests uses the dom0 part, which contains the memory offset and size of the structure used to define the Xen console.
    • For unprivileged domains the domU part of the union is used .The fields in this represent a shared memory page and event channel used to identify the console device.
  • The shared Info Page
    • The shared info contains information that is dynamically updated as the system runs.
    • It is explicitly mapped by the guest.
    • The content of this page is defined by the C structure shared_info which is declared in xen/include/public/xen.h
  • shared_info fields
    • struct vcpu_info_t vcpu_info[MAX_VIRT_CPUS]
    • This array contains one entry per virtual CPU assigned to the domain. Each array element is a vcpu_info_t structure containing CPU specific information:
      • Each virtual CPU has 3 flags relating to virtual interrupts (asynchronously delivered events).
        • uint8_t evtchn_upcall_pending : it is used by Xen to notify the running system that there are upcalls currently waiting for delivery on this virtual CPU.
        • uint8_t evtchn_upcall_mask : This is the mask for the previous field. This mask prevents any upcalls being delivered to the running virtual CPU.
        • unsigned long evtchn_pending_sel : Indicates which event is waiting. The event bitmap is an array of machine words, and this value indicates which word in the evtchn_pending field of the parent structure indicates the raised event.
      • arch
        • Architecture-specific information.
          • On x86, this include the virtual CR2 register, that contains the linear address of the last page fault, but can only be read from ring 0. This is automatically copied by the hypervisor’s page fault handler before raising the event with the guest domain.
      • time
        • This field, along with a number of fields sharing the wc_ (wall clock) prefix, is used to implement time keeping in paravirtualized Xen guests.
  • shared_info fields (2)
    • unsigned long evtchn_pending[sizeof(unsigned long) * 8];
    • This is a bitmap that indicates which event channels have events waiting. (256 and 512 event channels on a 32 and 64-bit systems respectively)
      • Bits are set by the hypervisor and cleared by the guest.
    • unsigned long evtchn_mask[sizeof(unsigned long) * 8];
    • This bitmap determines whether an event on a particular channel should be delivered asynchronously
      • Every time an event is generated, the corresponding bit in evtchn_pending is set to 1 . If the corresponding bit in evtchn_mask is set to 0 , the hypervisor issues an upcall and delivers the event asynchronously. This allows the guest kernel to switch between interrupt-driven and polling mechanisms on a per-channel basis.
    • struct arch_shared_info arch;
    • On x86 arch the arch_shared_info structure contains two fields; max_pfn and pfn_to_mfn_frame_list_list related to pseudo-physical to machine memory mapping.
  • An exercise: The simplest Xen kernel
  • The simplest Xen kernel
    • Bootstrap
      • Each Xen guest kernel must start with a section __xen_guest for the bootloader, with key-value pairs
        • GUEST_OS: name of the running kernel
        • XEN_VER: specifies the Xen version for which the guest was implemented
        • VIRT_BASE: guest’s address space this allocation is mapped (0 for kernels)
        • ELF_PADDR_OFFSET: value subtracted from addresses in ELF headers (0 for kernels)
        • HYPERCALL_PAGE: specifies the page number where the hypercall trampolines will be set
        • LOADER: special boot loaders (currently only generic is available)
      • After mapping everything into memory at the right places, Xen passes control to the guest kernel
        • A trampoline is defined _start
          • Clears the direction flag, sets up the stack and calls the kernel start passing the start info page address in the ESI register as a parameter
      • A guest kernel is expected to setup handlers to receive events at boot time, otherwise the kernel is not able to respond to the outside world (it is ignored in the book’s example)
    • Kernel.c
      • The start_kernel routine takes the start info page as the parameter (passed through the ESI)
      • The stack is reserved in this file, although it was referenced in bootstrap as well for creating the trampoline routine
      • If the hypervisor was compiled with debugging , then the HYPERVISOR_console_io will send the string to the console (otherwise the hypercall fails)
    • Debug.h
      • The hypercall takes three arguments: the command (write), the length of the string and the string pointer
      • The hypercall # is 18 (xen/include/public/xen.h)
  • Hypercalls
  • Executing Privileged instructions from apps Because guest kernels don’t run at ring 0 they’re not allowed to execute privileged instructions, a mechanism is needed to execute them in the right ring, supose exit(0): push dword 0 mov eax, 1 push eac int 80h Application Application Kernel Kernel Hypervisor Ring 0 Ring 1 Ring 2 Ring 3 Native Paravirtualized System Call Hypercall Direct System Call (Xen specific) The Hypervisor has the interrupts table
  • Replacing Privileged instructions with Hypercalls
    • Unmodified guests use privileged instructions which require transition to ring 0, causing performance penalty if resolved by the hypervisor
    • Paravirtual guests replace their privilege instructions by hypercalls
    • Xen uses 2 mechanisms for hypercalls:
    • An int 82h is used as the channel similar to system calls (deprecated after Xen 3.0)
    • Issued indirectly using the hypercall page provided when the guest is started
    • For the second mechanism, macros are provided to write hypercalls
    • #define _hypercall2(type, name, a1, a2)
    • ({
    • long __res, __ign1, __ign2;
    • asm volatile (
    • &quot;call hypercall_page + (&quot;STR(__HYPERVISOR_##name)&quot; * 32)&quot;
    • : &quot;=a&quot; (__res), &quot;=b&quot; (__ign1), &quot;=c&quot; (__ign2)
    • : &quot;1&quot; ((long)(a1)), &quot;2&quot; ((long)(a2))
    • : &quot;memory&quot; );
    • (type)__res;
    • })
    • A PV Xen guest uses the HYPERVISOR_sched_op function with SCHEDOP_yield argument instead of using the privileged instruction HLT, in order to relinquish CPU time to guests with running tasks
    • static inline int HYPERVISOR_sched_op(int cmd, void *arg)
    • {
    • return _hypercall2(int, sched_op, cmd, arg);
    • }
    • extras/mini-os/include/x86/x86_32/hypercall-x86_32.h, implemented at xen/common/schedule.c
  • Event Channels
  • Event Channels
    • Event channels are the basic primitive provided by Xen for event notifications, equivalent of a hardware interrupt valid for paravirtualized OSs
    • Events are one bit of information signaled by transitioning from 0 to 1
    • Physical IRQs: mapped from real IRQs used to communicate with hardware devices
    • Virtual IRQs: similar to PIRQs, but related to virtual devices such as the timer, debug console
    • Interdomain events: bidirectional interrupts that notify domains about certain event
    • Intradomain events: special case of interdomain events
    Hypervisor - VMM Domain 0 Domain Management and Control Domain U Paravirtual Guest Hardware Event Channel driver
  • Event Channel Interface Guests configure the Event Channel and send interrupts by issuing a specific hypercall: HYPERVISOR_event_channel_op (...) Guests are notified about pending events through callbacks installed during initialization, these events can be masked dynamically HYPERVISOR_set_callbacks(…) Hypervisor - VMM Domain 0 Domain Management and Control Domain U Paravirtual Guest Hardware Event Channel driver HYPERVISOR_event_channel_op Callback
  • HYPERVISOR_event_channel_op – 1/2
    • HYPERVISOR_event_channel_op(int cmd, void *arg) // defined at xen-3.1.0-srclinux-2.6-xen-sparseincludeasm-i386mach-xenasmhypercall.h
    • EVTCHNOP_alloc_unbound: Allocate a new event channel port, ready to be connected to by a remote domain
      • Specified domain must exist
      • A free port must exist in that domain
    • EVTCHNOP_bind_interdomain: Bind an event channel for interdomain communications
      • Caller domain must have a free port to bind.
      • Remote domain must exist.
      • Remote port must be allocated and currently unbound.
      • Remote port must be expecting the caller domain as the remote.
    • EVTCHNOP_bind_virq: Allocate a port and bind a VIRQ to it
      • Caller domain must have a free port to bind.
      • VIRQ must be valid.
      • VCPU must exist.
      • VIRQ must not currently be bound to an event channel
    • EVTCHNOP_bind_ipi: Allocate and bind a port for notifying other virtual CPUs.
      • Caller domain must have a free port to bind.
      • VCPU must exist.
    • EVTCHNOP_bind_pirq: Allocate and bind a port to a real IRQ.
      • Caller domain must have a free port to bind.
      • PIRQ must be within the valid range.
      • Another binding for this PIRQ must not exist for this domain.
  • HYPERVISOR_event_channel_op – 2/2
    • HYPERVISOR_event_channel_op(int cmd, void *arg) /* defined at xen-3.1.0-srclinux-2.6-xen-sparseincludeasm-i386mach-xenasmhypercall.h */
    • EVTCHNOP_close: Close an event channel (no more events will be received).
      • Port must be valid (currently allocated).
    • EVTCHNOP_send: Send a notification on an event channel attached to a port.
      • Port must be valid.
    • EVTCHNOP_status: Query the status of a port; what kind of port, whether it is bound, what remote domain is expected, what PIRQ or VIRQ it is bound to, what VCPU will be notified, etc.
      • Unprivileged domains may only query the state of their own ports.
      • Privileged domains may query any port.
  • Issuing event channel hypercalls
    • Structures defined at xen-3.1.0-srcxenincludepublicevent_channel.h
    • Hypervisor handlers defined at xen-3.1.0-srcxencommonevent_channel.c
    • Allocating an unbound event channel
    • evtchn_alloc_unbound_t op;
    • op.dom = DOMID_SELF;
    • op.remote_dom = remote_domain; /* an integer representing the domain */
    • if(HYPERVISOR_event_channel_op(EVTCHOP_alloc_unbound, &op) != 0)
    • {
    • /* Error */
    • }
    • Binding an event channel for interdomain communication
    • evtchn_bind_interdomain_t op;
    • op.remote_dom = remote_domain;
    • op.remote_port = remote_port;
    • if(HYPERVISOR_event_channel_op(EVTCHOP_bind_interdomain, &op) != 0)
    • {
    • /* Error */
    • }
  • HYPERVISOR_set_callbacks
    • Hypercall to configure the notification handlers
    • HYPERVISOR_set_callbacks(
    • unsigned long event_selector, unsigned long event_address,
    • unsigned long failsafe_selector, unsigned long failsafe_address)
    • /* defined at xen-3.1.0-srclinux-2.6-xen-sparseincludeasm-i386mach-xenasmhypercall.h */
    • event_selector + event_address: make the callback address for notifications
    • failsafe_selector + failsafe_address: make the callback if anything goes wrong with the event
    • Notifications can be prevented at a VCPU level or at an event level because they’re contained in the shared info page:
    • struct shared_info {…
    • struct vcpu_info vcpu_info[MAX_VIRT_CPUS] {…
    • uint8_t evtchn_upcall_mask;…};
    • unsigned long evtchn_mask[sizeof(unsigned long) * 8];
    • … };
  • Setting the notifications handler Handler and masks configuration /* Locations in the bootstrapping code */ extern volatile shared_info_t shared_info; void hypervisor_callback(void); void failsafe_callback(void); static evtchn_handler_t handlers[NUM_CHANNELS] ; void EVT_IGN(evtchn_port_t port, struct pt_regs * regs) {}; /* Initialise the event handlers */ void init_events(void) { /* Set the event delivery callbacks */ HYPERVISOR_set_callbacks( FLAT_KERNEL_CS, (unsigned long)hypervisor_callback , FLAT_KERNEL_CS, (unsigned long)failsafe_callback ); /* Set all handlers to ignore, and mask them */ for(unsigned int i=0 ; i<NUM_CHANNELS ; i++) { handlers[i] = EVT_IGN; SET_BIT(i,shared_info.evtchn_mask[0]); } /* Allow upcalls. */ shared_info.vcpu_info[0].evtchn_upcall_mask = 0 ; }
  • Implementing the callback function /* Dispatch events to the correct handlers */ void do_hypervisor_callback(struct pt_regs *regs) { unsigned int pending_selector, next_event_offset; vcpu_info_t *vcpu = &shared_info.vcpu_info[0]; /* Make sure we don't lose the edge on new events... */ vcpu->evtchn_upcall_pending = 0; /* Set the pending selector to 0 and get the old value atomically */ pending_selector = xchg(&vcpu->evtchn_pending_sel, 0); while(pending_selector != 0) { /* Get the first bit of the selector and clear it */ next_event_offset = first_bit(pending_selector); pending_selector &= ~(1 << next_event_offset); unsigned int event; /* While there are events pending on unmasked channels */ while(( event = (shared_info.evtchn_pending[pending_selector] & ~shared_info.evtchn_mask[pending_selector])) != 0) { /* Find the first waiting event */ unsigned int event_offset = first_bit(event); /* Combine the two offsets to get the port */ evtchn_port_t port = (pending_selector << 5) + event_offset ; /* 5 -> 32 bits */ /* Handle the event */ handlers[port](port, regs); /* Clear the pending flag */ CLEAR_BIT(shared_info.evtchn_pending[0], event_offset); } } } Maps a bit with an index in the callback matrix
  • XenStore
  • Xen Store
    • XenStore is a hierarchical namespace (similar to sysfs or Open Firmware) which is shared between domains
    • The interdomain communication primitives exposed by Xen are very low-level (virtual IRQ and shared memory)
    • XenStore is implemented on top of these primitives and provides some higher level operations (read a key, write a key, enumerate a directory, notify when a key changes value)
    • General Format
    • There are three main paths in XenStore:
    • /vm - stores configuration information about domain
    • /local/domain - stores information about the domain on the local node (domid, etc.)
    • /tool - stores information for the various tools
    • Detailed information at http://wiki.xensource.com/xenwiki/XenStoreReference
  • Ring buffers for split driver model
    • The ring buffer is a fairly standard lockless data structure for producer-consumer communications
    • Xen uses free-running counters
    • Each ring contains two kinds of data, a request and a response, updated by the two halves of the driver
    • Xen only allows responses to be written in a way that overwrites requests
  • Xen Split Device Driver Model (for PV guests)
    • Xen delegates hardware support typically to Domain 0, and device drivers typically consist of four main components:
    • The real driver
    • The back end split driver
    • A shared ring buffer (shared memory pages and events notification)
    • The front end split driver
    Hypervisor Domain 0 Backend device driver Paravirtual Guest Real device driver Frontend device driver Hardware Block devices Shared Ring Buffers
  • Xen HVM functionality
  • Xen HVM
    • Hardware Virtual Machines allow unmodified Operating Systems to run on Virtual Environments
    • This approach brings 2 kinds of problems:
    • For the unmodified OS, the VM must appear as a real PC
    • Hardware access
      • To keep isolation device emulation must be provided from Domain 0
      • Provide direct assignment from a VM to a specific HW
    Domain 0 Qemu-dm Domain U - HVM Xen virtual firmware Every HVM has a qemu-dm counterpart Handles networking and disk access from HVM Based in QEMU project
    • Virtual BIOS to provide standard start-up
    • Composed of 3 payloads
    • Vmxassist: real mode emulator for VMX
    • VGA BIOS
    • ROM BIOS
  • Xen QEMU-dm / Virtual firmware interaction
    • Xen Virtual firmware works as the front end driver in the split driver model
    • Guest issues a BIOS interrupt requesting data to be loaded from disk
    • The virtual BIOS translates the call into a request to the block device
    • The vBIOS interrupt is caught by QEMU-dm
    • QEMU-dm emulates the hardware and translates that to the real hardware in Domain 0
    • The inverse process happens for the response
    Domain 0 Qemu-dm Domain U - HVM Xen virtual firmware
  • HVM domain creation
    • Once the domain builder is specified as “hvm”:
    • Allocates and verifies memory for domain
    • Loads the hvmloader as a kernel (setup_guest at xc_hvm_build.c)
    • Initializes hypercalls table and verifies that Xen is active
    • Copies BIOS image to 0x000F0000 created from Bochs (tools/firmware/rombios)
    • Discovers and sets up PCI devices
    • Loads a VGA BIOS
    • For Intel platforms, loads real-mode emulator for VMX (tools/firmware/vmxassist)
  • HVM support in Xen Support for hardware virtualization is done through an abstract interface defined at xen/include/asm-x86/hvm struct hvm_function_table { char *name; void (* disable )(void); int (* vcpu_initialise )(struct vcpu *v); void (* vcpu_destroy )(struct vcpu *v); void (* store_cpu_guest_regs )(struct vcpu *v, struct cpu_user_regs *r, unsigned long *crs); void (* load_cpu_guest_regs )(struct vcpu *v, struct cpu_user_regs *r); void (* save_cpu_ctxt )(struct vcpu *v, struct hvm_hw_cpu *ctxt); int (* load_cpu_ctxt )(struct vcpu *v, struct hvm_hw_cpu *ctxt); int (* paging_enabled )(struct vcpu *v); int (* long_mode_enabled )(struct vcpu *v); int (* pae_enabled )(struct vcpu *v); int (* interrupts_enabled )(struct vcpu *v); int (* guest_x86_mode )(struct vcpu *v); unsigned long (* get_guest_ctrl_reg )(struct vcpu *v, unsigned int num); unsigned long (* get_segment_base )(struct vcpu *v, enum x86_segment seg); void (* get_segment_register )(struct vcpu *v, enum x86_segment seg, struct segment_register *reg); void (* update_host_cr3 )(struct vcpu *v); void (* update_guest_cr3 )(struct vcpu *v); void (* update_vtpr )(struct vcpu *v, unsigned long value); void (* stts )(struct vcpu *v); void (* set_tsc_offset )(struct vcpu *v, u64 offset); void (* inject_exception )(unsigned int trapnr, int errcode, unsigned long cr2); void (* init_ap_context )(struct vcpu_guest_context *ctxt, int vcpuid, int trampoline_vector); void (* init_hypercall_page )(struct domain *d, void *hypercall_page); int (* event_injection_faulted )(struct vcpu *v); };
  • Intel VT support in Xen The hvm_function_table is initialized at xen/arch/x86/hvm/vmx/vmx.c The following routines store and load completely save the state of a CPU through the VMCS .store_cpu_guest_regs = vmx_store_cpu_guest_regs .load_cpu_guest_regs = vmx_load_cpu_guest_regs This status copy is performed in a single instruction struct vmcs_struct { u32 vmcs_revision_id; unsigned char data [0]; /* vmcs size is read from MSR */ };
  • KVM overview
  • What is KVM?
    • It’s a VMM built within the Linux kernel
      • The name stands for Kernel Virtual Machines
      • It is included in mainline Linux, as of 2.6.20
    • It offers full-virtualization
      • Para-virtualization support is in alpha state
    • It works *only* in platforms with hardware-assisted virtualization
      • Currently only Intel-VT and AMD-V
      • Recently also s390, PowerPC and IA64
    • Decision taken to achieve a simple design
      • No need to deal with ring aliasing problem,
      • Nor excessive faulting avoidance
      • Nor guest memory management complexity
      • Etc
  • Why KVM?
    • Today’s hardware is becoming increasingly complex
      • Multiple HW threads on a core
      • Multiple cores on a socket
      • Multiple sockets on a system
      • NUMA memory models (on-chip memory controllers)
    • Scheduling and memory management is becoming harder accordingly
    • Great effort is required to program all this complexity in hypervisors
      • But an operating system kernel already handles this complexity
      • So why no reuse it?
    • KVM makes use of all the fine-tuning work that has gone (and is going) into the Linux kernel, applying it to a virtualized environment
    • Minimal footprint
      • Less than 10K lines of kernel code
      • Implemented as a Linux’s module
  • How it works?
    • A normal Linux process has two modes of execution: kernel and user
      • KVM adds a third mode: guest mode
    • A virtual machine in KVM will be “seen” as a normal Linux process
      • A portion of code will run in user mode: performs I/O on behalf of the guest
      • A portion of code will run in guest mode: performs non-I/O guest code
    guest mode With its own 4 rings
  • Key features
    • Simpler design: Kernel+Userspace (vs. Hypervisor+Kernel+Userspace)
      • Avoids many context switches
      • Code reuse (today and tomorrow)
      • Easy management of VMs (standard process tools)
    • Supports Qcow2 and Vmdk disk image formats
      • “ Growable” formats (copy-on-write)
      • Saved state of a VM with X Mb of RAM takes less than X Mb of file space
        • KVM skips RAM sectors mapped by itself
        • KVM uses the on-the-fly compression capability of Qcow2 and VMDK formats
        • I.e. an save state of a 384Mb’s Windows VM occupies ~40Mb
      • Discard-on-write capability (read’s made from base image A, write’s goes to new image B)
        • B will contain the differences from A performed by the VM
        • Later, B diff’s can be merged into A
    • Advanced guest memory management
      • Increased VM density with KSM (under development) [3]
        • KSM is a kernel module to save memory by searching and merging identical pages inside one or more memory areas
      • Balloon driver as in Xen
      • Guest’s page swapping allowed
  • Future trends
    • Para-virtualization support (Windows & Linux)
      • virtio devices already included in Linux’s mainline as of 2.6.25
    • Storage [4]
      • Many similar guests cause a lot of duplicate storage
      • Current solution : baseline + delta images
        • Delta degrades overtime (needs planning)
        • Disk-in-file is overheady
      • Future :
        • Block-level deduplication
          • Filesystem or block device looks for identical blocks ... and consolidates them
          • Btrfs being analyzed right now (has snapshots & reverse mappings)
        • Hostfs + file-based deduplication
          • No more virtual block device. Guest filesystem is a host directory
          • Host can carry out file dedup in the background
          • Requires changes in guest
        • Para-virtualized file systems (9P from IBM Research) [2]
          • Easy way to maintain consistency between two guests sharing a block device R/W
          • Provide a direct file system proxy mechanism built on top of the native host<->guest I/O transport, avoiding unnecessary network stack overhead
  • Future trends (2)
    • Containers & Isolation (reduce the impact of one guest on others)
      • Memory containers
        • Account each page to its container
        • Allows preferentially swapping some guests
      • I/O accounting (since I/O affects other guests)
        • Each I/O in flight is correctly accounted to initiating task
        • Important for I/O scheduling
    • Device passthrough methods
      • Several competing options
        • 1:1 mapping with Intel VT-d
        • Virtualization-capable devices with PCI SIG Single Root IOV
        • PVDMA
        • Userspace IRQ delivery
      • Still to see which will become mainline
    • VMs-AS-FILES
      • Cross-hypervisor virtualization containers to allow for transportability of VMs
      • OVF : Open Virtual Appliance Format [5]
    • Cross platform guest support (QuickTransit technology [6] )
      • I.e. a Solaris for Sparc running in an Intel platform
  • VMware overview
  • VMware In 1998, VMware created a solution to virtualize the x86 platform, creating the market for x86 virtualization The solution was a combination of binary translation and direct execution on the processor Nonvirtualizable instructions are replaced with new sequences of instructions User level code is directly executed on the processor Each VMM provides each VM with all the services of the physical system, including a virtual BIOS, virtual devices and virtualized memory management
  • VMware ESX architecture Datacenter-class virtualization platform used by many enterprise customers for server consolidation Runs directly on a physical server having direct access to the physical hardware of the server
    • Virtualization layer (VMM/VMKernel): implements the idealized hardware environment and virtualizes the physical hardware devices
    • Resource Manager: partitions and controls the physical resources of the underlying machine
    • Hardware interface components: enable hardware-specific service delivery
    • Service Console: boots the system, initiates execution of the virtualization layer and resource manager, and relinquishes control to those layers
    • Add
      • Virtual Center / Lab manager
  • VMware default deployment Primary method of interaction with virtual infrastructure (console and GUI) Virtualization layer that abstracts the processor, memory, storage, and networking resources of the physical host into multiple virtual machines Centrally manages the VMware ESX Server hosts Organizes all the configuration data for the virtual infrastructure environment Authorizes VirtualCenter Servers and ESX Server hosts appropriately for the licensing agreement VI Client from the VirtualCenter Server or ESX Server hosts
  • VMware for free
    • VMware provides freeware Server and Workstation virtualization solutions
    • VMware Server:
      • Is a free desktop application that lets you run virtual machines on your Windows or Linux PC
      • Lets you use host machine devices, such as CD and DVD drives, from the virtual machine
      • Datasheet or FAQ page is available
      • Different Virtual Appliances are provided for free
    • VMware Player :
      • Similar to VMware Server but limited to run pre-built virtual appliances
  • OpenVZ overview Operating System virtualization
  • OpenVZ
    • OpenVZ is an open source server virtualization solution that creates multiple isolated Virtual Private Servers (VPSs) or Virtual Environments (VEs) on a single physical server
    • VPS perform and execute exactly like a stand-alone server for its users and applications as it can be rebooted independently
    • All VPSs have their own set of processes and can run different Linux distributions, but all VPSs operate under the same kernel
    • OpenVZ is the basis of Parallels/Virtuozzo Containers
    • Distinctive features:
      • Operating System Virtualization
      • Network Virtualization
      • Resource Management
      • Templates
    • Installation: http://wiki.openvz.org/Quick_installation
    • User documentation: http://download.openvz.org/doc/OpenVZ-Users-Guide.pdf
  • OpenVZ Kernel
    • The OpenVZ kernel is a modified Linux kernel which adds the following functionality:
    • Virtualization and isolation: enables many virtual environments within a single kernel
    • Resource management: subsystem limits (and in some cases guarantees) resources such as CPU, RAM, and disk space on a per-VE basis
    • Live Migration/Checkpointing: a process of “freezing” a VE, saving its complete state to a disk file, with the ability to “unfreeze” that state later
  • OpenVZ Kernel Virtualization and Isolation
    • Each Virtual Environment has its own set of virtualized/isolated resources, such as:
    • Files
      • System libraries, applications, virtualized /proc and /sys, virtualized locks, etc.
    • Users and groups
      • Each VE has its own root user, as well as other users and groups.
    • Process tree
      • A VE sees only its own set of processes, starting from init. PIDs are virtualized, so that the init PID is 1 as it should be.
    • Network
      • Virtual network device, which allows the VE to have its own IP addresses, as well as a set of netfilter (iptables) and routing rules.
    • Devices
      • Devices are virtualized. In addition, any VE can be granted exclusive access to real devices like network interfaces, serial ports, disk partitions, etc.
    • IPC objects
      • Shared memory, semaphores, and messages.
  • OVZ Resource Management
    • Resource management subsystem consists of three components:
    • Two-level disk quota:
      • 1 st level: Server administrator can set up per-VE disk quotas in terms of disk space and number of inodes
      • 2 nd level: VE administrator (VE root) uses standard UNIX quota tools to set up per-user and per-group disk quotas.
    • “ Fair” CPU 2 level scheduler:
      • 1 st level: decides which VE to give the time slice to, taking into account the VE’s CPU priority and limit settings
      • 2 nd level: standard Linux scheduler decides which process in the VE to give the time slice to, using standard process priorities.
    • User Beancounters
      • This is a set of per-VE counters, limits, and guarantees
      • Set of about 20 parameters which are carefully chosen to cover all the aspects of VE operation, so no single VE can abuse any resource which is limited for the whole computer and thus cause harm to other VEs
      • The resources accounted and controlled are mainly memory and various in-kernel objects such as IPC shared memory segments, network buffers etc.
  • OpenVZ Checkpointing and live migration Allows the “live” migration of a VE to another physical server A “frozen” VE and its complete state is saved to a disk file, then transferred to another machine This VE can then be “unfrozen” (restored) there (the whole process takes a few seconds, and from the client’s point of view it looks not like a downtime, but rather a delay in processing, since the established network connections are also migrated) Host Host OpenVZ OpenVZ Disk Virtual Env Virtual Env Checkpoint Live migration
  • Backup
  • Xen Terminology – 1/2 Basics guest operating system: An operating system that can run within the Xen environment. hypervisor: Code running at a higher privilege level than the supervisor code of its guest operating systems. virtual machine monitor (&quot;vmm&quot;): In this context, the hypervisor. domain: A running virtual machine within which a guest OS executes. domain0 (&quot;dom0&quot;): The first domain, automatically started at boot time. Dom0 has permission to control all hardware on the system, and is used to manage the hypervisor and the other domains. unprivileged domain (&quot;domU&quot;): A domain with no special hardware access. Approaches to Virtualization full virtualization: An approach to virtualization which requires no modifications to the hosted operating system, providing the illusion of a complete system of real hardware devices. paravirtualization: An approach to virtualization which requires modifications to the operating system in order to run in a virtual machine. Xen uses paravirtualization but preserves binary compatibility for user space applications. Address Spaces MFN (machine frame number): Real host machine address; the addresses the processor understands. GPFN (guest pseudo-physical frame number): Guests run in an illusory contiguous physical address space, which is probably not contiguous in the machine address space. GMFN (guest machine frame number): Equivalent to GPFN for an auto-translated guest, and equivalent to MFN for normal paravirtualised guests. It represents what the guest thinks are MFNs. PFN (physical frame number): A catch-all for any kind of frame number. &quot;Physical&quot; here can mean guest-physical, machine-physical or guest-machine-physical. Page Tables SPT (shadow page table): shadow version of a guest OSes page table. Useful for numerous things, for instance in tracking dirty pages during live migration. PAE: Intel's Physical Addressing Extensions, which enable x86/32 machines to address up to 64 GB of physical memory. PSE (page size extension): used as a flag to indicate that a given page is ahuge/super page (2/4 MB instead of 4KB). x86 Architecture HVM: Hardware Virtual Machine, which is the full-virtualization mode supported by Xen. This mode requires hardware support, e.g. Intel's Virtualization Technology (VT) and AMD's Pacifica technology. VT-x: full-virtualization support on Intel's x86 VT-enabled processors VT-i: full-virtualization support on Intel's IA-64 VT-enabled processors Extracted from: http://wiki.xensource.com/xenwiki/XenTerminology
  • Xen Terminology – 2/2 Networking Infrastructure backend: one half of a communication end point - interdomain communication is implemented using a frontend and backend device model interacting via event channels. frontend: the device as presented to the guest; other half of the communication endpoint. vif: virtual interface; the name of the network backend device connected by an event channel to a network front end on the guest. vethN: local networking front end on dom0; renamed to ethN by xen network scripts in bridging mode (FIXME) pethN: real physical device (after renaming) Migration Live migration: A technique for moving a running virtual machine to another physical host, without stopping it or the services running on it. Scheduling BVT: The Borrowed Virtual Time scheduler is used to give proportional fair shares of the CPU to domains. SEDF: The Simple Earliest Deadline First scheduler provides weighted CPU sharing in an intuitive way and uses realtime algorithms to ensure time guarantees. Extracted from: http://wiki.xensource.com/xenwiki/XenTerminology
  • Intel privileged instructions
    • Some of the system instructions (called “privileged instructions”) are protected from use by application programs. The privileged instructions control system functions (such as the loading of system registers). They can be executed only when the CPL is 0 (most privileged). If one of these instructions is executed when the CPL is not 0, a general-protection exception (#GP) is generated. The following system instructions are privileged instructions (16):
    • LGDT — Load GDT register.
    • LLDT — Load LDT register.
    • LTR — Load task register.
    • LIDT — Load IDT register.
    • MOV (control registers) — Load and store control registers.
    • LMSW — Load machine status word.
    • CLTS — Clear task-switched flag in register CR0.
    • MOV (debug registers) — Load and store debug registers.
    • INVD — Invalidate cache, without writeback.
    • WBINVD — Invalidate cache, with writeback.
    • INVLPG —Invalidate TLB entry.
    • HLT— Halt processor.
    • RDMSR — Read Model-Specific Registers.
    • WRMSR —Write Model-Specific Registers.
    • RDPMC — Read Performance-Monitoring Counter.
    • RDTSC — Read Time-Stamp Counter.
  • QEMU Description - http://bellard.org/qemu/
    • http://bellard.org/qemu/qemu-tech.html
    • A fast processor emulator using a portable dynamic emulator
    • 2 operating modes (add diagrams for each case):
    • Full system emulation
    • User mode emulation
    • Generic features:
    • User space only or full system emulation
    • Using dynamic translation to native code for reasonable speed
    • Working on x86 and PowerPC hosts. Being tested on ARM, Sparc32, Alpha and S390
    • Self-modifying code support
    • Precise exceptions support
    • The virtual CPU is a library (libqemu) which can be used in other projects
    • QEMU full system emulation features:
    • QEMU can either use a full software MMU for maximum portability or use the host system call mmap() to simulate the target MMU
  • QEMU x86 emulation
    • QEMU x86 target features:
    • Support for 16 bit and 32 bit addressing with segmentation. LDT/GDT and IDT are emulated. VM86 mode is also supported to run DOSEMU
    • Support of host page sizes bigger than 4KB in user mode emulation
    • QEMU can emulate itself on x86
    • Current QEMU limitations:
    • No SSE/MMX support
    • No x86-64 support
    • IPC syscalls are missing
    • The x86 segment limits and access rights are not tested at every memory access
    • On non x86 host CPUs, doubles are used instead of the non standard 10 byte long doubles of x86 for floating point emulation to get maximum performances.
  • References
    • Intel® 64 and IA-32 Architectures - Software Developer’s Manual
    • http://wiki.xensource.com/xenwiki/XenArchitecture?action=AttachFile&do=get&target=Xen+Architecture_Q1+2008.pdf
    • http://wiki.xensource.com/xenwiki/XenArchitecture
    • http://www.xen.org/files/xensummit_4/Liguori_XenSummit_Spring_2007.pdf
    • http://wiki.xensource.com/xenwiki/XenTerminology
    • http://www.xen.org/xen/faqs.html
    • http://www.vmware.com/pdf/esx2_performance_implications.pdf
    • http://www.vmware.com/files/pdf/VMware_paravirtualization.pdf
    • http://download.openvz.org/doc/OpenVZ-Users-Guide.pdf
    • http://download.openvz.org/doc/openvz-intro.pdf
    • KVM project @ Sourceforge.net
    • Paravirtualized file systems , KVM Forum 2008.
    • Increasing Virtual Machine density with KSM , KVM Forum 2008.
    • Beyond kvm.ko , KVM Forum 2008.
    • Open-OVF: an OSS project around the Open Virtual Appliance format , KVM Forum 2008.
    • Cross platform guest support , KVM Forum 2008.