KVM: Kernel-based Virtual Machine
•
– Xen ring aliasing
• CPU QEMU
– BIOS
VMX root mode OS VMX non-root mode OS
proc.
QEMU
Ring 3
device memory VM Entry
emulation management VMCS
VM Exit
KVM
Ring 0 Guest OS Kernel
Linux Kernel
11
CPU
Xen KVM
VM VM (Xen DomU) VM (QEMU process)
(Dom0) Guest OS Guest OS
Process Process
VCPU VCPU
threads
Xen Hypervisor Linux
KVM
Domain Process
scheduler scheduler
Physical Physical
CPU CPU
12
OS Guest
OS
VA PA GVA GPA
GVA
VMM GPA
HPA
MMU#
MMU#
(CR3)
(CR3)
page page
H/W
13
PVM HVM EPT# HVM
Guest Guest Guest
OS OS OS
GVA HPA GVA GPA GVA GPA
OS
OS SPT
VMM VMM VMM
GVA HPA GPA HPA
MMU# MMU# MMU#
(CR3) (CR3) (CR3)
page page page
H/W
14
VM I/O
• I/O VM
(virtio, vhost)
PCI pass-
through
SR-IOV
– VMM VMM
Open vSwitch
• QEMU ne2000
rtl8139 e1000 VT-d
•
– Xen split driver model
– virtio vhost
– VMWare VMXNET3
• Direct assignment VMM bypass I/O
– PCI
– SR-IOV
20
VM I/O
I/O
PCI SR-IOV
VM1 VM2 VM1 VM2 VM1 VM2
Guest OS Guest OS Guest OS
… … …
Guest Physical Physical
driver driver driver
VMM VMM VMM
vSwitch
Physical
driver
NIC NIC NIC
Switch (VEB)
I/O emulation PCI passthrough SR-IOV
VM
21
Edge Virtual Bridging
(IEEE 802.1Qbg)
• VM
•
(a) Software VEB (b) Hardware VEB (c) VEPA, VN-Tag
VM1 VM2 VM1 VM2 VM1 VM2
VNIC VNIC VNIC VNIC VNIC VNIC
VMM vSwitch VMM VMM
NIC NIC switch NIC
switch
VEB: Virtual Ethernet Bridging VEPA: Virtual Ethernet Port Aggregator
22
I/O
• OS
–
• VM Exits
VMX root mode VMX non-root mode
QEMU
Ring 3
e1000 copy
Linux Kernel/ tap Guest OS Kernel
KVM
vSwitch buffer
Ring 0
Physical driver e1000
23
virtio
• VM Exits
• virtio_ring
– I/O
VMX root mode VMX non-root mode
QEMU
Ring 3
virtio_net copy
Linux Kernel/ tap Guest OS Kernel
KVM
vSwitch buffer
Ring 0
Physical driver virtio_net
24
vhost
• tap QEMU
• macvlan/macvtap
VMX root mode VMX non-root mode
QEMU
Ring 3
Linux Kernel/
vhost_net
KVM
Guest OS Kernel
macvtap
buffer
Ring 0
physical driver macvlan virtio_net
25
VM PCI pass-
SR-IOV
• (virtio, vhost) through
– VMM DMA VMM
Open vSwitch
– VMM VT-d
VMX root mode VMX non-root mode
QEMU
Ring 3
Linux Kernel/ Guest OS Kernel
KVM
Ring 0 buffer
physical driver
EOI
H/W VT-d DMA
26
VM1 VM2
: Guest OS
VMM …
VM Exit
VMCS
VMM
OS VM Entry
DMA
IOMMU
NIC
VMCS: Virtual Machine Control Structure
27
ource-id” in this document. The remapping hardware may determine the source-id of a
in implementation-specific ways. For example, some I/O bus protocols may provide the
device identity as part of each I/O transaction. In other cases (for Root-Complex
devices, for example), the source-id may be derived based on the Root-Complex internal
tion.
DMA remapping page walk
ress devices, the source-id is the requester identifier in the PCI Express transaction layer
requester identifier of a device, which is composed of its PCI Bus/Device/Function
assigned by configuration software and uniquely identifies the hardware function that
request. Figure 3-6 illustrates the requester-id1 as defined by the PCI Express
n.
1
ID
DMA Remapping—Intel® Virtualization Technology for Directed I/O
• BDF ID
5 87 3 2 0
DMA Remapping—Intel® Virtualization Technology for Directed I/O
Bus # Device # Function #
• ID DMA
Figure 3-6. Requester Identifier Format
• IOTLB hardware encounters a page-table entry with either Read or Write fieldisClear
— If
address translating a Atomic Operation (AtomicOp) request, the request bloc
(Dev 31, Func 7) Context entry 255
g sections describe the data structures for mapping I/O devices to domains.
Figure 3-8 shows a multi-level (3-level) page-table structure with 4KB page mappings
tables. Figure 3-9 shows a 2-level page-table structure with 2MB super pages.
Root-Entry
(Dev 0, Func 1) DMA
try functions as the top level (Dev 0, Func 0) to map devices on a specific PCI bus to6 their
(Bus 255) Root entry 255 structure Context entry 0 3 3 3 2 22 11
domains. Each root-entry structure contains the following fields: Translation
Context-entry Table Address
3 9 8 0 9 10 21 0
for Bus N Structures for Domain A DMA with address bits
(Bus N) Root entry N
t flag: The present field is used by software to indicate to hardware whether the root-
0s 63:39 validated to be 0s
present and initialized. Software may Clear the present field for root entries
12-bits
9-bits
9-bits
9-bits
onding to bus numbers that are either not present in the platform, or don’t have any
eam devices attached. If the present field of a root-entry used to process a DMA request
(Bus 0) Root entry 0 +
the DMA request is blocked, resulting in a translation fault. << 3
Root-entry Table
t-entry table pointer: The context-entry table 255
Context entry pointer references the context-entry
r devices on the bus identified by the root-entry. Section 3.3.3 describes context entries in
detail. << 3
+ SP = 0
illustrates the root-entry format. The root entries are programmed Translation the root-entry
Address through 4KB page
ocation of the root-entry table in system memory is programmed through the Root-entry
Structures for Domain B
ss register. The root-entry table is 4KB in sizeentry 0 accommodates 256 root entries to
Context and
<< 3 + SP = 0
CI bus number space (0-255). In the case of a PCI device, the bus number (upper 8-bits)
Context-entry Table
a DMA transaction’s source-id field is used for Bus 0 into the root-entry structure.
to index 1
4KB page table
2 6 1
llustrates how these tables are used to map devices to domains. 7 3 2 0
Figure 3-7. Device to Domain Mapping Structures ASR + SP = 0
Context
Entry 4KB page table
,
3.3.3 Context-Entry ,
4KB page table
A context-entry maps a specific I/O device on a bus to the domain to which it is assigned, and, 3-8. Example Multi-level Page Table
Figure in
turn, to the address translation structures for the domain. The context entries are programmed
Intel Virtualization Technology for Directed I/O
through the memory-resident context-entry tables. Each root-entry in the root-entry table contains
the pointer to the context-entry table for the corresponding bus number. Each context-entry table
Express devices entries, with each entryRouting-ID Interpretation (ARI), bits traditionally bus. For a PCI
contains 256 supporting Alternative representing a unique PCI device function on the 30
VT-d: Interrupt remapping
• MSI VM
• MSI/MSI-x
• Interrupt remapping table (IRT) MSI write
request
– VT-d CPU
• DMA write request destination ID
– VT-d VMM IRT
31
mance for I/O Virtualization Exit-Less Interrupt
mit2 Nadav Har’El1
1
• “ELI: Bare-Metal Performance
Assaf Schuster2 Dan Tsafrir2 for I/O Virtualization”, A.
Gordon, et al., ASPLOS 2012
2 Technion—Israel Institute of Technology
namit,muli,assaf,dan}@cs.technion.ac.il
– OS VM Exits ELI (Exit-Less
Interrupt)
– netperf Apache memcached BMM 97-100%
guest/host context switch (exits and entries) interrupt to the host
CPU forces an exit and delivers the through the
handling costIDT.
host (handling physical interrupts and their completions) guest interrupt
Guests receive virtual interrupts, which are not necessarily related IDT handler
to physical interrupts. The host may decide to inject guest
the guest with a
(a) baseline assigned
virtual interrupt because the host received a corresponding physical
physical interrupt interrupt interrupt
host
interrupt, or the host injection completion
interrupt may decide to inject the guest with a virtual
interrupt manufactured by the host. The host injects virtual interrupts shadow
guest
ELI through the guest IDT. When the processor enters guest mode after shadow IDT
(b) delivery an injection, the guest receives and handles the virtual interrupt.
interrupt host IDT VM non-assigned
During interrupt completion the guest will access its LAPIC. Just
handling, interrupt
like the IDT, full access to a core’s physical LAPIC implies total ELI (exit)
ELI guest
control of the core, so the host cannot easily give untrusted guests
delivery &
delivery hypervisor
(c) access to the physical LAPIC. For guests using the first LAPIC x2APIC
completion host Non present
generation, the processor forces an exit when the guest accesses
(d) bare-metal LAPIC memory area. For guests using x2APIC, the host traps
the physical
LAPIC accesses through an MSR bitmap. When running a guest, interrupt
the host provides the CPU with a bitmap specifying whichtime benign
MSRs the guest is allowed to access directly and which sensitive
Figure 1. Exits during interrupt handling
MSRs must not be accessed by the guest directly. When the guest Figure 2. ELI interrupt delivery flow
accesses sensitive MSRs, execution exits back to the host. In general, 32
PCI-SIG IO Virtualization
• I/O PCIe Gen2
– SR-IOV (Single Root-I/O Virtualization)
• VM
• NIC
– MR-IOV (Multi Root-I/O Virtualization)
•
•
• NEC ExpEther
• VMM SR-IOV
– KVM Xen VMWare Hyper-V
–
Linux VFIO
33
SR-IOV NIC
• 1 NIC NIC vNIC
VM
– vNIC = VF (Virtual Function)
VM1 VM2 VM3
vNIC vNIC vNIC
VMM
RX TX
Virtual
Function
L2 Classified Sorter
MAC/PHY
34
SR-IOV NIC
• Physical Function (PF)
– VMM
• Virtual Function (VF)
– VM OS VF
– PF PF
– 82576 8 256
VM
Guest OS VM Device System Device
Config Space Config Space
VF driver
VFn0 PFn0
Virtual NIC
VFn0
VMM PF driver
VFn1
Physical NIC VFn2
:
35
1. + tap VM
(virtio, vhost)
PCI pass-
through
SR-IOV
–
VMM
Open vSwitch
–
• VT-d
• Open vSwitch
2. MAC tap : macvlan/macvtap
–
VM1 VM2 VM1 VM2
1. 2.
eth0 eth0 eth0 eth0
VMM VMM
tap0 tap1 tap0 tap1
macvlan0 macvlan1
eth0 eth0
36
Open vSwitch
•
– Linux
•
• OvS
– OpenFlow
–
• Linux kernel 3.3
•
Pica8 Pronto
http://openvswitch.org/
37
VM
VM1 VM2
VM VLAN
• OS VLAN
eth0 eth0
• 1 VM VLAN ID
VMM
tap0 tap1 # ovs-vsctl add-br br0
# ovs-vsctl add-port br0 tap0 tag=101
# ovs-vsctl add-port br0 tap1 tag=102
vSwitch (br0)
# ovs-vsctl add-port br0 eth0
VLAN ID 101
eth0
VLAN ID 102 VLAN
tap0 <-> br0_101 <-> eth0.101
38
QEMU/KVM
$ cat /etc/ovs-ifup
#!/bin/sh
switch='br0'
/sbin/ip link set mtu 9000 dev $1 up
/opt/bin/ovs-vsctl add-port ${switch} $1
$ cat /etc/ovs-ifdown
#!/bin/sh
switch='br0'
/sbin/ip link set $1 down
/opt/bin/ovs-vsctl del-port ${switch} $1
QEMU/KVM tap
ovs-vsctl brctl
44
PCI
1. BIOS Intel VT VT-d
2. Linux VT-d
– intel_iommu=on
3. PCI
4. OS
5. OS
“How to assign devices with VT-d in KVM,”
http://www.linux-kvm.org/page/How_to_assign_devices_with_VT-
d_in_KVM
45
SR-IOV OS
• OS VF PCI
$ cat /proc/interrupts
CPU0 CPU1
...snip...
29: 114941 114133 PCI-MSI-edge eth1-rx-0
$ lspci
30: 77616 78385 PCI-MSI-edge eth1-tx-0
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
31: 5 5 PCI-MSI-edge eth1:mbx
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device
00:05.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller
Virtual Function (rev 01)
49
SR-IOV
• VF NIC
• VF OS
# ip link set dev eth5 vf 0 rate 200
# ip link set dev eth5 vf 1 rate 400
# ip link show dev eth5
42: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq
state UP mode DEFAULT qlen 1000
link/ether 00:1b:21:81:55:3e brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:16:3e:1d:ee:01, tx rate 200 (Mbps), spoof
checking on
vf 1 MAC 00:16:3e:1d:ee:02, tx rate 400 (Mbps), spoof
checking on
OS 2010-OS-117 13
OS
50
SR-IOV TIPS
• VF MAC
# ip link set dev eth5 vf 0 00:16:3e:1d:ee:01
• VF VLAN ID
# ip link set dev eth5 vf 0 vlan 101
• Intel 82576 GbE 82599 X540 10GbE
NIC
–
– http://www.intel.com/content/www/us/en/ethernet-
controllers/ethernet-controllers.html
51
VM
• VM NG
• PCI Bonding
– PCI NIC
– NIC
virtio NIC active-standby
bonding
– S
• SR-IOV NIC VF virio PV
1 NIC
52
SR-IOV:
GesutOS
bond0
eth0 eth1
(virtio) (igbvf)
tap0 Host OS Host OS
tap0
br0 br0
eth0 eth0
(igb) (igb)
SR-IOV NIC SR-IOV NIC
53
SR-IOV:
GesutOS
(qemu) device_del vf0
bond0
eth0 eth1
(virtio) (igbvf)
tap0 Host OS Host OS
tap0
br0 br0
eth0 eth0
(igb) (igb)
SR-IOV NIC SR-IOV NIC
54
SR-IOV:
(qemu) migrate -d tcp:x.x.x.x:y
GesutOS GesutOS
bond0
eth0
(virtio)
$ qemu -incoming tcp:0:y ...
tap0 Host OS Host OS
tap0
br0 br0
eth0 eth0
(igb) (igb)
SR-IOV NIC SR-IOV NIC
55
AIST Green Cloud
AGC 1 16
HPC
Compute node Dell PowerEdge M610 Host machine environment
CPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 6.0.1
Chipset Intel 5520 Linux kernel 2.6.32-5-amd64
Memory 48 GB DDR3 KVM 0.12.50
InfiniBand Mellanox ConnectX (MT26428) Compiler gcc/gfortran 4.4.5
MPI Open MPI 1.4.2
Blade switch VM environment
InfiniBand Mellanox M3601Q (QDR 16 ports) VCPU 8
Memory 45 GB
1 1 VM
64
MPI Point-to-Point
10000
(higher is better) 2.4 GB/s qperf
3.2 GB/s
1000
Bandwidth [MB/sec]
100
PCI KVM
10
Bare Metal
Bare Metal
KVM
1
1 10 100 1k 10k 100k 1M 10M 100M 1G
Message size [byte] Bare Metal:
65
NPB BT-MZ:
(higher is better)
300 100
Performance [Gop/s total]
250 Degradation of PE: 80
Parallel efficiency [%]
KVM: 2%, EC2 CCI: 14%
200
Bare Metal 60
150 KVM
Amazon EC2
40
100 Bare Metal (PE)
KVM (PE)
20
50 Amazon EC2 (PE)
0 0
1 2 4 8 16
EC2 Cluster compute Number of nodes
instances (CCI)
66
Bloss:
Rank 0 Rank 0 N
MPI OpenMP Bcast
760 MB
Liner Solver
(require 10GB mem.
Reduce
1 GB
coarse-grained MPI comm.
Parallel Efficiency 1 GB
120 Bcast Eigenvector calc.
(higher is better) Gather
100 350 MB
Parallel Efficiency [%]
80
60
Degradation of PE:
40
KVM: 8%, EC2 CCI: 22%
20 Bare Metal
KVM
Amazon EC2
Ideal
0
1 2 4 8 16
Number of nodes 67
VMWare ESXi
• Dell PowerEdge T410
– CPU Intel Hexa-core Xeon X5650, single socket
– 6GB DDR3-1333
– HBA: QLogic QLE2460 (single-port 4Gbps Fibre Channel)
• IBM DS3400 FC SAN
• VMM: VMWare ESXi 5.0 T410 Fibre DS3400
Channel
• OS Windows server 2008 R2
•
Ethernet
– 8 vCPU (out-of-band )
– 3840 MB
•
– IOMeter 2006.07.27 (http://www.iometer.org/)
Bare Metal Machine Raw Device Mapping VMDirectPath I/O
(BMM) (RDM) (FPT)
VM VM
Windows Windows Windows
NTFS NTFS NTFS
Volume manager Volume manager Volume manager
Disk class driver Disk class driver Disk class driver
Storport/FC HBA driver Storport/SCSI driver Storport/FC HBA driver
VMKernel VMKernel
FC HBA driver
LUN LUN LUN