I/O仮想化最前線〜ネットワークI/Oを中心に〜

VM
VM OS

– 

– 
2

I/O
•  I/O
– 
•  DB HPC
– 
• 
• 
–  I/O PCI
SR-IOV …
• 
– 
–  I/O
3

PCI pass-
VM virtio, vhost SR-IOV
through

VMM
Open vSwitch

VT-d

VM: Virtual Machine
VMM: Virtual Machine Monitor
SR-IOV: Single Root-I/O Virtualization
4

•  I/O
–  virtio vhost
–  PCI
–  SR-IOV
•  QEMU/KVM
• 
– 

5

•  CPU I/O

OS
–  OS
• 

OS

7

•  VM
–  VM I/F OS
•  VM 1960
–  1972 IBM VM/370
–  1973 ACM workshop on virtual computer systems

OS OS OS

VM VM VM

VMM

8

Intel
• 
–  VMWare 1999
• 
Popek Goldberg
–  Xen 2003 VMM
•  OS

• 
–  Intel VT AMD-V (2006)
–  !
– 
•  KVM (2006) BitVisor (2009) BHyVe (2011)

9

Intel VT (Virtualization Technology)
•  CPU
–  IA32 Intel 64 VT-x
–  Itanium VT-i
•  I/O
–  VT-d (Virtualization Technology for Directed I/O)
–  VT-c (Virtualization Technology for Connectivity)
•  VMDq IOAT SR-IOV

•  AMD
VMDq: Virtual Machine Device Queues
IOAT: IO Acceleration Technology
10

KVM: Kernel-based Virtual Machine
• 
–  Xen ring aliasing
•  CPU QEMU
–  BIOS
VMX root mode OS VMX non-root mode OS

proc.

QEMU
Ring 3
device memory VM Entry
emulation management VMCS
VM Exit

KVM
Ring 0 Guest OS Kernel
Linux Kernel
11

CPU
Xen KVM
VM VM (Xen DomU) VM (QEMU process)
(Dom0) Guest OS Guest OS
Process Process

VCPU VCPU
threads

Xen Hypervisor Linux

KVM
Domain Process
scheduler scheduler

Physical Physical
CPU CPU

12

OS Guest
OS
VA PA GVA GPA
GVA

VMM GPA

HPA

MMU#
MMU#
(CR3)
(CR3)
page page
H/W
13

PVM HVM EPT# HVM
Guest Guest Guest
OS OS OS
GVA HPA GVA GPA GVA GPA

OS
OS SPT
VMM VMM VMM

GVA HPA GPA HPA

MMU# MMU# MMU#
(CR3) (CR3) (CR3)
page page page
H/W
14

Intel Extended Page Table
GVA

TLB
OS
page walk

CR3 GVA GPA
TLB
GVA HPA

VMM

EPTP GPA HPA

3 Intel x64 4

HPA TLB: Translation Look-aside Buffer
15

I/O
•  IO (PIO)
• 
•  DMA (Direct Memory Access)

I/O DMA
CPU CPU

4.EOI
1.DMA
IN/OUT 3.
2.DMA
I/O

EOI: End Of Interrupt
17

PCI
• 
–  INTx
•  4
–  MSI/MSI-x (Message Signaled Interrupt)
•  DMA write
•  IDT (Interrupt Description Table) OS
• 

VMM

MSI
PCI INT A INTx CPU
IOAPIC
(Local APIC)
EOI

18

PCI
•  PCI BDF .

–  PCI
•  1
•  NIC 1
•  SR-IOV VF
$ lspci –tv
... snip ... 2 GbE
-[0000:00]-+-00.0 Intel Corporation 5500 I/O Hub to ESI Port
+-01.0-[01]--+-00.0 Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet
| -00.1 Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet
+-03.0-[05]--
+-07.0-[06]----00.0 Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connect
+-09.0-[03]--
... snip ...
.

19

VM I/O
•  I/O VM
(virtio, vhost)
PCI pass-
through
SR-IOV

–  VMM VMM
Open vSwitch
•  QEMU ne2000
rtl8139 e1000 VT-d

• 
–  Xen split driver model
–  virtio vhost
–  VMWare VMXNET3
•  Direct assignment VMM bypass I/O
–  PCI
–  SR-IOV
20

VM I/O
I/O
PCI SR-IOV
VM1 VM2 VM1 VM2 VM1 VM2
Guest OS Guest OS Guest OS
… … …
Guest Physical Physical
driver driver driver

VMM VMM VMM
vSwitch

Physical
driver

NIC NIC NIC

Switch (VEB)
I/O emulation PCI passthrough SR-IOV
VM

21

Edge Virtual Bridging
(IEEE 802.1Qbg)
•  VM
• 

(a) Software VEB (b) Hardware VEB (c) VEPA, VN-Tag
VM1 VM2 VM1 VM2 VM1 VM2
VNIC VNIC VNIC VNIC VNIC VNIC

VMM vSwitch VMM VMM

NIC NIC switch NIC

switch

VEB: Virtual Ethernet Bridging VEPA: Virtual Ethernet Port Aggregator
22

I/O
•  OS
– 
•  VM Exits
VMX root mode VMX non-root mode
QEMU
Ring 3
e1000 copy

Linux Kernel/ tap Guest OS Kernel
KVM
vSwitch buffer
Ring 0
Physical driver e1000

23

virtio
•  VM Exits
•  virtio_ring
–  I/O

QEMU
Ring 3
virtio_net copy

Linux Kernel/ tap Guest OS Kernel
KVM
vSwitch buffer
Ring 0
Physical driver virtio_net

24

vhost
•  tap QEMU

•  macvlan/macvtap
QEMU
Ring 3

Linux Kernel/
vhost_net
KVM
Guest OS Kernel
macvtap
buffer
Ring 0
physical driver macvlan virtio_net

25

VM PCI pass-
SR-IOV
•  (virtio, vhost) through

–  VMM DMA VMM
Open vSwitch

–  VMM VT-d

QEMU
Ring 3

Linux Kernel/ Guest OS Kernel
KVM
Ring 0 buffer

physical driver
EOI

H/W VT-d DMA

26

VM1 VM2
: Guest OS
VMM …
VM Exit
VMCS

VMM
OS VM Entry

DMA

IOMMU
NIC

VMCS: Virtual Machine Control Structure
27

Intel VT-d: I/O
•  VMM OS

–  I/O
–  OS
•  VMM

• 
•  VT-d
–  DMA remapping (IOMMU)
–  Interrupt remapping
VT-d
Interrupt remapping
28

VT-d: DMA remapping
• 
–  OS
–  DMA NG
•  VM DMA
– 

IOMMU MMU+EPT

DMA

I/O CPU
29

ource-id” in this document. The remapping hardware may determine the source-id of a
in implementation-specific ways. For example, some I/O bus protocols may provide the
device identity as part of each I/O transaction. In other cases (for Root-Complex
devices, for example), the source-id may be derived based on the Root-Complex internal
tion.

DMA remapping page walk
ress devices, the source-id is the requester identifier in the PCI Express transaction layer
requester identifier of a device, which is composed of its PCI Bus/Device/Function
assigned by configuration software and uniquely identifies the hardware function that
request. Figure 3-6 illustrates the requester-id1 as defined by the PCI Express
n.

1
ID
DMA Remapping—Intel® Virtualization Technology for Directed I/O

•  BDF ID
5 87 3 2 0
DMA Remapping—Intel® Virtualization Technology for Directed I/O
Bus # Device # Function #

•  ID DMA
Figure 3-6. Requester Identifier Format
•  IOTLB hardware encounters a page-table entry with either Read or Write fieldisClear
— If
address translating a Atomic Operation (AtomicOp) request, the request bloc
(Dev 31, Func 7) Context entry 255

g sections describe the data structures for mapping I/O devices to domains.
Figure 3-8 shows a multi-level (3-level) page-table structure with 4KB page mappings
tables. Figure 3-9 shows a 2-level page-table structure with 2MB super pages.
Root-Entry
(Dev 0, Func 1) DMA
try functions as the top level (Dev 0, Func 0) to map devices on a specific PCI bus to6 their
(Bus 255) Root entry 255 structure Context entry 0 3 3 3 2 22 11
domains. Each root-entry structure contains the following fields: Translation
Context-entry Table Address
3 9 8 0 9 10 21 0
for Bus N Structures for Domain A DMA with address bits
(Bus N) Root entry N
t flag: The present field is used by software to indicate to hardware whether the root-
0s 63:39 validated to be 0s
present and initialized. Software may Clear the present field for root entries

12-bits
9-bits

9-bits

9-bits
onding to bus numbers that are either not present in the platform, or don’t have any
eam devices attached. If the present field of a root-entry used to process a DMA request
(Bus 0) Root entry 0 +
the DMA request is blocked, resulting in a translation fault. << 3
Root-entry Table
t-entry table pointer: The context-entry table 255
Context entry pointer references the context-entry
r devices on the bus identified by the root-entry. Section 3.3.3 describes context entries in
detail. << 3
+ SP = 0

illustrates the root-entry format. The root entries are programmed Translation the root-entry
Address through 4KB page
ocation of the root-entry table in system memory is programmed through the Root-entry
Structures for Domain B
ss register. The root-entry table is 4KB in sizeentry 0 accommodates 256 root entries to
Context and
<< 3 + SP = 0
CI bus number space (0-255). In the case of a PCI device, the bus number (upper 8-bits)
Context-entry Table
a DMA transaction’s source-id field is used for Bus 0 into the root-entry structure.
to index 1
4KB page table
2 6 1
llustrates how these tables are used to map devices to domains. 7 3 2 0
Figure 3-7. Device to Domain Mapping Structures ASR + SP = 0
Context
Entry 4KB page table
,
3.3.3 Context-Entry ,
4KB page table

A context-entry maps a specific I/O device on a bus to the domain to which it is assigned, and, 3-8. Example Multi-level Page Table
Figure in
turn, to the address translation structures for the domain. The context entries are programmed
Intel Virtualization Technology for Directed I/O
through the memory-resident context-entry tables. Each root-entry in the root-entry table contains
the pointer to the context-entry table for the corresponding bus number. Each context-entry table
Express devices entries, with each entryRouting-ID Interpretation (ARI), bits traditionally bus. For a PCI
contains 256 supporting Alternative representing a unique PCI device function on the 30

VT-d: Interrupt remapping
•  MSI VM

•  MSI/MSI-x
•  Interrupt remapping table (IRT) MSI write
request
–  VT-d CPU
•  DMA write request destination ID
–  VT-d VMM IRT

31

mance for I/O Virtualization Exit-Less Interrupt
mit2 Nadav Har’El1
1
•  “ELI: Bare-Metal Performance
Assaf Schuster2 Dan Tsafrir2 for I/O Virtualization”, A.
Gordon, et al., ASPLOS 2012
2 Technion—Israel Institute of Technology
namit,muli,assaf,dan}@cs.technion.ac.il
–  OS VM Exits ELI (Exit-Less
Interrupt)
–  netperf Apache memcached BMM 97-100%
guest/host context switch (exits and entries) interrupt to the host
CPU forces an exit and delivers the through the
handling costIDT.
host (handling physical interrupts and their completions) guest interrupt
Guests receive virtual interrupts, which are not necessarily related IDT handler
to physical interrupts. The host may decide to inject guest
the guest with a
(a) baseline assigned
virtual interrupt because the host received a corresponding physical
physical interrupt interrupt interrupt
host
interrupt, or the host injection completion
interrupt may decide to inject the guest with a virtual
interrupt manufactured by the host. The host injects virtual interrupts shadow
guest
ELI through the guest IDT. When the processor enters guest mode after shadow IDT
(b) delivery an injection, the guest receives and handles the virtual interrupt.
interrupt host IDT VM non-assigned
During interrupt completion the guest will access its LAPIC. Just
handling, interrupt
like the IDT, full access to a core’s physical LAPIC implies total ELI (exit)
ELI guest
control of the core, so the host cannot easily give untrusted guests
delivery &
delivery hypervisor
(c) access to the physical LAPIC. For guests using the ﬁrst LAPIC x2APIC
completion host Non present
generation, the processor forces an exit when the guest accesses
(d) bare-metal LAPIC memory area. For guests using x2APIC, the host traps
the physical
LAPIC accesses through an MSR bitmap. When running a guest, interrupt
the host provides the CPU with a bitmap specifying whichtime benign
MSRs the guest is allowed to access directly and which sensitive
Figure 1. Exits during interrupt handling
MSRs must not be accessed by the guest directly. When the guest Figure 2. ELI interrupt delivery ﬂow
accesses sensitive MSRs, execution exits back to the host. In general, 32

PCI-SIG IO Virtualization
•  I/O PCIe Gen2
–  SR-IOV (Single Root-I/O Virtualization)
•  VM
•  NIC
–  MR-IOV (Multi Root-I/O Virtualization)
• 
• 
•  NEC ExpEther
•  VMM SR-IOV
–  KVM Xen VMWare Hyper-V
– 
Linux VFIO
33

SR-IOV NIC
•  1 NIC NIC vNIC
VM
–  vNIC = VF (Virtual Function)

VM1 VM2 VM3
vNIC vNIC vNIC

VMM

RX TX
Virtual
Function

L2 Classified Sorter

MAC/PHY

34

SR-IOV NIC
•  Physical Function (PF)
–  VMM
•  Virtual Function (VF)
–  VM OS VF
–  PF PF
–  82576 8 256
VM
Guest OS VM Device System Device
Config Space Config Space
VF driver
VFn0 PFn0
Virtual NIC
VFn0
VMM PF driver
VFn1
Physical NIC VFn2
:
35

1.  + tap VM
(virtio, vhost)
PCI pass-
through
SR-IOV

– 
VMM
Open vSwitch
– 
•  VT-d
•  Open vSwitch

2.  MAC tap : macvlan/macvtap
– 

VM1 VM2 VM1 VM2
1. 2.
eth0 eth0 eth0 eth0

VMM VMM
tap0 tap1 tap0 tap1

macvlan0 macvlan1
eth0 eth0

36

Open vSwitch
• 
–  Linux

• 
•  OvS
–  OpenFlow

– 
•  Linux kernel 3.3
• 
Pica8 Pronto

http://openvswitch.org/
37

VM

VM1 VM2
VM VLAN
•  OS VLAN
eth0 eth0
•  1 VM VLAN ID
VMM

tap0 tap1 # ovs-vsctl add-br br0
# ovs-vsctl add-port br0 tap0 tag=101
# ovs-vsctl add-port br0 tap1 tag=102
vSwitch (br0)
# ovs-vsctl add-port br0 eth0

VLAN ID 101
eth0
VLAN ID 102 VLAN

tap0 <-> br0_101 <-> eth0.101

38

QoS
•  Linux Qdisc
•  ingress policing egress shaping
VM1 VM2
ingress policing
# ovs-vsctl set Interface tap0
eth0 eth0 ingress_policing_rate=10000
# ovs-vsctl set Interface tap0
VMM
ingress_policing_burst=1000

tap0 tap1 : 10Mbps
ingress policing : 10MB
vSwitch (br0)
egress shaping

eth0

39

QoS
•  Linux Qdisc
•  ingress policing egress shaping
VM1 VM2
egress shaping

# ovs-vsctl -- set port eth0 qos=@newqos
eth0 eth0
-- --id=@newqos create qos type=linux-htb
other-config:max-rate=40000000
queues=0=@q0,1=@q1
VMM
-- --id=@q0 create queue other-config:min-
tap0 tap1 rate=10000000 other-config:max-rate=10000000
-- --id=@q1 create queue other-config:min-
ingress policing rate=20000000 other-config:max-rate=20000000
vSwitch (br0)
egress shaping # ovs-ofctl add-flow br0 “in_port=3
idle_timeout=0 actions=enqueue1:1

eth0
HTB HFSC
40

•  Linux
•  QEMU/KVM
–  QEMU PCI
–  libvirt Virt-manager
→
•  Open vSwitch 1.6.1
•  PCI & SR-IOV
" Intel Gigabit ET dual port server adapter [SR-IOV ]
" Intel Ethernet Converged Network Adapter X520-LR1 [SR-IOV ]
" Mellanox ConnectX-2 QDR Infiniband HCA
  Broadcom on board GbE NIC (BMC5709)
  Brocade BR1741M-k 10 Gigabit Converged HCA
42

QEMU/KVM
VM
CPU 2 (CPU model host)
#!/bin/sh
Memory 2GB
sudo /usr/bin/kvm
Network virtio_net
-cpu host
-smp 2 Storage virtio_blk
-m 2000
-net nic,model=virtio,macaddr=00:16:3e:1d:ff:01
-net tap,ifname=tap0,script=/etc/ovs-ifup,downscript=/etc/ovs-
ifdown
-monitor telnet::5963,server,nowait
-serial telnet::5964,server,nowait
-daemonize
-nographic
-drive file=/work/kvm/vm01.img,if=virtio
$@

43

QEMU/KVM
$ cat /etc/ovs-ifup
#!/bin/sh
switch='br0'
/sbin/ip link set mtu 9000 dev $1 up
/opt/bin/ovs-vsctl add-port ${switch} $1

$ cat /etc/ovs-ifdown
#!/bin/sh
switch='br0'
/sbin/ip link set $1 down
/opt/bin/ovs-vsctl del-port ${switch} $1

QEMU/KVM tap

ovs-vsctl brctl

44

PCI
1.  BIOS Intel VT VT-d
2.  Linux VT-d
–  intel_iommu=on
3.  PCI
4.  OS
5.  OS

“How to assign devices with VT-d in KVM,”
http://www.linux-kvm.org/page/How_to_assign_devices_with_VT-
d_in_KVM

45

PCI
•  PCI BDF ID ID
•  OS
pci_stub
# echo "8086 10fb" > /sys/bus/pci/drivers/pci-stub/new_id
# echo "0000:06:00.0" > /sys/bus/pci/devices/0000:06:00.0/driver/unbind
# echo "0000:06:00.0" > /sys/bus/pci/drivers/pci-stub/bind

•  QEMU
–  -device pci-assign,host=06:00.0
•  QEMU
–  device_add pci-assign,host=06:00.0,id=vf0
–  device_del vf0

46

SR-IOV VF
•  VF
# modprobe –r ixgbe max_vfs VF
# modprobe ixgbe max_vfs=8 •  OS VF PCI

$ lspci –tv
... snip ...
-[0000:00]-+-00.0 Intel Corporation 5500 I/O Hub to ESI Port
             +-01.0-[01]--+-00.0 Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet
Physical Function (PF)
             |            -00.1 Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet
+-03.0-[05]--
+-07.0-[06]----00.0 Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connect
             |            +-10.0 Intel Corporation 82599 Ethernet Controller Virtual Function
Virtual Function (VF)
             |            -11.6 Intel Corporation 82599 Ethernet Controller Virtual Function
+-09.0-[03]--
... snip ...

47

SR-IOV
•  PCI
•  OS
pci_stub
# echo "8086 10fb" > /sys/bus/pci/drivers/pci-stub/new_id
# echo "0000:06:10.0" > /sys/bus/pci/devices/0000:06:10.0/driver/unbind
# echo "0000:06:10.0" > /sys/bus/pci/drivers/pci-stub/bind

•  QEMU
–  -device pci-assign,host=06:10.0
•  QEMU
–  device_add pci-assign,host=06:10.0,id=vf0
–  device_del vf0
48

SR-IOV OS
•  OS VF PCI

$ cat /proc/interrupts
CPU0 CPU1
...snip...
29: 114941 114133 PCI-MSI-edge eth1-rx-0
$ lspci
30: 77616 78385 PCI-MSI-edge eth1-tx-0
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
31: 5 5 PCI-MSI-edge eth1:mbx
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device
00:05.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller
Virtual Function (rev 01)

49

SR-IOV
•  VF NIC
•  VF OS
# ip link set dev eth5 vf 0 rate 200
# ip link set dev eth5 vf 1 rate 400
# ip link show dev eth5
42: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq
state UP mode DEFAULT qlen 1000
link/ether 00:1b:21:81:55:3e brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:16:3e:1d:ee:01, tx rate 200 (Mbps), spoof
checking on
vf 1 MAC 00:16:3e:1d:ee:02, tx rate 400 (Mbps), spoof
checking on

OS 2010-OS-117 13
OS
50

SR-IOV TIPS
•  VF MAC
# ip link set dev eth5 vf 0 00:16:3e:1d:ee:01

•  VF VLAN ID
# ip link set dev eth5 vf 0 vlan 101

•  Intel 82576 GbE 82599 X540 10GbE
NIC
– 
–  http://www.intel.com/content/www/us/en/ethernet-
controllers/ethernet-controllers.html
51

VM
•  VM NG
•  PCI Bonding
–  PCI NIC

–  NIC
virtio NIC active-standby
bonding
–  S

•  SR-IOV NIC VF virio PV
1 NIC
52

SR-IOV:
GesutOS

bond0
eth0 eth1
(virtio) (igbvf)

tap0 Host OS Host OS
tap0

br0 br0
eth0 eth0
(igb) (igb)

SR-IOV NIC SR-IOV NIC

53

SR-IOV:
GesutOS
(qemu) device_del vf0
bond0
eth0 eth1
(virtio) (igbvf)

tap0

br0 br0
eth0 eth0
(igb) (igb)


54

SR-IOV:
(qemu) migrate -d tcp:x.x.x.x:y
GesutOS GesutOS

bond0
eth0
(virtio)
$ qemu -incoming tcp:0:y ...

tap0

br0 br0
eth0 eth0
(igb) (igb)


55

SR-IOV
(qemu) device_add pci-assign,
host=05:10.0,id=vf0 GesutOS

bond0
eth0 eth1
(virtio) (igbvf)

tap0

br0 br0
eth0 eth0
(igb) (igb)


56

MPI
Guest OS
rank 1
→
bond0
eth0 eth1
(virtio) (igbvf)

tap0 192.168.0.1 tap0 192.168.0.2 192.168.0.3

br0 rank 0
br0
eth0 eth0
(igb) (igb)

SR-IOV NIC SR-IOV NIC NIC

192.168.0.0/24
57

SymVirt
•  VM

–  Infiniband
•  OS VMM SymVirt
(Symbiotic Virtualization)
–  PCI
Cloud scheduler Cloud scheduler
–  VM allocation re-allocation

•  Failu
re!!
Failure
prediction
–  SymCR:
VM migration
–  SymPFT:
global storage global storage
(VM images) (VM images)

58

SymVirt
•  SymVirt coordinator
–  OS MPI
•  global consistency
!VM

•  SymVirt controller/agent
– 

Application
confirm confirm linkup
SymVirt coordinator
SymVirt SymVirt
wait signal Guest OS mode
VMM mode
detach migration re-attach

SymVirt controller/agent

R. Takano, et al., “Cooperative VM Migration for a Virtualized HPC Cluster
with VMM-Bypass I/O devices”, 8th IEEE e-Science 2012 ( )
59

•  AIST Super Cluster 2004 TOP500 #19

•  AIST Green Cloud 2010 AIST Super Cloud 2011

1/10
1~2
–  HPCI EC2
!
•  IT

61

• 

• 
←
• 
DB HPC

TOP3 IDC 2011
1. 
2. 
3. 
62

AIST Green Cloud
AGC 1 16
HPC
Compute node Dell PowerEdge M610 Host machine environment
CPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 6.0.1

Chipset Intel 5520 Linux kernel 2.6.32-5-amd64

Memory 48 GB DDR3 KVM 0.12.50

InfiniBand Mellanox ConnectX (MT26428) Compiler gcc/gfortran 4.4.5
MPI Open MPI 1.4.2
Blade switch VM environment
InfiniBand Mellanox M3601Q (QDR 16 ports) VCPU 8
Memory 45 GB
1 1 VM
64

MPI Point-to-Point
10000
(higher is better) 2.4 GB/s qperf
3.2 GB/s
1000
Bandwidth [MB/sec]

100

PCI KVM
10
Bare Metal
Bare Metal
KVM
1
1 10 100 1k 10k 100k 1M 10M 100M 1G
Message size [byte] Bare Metal:
65

NPB BT-MZ:
(higher is better)
300 100
Performance [Gop/s total]

250 Degradation of PE: 80

Parallel efficiency [%]
KVM: 2%, EC2 CCI: 14%
200
Bare Metal 60
150 KVM
Amazon EC2
40
100 Bare Metal (PE)
KVM (PE)
20
50 Amazon EC2 (PE)

0 0
1 2 4 8 16
EC2 Cluster compute Number of nodes
instances (CCI)
66

Bloss:
Rank 0 Rank 0 N
MPI OpenMP Bcast
760 MB
Liner Solver
(require 10GB mem.
Reduce
1 GB
coarse-grained MPI comm.
Parallel Efficiency 1 GB
120 Bcast Eigenvector calc.
(higher is better) Gather
100 350 MB
Parallel Efficiency [%]

80

60

Degradation of PE:
40
KVM: 8%, EC2 CCI: 22%
20 Bare Metal
KVM
Amazon EC2
Ideal
0
1 2 4 8 16
Number of nodes 67

VMWare ESXi
•  Dell PowerEdge T410
–  CPU Intel Hexa-core Xeon X5650, single socket
–  6GB DDR3-1333
–  HBA: QLogic QLE2460 (single-port 4Gbps Fibre Channel)
•  IBM DS3400 FC SAN
•  VMM: VMWare ESXi 5.0 T410 Fibre DS3400
Channel
•  OS Windows server 2008 R2
• 
Ethernet
–  8 vCPU (out-of-band )
–  3840 MB
• 
–  IOMeter 2006.07.27 (http://www.iometer.org/)

Bare Metal Machine Raw Device Mapping VMDirectPath I/O
(BMM) (RDM) (FPT)
VM VM

Windows Windows Windows

NTFS NTFS NTFS
Volume manager Volume manager Volume manager
Disk class driver Disk class driver Disk class driver
Storport/FC HBA driver Storport/SCSI driver Storport/FC HBA driver

VMKernel VMKernel

FC HBA driver

LUN LUN LUN

ESXi
•  FC SAN PCI
RDM
–  VMM SCSI FC

–  RDM PCI HBA

•  OS Linux Windows Linux ESXi

•  BMM
–

•  PCI
HPC
–  "InfiniBand PCI HPC
", SACSIS2011, pp.109-116, 2011 5 .
–  “HPC ”,
ACS37 , 2012 5 .

•  PCI
•  VM SR-IOV
– 
• 

–  VM
•  VM
72

•  HPC
– 

73

Yabusame
•  QEMU/KVM

– 

–  http://grivon.apgrid.org/quick-kvm-migration

74

•  I/O
•  I/O

–  I/O
–  virtio vhost
–  : PCI SR-IOV
•  VMM

!
–  SymVirt BitVisor

75

I/O仮想化最前線〜ネットワークI/Oを中心に〜

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to I/O仮想化最前線〜ネットワークI/Oを中心に〜

Similar to I/O仮想化最前線〜ネットワークI/Oを中心に〜 (20)

More from Ryousei Takano

More from Ryousei Takano (20)

Recently uploaded

Recently uploaded (20)

I/O仮想化最前線〜ネットワークI/Oを中心に〜