Your SlideShare is downloading. ×
I/O仮想化最前線〜ネットワークI/Oを中心に〜
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

I/O仮想化最前線〜ネットワークI/Oを中心に〜

11,033
views

Published on

日本ソフトウェア科学会大会チュートリアル「仮想化最前線」

日本ソフトウェア科学会大会チュートリアル「仮想化最前線」

Published in: Technology

0 Comments
52 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
11,033
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
267
Comments
0
Likes
52
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. I/O2012 8 24 @
  • 2. VM VM OS– –  2
  • 3. I/O•  I/O –  •  DB HPC –  • •  –  I/O PCI SR-IOV …•  –  –  I/O 3
  • 4. PCI pass-VM virtio, vhost SR-IOV throughVMM Open vSwitch VT-d VM: Virtual Machine VMM: Virtual Machine Monitor SR-IOV: Single Root-I/O Virtualization 4
  • 5. •  I/O –  virtio vhost –  PCI –  SR-IOV•  QEMU/KVM•  –  5
  • 6. 6
  • 7. •  CPU I/O OS –  OS •  OS 7
  • 8. •  VM –  VM I/F OS•  VM 1960 –  1972 IBM VM/370 –  1973 ACM workshop on virtual computer systems OS OS OS VM VM VM VMM 8
  • 9. Intel•  –  VMWare 1999 •  Popek Goldberg –  Xen 2003 VMM •  OS•  –  Intel VT AMD-V (2006) –  ! –  •  KVM (2006) BitVisor (2009) BHyVe (2011) 9
  • 10. Intel VT (Virtualization Technology)•  CPU –  IA32 Intel 64 VT-x –  Itanium VT-i•  I/O –  VT-d (Virtualization Technology for Directed I/O) –  VT-c (Virtualization Technology for Connectivity) •  VMDq IOAT SR-IOV•  AMD VMDq: Virtual Machine Device Queues IOAT: IO Acceleration Technology 10
  • 11. KVM: Kernel-based Virtual Machine •  –  Xen ring aliasing •  CPU QEMU –  BIOS VMX root mode OS VMX non-root mode OS proc. QEMURing 3 device memory VM Entry emulation management VMCS VM Exit KVMRing 0 Guest OS Kernel Linux Kernel 11
  • 12. CPU Xen KVMVM VM (Xen DomU) VM (QEMU process)(Dom0) Guest OS Guest OS Process Process VCPU VCPU threadsXen Hypervisor Linux KVM Domain Process scheduler scheduler Physical Physical CPU CPU 12
  • 13. OS Guest OS VA PA GVA GPA GVA VMM GPA HPA MMU# MMU# (CR3) (CR3) page pageH/W 13
  • 14. PVM HVM EPT# HVM Guest Guest Guest OS OS OS GVA HPA GVA GPA GVA GPA OS OS SPT VMM VMM VMM GVA HPA GPA HPA MMU# MMU# MMU# (CR3) (CR3) (CR3) page page pageH/W 14
  • 15. Intel Extended Page Table GVA TLB OS page walkCR3 GVA GPA TLB GVA HPA VMMEPTP GPA HPA 3 Intel x64 4 HPA TLB: Translation Look-aside Buffer 15
  • 16. I/O 16
  • 17. I/O•  IO (PIO)• •  DMA (Direct Memory Access) I/O DMA CPU CPU 4.EOI 1.DMAIN/OUT 3. 2.DMA I/O EOI: End Of Interrupt 17
  • 18. PCI•  –  INTx •  4 –  MSI/MSI-x (Message Signaled Interrupt) •  DMA write•  IDT (Interrupt Description Table) OS•  VMM MSI PCI INT A INTx CPU IOAPIC (Local APIC) EOI 18
  • 19. PCI •  PCI BDF . –  PCI •  1 •  NIC 1 •  SR-IOV VF$ lspci –tv ... snip ... 2 GbE -[0000:00]-+-00.0  Intel Corporation 5500 I/O Hub to ESI Port              +-01.0-[01]--+-00.0  Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet              |            -00.1  Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet              +-03.0-[05]--              +-07.0-[06]----00.0  Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connect              +-09.0-[03]-- ... snip ... . 19
  • 20. VM I/O•  I/O VM (virtio, vhost) PCI pass- through SR-IOV –  VMM VMM Open vSwitch •  QEMU ne2000 rtl8139 e1000 VT-d•  –  Xen split driver model –  virtio vhost –  VMWare VMXNET3•  Direct assignment VMM bypass I/O –  PCI –  SR-IOV 20
  • 21. VM I/OI/O PCI SR-IOVVM1 VM2 VM1 VM2 VM1 VM2 Guest OS Guest OS Guest OS … … … Guest Physical Physical driver driver driverVMM VMM VMM vSwitch Physical driverNIC NIC NIC Switch (VEB) I/O emulation PCI passthrough SR-IOV VM 21
  • 22. Edge Virtual Bridging (IEEE 802.1Qbg)•  VM•  (a) Software VEB (b) Hardware VEB (c) VEPA, VN-TagVM1 VM2 VM1 VM2 VM1 VM2 VNIC VNIC VNIC VNIC VNIC VNIC VMM vSwitch VMM VMM NIC NIC switch NIC switch VEB: Virtual Ethernet Bridging VEPA: Virtual Ethernet Port Aggregator 22
  • 23. I/O •  OS –  •  VM Exits VMX root mode VMX non-root mode QEMURing 3 e1000 copy Linux Kernel/ tap Guest OS Kernel KVM vSwitch bufferRing 0 Physical driver e1000 23
  • 24. virtio •  VM Exits •  virtio_ring –  I/O VMX root mode VMX non-root mode QEMURing 3 virtio_net copy Linux Kernel/ tap Guest OS Kernel KVM vSwitch bufferRing 0 Physical driver virtio_net 24
  • 25. vhost •  tap QEMU •  macvlan/macvtap VMX root mode VMX non-root mode QEMURing 3 Linux Kernel/ vhost_net KVM Guest OS Kernel macvtap bufferRing 0 physical driver macvlan virtio_net 25
  • 26. VM PCI pass- SR-IOV •  (virtio, vhost) through –  VMM DMA VMM Open vSwitch –  VMM VT-d VMX root mode VMX non-root mode QEMURing 3 Linux Kernel/ Guest OS Kernel KVMRing 0 buffer physical driver EOIH/W VT-d DMA 26
  • 27. VM1 VM2 : Guest OS VMM …VM Exit VMCS VMM OS VM Entry DMA IOMMU NIC VMCS: Virtual Machine Control Structure 27
  • 28. Intel VT-d: I/O•  VMM OS –  I/O –  OS •  VMM • •  VT-d –  DMA remapping (IOMMU) –  Interrupt remapping VT-d Interrupt remapping 28
  • 29. VT-d: DMA remapping•  –  OS –  DMA NG•  VM DMA –  IOMMU MMU+EPT DMA I/O CPU 29
  • 30. ource-id” in this document. The remapping hardware may determine the source-id of a in implementation-specific ways. For example, some I/O bus protocols may provide thedevice identity as part of each I/O transaction. In other cases (for Root-Complexdevices, for example), the source-id may be derived based on the Root-Complex internal tion. DMA remapping page walkress devices, the source-id is the requester identifier in the PCI Express transaction layer requester identifier of a device, which is composed of its PCI Bus/Device/Functionassigned by configuration software and uniquely identifies the hardware function that request. Figure 3-6 illustrates the requester-id1 as defined by the PCI Expressn. 1 ID DMA Remapping—Intel® Virtualization Technology for Directed I/O •  BDF ID 5 87 3 2 0 DMA Remapping—Intel® Virtualization Technology for Directed I/O Bus # Device # Function # •  ID DMA Figure 3-6. Requester Identifier Format •  IOTLB hardware encounters a page-table entry with either Read or Write fieldisClear — If address translating a Atomic Operation (AtomicOp) request, the request bloc (Dev 31, Func 7) Context entry 255 g sections describe the data structures for mapping I/O devices to domains. Figure 3-8 shows a multi-level (3-level) page-table structure with 4KB page mappings tables. Figure 3-9 shows a 2-level page-table structure with 2MB super pages. Root-Entry (Dev 0, Func 1) DMA try functions as the top level (Dev 0, Func 0) to map devices on a specific PCI bus to6 their (Bus 255) Root entry 255 structure Context entry 0 3 3 3 2 22 11domains. Each root-entry structure contains the following fields: Translation Context-entry Table Address 3 9 8 0 9 10 21 0 for Bus N Structures for Domain A DMA with address bits (Bus N) Root entry Nt flag: The present field is used by software to indicate to hardware whether the root- 0s 63:39 validated to be 0s present and initialized. Software may Clear the present field for root entries 12-bits 9-bits 9-bits 9-bitsonding to bus numbers that are either not present in the platform, or don’t have anyeam devices attached. If the present field of a root-entry used to process a DMA request (Bus 0) Root entry 0 + the DMA request is blocked, resulting in a translation fault. << 3 Root-entry Tablet-entry table pointer: The context-entry table 255 Context entry pointer references the context-entryr devices on the bus identified by the root-entry. Section 3.3.3 describes context entries indetail. << 3 + SP = 0 illustrates the root-entry format. The root entries are programmed Translation the root-entry Address through 4KB pageocation of the root-entry table in system memory is programmed through the Root-entry Structures for Domain Bss register. The root-entry table is 4KB in sizeentry 0 accommodates 256 root entries to Context and << 3 + SP = 0CI bus number space (0-255). In the case of a PCI device, the bus number (upper 8-bits) Context-entry Tablea DMA transaction’s source-id field is used for Bus 0 into the root-entry structure. to index 1 4KB page table 2 6 1llustrates how these tables are used to map devices to domains. 7 3 2 0 Figure 3-7. Device to Domain Mapping Structures ASR + SP = 0 Context Entry 4KB page table , 3.3.3 Context-Entry , 4KB page table A context-entry maps a specific I/O device on a bus to the domain to which it is assigned, and, 3-8. Example Multi-level Page Table Figure in turn, to the address translation structures for the domain. The context entries are programmed Intel Virtualization Technology for Directed I/O through the memory-resident context-entry tables. Each root-entry in the root-entry table contains the pointer to the context-entry table for the corresponding bus number. Each context-entry table Express devices entries, with each entryRouting-ID Interpretation (ARI), bits traditionally bus. For a PCI contains 256 supporting Alternative representing a unique PCI device function on the 30
  • 31. VT-d: Interrupt remapping•  MSI VM•  MSI/MSI-x•  Interrupt remapping table (IRT) MSI write request –  VT-d CPU •  DMA write request destination ID –  VT-d VMM IRT 31
  • 32. mance for I/O Virtualization Exit-Less Interrupt mit2 Nadav Har’El1 1 •  “ELI: Bare-Metal Performance Assaf Schuster2 Dan Tsafrir2 for I/O Virtualization”, A. Gordon, et al., ASPLOS 2012 2 Technion—Israel Institute of Technology namit,muli,assaf,dan}@cs.technion.ac.il –  OS VM Exits ELI (Exit-Less Interrupt) –  netperf Apache memcached BMM 97-100% guest/host context switch (exits and entries) interrupt to the host CPU forces an exit and delivers the through the handling costIDT. host (handling physical interrupts and their completions) guest interrupt Guests receive virtual interrupts, which are not necessarily related IDT handler to physical interrupts. The host may decide to inject guest the guest with a (a) baseline assigned virtual interrupt because the host received a corresponding physical physical interrupt interrupt interrupt host interrupt, or the host injection completion interrupt may decide to inject the guest with a virtual interrupt manufactured by the host. The host injects virtual interrupts shadow guest ELI through the guest IDT. When the processor enters guest mode after shadow IDT (b) delivery an injection, the guest receives and handles the virtual interrupt. interrupt host IDT VM non-assigned During interrupt completion the guest will access its LAPIC. Just handling, interrupt like the IDT, full access to a core’s physical LAPIC implies total ELI (exit) ELI guest control of the core, so the host cannot easily give untrusted guests delivery & delivery hypervisor (c) access to the physical LAPIC. For guests using the first LAPIC x2APIC completion host Non present generation, the processor forces an exit when the guest accesses (d) bare-metal LAPIC memory area. For guests using x2APIC, the host traps the physical LAPIC accesses through an MSR bitmap. When running a guest, interrupt the host provides the CPU with a bitmap specifying whichtime benign MSRs the guest is allowed to access directly and which sensitive Figure 1. Exits during interrupt handling MSRs must not be accessed by the guest directly. When the guest Figure 2. ELI interrupt delivery flow accesses sensitive MSRs, execution exits back to the host. In general, 32
  • 33. PCI-SIG IO Virtualization•  I/O PCIe Gen2 –  SR-IOV (Single Root-I/O Virtualization) •  VM •  NIC –  MR-IOV (Multi Root-I/O Virtualization) •  •  •  NEC ExpEther•  VMM SR-IOV –  KVM Xen VMWare Hyper-V –  Linux VFIO 33
  • 34. SR-IOV NIC •  1 NIC NIC vNIC VM –  vNIC = VF (Virtual Function) VM1 VM2 VM3 vNIC vNIC vNIC VMM RX TXVirtualFunction L2 Classified Sorter MAC/PHY 34
  • 35. SR-IOV NIC•  Physical Function (PF) –  VMM•  Virtual Function (VF) –  VM OS VF –  PF PF –  82576 8 256 VM Guest OS VM Device System Device Config Space Config Space VF driver VFn0 PFn0 Virtual NIC VFn0 VMM PF driver VFn1 Physical NIC VFn2 : 35
  • 36. 1.  + tap VM (virtio, vhost) PCI pass- through SR-IOV –  VMM Open vSwitch –  •  VT-d •  Open vSwitch2.  MAC tap : macvlan/macvtap –  VM1 VM2 VM1 VM2 1. 2. eth0 eth0 eth0 eth0 VMM VMM tap0 tap1 tap0 tap1 macvlan0 macvlan1 eth0 eth0 36
  • 37. Open vSwitch•  –  Linux •  •  OvS –  OpenFlow –  •  Linux kernel 3.3 •  Pica8 Pronto http://openvswitch.org/ 37
  • 38. VM VM1 VM2 VM VLAN •  OS VLAN eth0 eth0 •  1 VM VLAN ID VMM tap0 tap1 # ovs-vsctl add-br br0 # ovs-vsctl add-port br0 tap0 tag=101 # ovs-vsctl add-port br0 tap1 tag=102 vSwitch (br0) # ovs-vsctl add-port br0 eth0 VLAN ID 101 eth0VLAN ID 102 VLAN tap0 <-> br0_101 <-> eth0.101 38
  • 39. QoS•  Linux Qdisc•  ingress policing egress shaping VM1 VM2 ingress policing # ovs-vsctl set Interface tap0 eth0 eth0 ingress_policing_rate=10000 # ovs-vsctl set Interface tap0 VMM ingress_policing_burst=1000 tap0 tap1 : 10Mbpsingress policing : 10MB vSwitch (br0)egress shaping eth0 39
  • 40. QoS•  Linux Qdisc•  ingress policing egress shaping VM1 VM2 egress shaping # ovs-vsctl -- set port eth0 qos=@newqos eth0 eth0 -- --id=@newqos create qos type=linux-htb other-config:max-rate=40000000 queues=0=@q0,1=@q1 VMM -- --id=@q0 create queue other-config:min- tap0 tap1 rate=10000000 other-config:max-rate=10000000 -- --id=@q1 create queue other-config:min-ingress policing rate=20000000 other-config:max-rate=20000000 vSwitch (br0) egress shaping # ovs-ofctl add-flow br0 “in_port=3 idle_timeout=0 actions=enqueue1:1 eth0 HTB HFSC 40
  • 41. QEMU/KVM 41
  • 42. •  Linux•  QEMU/KVM –  QEMU PCI –  libvirt Virt-manager →•  Open vSwitch 1.6.1•  PCI & SR-IOV " Intel Gigabit ET dual port server adapter [SR-IOV ] " Intel Ethernet Converged Network Adapter X520-LR1 [SR-IOV ] " Mellanox ConnectX-2 QDR Infiniband HCA   Broadcom on board GbE NIC (BMC5709)   Brocade BR1741M-k 10 Gigabit Converged HCA 42
  • 43. QEMU/KVM VM CPU 2 (CPU model host)#!/bin/sh Memory 2GBsudo /usr/bin/kvm Network virtio_net -cpu host -smp 2 Storage virtio_blk -m 2000 -net nic,model=virtio,macaddr=00:16:3e:1d:ff:01 -net tap,ifname=tap0,script=/etc/ovs-ifup,downscript=/etc/ovs-ifdown -monitor telnet::5963,server,nowait -serial telnet::5964,server,nowait -daemonize -nographic -drive file=/work/kvm/vm01.img,if=virtio $@ 43
  • 44. QEMU/KVM$ cat /etc/ovs-ifup #!/bin/sh switch=br0 /sbin/ip link set mtu 9000 dev $1 up /opt/bin/ovs-vsctl add-port ${switch} $1 $ cat /etc/ovs-ifdown #!/bin/sh switch=br0 /sbin/ip link set $1 down /opt/bin/ovs-vsctl del-port ${switch} $1 QEMU/KVM tap ovs-vsctl brctl 44
  • 45. PCI1.  BIOS Intel VT VT-d2.  Linux VT-d –  intel_iommu=on3.  PCI4.  OS5.  OS “How to assign devices with VT-d in KVM,”http://www.linux-kvm.org/page/How_to_assign_devices_with_VT-d_in_KVM 45
  • 46. PCI•  PCI BDF ID ID•  OS pci_stub # echo "8086 10fb" > /sys/bus/pci/drivers/pci-stub/new_id # echo "0000:06:00.0" > /sys/bus/pci/devices/0000:06:00.0/driver/unbind # echo "0000:06:00.0" > /sys/bus/pci/drivers/pci-stub/bind •  QEMU –  -device pci-assign,host=06:00.0 •  QEMU –  device_add pci-assign,host=06:00.0,id=vf0 –  device_del vf0 46
  • 47. SR-IOV VF •  VF # modprobe –r ixgbe max_vfs VF # modprobe ixgbe max_vfs=8 •  OS VF PCI$ lspci –tv ... snip ...  -[0000:00]-+-00.0  Intel Corporation 5500 I/O Hub to ESI Port              +-01.0-[01]--+-00.0  Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet Physical Function (PF)             |            -00.1  Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet              +-03.0-[05]--              +-07.0-[06]----00.0  Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connect              |            +-10.0  Intel Corporation 82599 Ethernet Controller Virtual Function              |            +-10.2  Intel Corporation 82599 Ethernet Controller Virtual Function              |            +-10.4  Intel Corporation 82599 Ethernet Controller Virtual Function Virtual Function (VF)             |            +-10.6  Intel Corporation 82599 Ethernet Controller Virtual Function              |            +-11.0  Intel Corporation 82599 Ethernet Controller Virtual Function              |            +-11.2  Intel Corporation 82599 Ethernet Controller Virtual Function              |            +-11.4  Intel Corporation 82599 Ethernet Controller Virtual Function              |            -11.6  Intel Corporation 82599 Ethernet Controller Virtual Function              +-09.0-[03]-- ... snip ... 47
  • 48. SR-IOV•  PCI•  OS pci_stub # echo "8086 10fb" > /sys/bus/pci/drivers/pci-stub/new_id # echo "0000:06:10.0" > /sys/bus/pci/devices/0000:06:10.0/driver/unbind # echo "0000:06:10.0" > /sys/bus/pci/drivers/pci-stub/bind •  QEMU –  -device pci-assign,host=06:10.0 •  QEMU –  device_add pci-assign,host=06:10.0,id=vf0 –  device_del vf0 48
  • 49. SR-IOV OS•  OS VF PCI $ cat /proc/interrupts CPU0 CPU1 ...snip... 29: 114941 114133 PCI-MSI-edge eth1-rx-0 $ lspci 30: 77616 78385 PCI-MSI-edge eth1-tx-0 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 31: 5 5 PCI-MSI-edge eth1:mbx 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device 00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device 00:05.0 Ethernet controller: Intel Corporation 82599 Ethernet ControllerVirtual Function (rev 01) 49
  • 50. SR-IOV•  VF NIC•  VF OS # ip link set dev eth5 vf 0 rate 200 # ip link set dev eth5 vf 1 rate 400 # ip link show dev eth5 42: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000 link/ether 00:1b:21:81:55:3e brd ff:ff:ff:ff:ff:ff vf 0 MAC 00:16:3e:1d:ee:01, tx rate 200 (Mbps), spoof checking on vf 1 MAC 00:16:3e:1d:ee:02, tx rate 400 (Mbps), spoof checking on OS 2010-OS-117 13 OS 50
  • 51. SR-IOV TIPS•  VF MAC # ip link set dev eth5 vf 0 00:16:3e:1d:ee:01 •  VF VLAN ID # ip link set dev eth5 vf 0 vlan 101 •  Intel 82576 GbE 82599 X540 10GbE NIC –  –  http://www.intel.com/content/www/us/en/ethernet- controllers/ethernet-controllers.html 51
  • 52. VM•  VM NG•  PCI Bonding –  PCI NIC –  NIC virtio NIC active-standby bonding –  S•  SR-IOV NIC VF virio PV 1 NIC 52
  • 53. SR-IOV: GesutOS bond0 eth0 eth1 (virtio) (igbvf) tap0 Host OS Host OS tap0 br0 br0 eth0 eth0 (igb) (igb) SR-IOV NIC SR-IOV NIC 53
  • 54. SR-IOV: GesutOS (qemu) device_del vf0 bond0 eth0 eth1 (virtio) (igbvf) tap0 Host OS Host OS tap0 br0 br0 eth0 eth0 (igb) (igb) SR-IOV NIC SR-IOV NIC 54
  • 55. SR-IOV: (qemu) migrate -d tcp:x.x.x.x:y GesutOS GesutOS bond0 eth0 (virtio) $ qemu -incoming tcp:0:y ... tap0 Host OS Host OS tap0 br0 br0 eth0 eth0 (igb) (igb) SR-IOV NIC SR-IOV NIC 55
  • 56. SR-IOV(qemu) device_add pci-assign,host=05:10.0,id=vf0 GesutOS bond0 eth0 eth1 (virtio) (igbvf) tap0 Host OS Host OS tap0 br0 br0 eth0 eth0 (igb) (igb) SR-IOV NIC SR-IOV NIC 56
  • 57. MPIGuest OS rank 1 → bond0 eth0 eth1 (virtio) (igbvf) tap0 192.168.0.1 tap0 192.168.0.2 192.168.0.3 br0 rank 0 br0 eth0 eth0 (igb) (igb) SR-IOV NIC SR-IOV NIC NIC 192.168.0.0/24 57
  • 58. SymVirt•  VM –  Infiniband•  OS VMM SymVirt (Symbiotic Virtualization) –  PCI Cloud scheduler Cloud scheduler –  VM allocation re-allocation•  Failu re!! Failure prediction –  SymCR: VM migration –  SymPFT: global storage global storage (VM images) (VM images) 58
  • 59. SymVirt•  SymVirt coordinator –  OS MPI •  global consistency !VM•  SymVirt controller/agent –  Application confirm confirm linkup SymVirt coordinator SymVirt SymVirt wait signal Guest OS mode VMM mode detach migration re-attach SymVirt controller/agent R. Takano, et al., “Cooperative VM Migration for a Virtualized HPC Cluster with VMM-Bypass I/O devices”, 8th IEEE e-Science 2012 ( ) 59
  • 60. HPC 60
  • 61. •  AIST Super Cluster 2004 TOP500 #19•  AIST Green Cloud 2010 AIST Super Cloud 2011 1/10 1~2 –  HPCI EC2 !•  IT 61
  • 62. • •  ←•  DB HPC TOP3 IDC 2011 1.  2.  3.  62
  • 63. e.g., ASC 63
  • 64. AIST Green Cloud AGC 1 16 HPC Compute node Dell PowerEdge M610 Host machine environmentCPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 6.0.1Chipset Intel 5520 Linux kernel 2.6.32-5-amd64Memory 48 GB DDR3 KVM 0.12.50InfiniBand Mellanox ConnectX (MT26428) Compiler gcc/gfortran 4.4.5 MPI Open MPI 1.4.2 Blade switch VM environmentInfiniBand Mellanox M3601Q (QDR 16 ports) VCPU 8 Memory 45 GB 1 1 VM 64
  • 65. MPI Point-to-Point 10000 (higher is better) 2.4 GB/s qperf 3.2 GB/s 1000Bandwidth [MB/sec] 100 PCI KVM 10 Bare Metal Bare Metal KVM 1 1 10 100 1k 10k 100k 1M 10M 100M 1G Message size [byte] Bare Metal: 65
  • 66. NPB BT-MZ: (higher is better) 300 100 Performance [Gop/s total] 250 Degradation of PE: 80 Parallel efficiency [%] KVM: 2%, EC2 CCI: 14% 200 Bare Metal 60 150 KVM Amazon EC2 40 100 Bare Metal (PE) KVM (PE) 20 50 Amazon EC2 (PE) 0 0 1 2 4 8 16 EC2 Cluster compute Number of nodesinstances (CCI) 66
  • 67. Bloss: Rank 0 Rank 0 N MPI OpenMP Bcast 760 MB Liner Solver (require 10GB mem. Reduce 1 GB coarse-grained MPI comm. Parallel Efficiency 1 GB 120 Bcast Eigenvector calc. (higher is better) Gather 100 350 MBParallel Efficiency [%] 80 60 Degradation of PE: 40 KVM: 8%, EC2 CCI: 22% 20 Bare Metal KVM Amazon EC2 Ideal 0 1 2 4 8 16 Number of nodes 67
  • 68. VMWare ESXi•  Dell PowerEdge T410 –  CPU Intel Hexa-core Xeon X5650, single socket –  6GB DDR3-1333 –  HBA: QLogic QLE2460 (single-port 4Gbps Fibre Channel)•  IBM DS3400 FC SAN•  VMM: VMWare ESXi 5.0 T410 Fibre DS3400 Channel•  OS Windows server 2008 R2•  Ethernet –  8 vCPU (out-of-band ) –  3840 MB•  –  IOMeter 2006.07.27 (http://www.iometer.org/)
  • 69. Bare Metal Machine Raw Device Mapping VMDirectPath I/O (BMM) (RDM) (FPT) VM VM Windows Windows Windows NTFS NTFS NTFS Volume manager Volume manager Volume manager Disk class driver Disk class driver Disk class driver Storport/FC HBA driver Storport/SCSI driver Storport/FC HBA driver VMKernel VMKernel FC HBA driver LUN LUN LUN
  • 70. 12 OS
  • 71. ESXi•  FC SAN PCI RDM –  VMM SCSI FC –  RDM PCI HBA •  OS Linux Windows Linux ESXi•  BMM – 
  • 72. •  PCI HPC –  "InfiniBand PCI HPC ", SACSIS2011, pp.109-116, 2011 5 . –  “HPC ”, ACS37 , 2012 5 .•  PCI•  VM SR-IOV – •  –  VM •  VM 72
  • 73. •  HPC –  73
  • 74. Yabusame•  QEMU/KVM –  –  http://grivon.apgrid.org/quick-kvm-migration 74
  • 75. •  I/O•  I/O –  I/O –  virtio vhost –  : PCI SR-IOV•  VMM ! –  SymVirt BitVisor 75