Advertisement

I/O仮想化最前線〜ネットワークI/Oを中心に〜

Group Leader at National Institute of Advanced Industrial Science and Technology (AIST)
Aug. 24, 2012
Advertisement

More Related Content

Slideshows for you(20)

Advertisement

More from Ryousei Takano(20)

Advertisement

I/O仮想化最前線〜ネットワークI/Oを中心に〜

  1. I/O 2012 8 24 @
  2. VM VM OS –  –  2
  3. I/O •  I/O –  •  DB HPC –  •  •  –  I/O PCI SR-IOV … •  –  –  I/O 3
  4. PCI pass- VM virtio, vhost SR-IOV through VMM Open vSwitch VT-d VM: Virtual Machine VMM: Virtual Machine Monitor SR-IOV: Single Root-I/O Virtualization 4
  5. •  I/O –  virtio vhost –  PCI –  SR-IOV •  QEMU/KVM •  –  5
  6. 6
  7. •  CPU I/O OS –  OS •  OS 7
  8. •  VM –  VM I/F OS •  VM 1960 –  1972 IBM VM/370 –  1973 ACM workshop on virtual computer systems OS OS OS VM VM VM VMM 8
  9. Intel •  –  VMWare 1999 •  Popek Goldberg –  Xen 2003 VMM •  OS •  –  Intel VT AMD-V (2006) –  ! –  •  KVM (2006) BitVisor (2009) BHyVe (2011) 9
  10. Intel VT (Virtualization Technology) •  CPU –  IA32 Intel 64 VT-x –  Itanium VT-i •  I/O –  VT-d (Virtualization Technology for Directed I/O) –  VT-c (Virtualization Technology for Connectivity) •  VMDq IOAT SR-IOV •  AMD VMDq: Virtual Machine Device Queues IOAT: IO Acceleration Technology 10
  11. KVM: Kernel-based Virtual Machine •  –  Xen ring aliasing •  CPU QEMU –  BIOS VMX root mode OS VMX non-root mode OS proc. QEMU Ring 3 device memory VM Entry emulation management VMCS VM Exit KVM Ring 0 Guest OS Kernel Linux Kernel 11
  12. CPU Xen KVM VM VM (Xen DomU) VM (QEMU process) (Dom0) Guest OS Guest OS Process Process VCPU VCPU threads Xen Hypervisor Linux KVM Domain Process scheduler scheduler Physical Physical CPU CPU 12
  13. OS Guest OS VA PA GVA GPA GVA VMM GPA HPA MMU# MMU# (CR3) (CR3) page page H/W 13
  14. PVM HVM EPT# HVM Guest Guest Guest OS OS OS GVA HPA GVA GPA GVA GPA OS OS SPT VMM VMM VMM GVA HPA GPA HPA MMU# MMU# MMU# (CR3) (CR3) (CR3) page page page H/W 14
  15. Intel Extended Page Table GVA TLB OS page walk CR3 GVA GPA TLB GVA HPA VMM EPTP GPA HPA 3 Intel x64 4 HPA TLB: Translation Look-aside Buffer 15
  16. I/O 16
  17. I/O •  IO (PIO) •  •  DMA (Direct Memory Access) I/O DMA CPU CPU 4.EOI 1.DMA IN/OUT 3. 2.DMA I/O EOI: End Of Interrupt 17
  18. PCI •  –  INTx •  4 –  MSI/MSI-x (Message Signaled Interrupt) •  DMA write •  IDT (Interrupt Description Table) OS •  VMM MSI PCI INT A INTx CPU IOAPIC (Local APIC) EOI 18
  19. PCI •  PCI BDF . –  PCI •  1 •  NIC 1 •  SR-IOV VF $ lspci –tv ... snip ... 2 GbE  -[0000:00]-+-00.0  Intel Corporation 5500 I/O Hub to ESI Port              +-01.0-[01]--+-00.0  Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet              |            -00.1  Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet              +-03.0-[05]--              +-07.0-[06]----00.0  Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connect              +-09.0-[03]-- ... snip ... . 19
  20. VM I/O •  I/O VM (virtio, vhost) PCI pass- through SR-IOV –  VMM VMM Open vSwitch •  QEMU ne2000 rtl8139 e1000 VT-d •  –  Xen split driver model –  virtio vhost –  VMWare VMXNET3 •  Direct assignment VMM bypass I/O –  PCI –  SR-IOV 20
  21. VM I/O I/O PCI SR-IOV VM1 VM2 VM1 VM2 VM1 VM2 Guest OS Guest OS Guest OS … … … Guest Physical Physical driver driver driver VMM VMM VMM vSwitch Physical driver NIC NIC NIC Switch (VEB) I/O emulation PCI passthrough SR-IOV VM 21
  22. Edge Virtual Bridging (IEEE 802.1Qbg) •  VM •  (a) Software VEB (b) Hardware VEB (c) VEPA, VN-Tag VM1 VM2 VM1 VM2 VM1 VM2 VNIC VNIC VNIC VNIC VNIC VNIC VMM vSwitch VMM VMM NIC NIC switch NIC switch VEB: Virtual Ethernet Bridging VEPA: Virtual Ethernet Port Aggregator 22
  23. I/O •  OS –  •  VM Exits VMX root mode VMX non-root mode QEMU Ring 3 e1000 copy Linux Kernel/ tap Guest OS Kernel KVM vSwitch buffer Ring 0 Physical driver e1000 23
  24. virtio •  VM Exits •  virtio_ring –  I/O VMX root mode VMX non-root mode QEMU Ring 3 virtio_net copy Linux Kernel/ tap Guest OS Kernel KVM vSwitch buffer Ring 0 Physical driver virtio_net 24
  25. vhost •  tap QEMU •  macvlan/macvtap VMX root mode VMX non-root mode QEMU Ring 3 Linux Kernel/ vhost_net KVM Guest OS Kernel macvtap buffer Ring 0 physical driver macvlan virtio_net 25
  26. VM PCI pass- SR-IOV •  (virtio, vhost) through –  VMM DMA VMM Open vSwitch –  VMM VT-d VMX root mode VMX non-root mode QEMU Ring 3 Linux Kernel/ Guest OS Kernel KVM Ring 0 buffer physical driver EOI H/W VT-d DMA 26
  27. VM1 VM2 : Guest OS VMM … VM Exit VMCS VMM OS VM Entry DMA IOMMU NIC VMCS: Virtual Machine Control Structure 27
  28. Intel VT-d: I/O •  VMM OS –  I/O –  OS •  VMM •  •  VT-d –  DMA remapping (IOMMU) –  Interrupt remapping VT-d Interrupt remapping 28
  29. VT-d: DMA remapping •  –  OS –  DMA NG •  VM DMA –  IOMMU MMU+EPT DMA I/O CPU 29
  30. ource-id” in this document. The remapping hardware may determine the source-id of a in implementation-specific ways. For example, some I/O bus protocols may provide the device identity as part of each I/O transaction. In other cases (for Root-Complex devices, for example), the source-id may be derived based on the Root-Complex internal tion. DMA remapping page walk ress devices, the source-id is the requester identifier in the PCI Express transaction layer requester identifier of a device, which is composed of its PCI Bus/Device/Function assigned by configuration software and uniquely identifies the hardware function that request. Figure 3-6 illustrates the requester-id1 as defined by the PCI Express n. 1 ID DMA Remapping—Intel® Virtualization Technology for Directed I/O •  BDF ID 5 87 3 2 0 DMA Remapping—Intel® Virtualization Technology for Directed I/O Bus # Device # Function # •  ID DMA Figure 3-6. Requester Identifier Format •  IOTLB hardware encounters a page-table entry with either Read or Write fieldisClear — If address translating a Atomic Operation (AtomicOp) request, the request bloc (Dev 31, Func 7) Context entry 255 g sections describe the data structures for mapping I/O devices to domains. Figure 3-8 shows a multi-level (3-level) page-table structure with 4KB page mappings tables. Figure 3-9 shows a 2-level page-table structure with 2MB super pages. Root-Entry (Dev 0, Func 1) DMA try functions as the top level (Dev 0, Func 0) to map devices on a specific PCI bus to6 their (Bus 255) Root entry 255 structure Context entry 0 3 3 3 2 22 11 domains. Each root-entry structure contains the following fields: Translation Context-entry Table Address 3 9 8 0 9 10 21 0 for Bus N Structures for Domain A DMA with address bits (Bus N) Root entry N t flag: The present field is used by software to indicate to hardware whether the root- 0s 63:39 validated to be 0s present and initialized. Software may Clear the present field for root entries 12-bits 9-bits 9-bits 9-bits onding to bus numbers that are either not present in the platform, or don’t have any eam devices attached. If the present field of a root-entry used to process a DMA request (Bus 0) Root entry 0 + the DMA request is blocked, resulting in a translation fault. << 3 Root-entry Table t-entry table pointer: The context-entry table 255 Context entry pointer references the context-entry r devices on the bus identified by the root-entry. Section 3.3.3 describes context entries in detail. << 3 + SP = 0 illustrates the root-entry format. The root entries are programmed Translation the root-entry Address through 4KB page ocation of the root-entry table in system memory is programmed through the Root-entry Structures for Domain B ss register. The root-entry table is 4KB in sizeentry 0 accommodates 256 root entries to Context and << 3 + SP = 0 CI bus number space (0-255). In the case of a PCI device, the bus number (upper 8-bits) Context-entry Table a DMA transaction’s source-id field is used for Bus 0 into the root-entry structure. to index 1 4KB page table 2 6 1 llustrates how these tables are used to map devices to domains. 7 3 2 0 Figure 3-7. Device to Domain Mapping Structures ASR + SP = 0 Context Entry 4KB page table , 3.3.3 Context-Entry , 4KB page table A context-entry maps a specific I/O device on a bus to the domain to which it is assigned, and, 3-8. Example Multi-level Page Table Figure in turn, to the address translation structures for the domain. The context entries are programmed Intel Virtualization Technology for Directed I/O through the memory-resident context-entry tables. Each root-entry in the root-entry table contains the pointer to the context-entry table for the corresponding bus number. Each context-entry table Express devices entries, with each entryRouting-ID Interpretation (ARI), bits traditionally bus. For a PCI contains 256 supporting Alternative representing a unique PCI device function on the 30
  31. VT-d: Interrupt remapping •  MSI VM •  MSI/MSI-x •  Interrupt remapping table (IRT) MSI write request –  VT-d CPU •  DMA write request destination ID –  VT-d VMM IRT 31
  32. mance for I/O Virtualization Exit-Less Interrupt mit2 Nadav Har’El1 1 •  “ELI: Bare-Metal Performance Assaf Schuster2 Dan Tsafrir2 for I/O Virtualization”, A. Gordon, et al., ASPLOS 2012 2 Technion—Israel Institute of Technology namit,muli,assaf,dan}@cs.technion.ac.il –  OS VM Exits ELI (Exit-Less Interrupt) –  netperf Apache memcached BMM 97-100% guest/host context switch (exits and entries) interrupt to the host CPU forces an exit and delivers the through the handling costIDT. host (handling physical interrupts and their completions) guest interrupt Guests receive virtual interrupts, which are not necessarily related IDT handler to physical interrupts. The host may decide to inject guest the guest with a (a) baseline assigned virtual interrupt because the host received a corresponding physical physical interrupt interrupt interrupt host interrupt, or the host injection completion interrupt may decide to inject the guest with a virtual interrupt manufactured by the host. The host injects virtual interrupts shadow guest ELI through the guest IDT. When the processor enters guest mode after shadow IDT (b) delivery an injection, the guest receives and handles the virtual interrupt. interrupt host IDT VM non-assigned During interrupt completion the guest will access its LAPIC. Just handling, interrupt like the IDT, full access to a core’s physical LAPIC implies total ELI (exit) ELI guest control of the core, so the host cannot easily give untrusted guests delivery & delivery hypervisor (c) access to the physical LAPIC. For guests using the first LAPIC x2APIC completion host Non present generation, the processor forces an exit when the guest accesses (d) bare-metal LAPIC memory area. For guests using x2APIC, the host traps the physical LAPIC accesses through an MSR bitmap. When running a guest, interrupt the host provides the CPU with a bitmap specifying whichtime benign MSRs the guest is allowed to access directly and which sensitive Figure 1. Exits during interrupt handling MSRs must not be accessed by the guest directly. When the guest Figure 2. ELI interrupt delivery flow accesses sensitive MSRs, execution exits back to the host. In general, 32
  33. PCI-SIG IO Virtualization •  I/O PCIe Gen2 –  SR-IOV (Single Root-I/O Virtualization) •  VM •  NIC –  MR-IOV (Multi Root-I/O Virtualization) •  •  •  NEC ExpEther •  VMM SR-IOV –  KVM Xen VMWare Hyper-V –  Linux VFIO 33
  34. SR-IOV NIC •  1 NIC NIC vNIC VM –  vNIC = VF (Virtual Function) VM1 VM2 VM3 vNIC vNIC vNIC VMM RX TX Virtual Function L2 Classified Sorter MAC/PHY 34
  35. SR-IOV NIC •  Physical Function (PF) –  VMM •  Virtual Function (VF) –  VM OS VF –  PF PF –  82576 8 256 VM Guest OS VM Device System Device Config Space Config Space VF driver VFn0 PFn0 Virtual NIC VFn0 VMM PF driver VFn1 Physical NIC VFn2 : 35
  36. 1.  + tap VM (virtio, vhost) PCI pass- through SR-IOV –  VMM Open vSwitch –  •  VT-d •  Open vSwitch 2.  MAC tap : macvlan/macvtap –  VM1 VM2 VM1 VM2 1. 2. eth0 eth0 eth0 eth0 VMM VMM tap0 tap1 tap0 tap1 macvlan0 macvlan1 eth0 eth0 36
  37. Open vSwitch •  –  Linux •  •  OvS –  OpenFlow –  •  Linux kernel 3.3 •  Pica8 Pronto http://openvswitch.org/ 37
  38. VM VM1 VM2 VM VLAN •  OS VLAN eth0 eth0 •  1 VM VLAN ID VMM tap0 tap1 # ovs-vsctl add-br br0 # ovs-vsctl add-port br0 tap0 tag=101 # ovs-vsctl add-port br0 tap1 tag=102 vSwitch (br0) # ovs-vsctl add-port br0 eth0 VLAN ID 101 eth0 VLAN ID 102 VLAN tap0 <-> br0_101 <-> eth0.101 38
  39. QoS •  Linux Qdisc •  ingress policing egress shaping VM1 VM2 ingress policing # ovs-vsctl set Interface tap0 eth0 eth0 ingress_policing_rate=10000 # ovs-vsctl set Interface tap0 VMM ingress_policing_burst=1000 tap0 tap1 : 10Mbps ingress policing : 10MB vSwitch (br0) egress shaping eth0 39
  40. QoS •  Linux Qdisc •  ingress policing egress shaping VM1 VM2 egress shaping # ovs-vsctl -- set port eth0 qos=@newqos eth0 eth0 -- --id=@newqos create qos type=linux-htb other-config:max-rate=40000000 queues=0=@q0,1=@q1 VMM -- --id=@q0 create queue other-config:min- tap0 tap1 rate=10000000 other-config:max-rate=10000000 -- --id=@q1 create queue other-config:min- ingress policing rate=20000000 other-config:max-rate=20000000 vSwitch (br0) egress shaping # ovs-ofctl add-flow br0 “in_port=3 idle_timeout=0 actions=enqueue1:1 eth0 HTB HFSC 40
  41. QEMU/KVM 41
  42. •  Linux •  QEMU/KVM –  QEMU PCI –  libvirt Virt-manager → •  Open vSwitch 1.6.1 •  PCI & SR-IOV " Intel Gigabit ET dual port server adapter [SR-IOV ] " Intel Ethernet Converged Network Adapter X520-LR1 [SR-IOV ] " Mellanox ConnectX-2 QDR Infiniband HCA   Broadcom on board GbE NIC (BMC5709)   Brocade BR1741M-k 10 Gigabit Converged HCA 42
  43. QEMU/KVM VM CPU 2 (CPU model host) #!/bin/sh Memory 2GB sudo /usr/bin/kvm Network virtio_net -cpu host -smp 2 Storage virtio_blk -m 2000 -net nic,model=virtio,macaddr=00:16:3e:1d:ff:01 -net tap,ifname=tap0,script=/etc/ovs-ifup,downscript=/etc/ovs- ifdown -monitor telnet::5963,server,nowait -serial telnet::5964,server,nowait -daemonize -nographic -drive file=/work/kvm/vm01.img,if=virtio $@ 43
  44. QEMU/KVM $ cat /etc/ovs-ifup #!/bin/sh switch='br0' /sbin/ip link set mtu 9000 dev $1 up /opt/bin/ovs-vsctl add-port ${switch} $1 $ cat /etc/ovs-ifdown #!/bin/sh switch='br0' /sbin/ip link set $1 down /opt/bin/ovs-vsctl del-port ${switch} $1 QEMU/KVM tap ovs-vsctl brctl 44
  45. PCI 1.  BIOS Intel VT VT-d 2.  Linux VT-d –  intel_iommu=on 3.  PCI 4.  OS 5.  OS “How to assign devices with VT-d in KVM,” http://www.linux-kvm.org/page/How_to_assign_devices_with_VT- d_in_KVM 45
  46. PCI •  PCI BDF ID ID •  OS pci_stub # echo "8086 10fb" > /sys/bus/pci/drivers/pci-stub/new_id # echo "0000:06:00.0" > /sys/bus/pci/devices/0000:06:00.0/driver/unbind # echo "0000:06:00.0" > /sys/bus/pci/drivers/pci-stub/bind •  QEMU –  -device pci-assign,host=06:00.0 •  QEMU –  device_add pci-assign,host=06:00.0,id=vf0 –  device_del vf0 46
  47. SR-IOV VF •  VF # modprobe –r ixgbe max_vfs VF # modprobe ixgbe max_vfs=8 •  OS VF PCI $ lspci –tv ... snip ...  -[0000:00]-+-00.0  Intel Corporation 5500 I/O Hub to ESI Port              +-01.0-[01]--+-00.0  Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet Physical Function (PF)              |            -00.1  Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet              +-03.0-[05]--              +-07.0-[06]----00.0  Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connect              |            +-10.0  Intel Corporation 82599 Ethernet Controller Virtual Function              |            +-10.2  Intel Corporation 82599 Ethernet Controller Virtual Function              |            +-10.4  Intel Corporation 82599 Ethernet Controller Virtual Function Virtual Function (VF)              |            +-10.6  Intel Corporation 82599 Ethernet Controller Virtual Function              |            +-11.0  Intel Corporation 82599 Ethernet Controller Virtual Function              |            +-11.2  Intel Corporation 82599 Ethernet Controller Virtual Function              |            +-11.4  Intel Corporation 82599 Ethernet Controller Virtual Function              |            -11.6  Intel Corporation 82599 Ethernet Controller Virtual Function              +-09.0-[03]-- ... snip ... 47
  48. SR-IOV •  PCI •  OS pci_stub # echo "8086 10fb" > /sys/bus/pci/drivers/pci-stub/new_id # echo "0000:06:10.0" > /sys/bus/pci/devices/0000:06:10.0/driver/unbind # echo "0000:06:10.0" > /sys/bus/pci/drivers/pci-stub/bind •  QEMU –  -device pci-assign,host=06:10.0 •  QEMU –  device_add pci-assign,host=06:10.0,id=vf0 –  device_del vf0 48
  49. SR-IOV OS •  OS VF PCI $ cat /proc/interrupts CPU0 CPU1 ...snip... 29: 114941 114133 PCI-MSI-edge eth1-rx-0 $ lspci 30: 77616 78385 PCI-MSI-edge eth1-tx-0 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 31: 5 5 PCI-MSI-edge eth1:mbx 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device 00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device 00:05.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 49
  50. SR-IOV •  VF NIC •  VF OS # ip link set dev eth5 vf 0 rate 200 # ip link set dev eth5 vf 1 rate 400 # ip link show dev eth5 42: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000 link/ether 00:1b:21:81:55:3e brd ff:ff:ff:ff:ff:ff vf 0 MAC 00:16:3e:1d:ee:01, tx rate 200 (Mbps), spoof checking on vf 1 MAC 00:16:3e:1d:ee:02, tx rate 400 (Mbps), spoof checking on OS 2010-OS-117 13 OS 50
  51. SR-IOV TIPS •  VF MAC # ip link set dev eth5 vf 0 00:16:3e:1d:ee:01 •  VF VLAN ID # ip link set dev eth5 vf 0 vlan 101 •  Intel 82576 GbE 82599 X540 10GbE NIC –  –  http://www.intel.com/content/www/us/en/ethernet- controllers/ethernet-controllers.html 51
  52. VM •  VM NG •  PCI Bonding –  PCI NIC –  NIC virtio NIC active-standby bonding –  S •  SR-IOV NIC VF virio PV 1 NIC 52
  53. SR-IOV: GesutOS bond0 eth0 eth1 (virtio) (igbvf) tap0 Host OS Host OS tap0 br0 br0 eth0 eth0 (igb) (igb) SR-IOV NIC SR-IOV NIC 53
  54. SR-IOV: GesutOS (qemu) device_del vf0 bond0 eth0 eth1 (virtio) (igbvf) tap0 Host OS Host OS tap0 br0 br0 eth0 eth0 (igb) (igb) SR-IOV NIC SR-IOV NIC 54
  55. SR-IOV: (qemu) migrate -d tcp:x.x.x.x:y GesutOS GesutOS bond0 eth0 (virtio) $ qemu -incoming tcp:0:y ... tap0 Host OS Host OS tap0 br0 br0 eth0 eth0 (igb) (igb) SR-IOV NIC SR-IOV NIC 55
  56. SR-IOV (qemu) device_add pci-assign, host=05:10.0,id=vf0 GesutOS bond0 eth0 eth1 (virtio) (igbvf) tap0 Host OS Host OS tap0 br0 br0 eth0 eth0 (igb) (igb) SR-IOV NIC SR-IOV NIC 56
  57. MPI Guest OS rank 1 → bond0 eth0 eth1 (virtio) (igbvf) tap0 192.168.0.1 tap0 192.168.0.2 192.168.0.3 br0 rank 0 br0 eth0 eth0 (igb) (igb) SR-IOV NIC SR-IOV NIC NIC 192.168.0.0/24 57
  58. SymVirt •  VM –  Infiniband •  OS VMM SymVirt (Symbiotic Virtualization) –  PCI Cloud scheduler Cloud scheduler –  VM allocation re-allocation •  Failu re!! Failure prediction –  SymCR: VM migration –  SymPFT: global storage global storage (VM images) (VM images) 58
  59. SymVirt •  SymVirt coordinator –  OS MPI •  global consistency !VM •  SymVirt controller/agent –  Application confirm confirm linkup SymVirt coordinator SymVirt SymVirt wait signal Guest OS mode VMM mode detach migration re-attach SymVirt controller/agent R. Takano, et al., “Cooperative VM Migration for a Virtualized HPC Cluster with VMM-Bypass I/O devices”, 8th IEEE e-Science 2012 ( ) 59
  60. HPC 60
  61. •  AIST Super Cluster 2004 TOP500 #19 •  AIST Green Cloud 2010 AIST Super Cloud 2011 1/10 1~2 –  HPCI EC2 ! •  IT 61
  62. •  •  ← •  DB HPC TOP3 IDC 2011 1.  2.  3.  62
  63. e.g., ASC 63
  64. AIST Green Cloud AGC 1 16 HPC Compute node Dell PowerEdge M610 Host machine environment CPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 6.0.1 Chipset Intel 5520 Linux kernel 2.6.32-5-amd64 Memory 48 GB DDR3 KVM 0.12.50 InfiniBand Mellanox ConnectX (MT26428) Compiler gcc/gfortran 4.4.5 MPI Open MPI 1.4.2 Blade switch VM environment InfiniBand Mellanox M3601Q (QDR 16 ports) VCPU 8 Memory 45 GB 1 1 VM 64
  65. MPI Point-to-Point 10000 (higher is better) 2.4 GB/s qperf 3.2 GB/s 1000 Bandwidth [MB/sec] 100 PCI KVM 10 Bare Metal Bare Metal KVM 1 1 10 100 1k 10k 100k 1M 10M 100M 1G Message size [byte] Bare Metal: 65
  66. NPB BT-MZ: (higher is better) 300 100 Performance [Gop/s total] 250 Degradation of PE: 80 Parallel efficiency [%] KVM: 2%, EC2 CCI: 14% 200 Bare Metal 60 150 KVM Amazon EC2 40 100 Bare Metal (PE) KVM (PE) 20 50 Amazon EC2 (PE) 0 0 1 2 4 8 16 EC2 Cluster compute Number of nodes instances (CCI) 66
  67. Bloss: Rank 0 Rank 0 N MPI OpenMP Bcast 760 MB Liner Solver (require 10GB mem. Reduce 1 GB coarse-grained MPI comm. Parallel Efficiency 1 GB 120 Bcast Eigenvector calc. (higher is better) Gather 100 350 MB Parallel Efficiency [%] 80 60 Degradation of PE: 40 KVM: 8%, EC2 CCI: 22% 20 Bare Metal KVM Amazon EC2 Ideal 0 1 2 4 8 16 Number of nodes 67
  68. VMWare ESXi •  Dell PowerEdge T410 –  CPU Intel Hexa-core Xeon X5650, single socket –  6GB DDR3-1333 –  HBA: QLogic QLE2460 (single-port 4Gbps Fibre Channel) •  IBM DS3400 FC SAN •  VMM: VMWare ESXi 5.0 T410 Fibre DS3400 Channel •  OS Windows server 2008 R2 •  Ethernet –  8 vCPU (out-of-band ) –  3840 MB •  –  IOMeter 2006.07.27 (http://www.iometer.org/)
  69. Bare Metal Machine Raw Device Mapping VMDirectPath I/O (BMM) (RDM) (FPT) VM VM Windows Windows Windows NTFS NTFS NTFS Volume manager Volume manager Volume manager Disk class driver Disk class driver Disk class driver Storport/FC HBA driver Storport/SCSI driver Storport/FC HBA driver VMKernel VMKernel FC HBA driver LUN LUN LUN
  70. 12 OS
  71. ESXi •  FC SAN PCI RDM –  VMM SCSI FC –  RDM PCI HBA •  OS Linux Windows Linux ESXi •  BMM – 
  72. •  PCI HPC –  "InfiniBand PCI HPC ", SACSIS2011, pp.109-116, 2011 5 . –  “HPC ”, ACS37 , 2012 5 . •  PCI •  VM SR-IOV –  •  –  VM •  VM 72
  73. •  HPC –  73
  74. Yabusame •  QEMU/KVM –  –  http://grivon.apgrid.org/quick-kvm-migration 74
  75. •  I/O •  I/O –  I/O –  virtio vhost –  : PCI SR-IOV •  VMM ! –  SymVirt BitVisor 75
Advertisement