Your SlideShare is downloading. ×
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster


Published on

AIST booth presentation slides at SC11.

AIST booth presentation slides at SC11.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster Ryousei Takano Information Technology Research Institute,National Institute of Advanced Industrial Science and Technology (AIST), Japan SC2011@Seattle, Nov.15 2011
  • 2. Outline•  What is HPC Cloud?•  Performance tuning method for HPC Cloud –  PCI passthrough –  NUMA affinity –  VMM noise reduction•  Performance evaluation 2
  • 3. HPC CloudHPC Cloud utilizes cloud resources in HighPerformance Computing (HPC) applicationsVirtualized Clusters Users require resources Provider allocates users a dedicated according to needs virtual cluster on demand Physical Cluster 3
  • 4. HPC Cloud (cont’d)•  Pros: –  User side: easy to deployment –  Provider side: high resource utilization•  Cons: –  Performance degradation? The method of performance tuning on a virtualized environment is not established. 4
  • 5. Toward a practical HPC Cloud To reduce the overhead of “True” HPC Cloud VM1 interrupt virtualization The performance is Guest OS To disable unnecessary services closing to that of bare Physical driver on the host OS (i.e., ksmd). metals. VMM Reduce VMM noise NIC Set NUMA (not completed) affinity VM (QEMU process) Guest OS Threads Use PCI VCPU threads passthrough Linux kernel Current KVM HPC CloudIts performance is not good and Physical unstable. CPU CPU socket 5
  • 6. PCI passthrough IO emulation PCI passthrough SR-IOVVM1 VM2 VM1 VM2 VM1 VM2 Guest OS Guest OS Guest OS … … … Guest Physical Physical driver driver driverVMM VMM VMM vSwitch Physical driverNIC NIC NIC Switch (VEB) IO emulation PCI passthrough SR-IOV VM sharing Performance 6
  • 7. Virtual CPU scheduling Bare Metal Xen KVM VM (Xen DomU) VM (QEMU process)VM Guest OS Guest OS(Dom0) Threads Threads Virtual Machine A guest OS can not run numactl VCPU V0 V1 V2 V3 V0 V1 V2 V3 threads VCPUXen Hypervisor Linux kernel KVM Domain Process Virtual Machine scheduler scheduler Monitor (VMM) Physical Physical CPU CPU P0 P1 P2 P3 P0 P1 P2 P3 Hardware CPU socket 7
  • 8. NUMA affinity Bare Metal KVMLinux VM (QEMU process) Threads Guest OS Threads numactl numactl bind threads Process to vSocket VCPU scheduler V0 V1 V2 V3 threads Linux kernel taskset KVM pin vCPU to Process CPU (Vn = Pn) scheduler CPU socketPhysicalCPU P0 P1 P2 P3 Physical CPU P0 P1 P2 P3 memory memory CPU socket 8
  • 9. Evaluation Evaluation of HPC applications on 16 nodes cluster (part of AIST Green Cloud Cluster) Compute node Dell PowerEdge M610 Host machine environmentCPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 6.0.1Chipset Intel 5520 Linux kernel 2.6.32-5-amd64Memory 48 GB DDR3 KVM 0.12.50InfiniBand Mellanox ConnectX (MT26428) Compiler gcc/gfortran 4.4.5 MPI Open MPI 1.4.2 Blade switch VM environmentInfiniBand Mellanox M3601Q (QDR 16 ports) VCPU 8 Memory 45 GB 9
  • 10. MPI Point-to-Point communication performance 10000 (higher is better) 1000Bandwidth [MB/sec] 100 10 PCI passthrough improves MPI communication throughput close to that of bare metal machines. Bare Metal KVM 1 1 10 100 1k 10k 100k 1M 10M 100M 1G Message size [byte] Bare Metal: non-virtualized cluster 10
  • 11. NUMA affinityExecution time on a single node: NPB multi-zone(Computational Fluid Dynamics) and Bloss (Non-lineareignsolver) SP-MZ [sec] BT-MZ [sec] Bloss [min] Bare Metal 94.41 (1.00) 138.01 (1.00) 21.02 (1.00) KVM 104.57 (1.11) 141.69 (1.03) 22.12 (1.05) KVM (w/ bind) 96.14 (1.02) 139.32 (1.01) 21.28 (1.01)NUMA affinity is an important performance factor not onlyon bare metal machines but also on virtual machines. 11
  • 12. NPB BT-MZ: Parallel efficiency (higher is better) 300 100Performance [Gop/s total] 250 Degradation of PE: 80 Parallel efficiency [%] KVM: 2%, EC2: 14% 200 Bare Metal 60 150 KVM Amazon EC2 40 100 Bare Metal (PE) KVM (PE) 20 50 Amazon EC2 (PE) 0 0 1 2 4 8 16 Number of nodes 12
  • 13. Bloss: Parallel efficiency Bloss: non-linear internal eigensolver –  Hierarchical parallel program by MPI and OpenMP 120 Overhead of communication 100 and virtualizationParallel Efficiency [%] 80 60 Degradation of PE: KVM: 8%, EC2: 22% 40 20 Bare Metal KVM Amazon EC2 Ideal 0 1 2 4 8 16 Number of nodes 13
  • 14. SummaryHPC Cloud is promising!•  The performance of coarse-grained parallel applications is comparable to bare metal machines•  We plan to operate a private cloud service “AIST Cloud” for HPC users•  Open issues –  VMM noise reduction –  VMM-bypass device-aware VM scheduling –  Live migration with VMM-bypass devices 14
  • 15. LINPACK Efficiency TOP500 June 2011 100 InfiniBand: 79% 80Efficiency (%) 10 Gigabit Ethernet: 74% 60 40 Gigabit Ethernet: 54% GPGPU machines #451 Amazon EC2 InfiniBand cluster compute instances 20 Gigabit Ethernet 10 Gigabit Ethernet Virtualization causes the 0 performance degradation! TOP500 rank Efficiency Maximum LINPACK performance Rmax Theoretical peak performance Rpeak
  • 16. Bloss: Parallel efficiency Bloss: non-linear internal eigensolver –  Hierarchical parallel program by MPI and OpenMP 120 100Parallel Efficiency [%] 80 60 Binding threads and physical CPUs can be sensitive to VMM noise and degrade the performance. 40 Bare Metal 20 KVM KVM (w/ bind) Amazon EC2 Ideal 0 1 2 4 8 16 Number of nodes 16