0
Toward a practical “HPC Cloud”:  Performance tuning of a virtualized HPC cluster                       Ryousei Takano     ...
Outline•  What is HPC Cloud?•  Performance tuning method for HPC Cloud  –  PCI passthrough  –  NUMA affinity  –  VMM noise...
HPC CloudHPC Cloud utilizes cloud resources in HighPerformance Computing (HPC) applicationsVirtualized Clusters      Users...
HPC Cloud (cont’d)•  Pros:   –  User side: easy to deployment   –  Provider side: high resource utilization•  Cons:   –  P...
Toward a practical HPC Cloud                          To reduce the overhead of                      “True” HPC Cloud VM1 ...
PCI passthrough  IO emulation                    PCI passthrough                     SR-IOVVM1              VM2           ...
Virtual CPU scheduling         Bare Metal            Xen                                          KVM            VM (Xen D...
NUMA affinity        Bare Metal                                KVMLinux                              VM (QEMU process)  Th...
Evaluation Evaluation of HPC applications on 16 nodes cluster (part of AIST Green Cloud Cluster)   Compute node Dell Power...
MPI Point-to-Point                     communication performance                     10000                                ...
NUMA affinityExecution time on a single node: NPB multi-zone(Computational Fluid Dynamics) and Bloss (Non-lineareignsolver...
NPB BT-MZ: Parallel efficiency                                                                        (higher is better)  ...
Bloss: Parallel efficiency                          Bloss: non-linear internal eigensolver                                ...
SummaryHPC Cloud is promising!•  The performance of coarse-grained parallel   applications is comparable to bare metal   m...
LINPACK Efficiency                                                                            TOP500 June 2011           1...
Bloss: Parallel efficiency                          Bloss: non-linear internal eigensolver                                ...
Upcoming SlideShare
Loading in...5
×

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

1,318

Published on

AIST booth presentation slides at SC11.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,318
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
26
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster"

  1. 1. Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster Ryousei Takano Information Technology Research Institute,National Institute of Advanced Industrial Science and Technology (AIST), Japan SC2011@Seattle, Nov.15 2011
  2. 2. Outline•  What is HPC Cloud?•  Performance tuning method for HPC Cloud –  PCI passthrough –  NUMA affinity –  VMM noise reduction•  Performance evaluation 2
  3. 3. HPC CloudHPC Cloud utilizes cloud resources in HighPerformance Computing (HPC) applicationsVirtualized Clusters Users require resources Provider allocates users a dedicated according to needs virtual cluster on demand Physical Cluster 3
  4. 4. HPC Cloud (cont’d)•  Pros: –  User side: easy to deployment –  Provider side: high resource utilization•  Cons: –  Performance degradation? The method of performance tuning on a virtualized environment is not established. 4
  5. 5. Toward a practical HPC Cloud To reduce the overhead of “True” HPC Cloud VM1 interrupt virtualization The performance is Guest OS To disable unnecessary services closing to that of bare Physical driver on the host OS (i.e., ksmd). metals. VMM Reduce VMM noise NIC Set NUMA (not completed) affinity VM (QEMU process) Guest OS Threads Use PCI VCPU threads passthrough Linux kernel Current KVM HPC CloudIts performance is not good and Physical unstable. CPU CPU socket 5
  6. 6. PCI passthrough IO emulation PCI passthrough SR-IOVVM1 VM2 VM1 VM2 VM1 VM2 Guest OS Guest OS Guest OS … … … Guest Physical Physical driver driver driverVMM VMM VMM vSwitch Physical driverNIC NIC NIC Switch (VEB) IO emulation PCI passthrough SR-IOV VM sharing Performance 6
  7. 7. Virtual CPU scheduling Bare Metal Xen KVM VM (Xen DomU) VM (QEMU process)VM Guest OS Guest OS(Dom0) Threads Threads Virtual Machine A guest OS can not run numactl VCPU V0 V1 V2 V3 V0 V1 V2 V3 threads VCPUXen Hypervisor Linux kernel KVM Domain Process Virtual Machine scheduler scheduler Monitor (VMM) Physical Physical CPU CPU P0 P1 P2 P3 P0 P1 P2 P3 Hardware CPU socket 7
  8. 8. NUMA affinity Bare Metal KVMLinux VM (QEMU process) Threads Guest OS Threads numactl numactl bind threads Process to vSocket VCPU scheduler V0 V1 V2 V3 threads Linux kernel taskset KVM pin vCPU to Process CPU (Vn = Pn) scheduler CPU socketPhysicalCPU P0 P1 P2 P3 Physical CPU P0 P1 P2 P3 memory memory CPU socket 8
  9. 9. Evaluation Evaluation of HPC applications on 16 nodes cluster (part of AIST Green Cloud Cluster) Compute node Dell PowerEdge M610 Host machine environmentCPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 6.0.1Chipset Intel 5520 Linux kernel 2.6.32-5-amd64Memory 48 GB DDR3 KVM 0.12.50InfiniBand Mellanox ConnectX (MT26428) Compiler gcc/gfortran 4.4.5 MPI Open MPI 1.4.2 Blade switch VM environmentInfiniBand Mellanox M3601Q (QDR 16 ports) VCPU 8 Memory 45 GB 9
  10. 10. MPI Point-to-Point communication performance 10000 (higher is better) 1000Bandwidth [MB/sec] 100 10 PCI passthrough improves MPI communication throughput close to that of bare metal machines. Bare Metal KVM 1 1 10 100 1k 10k 100k 1M 10M 100M 1G Message size [byte] Bare Metal: non-virtualized cluster 10
  11. 11. NUMA affinityExecution time on a single node: NPB multi-zone(Computational Fluid Dynamics) and Bloss (Non-lineareignsolver) SP-MZ [sec] BT-MZ [sec] Bloss [min] Bare Metal 94.41 (1.00) 138.01 (1.00) 21.02 (1.00) KVM 104.57 (1.11) 141.69 (1.03) 22.12 (1.05) KVM (w/ bind) 96.14 (1.02) 139.32 (1.01) 21.28 (1.01)NUMA affinity is an important performance factor not onlyon bare metal machines but also on virtual machines. 11
  12. 12. NPB BT-MZ: Parallel efficiency (higher is better) 300 100Performance [Gop/s total] 250 Degradation of PE: 80 Parallel efficiency [%] KVM: 2%, EC2: 14% 200 Bare Metal 60 150 KVM Amazon EC2 40 100 Bare Metal (PE) KVM (PE) 20 50 Amazon EC2 (PE) 0 0 1 2 4 8 16 Number of nodes 12
  13. 13. Bloss: Parallel efficiency Bloss: non-linear internal eigensolver –  Hierarchical parallel program by MPI and OpenMP 120 Overhead of communication 100 and virtualizationParallel Efficiency [%] 80 60 Degradation of PE: KVM: 8%, EC2: 22% 40 20 Bare Metal KVM Amazon EC2 Ideal 0 1 2 4 8 16 Number of nodes 13
  14. 14. SummaryHPC Cloud is promising!•  The performance of coarse-grained parallel applications is comparable to bare metal machines•  We plan to operate a private cloud service “AIST Cloud” for HPC users•  Open issues –  VMM noise reduction –  VMM-bypass device-aware VM scheduling –  Live migration with VMM-bypass devices 14
  15. 15. LINPACK Efficiency TOP500 June 2011 100 InfiniBand: 79% 80Efficiency (%) 10 Gigabit Ethernet: 74% 60 40 Gigabit Ethernet: 54% GPGPU machines #451 Amazon EC2 InfiniBand cluster compute instances 20 Gigabit Ethernet 10 Gigabit Ethernet Virtualization causes the 0 performance degradation! TOP500 rank Efficiency Maximum LINPACK performance Rmax Theoretical peak performance Rpeak
  16. 16. Bloss: Parallel efficiency Bloss: non-linear internal eigensolver –  Hierarchical parallel program by MPI and OpenMP 120 100Parallel Efficiency [%] 80 60 Binding threads and physical CPUs can be sensitive to VMM noise and degrade the performance. 40 Bare Metal 20 KVM KVM (w/ bind) Amazon EC2 Ideal 0 1 2 4 8 16 Number of nodes 16
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×