Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Upcoming SlideShare
Loading in...5
×
 

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

on

  • 1,516 views

Presentation slides at CUTE 2011 (Korea-Japan e-Science and Cloud Symposium)

Presentation slides at CUTE 2011 (Korea-Japan e-Science and Cloud Symposium)

Statistics

Views

Total Views
1,516
Views on SlideShare
1,516
Embed Views
0

Actions

Likes
3
Downloads
27
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster Presentation Transcript

  • Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster Ryousei Takano, Tsutomu Ikegami, Takahiro Hirofuchi, Yoshio Tanaka Information Technology Research Institute,National Institute of Advanced Industrial Science and Technology (AIST), Japan CUTE2011@Seoul, Dec.15 2011
  • Background•  Cloud computing is getting increased attention from High Performance Computing community. –  e.g., Amazon EC2 Cluster Compute Instances•  Virtualization is a key technology. –  Provider rely on virtualization to consolidate computing resources.•  Virtualization provides not only opportunities, but also challenges for HPC systems and applications. –  Concern: Performance degradation due to the overhead of virtualization 2
  • Contribution•  Goal: –  To realize a practical HPC Cloud whose performance is close to that of bare metal (i.e., non-virtualized) machines•  Contributions: –  Feasible study by evaluating the HPC Challenge benchmark on a 16 node InfiniBand cluster –  The effect of three performance tuning techniques: •  PCI passthrough •  NUMA affinity •  VMM noise reduction 3
  • Outline•  Background•  Performance tuning techniques for HPC Cloud –  PCI passthrough –  NUMA affinity –  VMM noise reduction•  Performance evaluation –  HPC Challenge benchmark suite –  Results•  Summary 4
  • Outline•  Background•  Performance tuning techniques for HPC Cloud –  PCI passthrough –  NUMA affinity –  VMM noise reduction•  Performance evaluation –  HPC Challenge benchmark suite –  Results•  Summary 5
  • Toward a practical HPC Cloud To reduce the overhead of “True” HPC Cloud VM1 interrupt virtualization The performance is Guest OS To disable unnecessary services close to that of bare Physical driver on the host OS (i.e., ksmd). metal machines. VMM Reduce VMM noise NIC Set NUMA affinity VM (QEMU process) Guest OS Threads Use PCI VCPU threads passthrough Linux kernel Current KVM HPC CloudIts performance is not good and Physical unstable. CPU CPU socket 6
  • IO architectures of VMs IO emulation PCI passthrough VM1 VM2 VM1 VM2 Guest OS Guest OS … … Guest Physical driver driver VMM VMM vSwitch Physical driver VMM-bypass access NIC NICIO emulation degrades the performance PCI passthrough achieves the performancedue to the overhead of VMM processing. comparable to bare metal machines. VMM: Virtual Machine Monitor 7
  • NUMA affinity Bare MetalLinux On NUMA systems, memory affinity Threads is an important performance factor. numactl Process Local memory accesses are faster scheduler than remote memory accesses. In order to avoid inter-socket memory transfer, binding a thread to CPU socket can be effective. CPU socketPhysicalCPU P0 P1 P2 P3 memory memory NUMA: Non Uniform Memory Access 8
  • NUMA affinity: KVM Bare Metal KVMLinux VM (QEMU process) Threads Guest OS Threads numactl numactl bind threads Process to vSocket VCPU scheduler V0 V1 V2 V3 threads Linux kernel taskset KVM pin vCPU to Process CPU (Vn = Pn) scheduler CPU socketPhysicalCPU P0 P1 P2 P3 Physical CPU P0 P1 P2 P3 memory memory CPU socket 9
  • NUMA affinity: Xen Bare Metal XenLinux VM (Xen DomU) Threads VM Guest OS (Dom0) Threads numactl numactl cannot run on a guest Process OS, because Xen does not scheduler V0 V1 V2 V3 disclose the physical NUMA VCPU topology. Xen Hypervisor pin vCPU to Domain CPU (Vn = Pn) scheduler CPU socketPhysicalCPU P0 P1 P2 P3 Physical CPU P0 P1 P2 P3 memory memory 10
  • VMM noise•  OS noise is well-known problem to large-scale system scalability. –  OS activities and some daemon programs take up CPU time, consume cache and TLB, and delay the synchronization of parallel processes•  VMM level noise, called VMM noise, can cause the same problem for a guest OS. –  The overhead of interrupt virtualization that results in VM exits (i.e., VM-to-VMM switching) –  Unnecessary services on the host OS (i.e., ksmd)•  Now, we do not take care of VMM noise. 11
  • Outline•  Background•  Performance tuning techniques for HPC Cloud –  PCI passthrough –  NUMA affinity –  VMM noise reduction•  Performance evaluation –  HPC Challenge benchmark suite –  Results•  Summary 12
  • Experimental setting Evaluation of HPC Challenge benchmark on a 16 node Infiniband cluster Blade server Dell PowerEdge M610 Host machine environmentCPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 6.0.1Chipset Intel 5520 Linux kernel 2.6.32-5-amd64Memory 48 GB DDR3 KVM 0.12.50InfiniBand Mellanox ConnectX (MT26428) Xen 4.0.1 Compiler gcc/gfortran 4.4.5 Blade switch MPI Open MPI 1.4.2InfiniBand Mellanox M3601Q (QDR 16 ports) VM environment VCPU 8 Only 1 VM runs on 1 host. ! Memory 45 GB 13
  • HPC Challenge Benchmark Suite We measure spatial and temporal locality boundaries !"#$%&#$"(")(#*+(,-..(/+0$1 by evaluating HPC Challenge benchmark suite.Communication Compute intensiveintensive %&( @@8 /;<!! ,-5 8+93":&4(5"6&4$#7 !$00$"( -&:#+: >334$6&#$"0 Memory intensive -8=>?2 =&A"9>66+00 28=<>! "#$ 23&#$&4(5"6&4$#7 %&( From: Piotr Luszczek, et al., “The HPC Challenge (HPCC) Benchmark Suite,” SC2006 Tutorial. 14
  • HPC Challenge: Result HPL(G) Compute intensive 1.4 1.2 1 Random Ring Latency PTRANS(G) 0.8 Memory intensive 0.6 0.4 BMM BMM+pin 0.2 KVM 0 KVM+pin+bind Xen Random Ring BW STREAM(EP) Xen+pin Comparing Xen and KVM, the performances are almost same. FFT(G) RandomAccess(G)Communication G: Global, EP: Embarrassingly parallelintensive Higher is better, except for Random Ring Latency. 15
  • HPC Challenge: Result Xen KVM HPL(G) HPL(G) 1.2 1.2 1 1 Random Random Ring 0.8 Ring 0.8 PTRANS(G) PTRANS(G) Latency Latency 0.6 0.6 0.4 0.4 0.2 0.2 0 0Random STREAM Random Ring STREAM(EP)Ring BW (EP) BW RandomAcc RandomAccess FFT(G) FFT(G) ess(G) (G) BMM Xen Xen+pin BMM KVM KVM+pin+bind NUMA affinity is important even on a VM. But, the effect of VCPU pin is uncertain. G: Global, EP: Embarrassingly parallel Higher is better, except for Random Ring Latency. 16
  • HPL: High Performance LINPACK•  BMM: The LINPACK efficiency is 57.7% in 16 nodes (63.1% in a single node).•  BMM, KVM: setting NUMA affinity is effective.•  Virtualization overhead is 6 to 8%.Configuration 1 node 16 nodesBMM 50.24 (1.00) 706.21 (1.00)BMM + bind 51.07 (1.02) 747.88 (1.06)Xen 49.44 (0.98) 700.23 (0.99)Xen + pin 49.37 (0.98) 698.93 (0.99)KVM 48.03 (0.96) 671.97 (0.95)KVM + pin + bind 49.33 (0.98) 684.96 (0.97) 17
  • Discussion•  The performance of global benchmarks, except for FFT(G), is almost comparable with that of bare metal machines. –  FFT decreased the performance by 11% to 20% due to the virtualization overhead related to the inter-node communication and/or VMM noise. –  PCI passthrough improves MPI communication throughput close to that of bare metal machines. But, interrupt injection that results in VM exits can disturb the application execution. 18
  • Discussion (cont.)•  The performance of Xen is marginally better than that of KVM, except for RandomRing Bandwidth. –  The bandwidth decreases by 4% in KVM, 20% in Xen.•  KVM: The performance of STREAM(EP) decreases by 27%. –  A lot of memory contention among processes (TLB miss) may occur. It is the worst situation for EPT (Extended Page Table), because the page walk of EPT takes more time than that of shadow page table. This means a virtual machine is more sensitive to memory contention than a bare metal machine. 19
  • Outline•  Background•  Performance tuning techniques for HPC Cloud –  PCI passthrough –  NUMA affinity –  VMM noise reduction•  Performance evaluation –  HPC Challenge benchmark suite –  Results•  Summary 20
  • SummaryHPC Cloud is promising!•  The performance of coarse-grained parallel applications is comparable to bare metal machines.•  We plan to adopt these performance tuning techniques into our private cloud service called “AIST Cloud.”•  Open issues: –  VMM noise reduction –  Live migration with VMM-bypass devices 21
  • 22
  • HPC CloudHPC Cloud utilizes cloud resources in HighPerformance Computing (HPC) applications.Virtualized Clusters Users require resources Provider allocates users a dedicated according to needs. virtual cluster on demand. Physical Cluster 23
  • Amazon EC2 CCI in TOP500 TOP500 Nov. 2011 100 InfiniBand: 76% 80Efficiency (%) 10 Gigabit Ethernet: 72% 60 40 GPGPU machines Gigabit Ethernet: 52% #42 Amazon EC2 InfiniBand 20 cluster compute instances Gigabit Ethernet 10 Gigabit Ethernet 0 TOP500 rank Efficiency Maximum LINPACK performance Rmax Theoretical peak performance Rpeak
  • LINPACK Efficiency TOP500 June 2011 100 InfiniBand: 79% 80Efficiency (%) 10 Gigabit Ethernet: 74% 60 40 Gigabit Ethernet: 54% GPGPU machines #451 Amazon EC2 InfiniBand cluster compute instances 20 Gigabit Ethernet 10 Gigabit Ethernet Virtualization causes the 0 performance degradation! TOP500 rank Efficiency Maximum LINPACK performance Rmax Theoretical peak performance Rpeak