• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
 

SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters

on

  • 669 views

How well does InfiniBand virtualized with SR-IOV really perform? SDSC carried out some initial application benchmarking studies and compared to the best-available commercial alternative to determine ...

How well does InfiniBand virtualized with SR-IOV really perform? SDSC carried out some initial application benchmarking studies and compared to the best-available commercial alternative to determine whether or not SR-IOV was a viable technology for closing the performance gap of virtualized HPC. The results were promising, and this technology will be used in Comet, SDSC's two-petaflop supercomputer being deployed in 2015.

Statistics

Views

Total Views
669
Views on SlideShare
666
Embed Views
3

Actions

Likes
1
Downloads
1
Comments
0

2 Embeds 3

http://www.linkedin.com 2
https://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters Presentation Transcript

    • SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters! Glenn K. Lockwood! Christopher Irving! Philip M. Papadopoulos! Mahidhar Tatineni! Rick Wagner! SAN DIEGO SUPERCOMPUTER CENTER
    • Single Root I/O Virtualization in HPC! •  Problem: complex workflows demand increasing flexibility from HPC platforms" •  Virtualization = flexibility" •  Virtualization = IO performance loss (e.g., excessive DMA interrupts)" •  Solution: SR-IOV and Mellanox ConnectX-3 InfiniBand HCAs " •  One physical function (PF) à multiple virtual functions (VF), each with own DMA streams, memory space, interrupts" •  Allows DMA to bypass hypervisor to VMs! SAN DIEGO SUPERCOMPUTER CENTER
    • High-Performance Virtualization on Comet ! •  Mellanox FDR InfiniBand HCAs with SR-IOV" •  Rocks and OpenStack Nova to manage VMs" •  Flexibility to support complex science gateways and web-based workflow engines" •  custom compute appliances and virtual clusters developed with FutureGrid and their existing expertise" •  backed by virtualized Lustre running over virtualized InfiniBand" SAN DIEGO SUPERCOMPUTER CENTER
    • Hardware/Software Configuration of Test Cluster ! Native, SR-IOV! Amazon EC2! Platform" •  Rocks 6.1 (EL6)" •  Virtualization via kvm   •  Amazon Linux 2013.03 (EL6)" •  cc2.8xlarge Instances" CPUs" •  2x Xeon E5-2660 (2.2GHz)" •  16 cores per node" •  2x Xeon E5-2670 (2.6GHz)" •  16 cores per node" RAM" •  64 GB DDR3 DRAM" •  60.5 DDR3 DRAM" Interconnect" •  QDR4X InfiniBand" •  10 GbE" •  Mellanox ConnectX-3 (MT27500)" •  common placement group" •  Intel VT-d, SR-IOV enabled in firmware, kernel, drivers" •  mlx4_core  1.1" •  Mellanox OFED 2.0" •  HCA firmware 2.11.1192" SAN DIEGO SUPERCOMPUTER CENTER
    • 50x less latency than Amazon EC2! •  SR-IOV! •  < 30% overhead for M < 128 bytes" •  < 10% overhead for eager send/recv" •  Overhead à 0% for bandwidth-limited regime" •  Amazon EC2! •  > 5000% worse latency" •  Time dependent (noisy)" 5 SAN DIEGO SUPERCOMPUTER CENTER OSU Microbenchmarks (3.9, osu_latency)"
    • 10x more bandwidth than Amazon EC2! •  SR-IOV! •  < 2% bandwidth loss over entire range" •  > 95% peak bandwidth" •  Amazon EC2! •  < 35% peak bandwidth" •  900% to 2500% worse bandwidth than virtualized InfiniBand" 6 SAN DIEGO SUPERCOMPUTER CENTER OSU Microbenchmarks (3.9, osu_bw)"
    • Weather Modeling – 15% Overhead! •  96-core (6-node) calculation" •  Nearest-neighbor communication" •  Scalable algorithms" •  SR-IOV incurs modest (15%) performance hit" •  ...but still still 20% faster*** than Amazon" SAN DIEGO SUPERCOMPUTER CENTER WRF 3.4.1 – 3hr forecast" *** 20% faster despite SR-IOV cluster having 20% slower CPUs"
    • Quantum ESPRESSO: 5x Faster than EC2! •  48-core, 3 node calc" •  CG matrix inversion (irregular comm.)" •  3D FFT matrix transposes (All-to-all communication)" •  28% slower w/ SR-IOV" •  SR-IOV still > 500% faster*** than EC2" SAN DIEGO SUPERCOMPUTER CENTER Quantum Espresso 5.0.2 – DEISA AUSURF112 benchmark" *** 20% faster despite SR-IOV cluster having 20% slower CPUs"
    • Conclusions! •  SR-IOV: huge step forward in high-performance virtualization" •  Shows substantial improvement in latency over Amazon EC2, and it provides nearly 0 bandwidth overhead! •  Benchmark application performance confirms this: significant improvement over EC2" •  SR-IOV: lowers performance barrier to virtualizing the interconnect and makes fully virtualized HPC clusters viable! •  Comet will deliver virtualized HPC to new/non-traditional communities that need flexibility without major loss of performance! SAN DIEGO SUPERCOMPUTER CENTER