Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 68

Programming Models for Exascale Systems

2

Share

Download to read offline

In this video from the 2016 Stanford HPC Conference, DK Panda from Ohio State University presents: Programming Models for Exascale Systems.

This talk will focus on programming models and their designs for upcoming exascale systems with millions of processors and accelerators. Current status and future trends of MPI and PGAS (UPC and OpenSHMEM) programming models will be presented."

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Programming Models for Exascale Systems

  1. 1. Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu h<p://www.cse.ohio-state.edu/~panda Keynote Talk at HPCAC-Stanford (Feb 2016) by
  2. 2. HPCAC-Stanford (Feb ‘16) 2 Network Based CompuNng Laboratory High-End CompuNng (HEC): ExaFlop & ExaByte 100-200 PFlops in 2016-2018 1 EFlops in 2020-2024? F i g u r e 1 Source: IDC's Digital Universe Study, sponsored by EMC, December 2012 Within these broad outlines of the digital universe are some singularities worth noting. First, while the portion of the digital universe holding potential analytic value is growing, only a tin fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digita 10K-20K EBytes in 2016-2018 40K EBytes in 2020 ? ExaFlop & HPC •  ExaByte & BigData • 
  3. 3. HPCAC-Stanford (Feb ‘16) 3 Network Based CompuNng Laboratory 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500 Percentage of Clusters Number of Clusters Timeline Percentage of Clusters Number of Clusters Trends for Commodity CompuNng Clusters in the Top 500 List (hTp://www.top500.org) 85%
  4. 4. HPCAC-Stanford (Feb ‘16) 4 Network Based CompuNng Laboratory Drivers of Modern HPC Cluster Architectures Tianhe – 2 Titan Stampede Tianhe – 1A •  MulR-core/many-core technologies •  Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) •  Solid State Drives (SSDs), Non-VolaRle Random-Access Memory (NVRAM), NVMe-SSD •  Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) Accelerators / Coprocessors high compute density, high performance/waT >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, 100Gbps Bandwidth> MulN-core Processors SSD, NVMe-SSD, NVRAM
  5. 5. HPCAC-Stanford (Feb ‘16) 5 Network Based CompuNng Laboratory •  235 IB Clusters (47%) in the Nov’ 2015 Top500 list (h<p://www.top500.org) •  InstallaRons in the Top 50 (21 systems): Large-scale InfiniBand InstallaNons 462,462 cores (Stampede) at TACC (10th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (25th) 185,344 cores (Pleiades) at NASA/Ames (13th) 194,616 cores (Cascade) at PNNL (27th) 72,800 cores Cray CS-Storm in US (15th) 76,032 cores (Makman-2) at Saudi Aramco (32nd) 72,800 cores Cray CS-Storm in US (16th) 110,400 cores (Pangea) in France (33rd) 265,440 cores SGI ICE at Tulip Trading Australia (17th) 37,120 cores (Lomonosov-2) at Russia/MSU (35th) 124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (18th) 57,600 cores (SwifLucy) in US (37th) 72,000 cores (HPC2) in Italy (19th) 55,728 cores (Prometheus) at Poland/Cyfronet (38th) 152,692 cores (Thunder) at AFRL/USA (21st ) 50,544 cores (Occigen) at France/GENCI-CINES (43rd) 147,456 cores (SuperMUC) in Germany (22nd) 76,896 cores (Salomon) SGI ICE in Czech Republic (47th) 86,016 cores (SuperMUC Phase 2) in Germany (24th) and many more!
  6. 6. HPCAC-Stanford (Feb ‘16) 6 Network Based CompuNng Laboratory Towards Exascale System (Today and Target) Systems 2016 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20 MW (50 Gflops/W) O(1) ~15x System memory 1.4 PB (1.024PB CPU + 0.384PB CoP) 32 – 64 PB ~50X Node performance 3.43TF/s (0.4 CPU + 3 CoP) 1.2 or 15 TF O(1) Node concurrency 24 core CPU + 171 cores CoP O(1k) or O(10k) ~5x - ~50x Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x Total concurrency 3.12M 12.48M threads (4 /core) O(billion) for latency hiding ~100x MTTI Few/day Many/day O(?) Courtesy: Prof. Jack Dongarra
  7. 7. HPCAC-Stanford (Feb ‘16) 7 Network Based CompuNng Laboratory •  ScienRfic CompuRng –  Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model –  Many discussions towards ParRRoned Global Address Space (PGAS) •  UPC, OpenSHMEM, CAF, etc. –  Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) •  Big Data/Enterprise/Commercial CompuRng –  Focuses on large data and data analysis –  Hadoop (HDFS, HBase, MapReduce) –  Spark is emerging for in-memory compuRng –  Memcached is also used for Web 2.0 Two Major Categories of ApplicaNons
  8. 8. HPCAC-Stanford (Feb ‘16) 8 Network Based CompuNng Laboratory Parallel Programming Models Overview P1 P2 P3 Shared Memory P1 P2 P3 Memory Memory Memory P1 P2 P3 Memory Memory Memory Logical shared memory Shared Memory Model SHMEM, DSM Distributed Memory Model MPI (Message Passing Interface) ParRRoned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … •  Programming models provide abstract machine models •  Models can be mapped on different types of systems –  e.g. Distributed Shared Memory (DSM), MPI within a node, etc. •  PGAS models and Hybrid MPI+PGAS models are gradually receiving importance
  9. 9. HPCAC-Stanford (Feb ‘16) 9 Network Based CompuNng Laboratory ParNNoned Global Address Space (PGAS) Models •  Key features -  Simple shared memory abstracRons -  Light weight one-sided communicaRon -  Easier to express irregular communicaRon •  Different approaches to PGAS -  Languages •  Unified Parallel C (UPC) •  Co-Array Fortran (CAF) •  X10 •  Chapel -  Libraries •  OpenSHMEM •  Global Arrays
  10. 10. HPCAC-Stanford (Feb ‘16) 10 Network Based CompuNng Laboratory Hybrid (MPI+PGAS) Programming •  ApplicaRon sub-kernels can be re-wri<en in MPI/PGAS based on communicaRon characterisRcs •  Benefits: –  Best of Distributed CompuRng Model –  Best of Shared Memory CompuRng Model •  Exascale Roadmap*: –  “Hybrid Programming is a pracRcal way to program exascale systems” * The Interna4onal Exascale So;ware Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, Interna4onal Journal of High Performance Computer Applica4ons, ISSN 1094-3420 Kernel 1 MPI Kernel 2 MPI Kernel 3 MPI Kernel N MPI HPC ApplicaNon Kernel 2 PGAS Kernel N PGAS
  11. 11. HPCAC-Stanford (Feb ‘16) 11 Network Based CompuNng Laboratory Designing CommunicaNon Libraries for MulN-Petaflop and Exaflop Systems: Challenges Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. ApplicaNon Kernels/ApplicaNons Networking Technologies (InfiniBand, 40/100GigE, Aries, and OmniPath) MulN/Many-core Architectures Accelerators (NVIDIA and MIC) Middleware Co-Design OpportuniNes and Challenges across Various Layers Performance Scalability Fault- Resilience CommunicaNon Library or RunNme for Programming Models Point-to-point CommunicaNon CollecNve CommunicaNon Energy- Awareness SynchronizaNon and Locks I/O and File Systems Fault Tolerance
  12. 12. HPCAC-Stanford (Feb ‘16) 12 Network Based CompuNng Laboratory •  Scalability for million to billion processors –  Support for highly-efficient inter-node and intra-node communicaRon (both two-sided and one-sided) –  Scalable job start-up •  Scalable CollecRve communicaRon –  Offload –  Non-blocking –  Topology-aware •  Balancing intra-node and inter-node communicaRon for next generaRon nodes (128-1024 cores) –  MulRple end-points per node •  Support for efficient mulR-threading •  Integrated Support for GPGPUs and Accelerators •  Fault-tolerance/resiliency •  QoS support for communicaRon and I/O •  Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, …) •  VirtualizaRon •  Energy-Awareness Broad Challenges in Designing CommunicaNon Libraries for (MPI+X) at Exascale
  13. 13. HPCAC-Stanford (Feb ‘16) 13 Network Based CompuNng Laboratory •  Extreme Low Memory Footprint –  Memory per core conRnues to decrease •  D-L-A Framework –  Discover •  Overall network topology (fat-tree, 3D, …), Network topology for processes for a given job •  Node architecture, Health of network and node –  Learn •  Impact on performance and scalability •  PotenRal for failure –  Adapt •  Internal protocols and algorithms •  Process mapping •  Fault-tolerance soluRons –  Low overhead techniques while delivering performance, scalability and fault-tolerance AddiNonal Challenges for Designing Exascale Somware Libraries
  14. 14. HPCAC-Stanford (Feb ‘16) 14 Network Based CompuNng Laboratory Overview of the MVAPICH2 Project •  High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) –  MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 –  MVAPICH2-X (MPI + PGAS), Available since 2011 –  Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 –  Support for VirtualizaRon (MVAPICH2-Virt), Available since 2015 –  Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 –  Used by more than 2,525 organizaNons in 77 countries –  More than 351,000 (> 0.35 million) downloads from the OSU site directly –  Empowering many TOP500 clusters (Nov ‘15 ranking) •  10th ranked 519,640-core cluster (Stampede) at TACC •  13th ranked 185,344-core cluster (Pleiades) at NASA •  25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo InsRtute of Technology and many others –  Available with sofware stacks of many vendors and Linux Distros (RedHat and SuSE) –  h<p://mvapich.cse.ohio-state.edu •  Empowering Top500 systems for over a decade –  System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> –  Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops)
  15. 15. HPCAC-Stanford (Feb ‘16) 15 Network Based CompuNng Laboratory MVAPICH2 Architecture High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++*) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable CommunicaNon RunNme Diverse APIs and Mechanisms Point-to- point PrimiNves CollecNves Algorithms Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance VirtualizaNon AcNve Messages Job Startup IntrospecNon & Analysis Support for Modern Networking Technology (InfiniBand, iWARP, RoCE, OmniPath) Support for Modern MulN-/Many-core Architectures (Intel-Xeon, OpenPower*, Xeon-Phi (MIC, KNL*), NVIDIA GPGPU) Transport Protocols Modern Features RC XRC UD DC UMR ODP* SR- IOV MulN Rail Transport Mechanisms Shared Memory CMA IVSHMEM Modern Features MCDRAM* NVLink* CAPI* * - Upcoming
  16. 16. HPCAC-Stanford (Feb ‘16) 16 Network Based CompuNng Laboratory •  Scalability for million to billion processors –  Support for highly-efficient inter-node and intra-node communicaRon (both two-sided and one-sided RMA) –  Support for advanced IB mechanisms (UMR and ODP) –  Extremely minimal memory footprint –  Scalable job start-up •  CollecRve communicaRon •  Integrated Support for GPGPUs •  Integrated Support for MICs •  Unified RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) •  VirtualizaRon •  Energy-Awareness •  InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
  17. 17. HPCAC-Stanford (Feb ‘16) 17 Network Based CompuNng Laboratory One-way Latency: MPI over IB with MVAPICH2 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 Small Message Latency Message Size (bytes) Latency (us) 1.26 1.19 0.95 1.15 TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back 0 20 40 60 80 100 120 TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-4-EDR Large Message Latency Message Size (bytes) Latency (us)
  18. 18. HPCAC-Stanford (Feb ‘16) 18 Network Based CompuNng Laboratory Bandwidth: MPI over IB with MVAPICH2 0 2000 4000 6000 8000 10000 12000 14000 UnidirecNonal Bandwidth Bandwidth (MBytes/sec) Message Size (bytes) 12465 3387 6356 12104 0 5000 10000 15000 20000 25000 30000 TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-4-EDR BidirecNonal Bandwidth Bandwidth (MBytes/sec) Message Size (bytes) 21425 12161 24353 6308 TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back
  19. 19. HPCAC-Stanford (Feb ‘16) 19 Network Based CompuNng Laboratory 0 0.5 1 0 1 2 4 8 16 32 64 128 256 512 1K Latency (us) Message Size (Bytes) Latency Intra-Socket Inter-Socket MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) Latest MVAPICH2 2.2b Intel Ivy-bridge 0.18 us 0.45 us 0 5000 10000 15000 Bandwidth (MB/s) Message Size (Bytes) Bandwidth (Inter-socket) inter-Socket-CMA inter-Socket-Shmem inter-Socket-LiMIC 0 5000 10000 15000 Bandwidth (MB/s) Message Size (Bytes) Bandwidth (Intra-socket) intra-Socket-CMA intra-Socket-Shmem intra-Socket-LiMIC 14,250 MB/s 13,749 MB/s
  20. 20. HPCAC-Stanford (Feb ‘16) 20 Network Based CompuNng Laboratory •  Introduced by Mellanox to support direct local and remote nonconRguous memory access –  Avoid packing at sender and unpacking at receiver •  Available with MVAPICH2-X 2.2b User-mode Memory RegistraNon (UMR) 0 50 100 150 200 250 300 350 4K 16K 64K 256K 1M Latency (us) Message Size (Bytes) Small & Medium Message Latency UMR Default 0 5000 10000 15000 20000 2M 4M 8M 16M Latency (us) Message Size (Bytes) Large Message Latency UMR Default Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with Mellanox IB FDR switch M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, High Performance MPI Datatype Support with User-mode Memory RegistraNon: Challenges, Designs and Benefits, CLUSTER, 2015
  21. 21. HPCAC-Stanford (Feb ‘16) 21 Network Based CompuNng Laboratory •  Introduced by Mellanox to support direct remote memory access without pinning •  Memory regions paged in/out dynamically by the HCA/OS •  Size of registered buffers can be larger than physical memory •  Will be available in upcoming MVAPICH2-X 2.2 RC1 On-Demand Paging (ODP) Connect-IB (54 Gbps): 2.6 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch 0 500 1000 1500 16 32 64 Pin-down Buffer Size (MB) Number of Processes Graph500 Pin-down Buffer Sizes Pin-down ODP 0 1 2 3 4 5 16 32 64 ExecuNon Time (s) Number of Processes Graph500 BFS Kernel Pin-down ODP
  22. 22. HPCAC-Stanford (Feb ‘16) 22 Network Based CompuNng Laboratory Minimizing Memory Footprint by Direct Connect (DC) Transport Node 0 P1 P0 Node 1 P3 P2 Node 3 P7 P6 Node 2 P5 P4 IB Network •  Constant connecRon cost (One QP for any peer) •  Full Feature Set (RDMA, Atomics etc) •  Separate objects for send (DC IniRator) and receive (DC Target) –  DC Target idenRfied by “DCT Number” –  Messages routed with (DCT Number, LID) –  Requires same “DC Key” to enable communicaRon •  Available since MVAPICH2-X 2.2a 0 0.5 1 160 320 620 Normalized ExecuNon Time Number of Processes NAMD - Apoa1: Large data set RC DC-Pool UD XRC 10 22 47 97 1 1 1 2 10 10 10 10 1 1 3 5 1 10 100 80 160 320 640 ConnecNon Memory (KB) Number of Processes Memory Footprint for Alltoall RC DC-Pool UD XRC H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences. IEEE InternaRonal SupercompuRng Conference (ISC ’14)
  23. 23. HPCAC-Stanford (Feb ‘16) 23 Network Based CompuNng Laboratory •  Near-constant MPI and OpenSHMEM iniRalizaRon Rme at any process count •  10x and 30x improvement in startup Rme of MPI and OpenSHMEM respecRvely at 16,384 processes •  Memory consumpRon reduced for remote endpoint informaRon by O(processes per node) •  1GB Memory saved per node with 1M processes and 16 processes per node Towards High Performance and Scalable Startup at Exascale P M O Job Startup Performance Memory Required to Store Endpoint InformaRon P M PGAS – State of the art MPI – State of the art O PGAS/MPI – OpRmized PMIX_Ring PMIX_Ibarrier PMIX_Iallgather Shmem based PMI On-demand ConnecRon On-demand ConnecNon Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and D K Panda, 20th InternaRonal Workshop on High-level Parallel Programming Models and SupporRve Environments (HIPS ’15) PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the 21st European MPI Users' Group MeeRng (EuroMPI/Asia ’14) Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th IEEE/ ACM InternaRonal Symposium on Cluster, Cloud and Grid CompuRng (CCGrid ’15) SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K Panda, 16th IEEE/ACM InternaRonal Symposium on Cluster, Cloud and Grid CompuRng (CCGrid ’16) , Accepted for Publica6on
  24. 24. HPCAC-Stanford (Feb ‘16) 24 Network Based CompuNng Laboratory •  SHMEMPMI allows MPI processes to directly read remote endpoint (EP) informaRon from the process manager through shared memory segments •  Only a single copy per node - O(processes per node) reducRon in memory usage •  EsRmated savings of 1GB per node with 1 million processes and 16 processes per node •  Up to 1,000 Rmes faster PMI Gets compared to default design. Will be available in MVAPICH2 2.2RC1. Process Management Interface over Shared Memory (SHMEMPMI) TACC Stampede - Connect-IB (54 Gbps): 2.6 GHz Quad Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR SHMEMPMI – Shared Memory Based PMI for Performance and Scalability S. Chakraborty, H. Subramoni, J. Perkins, and D.K. Panda, 16th IEEE/ACM InternaRonal Symposium on Cluster, Cloud and Grid CompuRng (CCGrid ‘16), Accepted for publica6on 0 50 100 150 200 250 300 1 2 4 8 16 32 Time Taken (milliseconds) Number of Processes per Node Time Taken by one PMI_Get Default SHMEMPMI 0.0001 0.001 0.01 0.1 1 10 100 1000 10000 16 64 256 1K 4K 16K 64K 256K 1M Memory Usage per Node (MB) Number of Processes per Job Memory Usage for Remote EP InformaRon Fence - Default Allgather - Default Fence - Shmem Allgather - Shmem EsNmated 1000x Actual 16x
  25. 25. HPCAC-Stanford (Feb ‘16) 25 Network Based CompuNng Laboratory •  Scalability for million to billion processors •  CollecRve communicaRon –  Offload and Non-blocking –  Topology-aware •  Integrated Support for GPGPUs •  Integrated Support for MICs •  Unified RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) •  VirtualizaRon •  Energy-Awareness •  InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
  26. 26. HPCAC-Stanford (Feb ‘16) 26 Network Based CompuNng Laboratory Modified HPL with Offload-Bcast does up to 4.5% be<er than default version (512 Processes) 0 1 2 3 4 5 512 600 720 800 ApplicaNon Run-Time (s) Data Size 0 5 10 15 64 128 256 512 Run-Time (s) Number of Processes PCG-Default Modified-PCG-Offload Co-Design with MPI-3 Non-Blocking CollecNves and CollecNve Offload Co-Direct Hardware (Available since MVAPICH2-X 2.2a) Modified P3DFFT with Offload-Alltoall does up to 17% be<er than default version (128 Processes) K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking All-to-All with CollecNve Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, ISC 2011 17% 0 0.2 0.4 0.6 0.8 1 1.2 10 20 30 40 50 60 70 Normalized Performance HPL-Offload HPL-1ring HPL-Host HPL Problem Size (N) as % of Total Memory 4.5% Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% be<er than default version K. Kandalla, et. al, Designing Non-blocking Broadcast with CollecNve Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011 K. Kandalla, et. al., Designing Non-blocking Allreduce with CollecNve Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12 21.8% Can Network-Offload based Non-Blocking Neighborhood MPI CollecNves Improve CommunicaNon Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12
  27. 27. HPCAC-Stanford (Feb ‘16) 27 Network Based CompuNng Laboratory Network-Topology-Aware Placement of Processes •  Can we design a highly scalable network topology detecRon service for IB? •  How do we design the MPI communicaRon library in a network-topology-aware manner to efficiently leverage the topology informaRon generated by our service? •  What are the potenRal benefits of using a network-topology-aware MPI library on the performance of parallel scienRfic applicaRons? Overall performance and Split up of physical communicaNon for MILC on Ranger Performance for varying system sizes Default for 2048 core run Topo-Aware for 2048 core run 15% H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper Finalist •  Reduce network topology discovery Nme from O(N2 hosts) to O(Nhosts) •  15% improvement in MILC execuNon Nme @ 2048 cores •  15% improvement in Hypre execuNon Nme @ 1024 cores
  28. 28. HPCAC-Stanford (Feb ‘16) 28 Network Based CompuNng Laboratory •  Scalability for million to billion processors •  CollecRve communicaRon •  Integrated Support for GPGPUs –  CUDA-Aware MPI –  GPUDirect RDMA (GDR) Support –  CUDA-aware Non-blocking CollecRves –  Support for Managed Memory –  Efficient datatype Processing •  Integrated Support for MICs •  Unified RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) •  VirtualizaRon •  Energy-Awareness •  InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
  29. 29. HPCAC-Stanford (Feb ‘16) 29 Network Based CompuNng Laboratory PCIe GPU CPU NIC Switch At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .); At Receiver: MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .); •  Data movement in applicaRons with standard MPI and CUDA interfaces High Produc4vity and Low Performance MPI + CUDA - Naive
  30. 30. HPCAC-Stanford (Feb ‘16) 30 Network Based CompuNng Laboratory PCIe GPU CPU NIC Switch At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_hostbuf + j * block_sz, blksz . . .); } MPI_Waitall(); <<Similar at receiver>> •  Pipelining at user level with non-blocking MPI and CUDA interfaces Low Produc4vity and High Performance MPI + CUDA - Advanced
  31. 31. HPCAC-Stanford (Feb ‘16) 31 Network Based CompuNng Laboratory At Sender: At Receiver: MPI_Recv(r_devbuf, size, …); inside MVAPICH2 •  Standard MPI interfaces used for unified data movement •  Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) •  Overlaps data movement from GPU with RDMA transfers High Performance and High Produc4vity MPI_Send(s_devbuf, size, …); GPU-Aware MPI Library: MVAPICH2-GPU
  32. 32. HPCAC-Stanford (Feb ‘16) 32 Network Based CompuNng Laboratory •  OFED with support for GPUDirect RDMA is developed by NVIDIA and Mellanox •  OSU has a design of MVAPICH2 using GPUDirect RDMA –  Hybrid design using GPU-Direct RDMA •  GPUDirect RDMA and Host-based pipelining •  Alleviates P2P bandwidth bo<lenecks on SandyBridge and IvyBridge –  Support for communicaRon using mulR-rail –  Support for Mellanox Connect-IB and ConnectX VPI adapters –  Support for RoCE with Mellanox ConnectX VPI adapters GPU-Direct RDMA (GDR) with CUDA IB Adapter System Memory GPU Memory GPU CPU Chipset P2P write: 5.2 GB/s P2P read: < 1.0 GB/s SNB E5-2670 P2P write: 6.4 GB/s P2P read: 3.5 GB/s IVB E5-2680V2 SNB E5-2670 / IVB E5-2680V2
  33. 33. HPCAC-Stanford (Feb ‘16) 33 Network Based CompuNng Laboratory CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases •  Support for MPI communicaRon from NVIDIA GPU device memory •  High performance RDMA-based inter-node point-to-point communicaRon (GPU-GPU, GPU-Host and Host-GPU) •  High performance intra-node point-to-point communicaRon for mulR- GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) •  Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communicaRon for mulRple GPU adapters/node •  OpRmized and tuned collecRves for GPU device buffers •  MPI datatype support for point-to-point and collecRve communicaRon from GPU device buffers
  34. 34. HPCAC-Stanford (Feb ‘16) 34 Network Based CompuNng Laboratory MVAPICH2-GDR-2.2b Intel Ivy Bridge (E5-2680 v2) node - 20 cores NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA 10x 2X 11x 2x Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR) 0 5 10 15 20 25 30 0 2 8 32 128 512 2K MV2-GDR2.2b MV2-GDR2.0b MV2 w/o GDR GPU-GPU internode latency Message Size (bytes) Latency (us) 2.18us 0 500 1000 1500 2000 2500 3000 1 4 16 64 256 1K 4K MV2-GDR2.2b MV2-GDR2.0b MV2 w/o GDR GPU-GPU Internode Bandwidth Message Size (bytes) Bandwidth (MB/ s) 11X 0 1000 2000 3000 4000 1 4 16 64 256 1K 4K MV2-GDR2.2b MV2-GDR2.0b MV2 w/o GDR GPU-GPU Internode Bi-Bandwidth Message Size (bytes) Bi-Bandwidth (MB/s)
  35. 35. HPCAC-Stanford (Feb ‘16) 35 Network Based CompuNng Laboratory •  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) •  HoomdBlue Version 1.0.5 •  GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 ApplicaNon-Level EvaluaNon (HOOMD-blue) 0 500 1000 1500 2000 2500 4 8 16 32 Average Time Steps per second (TPS) Number of Processes MV2 MV2+GDR 0 500 1000 1500 2000 2500 3000 3500 4 8 16 32 Average Time Steps per second (TPS) Number of Processes 64K ParNcles 256K ParNcles 2X 2X
  36. 36. HPCAC-Stanford (Feb ‘16) 36 Network Based CompuNng Laboratory 0 20 40 60 80 100 120 4K 16K 64K 256K 1M Overlap (%) Message Size (Bytes) Medium/Large Message Overlap (64 GPU nodes) Ialltoall (1process/node) Ialltoall (2process/node; 1process/GPU) 0 20 40 60 80 100 120 4K 16K 64K 256K 1M Overlap (%) Message Size (Bytes) Medium/Large Message Overlap (64 GPU nodes) Igather (1process/node) Igather (2processes/node; 1process/ GPU) Plazorm: Wilkes: Intel Ivy Bridge NVIDIA Tesla K20c + Mellanox Connect-IB Available since MVAPICH2-GDR 2.2a CUDA-Aware Non-Blocking CollecNves A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU CollecNves using CORE-Direct and CUDA CapabiliNes on IB Clusters, HIPC, 2015
  37. 37. HPCAC-Stanford (Feb ‘16) 37 Network Based CompuNng Laboratory CommunicaNon RunNme with GPU Managed Memory ●  CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified) memory allowing a common memory allocaRon for GPU or CPU through cudaMallocManaged() call ●  Significant producRvity benefits due to abstracRon of explicit allocaRon and cudaMemcpy() ●  Extended MVAPICH2 to perform communicaRons directly from managed buffers (Available in MVAPICH2-GDR 2.2b) ●  OSU Micro-benchmarks extended to evaluate the performance of point-to-point and collecRve communicaRons using managed buffers ●  Available in OMB 5.2 D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance CommunicaRon RunRme for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, to be held in conjuncRon with PPoPP ‘16 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 2 4 8 16 32 64 128 256 1K 4K 8K 16K Halo Exchange Time (ms) Total Dimension Size (Bytes) 2D Stencil Performance for Halowidth=1 Device Managed
  38. 38. HPCAC-Stanford (Feb ‘16) 38 Network Based CompuNng Laboratory CPU Progress GPU Time Initiate Kernel Start Send Isend(1) Initiate Kernel Start Send Initiate Kernel GPU CPU Initiate Kernel Start Send Wait For Kernel (WFK) Kernel on Stream Isend(1) Existing Design Proposed Design Kernel on Stream Kernel on Stream Isend(2)Isend(3) Kernel on Stream Initiate Kernel Start Send Wait For Kernel (WFK) Kernel on Stream Isend(1) Initiate Kernel Start Send Wait For Kernel (WFK) Kernel on Stream Isend(1) Wait WFK Start Send Wait Progress Start Finish Proposed Finish Existing WFK WFK Expected Benefits MPI Datatype Processing (CommunicaNon OpNmizaNon ) Waste of compuNng resources on CPU and GPU Common Scenario *Buf1, Buf2…contain non- conRguous MPI Datatype MPI_Isend (A,.. Datatype,…) MPI_Isend (B,.. Datatype,…) MPI_Isend (C,.. Datatype,…) MPI_Isend (D,.. Datatype,…) … MPI_Waitall (…);
  39. 39. HPCAC-Stanford (Feb ‘16) 39 Network Based CompuNng Laboratory ApplicaNon-Level EvaluaNon (HaloExchange - Cosmo) 0 0.5 1 1.5 16 32 64 96 Normalized ExecuNon Time Number of GPUs CSCS GPU cluster Default Callback-based Event-based 0 0.5 1 1.5 4 8 16 32 Normalized ExecuNon Time Number of GPUs Wilkes GPU Cluster Default Callback-based Event-based •  2X improvement on 32 GPUs nodes •  30% improvement on 96 GPU nodes (8 GPUs/node) C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, ExploiNng Maximal Overlap for Non- ConNguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16
  40. 40. HPCAC-Stanford (Feb ‘16) 40 Network Based CompuNng Laboratory •  Scalability for million to billion processors •  CollecRve communicaRon •  Integrated Support for GPGPUs •  Integrated Support for MICs •  Unified RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) •  VirtualizaRon •  Energy-Awareness •  InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
  41. 41. HPCAC-Stanford (Feb ‘16) 41 Network Based CompuNng Laboratory MPI ApplicaNons on MIC Clusters Xeon Xeon Phi MulR-core Centric Many-core Centric MPI Program MPI Program Offloaded ComputaRon MPI Program MPI Program MPI Program Host-only Offload (/reverse Offload) Symmetric Coprocessor-only •  Flexibility in launching MPI jobs on clusters with Xeon Phi
  42. 42. HPCAC-Stanford (Feb ‘16) 42 Network Based CompuNng Laboratory MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC •  Offload Mode •  Intranode CommunicaRon •  Coprocessor-only and Symmetric Mode •  Internode CommunicaRon •  Coprocessors-only and Symmetric Mode •  MulR-MIC Node ConfiguraRons •  Running on three major systems •  Stampede, Blueridge (Virginia Tech) and Beacon (UTK)
  43. 43. HPCAC-Stanford (Feb ‘16) 43 Network Based CompuNng Laboratory MIC-Remote-MIC P2P CommunicaNon with Proxy-based CommunicaNon Bandwidth BeTer BeTer BeTer Latency (Large Messages) 0 1000 2000 3000 4000 5000 8K 32K 128K 512K 2M Latency(usec) Message Size (Bytes) 0 2000 4000 6000 1 16 256 4K 64K 1M Bandwidth (MB/sec) Message Size (Bytes) 5236 Intra-socket P2P Inter-socket P2P 0 5000 10000 15000 8K 32K 128K 512K 2M Latency(usec) Message Size (Bytes) Latency (Large Messages) 0 2000 4000 6000 1 16 256 4K 64K 1M Bandwidth (MB/sec) Message Size (Bytes) BeTer 5594 Bandwidth
  44. 44. HPCAC-Stanford (Feb ‘16) 44 Network Based CompuNng Laboratory OpNmized MPI CollecNves for MIC Clusters (Allgather & Alltoall) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014 0 10000 20000 30000 1 2 4 8 16 32 64 128 256 512 1K Latency (usecs) Message Size (Bytes) 32-Node-Allgather (16H + 16 M) Small Message Latency MV2-MIC MV2-MIC-Opt 0 500 1000 1500 8K 16K 32K 64K 128K 256K 512K 1M Latency (usecs) Message Size (Bytes) 32-Node-Allgather (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 0 500 1000 4K 8K 16K 32K 64K 128K 256K 512K Latency (usecs) Message Size (Bytes) 32-Node-Alltoall (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 0 20 40 60 MV2-MIC-Opt MV2-MIC ExecuNon Time (secs) 32 Nodes (8H + 8M), Size = 2K*2K*1K P3DFFT Performance CommunicaRon ComputaRon 76% 58% 55%
  45. 45. HPCAC-Stanford (Feb ‘16) 45 Network Based CompuNng Laboratory •  Scalability for million to billion processors •  CollecRve communicaRon •  Integrated Support for GPGPUs •  Integrated Support for MICs •  Unified RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) •  VirtualizaRon •  Energy-Awareness •  InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
  46. 46. HPCAC-Stanford (Feb ‘16) 46 Network Based CompuNng Laboratory MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS ApplicaNons MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS) ApplicaNons Unified MVAPICH2-X RunNme InfiniBand, RoCE, iWARP OpenSHMEM Calls MPI Calls UPC Calls •  Unified communicaRon runRme for MPI, UPC, OpenSHMEM, CAF available with MVAPICH2-X 1.9 (2012) onwards! •  UPC++ support will be available in upcoming MVAPICH2-X 2.2RC1 •  Feature Highlights –  Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC + CAF –  MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with iniRal support for UPC 1.3), CAF 2008 standard (OpenUH) –  Scalable Inter-node and intra-node communicaRon – point-to-point and collecRves CAF Calls
  47. 47. HPCAC-Stanford (Feb ‘16) 47 Network Based CompuNng Laboratory ApplicaNon Level Performance with Graph500 and Sort Graph500 ExecuNon Time J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, InternaNonal SupercompuNng Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, SupporNng Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance EvaluaNon, Int'l Conference on Parallel Processing (ICPP '12), September 2012 0 5 10 15 20 25 30 35 4K 8K 16K Time (s) No. of Processes MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM) 13X 7.6X •  Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design •  8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple •  16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, OpNmizing CollecNve CommunicaNon in OpenSHMEM, Int'l Conference on ParNNoned Global Address Space Programming Models (PGAS '13), October 2013. Sort ExecuNon Time 0 1000 2000 3000 500GB-512 1TB-1K 2TB-2K 4TB-4K Time (seconds) Input Data - No. of Processes MPI Hybrid 51% •  Performance of Hybrid (MPI+OpenSHMEM) Sort ApplicaRon •  4,096 processes, 4 TB Input Size - MPI – 2408 sec; 0.16 TB/min - Hybrid – 1172 sec; 0.36 TB/min - 51% improvement over MPI-design
  48. 48. HPCAC-Stanford (Feb ‘16) 48 Network Based CompuNng Laboratory MiniMD – Total ExecuNon Time •  Hybrid design performs be<er than MPI implementaRon •  1,024 processes -  17% improvement over MPI version •  Strong Scaling Input size: 128 * 128 * 128 Performance Strong Scaling 0 500 1000 1500 2000 2500 512 1,024 Hybrid-Barrier MPI-Original Hybrid-Advanced 17% 0 500 1000 1500 2000 2500 3000 256 512 1,024 Hybrid-Barrier MPI-Original Hybrid-Advanced Time (ms) Time (ms) # of Cores # of Cores M. Li, J. Lin, X. Lu, K. Hamidouche, K. Tomko and D. K. Panda, Scalable MiniMD Design with Hybrid MPI and OpenSHMEM, OpenSHMEM User Group MeeNng (OUG ’14), held in conjuncNon with 8th InternaNonal Conference on ParNNoned Global Address Space Programming Models, (PGAS 14).
  49. 49. HPCAC-Stanford (Feb ‘16) 49 Network Based CompuNng Laboratory Hybrid MPI+UPC NAS-FT •  Modified NAS FT UPC all-to-all pa<ern using MPI_Alltoall •  Truly hybrid program •  For FT (Class C, 128 processes) •  34% improvement over UPC-GASNet •  30% improvement over UPC-OSU 0 5 10 15 20 25 30 35 B-64 C-64 B-128 C-128 Time (s) NAS Problem Size – System Size UPC-GASNet UPC-OSU Hybrid-OSU 34% J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI RunNmes: Experience with MVAPICH, Fourth Conference on ParNNoned Global Address Space Programming Model (PGAS ’10), October 2010 Hybrid MPI + UPC Support Available since MVAPICH2-X 1.9 (2012)
  50. 50. HPCAC-Stanford (Feb ‘16) 50 Network Based CompuNng Laboratory •  Scalability for million to billion processors •  CollecRve communicaRon •  Integrated Support for GPGPUs •  Integrated Support for MICs •  Unified RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) •  VirtualizaRon •  Energy-Awareness •  InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
  51. 51. HPCAC-Stanford (Feb ‘16) 51 Network Based CompuNng Laboratory •  VirtualizaRon has many benefits –  Fault-tolerance –  Job migraRon –  CompacRon •  Have not been very popular in HPC due to overhead associated with VirtualizaRon •  New SR-IOV (Single Root – IO VirtualizaRon) support available with Mellanox InfiniBand adapters changes the field •  Enhanced MVAPICH2 support for SR-IOV •  MVAPICH2-Virt 2.1 (with and without OpenStack) is publicly available Can HPC and VirtualizaNon be Combined? J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI ApplicaNons on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14 J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15
  52. 52. HPCAC-Stanford (Feb ‘16) 52 Network Based CompuNng Laboratory •  Redesign MVAPICH2 to make it virtual machine aware –  SR-IOV shows near to naRve performance for inter-node point to point communicaRon –  IVSHMEM offers zero-copy access to data on shared memory of co-resident VMs –  Locality Detector: maintains the locality informaRon of co-resident virtual machines –  CommunicaRon Coordinator: selects the communicaRon channel (SR-IOV, IVSHMEM) adapRvely Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM Host Environment Guest 1 Hypervisor PF Driver Infiniband Adapter Physical Function user space kernel space MPI proc PCI Device VF Driver Guest 2 user space kernel space MPI proc PCI Device VF Driver Virtual Function Virtual Function /dev/shm/ IV-SHM IV-Shmem Channel SR-IOV Channel J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM Shmem Benefit MPI ApplicaRons on SR-IOV based Virtualized InfiniBand Clusters? Euro-Par, 2014. J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High Performance MPI Library over SR-IOV Enabled InfiniBand Clusters. HiPC, 2014.
  53. 53. HPCAC-Stanford (Feb ‘16) 53 Network Based CompuNng Laboratory Nova Glance Neutron Swift Keystone Cinder Heat Ceilometer Horizon VM Backup volumes in Stores images in Provides images Provides Network Provisions Provides Volumes Monitors Provides UI Provides Auth for Orchestrates cloud •  OpenStack is one of the most popular open-source soluRons to build clouds and manage virtual machines •  Deployment with OpenStack –  SupporRng SR-IOV configuraRon –  SupporRng IVSHMEM configuraRon –  Virtual Machine aware design of MVAPICH2 with SR-IOV •  An efficient approach to build HPC Clouds with MVAPICH2-Virt and OpenStack MVAPICH2-Virt with SR-IOV and IVSHMEM over OpenStack J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds. CCGrid, 2015.
  54. 54. HPCAC-Stanford (Feb ‘16) 54 Network Based CompuNng Laboratory 0 50 100 150 200 250 300 350 400 milc leslie3d pop2 GAPgeofem zeusmp2 lu ExecuNon Time (s) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-NaRve 1% 9.5% 0 1000 2000 3000 4000 5000 6000 22,20 24,10 24,16 24,20 26,10 26,16 ExecuNon Time (ms) Problem Size (Scale, Edgefactor) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-NaRve 2% •  32 VMs, 6 Core/VM •  Compared to NaRve, 2-5% overhead for Graph500 with 128 Procs •  Compared to NaRve, 1-9.5% overhead for SPEC MPI2007 with 128 Procs ApplicaNon-Level Performance on Chameleon SPEC MPI2007Graph500 5%
  55. 55. HPCAC-Stanford (Feb ‘16) 55 Network Based CompuNng Laboratory NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument •  Large-scale instrument –  TargeRng Big Data, Big Compute, Big Instrument research –  ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network •  Reconfigurable instrument –  Bare metal reconfiguraRon, operated as single instrument, graduated approach for ease-of-use •  Connected instrument –  Workload and Trace Archive –  Partnerships with producRon clouds: CERN, OSDC, Rackspace, Google, and others –  Partnerships with users •  Complementary instrument –  ComplemenRng GENI, Grid’5000, and other testbeds •  Sustainable instrument –  Industry connecRons h<p://www.chameleoncloud.org/
  56. 56. HPCAC-Stanford (Feb ‘16) 56 Network Based CompuNng Laboratory •  Scalability for million to billion processors •  CollecRve communicaRon •  Integrated Support for GPGPUs •  Integrated Support for MICs •  Unified RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) •  VirtualizaRon •  Energy-Awareness •  InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
  57. 57. HPCAC-Stanford (Feb ‘16) 57 Network Based CompuNng Laboratory •  MVAPICH2-EA 2.1 (Energy-Aware) •  A white-box approach •  New Energy-Efficient communicaRon protocols for pt-pt and collecRve operaRons •  Intelligently apply the appropriate Energy saving techniques •  ApplicaRon oblivious energy saving •  OEMT •  A library uRlity to measure energy consumpRon for MPI applicaRons •  Works with all MPI runRmes •  PRELOAD opRon for precompiled applicaRons •  Does not require ROOT permission: •  A safe kernel module to read only a subset of MSRs Energy-Aware MVAPICH2 & OSU Energy Management Tool (OEMT)
  58. 58. HPCAC-Stanford (Feb ‘16) 58 Network Based CompuNng Laboratory •  An energy efficient runRme that provides energy savings without applicaRon knowledge •  Uses automaRcally and transparently the best energy lever •  Provides guarantees on maximum degradaRon with 5-41% savings at <= 5% degradaRon •  PessimisRc MPI applies energy reducRon lever to each MPI call MVAPICH2-EA: ApplicaNon Oblivious Energy-Aware-MPI (EAM) A Case for ApplicaNon-Oblivious Energy-Efficient MPI RunNme A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D. K. Panda, D. Kerbyson, and A. Hoise, SupercompuNng ‘15, Nov 2015 [Best Student Paper Finalist] 1
  59. 59. HPCAC-Stanford (Feb ‘16) 59 Network Based CompuNng Laboratory •  Scalability for million to billion processors •  CollecRve communicaRon •  Integrated Support for GPGPUs •  Integrated Support for MICs •  Unified RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) •  VirtualizaRon •  Energy-Awareness •  InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
  60. 60. HPCAC-Stanford (Feb ‘16) 60 Network Based CompuNng Laboratory •  OSU INAM monitors IB clusters in real Rme by querying various subnet management enRRes in the network •  Major features of the OSU INAM tool include: –  Analyze and profile network-level acRviRes with many parameters (data and errors) at user specified granularity –  Capability to analyze and profile node-level, job-level and process-level acRviRes for MPI communicaRon (pt-to-pt, collecRves and RMA) –  Remotely monitor CPU uRlizaRon of MPI processes at user specified granularity –  Visualize the data transfer happening in a "live" fashion - Live View for •  EnRre Network - Live Network Level View •  ParRcular Job - Live Job Level View •  One or mulRple Nodes - Live Node Level View –  Capability to visualize data transfer that happened in the network at a Rme duraRon in the past •  EnRre Network - Historical Network Level View •  ParRcular Job - Historical Job Level View •  One or mulRple Nodes - Historical Node Level View Overview of OSU INAM
  61. 61. HPCAC-Stanford (Feb ‘16) 61 Network Based CompuNng Laboratory OSU INAM – Network Level View •  Show network topology of large clusters •  Visualize traffic pa<ern on different links •  Quickly idenRfy congested links/links in error state •  See the history unfold – play back historical state of the network Full Network (152 nodes) Zoomed-in View of the Network
  62. 62. HPCAC-Stanford (Feb ‘16) 62 Network Based CompuNng Laboratory OSU INAM – Job and Node Level Views Visualizing a Job (5 Nodes) Finding Routes Between Nodes •  Job level view •  Show different network metrics (load, error, etc.) for any live job •  Play back historical data for completed jobs to idenRfy bo<lenecks •  Node level view provides details per process or per node •  CPU uRlizaRon for each rank/node •  Bytes sent/received for MPI operaRons (pt-to-pt, collecRve, RMA) •  Network metrics (e.g. XmitDiscard, RcvError) per rank/node
  63. 63. HPCAC-Stanford (Feb ‘16) 63 Network Based CompuNng Laboratory MVAPICH2 – Plans for Exascale •  Performance and Memory scalability toward 1M cores •  Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) –  Support for task-based parallelism (UPC++) •  Enhanced OpRmizaRon for GPU Support and Accelerators •  Taking advantage of advanced features –  User Mode Memory RegistraRon (UMR) –  On-demand Paging •  Enhanced Inter-node and Intra-node communicaRon schemes for upcoming OmniPath and Knights Landing architectures •  Extended RMA support (as in MPI 3.0) •  Extended topology-aware collecRves •  Energy-aware point-to-point (one-sided and two-sided) and collecRves •  Extended Support for MPI Tools Interface (as in MPI 3.0) •  Extended Checkpoint-Restart and migraRon support with SCR
  64. 64. HPCAC-Stanford (Feb ‘16) 64 Network Based CompuNng Laboratory •  Exascale systems will be constrained by –  Power –  Memory per core –  Data movement cost –  Faults •  Programming Models and RunRmes for HPC need to be designed for –  Scalability –  Performance –  Fault-resilience –  Energy-awareness –  Programmability –  ProducRvity •  Highlighted some of the issues and challenges •  Need conRnuous innovaRon on all these fronts Looking into the Future ….
  65. 65. HPCAC-Stanford (Feb ‘16) 65 Network Based CompuNng Laboratory Funding Acknowledgments Funding Support by Equipment Support by
  66. 66. HPCAC-Stanford (Feb ‘16) 66 Network Based CompuNng Laboratory Personnel Acknowledgments Current Students –  A. AugusRne (M.S.) –  A. Awan (Ph.D.) –  S. Chakraborthy (Ph.D.) –  C.-H. Chu (Ph.D.) –  N. Islam (Ph.D.) –  M. Li (Ph.D.) Past Students –  P. Balaji (Ph.D.) –  S. Bhagvat (M.S.) –  A. Bhat (M.S.) –  D. BunRnas (Ph.D.) –  L. Chai (Ph.D.) –  B. Chandrasekharan (M.S.) –  N. Dandapanthula (M.S.) –  V. Dhanraj (M.S.) –  T. Gangadharappa (M.S.) –  K. Gopalakrishnan (M.S.) –  G. Santhanaraman (Ph.D.) –  A. Singh (Ph.D.) –  J. Sridhar (M.S.) –  S. Sur (Ph.D.) –  H. Subramoni (Ph.D.) –  K. Vaidyanathan (Ph.D.) –  A. Vishnu (Ph.D.) –  J. Wu (Ph.D.) –  W. Yu (Ph.D.) Past Research Scien4st –  S. Sur Current Post-Doc –  J. Lin –  D. Banerjee Current Programmer –  J. Perkins Past Post-Docs –  H. Wang –  X. Besseron –  H.-W. Jin –  M. Luo –  W. Huang (Ph.D.) –  W. Jiang (M.S.) –  J. Jose (Ph.D.) –  S. Kini (M.S.) –  M. Koop (Ph.D.) –  R. Kumar (M.S.) –  S. Krishnamoorthy (M.S.) –  K. Kandalla (Ph.D.) –  P. Lai (M.S.) –  J. Liu (Ph.D.) –  M. Luo (Ph.D.) –  A. Mamidala (Ph.D.) –  G. Marsh (M.S.) –  V. Meshram (M.S.) –  A. Moody (M.S.) –  S. Naravula (Ph.D.) –  R. Noronha (Ph.D.) –  X. Ouyang (Ph.D.) –  S. Pai (M.S.) –  S. Potluri (Ph.D.) –  R. Rajachandrasekar (Ph.D.) –  K. Kulkarni (M.S.) –  M. Rahman (Ph.D.) –  D. Shankar (Ph.D.) –  A. Venkatesh (Ph.D.) –  J. Zhang (Ph.D.) –  E. Mancini –  S. Marcarelli –  J. Vienne Current Research Scien4sts Current Senior Research Associate –  H. Subramoni –  X. Lu Past Programmers –  D. Bureddy - K. Hamidouche Current Research Specialist –  M. Arnold
  67. 67. HPCAC-Stanford (Feb ‘16) 67 Network Based CompuNng Laboratory InternaNonal Workshop on CommunicaNon Architectures at Extreme Scale (Exacomm) ExaComm 2015 was held with Int’l SupercompuRng Conference (ISC ‘15), at Frankfurt, Germany, on Thursday, July 16th, 2015 One Keynote Talk: John M. Shalf, CTO, LBL/NERSC Four Invited Talks: Dror Goldenberg (Mellanox); MarRn Schulz (LLNL); Cyriel Minkenberg (IBM-Zurich); Arthur (Barney) Maccabe (ORNL) Panel: Ron Brightwell (Sandia) Two Research Papers ExaComm 2016 will be held in conjuncRon with ISC ’16 h<p://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html Technical Paper Submission Deadline: Friday, April 15, 2016
  68. 68. HPCAC-Stanford (Feb ‘16) 68 Network Based CompuNng Laboratory panda@cse.ohio-state.edu Thank You! The High-Performance Big Data Project h<p://hibd.cse.ohio-state.edu/ Network-Based CompuRng Laboratory h<p://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project h<p://mvapich.cse.ohio-state.edu/

×