Programming Models for Exascale Systems

High-Performance and Scalable Designs of
Programming Models for Exascale Systems
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
h<p://www.cse.ohio-state.edu/~panda
Talk at HPCAC-Switzerland (Mar 2016)
by

HPCAC-Switzerland (Mar ‘16) 2 Network Based CompuNng Laboratory
High-End CompuNng (HEC): ExaFlop & ExaByte
100-200
PFlops in
2016-2018
1 EFlops in
2020-2024?
F i g u r e 1
Source: IDC's Digital Universe Study, sponsored by EMC, December 2012
Within these broad outlines of the digital universe are some singularities worth noting.
First, while the portion of the digital universe holding potential analytic value is growing, only a tin
fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digita
10K-20K
EBytes in
2016-2018
40K EBytes
in 2020 ?
ExaFlop & HPC • 
ExaByte & BigData •

0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
Percentage of Clusters
Number of Clusters
Timeline
Percentage of Clusters
Number of Clusters
Trends for Commodity CompuNng Clusters in the Top 500 List
(hUp://www.top500.org)
85%

Drivers of Modern HPC Cluster Architectures
Tianhe – 2 Titan Stampede Tianhe – 1A
•  MulR-core/many-core technologies
•  Remote Direct Memory Access (RDMA)-enabled networking (InﬁniBand and RoCE)
•  Solid State Drives (SSDs), Non-VolaRle Random-Access Memory (NVRAM), NVMe-SSD
•  Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

Accelerators / Coprocessors
high compute density, high
performance/waU
>1 TFlop DP on a chip
High Performance Interconnects -
InﬁniBand
<1usec latency, 100Gbps Bandwidth> MulN-core Processors SSD, NVMe-SSD, NVRAM

•  235 IB Clusters (47%) in the Nov’ 2015 Top500 list (h<p://www.top500.org)
•  InstallaRons in the Top 50 (21 systems):
Large-scale InﬁniBand InstallaNons
462,462 cores (Stampede) at TACC (10th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (25th)
185,344 cores (Pleiades) at NASA/Ames (13th) 194,616 cores (Cascade) at PNNL (27th)
72,800 cores Cray CS-Storm in US (15th) 76,032 cores (Makman-2) at Saudi Aramco (32nd)
72,800 cores Cray CS-Storm in US (16th) 110,400 cores (Pangea) in France (33rd)
265,440 cores SGI ICE at Tulip Trading Australia (17th) 37,120 cores (Lomonosov-2) at Russia/MSU (35th)
124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (18th) 57,600 cores (SwifLucy) in US (37th)
72,000 cores (HPC2) in Italy (19th) 55,728 cores (Prometheus) at Poland/Cyfronet (38th)
152,692 cores (Thunder) at AFRL/USA (21st ) 50,544 cores (Occigen) at France/GENCI-CINES (43rd)
147,456 cores (SuperMUC) in Germany (22nd) 76,896 cores (Salomon) SGI ICE in Czech Republic (47th)
86,016 cores (SuperMUC Phase 2) in Germany (24th) and many more!

•  ScienRﬁc CompuRng
–  Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant
Programming Model
–  Many discussions towards ParRRoned Global Address Space (PGAS)
•  UPC, OpenSHMEM, CAF, etc.
–  Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
•  Big Data/Enterprise/Commercial CompuRng
–  Focuses on large data and data analysis
–  Hadoop (HDFS, HBase, MapReduce)
–  Spark is emerging for in-memory compuRng
–  Memcached is also used for Web 2.0
Two Major Categories of ApplicaNons

Towards Exascale System (Today and Target)
Systems 2016
Tianhe-2
2020-2024 Difference
Today & Exascale
System peak 55 PFlop/s 1 EFlop/s ~20x
Power 18 MW
(3 Gflops/W)
~20 MW
(50 Gflops/W)
O(1)
~15x
System memory 1.4 PB
(1.024PB CPU + 0.384PB CoP)
32 – 64 PB ~50X
Node performance 3.43TF/s
(0.4 CPU + 3 CoP)
1.2 or 15 TF O(1)
Node concurrency 24 core CPU +
171 cores CoP
O(1k) or O(10k) ~5x - ~50x
Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x
System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x
Total concurrency 3.12M
12.48M threads (4 /core)
O(billion)
for latency hiding
~100x
MTTI Few/day Many/day O(?)
Courtesy: Prof. Jack Dongarra

•  Energy and Power Challenge
–  Hard to solve power requirements for data movement
•  Memory and Storage Challenge
–  Hard to achieve high capacity and high data rate
•  Concurrency and Locality Challenge
–  Management of very large amount of concurrency (billion threads)
•  Resiliency Challenge
–  Low voltage devices (for low power) introduce more faults
Basic Design Challenges for Exascale Systems

Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
ParRRoned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
•  Programming models provide abstract machine models
•  Models can be mapped on diﬀerent types of systems
–  e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
•  PGAS models and Hybrid MPI+PGAS models are gradually receiving
importance

•  Message Passing Library standardized by MPI Forum
–  C and Fortran
•  Goal: portable, eﬃcient and ﬂexible standard for wriRng parallel applicaRons
•  Not IEEE or ISO standard, but widely considered “industry standard” for HPC
applicaRon
•  EvoluRon of MPI
–  MPI-1: 1994
–  MPI-2: 1996
–  MPI-3.0: 2008 – 2012, standardized on September 21, 2012
–  MPI-3.1: 2012 – 2015, standardized on June 4, 2015
–  Next plan is for MPI 4.0
MPI Overview and History

•  Power required for data movement operaRons is one of the main challenges
•  Non-blocking collecRves
–  Overlap computaRon and communicaRon
•  Much improved One-sided interface
–  Reduce synchronizaRon of sender/receiver
•  Manage concurrency
–  Improved interoperability with PGAS (e.g. UPC, Global Arrays, OpenSHMEM, CAF)
•  Resiliency
–  New interface for detecRng failures
How does MPI Plan to Meet Exascale Challenges?

•  Major features in MPI 3.0
–  Non-blocking CollecRves
–  Improved One-Sided (RMA) Model
–  MPI Tools Interface
•  SpeciﬁcaRon is available from:
h<p://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
Major New Features in MPI-3.0

MPI-3 RMA : One-sided CommunicaNon Model
HCA HCA HCAP 1 P 2 P 3
Write to P2
Write to P3
Write Data from P1
Write data from P2
Post to HCA
Post to HCA
Buffer at P2 Buffer at P3
Global Region Creation (Buffer Info Exchanged)
Buffer at P1
HCA Write
Data to P2
HCA Write
Data to P3

•  Non-blocking one-sided communicaRon rouRnes
–  Put, Get (Rput, Rget)
–  Accumulate, Get_accumulate
–  Atomics
•  Flexible synchronizaRon operaRons to control iniRaRon and compleRon
MPI-3 RMA: CommunicaNon and synchronizaNon PrimiNves
MPI One-sided SynchronizaNon/CompleNon PrimiNves
SynchronizaNon CompleNon Win_sync
Lock/
Unlock
Lock_all/
Unlock_all
Fence
Post-Wait/
Start-Complete
Flush
Flush_all
Flush_local
Flush_local_all

•  Network adapters can provide
RDMA feature that doesn’t require
sofware involvement at remote
side
•  As long as puts/gets are executed as
soon as they are issued, overlap can
be achieved
•  RDMA-based implementaRons do
just that
MPI-3 RMA : Overlapping CommunicaNon and ComputaNon

•  Enables overlap of computaRon with communicaRon
•  Non-blocking calls do not match blocking collecRve calls
–  MPI may use diﬀerent algorithms for blocking and non-blocking collecRves
–  Blocking collecRves: OpRmized for latency
–  Non-blocking collecRves: OpRmized for overlap
•  A process calling an NBC operaRon
–  Schedules collecRve operaRon and immediately returns
–  Executes applicaRon computaRon code
–  Waits for the end of the collecRve
•  The communicaRon progress by
–  ApplicaRon code through MPI_Test
–  Network adapter (HCA) with hardware support
–  Dedicated processes / thread in MPI library
•  There is a non-blocking equivalent for each blocking operaRon
–  Has an “I” in the name (MPI_Bcast -> MPI_Ibcast; MPI_Reduce -> MPI_Ireduce)
MPI-3 Non-blocking CollecNve (NBC) OperaNons

MPI Tools Interface
•  Extended tools support in MPI-3, beyond the PMPI interface
•  Provide standardized interface (MPIT) to access MPI internal
informaRon
•  ConﬁguraRon and control informaRon
•  Eager limit, buﬀer sizes, . . .
•  Performance informaRon
•  Time spent in blocking, memory usage, . . .
•  Debugging informaRon
•  Packet counters, thresholds, . . .
•  External tools can build on top of this standard interface

•  MPI 3.1 was approved on June 4, 2015
–  SpeciﬁcaRon is available from:
h<p://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
•  Major features and enhancements:
–  CorrecRon to the Fortran bindings introduced in MPI-3.0
–  New funcRons added include rouRnes to manipulate MPI_Aint values in a
portable manner
–  Nonblocking collecRve I/O rouRnes
–  RouRnes to get the index value by name for MPI_T performance and
control variables
MPI-3.1 Enhancements

ParNNoned Global Address Space (PGAS) Models
•  Key features
-  Simple shared memory abstracRons
-  Light weight one-sided communicaRon
-  Easier to express irregular communicaRon
•  Diﬀerent approaches to PGAS
-  Languages
•  Uniﬁed Parallel C (UPC)
•  Co-Array Fortran (CAF)
•  X10
•  Chapel
-  Libraries
•  OpenSHMEM
•  UPC++
•  Global Arrays

OpenSHMEM
•  SHMEM implementaRons – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP SHMEM, GSHMEM
•  Subtle differences in API, across versions – example:
SGI SHMEM Quadrics SHMEM Cray SHMEM
IniNalizaNon start_pes(0) shmem_init start_pes
Process ID _my_pe my_pe shmem_my_pe
•  Made applicaRon codes non-portable
•  OpenSHMEM is an effort to address this:
“A new, open specifica>on to consolidate the various extant SHMEM versions
into a widely accepted standard.” – OpenSHMEM Specifica>on v1.0
by University of Houston and Oak Ridge NaRonal Lab
SGI SHMEM is the baseline

•  UPC: Uniﬁed Parallel C - PGAS based language extension to C
–  An ISO C99-based language providing uniform programming model for both shared and distributed
memory hardware to support HPC
–  UPC = UPC translator + C compiler + UPC runRme
•  Coarray Fortran (CAF): Language-level PGAS support in Fortran
–  An extension to Fortran to support global shared array (coarray) in parallel Fortran applicaRons
–  CAF = CAF compiler + CAF runRme (libcaf)
–  Basic support in Fortran 2008 and extended support to collecRve in Fortran 2015
•  UPC++: An Object Oriented PGAS Programming Model
–  A compiler-free PGAS programming model in context of C++
–  Built on top of C++ standard templates and runRme libraries
–  Extension to UPC’s programming idioms
–  Register task for async execuRon
UPC, CAF and UPC++

•  Hierarchical architectures with mulRple address spaces
•  (MPI + PGAS) Model
–  MPI across address spaces
–  PGAS within an address space
•  MPI is good at moving data between address spaces
•  Within an address space, MPI can interoperate with other shared memory programming
models
•  ApplicaRons can have kernels with different communicaRon pa<erns
•  Can benefit from different models
•  Re-wriRng complete applicaRons can be a huge effort
•  Port criRcal kernels to the desired model instead
MPI+PGAS for Exascale Architectures and ApplicaNons

Hybrid (MPI+PGAS) Programming
•  ApplicaRon sub-kernels can be re-wri<en in MPI/PGAS based on communicaRon
characterisRcs
•  Beneﬁts:
–  Best of Distributed CompuRng Model
–  Best of Shared Memory CompuRng Model
•  Exascale Roadmap*:
–  “Hybrid Programming is a pracRcal way to
program exascale systems”
* The Interna>onal Exascale SoKware Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011,
Interna>onal Journal of High Performance Computer Applica>ons, ISSN 1094-3420
Kernel 1
MPI
Kernel 2
MPI
Kernel 3
MPI
Kernel N
MPI
HPC ApplicaNon
Kernel 2
PGAS
Kernel N
PGAS

Designing CommunicaNon Libraries for MulN-Petaflop and
Exaflop Systems: Challenges
Programming Models
MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
ApplicaNon Kernels/ApplicaNons
Networking Technologies
(InfiniBand, 40/100GigE,
Aries, and OmniPath)
MulN/Many-core
Architectures
Accelerators
(NVIDIA and MIC)
Middleware
Co-Design
OpportuniNes
and
Challenges
across Various
Layers

Performance
Scalability
Fault-
Resilience
CommunicaNon Library or RunNme for Programming Models
Point-to-point
CommunicaNon
CollecNve
CommunicaNon
Energy-
Awareness
SynchronizaNon
and Locks
I/O and
File Systems
Fault
Tolerance

•  Scalability for million to billion processors
–  Support for highly-efficient inter-node and intra-node communicaRon (both two-sided and one-sided)
–  Scalable job start-up
•  Scalable CollecRve communicaRon
–  Offload
–  Non-blocking
–  Topology-aware
•  Balancing intra-node and inter-node communicaRon for next generaRon nodes (128-1024 cores)
–  MulRple end-points per node
•  Support for efficient mulR-threading
•  Integrated Support for GPGPUs and Accelerators
•  Fault-tolerance/resiliency
•  QoS support for communicaRon and I/O
•  Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM,
CAF, …)
•  VirtualizaRon
•  Energy-Awareness

Broad Challenges in Designing CommunicaNon Libraries for (MPI+X) at
Exascale

•  Extreme Low Memory Footprint
–  Memory per core conRnues to decrease

•  D-L-A Framework
–  Discover
•  Overall network topology (fat-tree, 3D, …), Network topology for processes for a given job
•  Node architecture, Health of network and node
–  Learn
•  Impact on performance and scalability
•  PotenRal for failure
–  Adapt
•  Internal protocols and algorithms
•  Process mapping
•  Fault-tolerance soluRons
–  Low overhead techniques while delivering performance, scalability and fault-tolerance

AddiNonal Challenges for Designing Exascale Soqware Libraries

Overview of the MVAPICH2 Project
•  High Performance open-source MPI Library for InﬁniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)
–  MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
–  MVAPICH2-X (MPI + PGAS), Available since 2011
–  Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
–  Support for VirtualizaRon (MVAPICH2-Virt), Available since 2015
–  Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
–  Used by more than 2,525 organizaNons in 77 countries
–  More than 356,000 (> 0.36 million) downloads from the OSU site directly
–  Empowering many TOP500 clusters (Nov ‘15 ranking)
•  10th ranked 519,640-core cluster (Stampede) at TACC
•  13th ranked 185,344-core cluster (Pleiades) at NASA
•  25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo InsRtute of Technology and many others
–  Available with sofware stacks of many vendors and Linux Distros (RedHat and SuSE)
–  h<p://mvapich.cse.ohio-state.edu
•  Empowering Top500 systems for over a decade
–  System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->
–  Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops)

MVAPICH2 Architecture
High Performance Parallel Programming Models
Message Passing Interface
(MPI)
PGAS
(UPC, OpenSHMEM, CAF, UPC++*)
Hybrid --- MPI + X
(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable CommunicaNon RunNme
Diverse APIs and Mechanisms
Point-to-
point
PrimiNves
CollecNves
Algorithms
Energy-
Awareness
Remote
Memory
Access
I/O and
File Systems
Fault
Tolerance
VirtualizaNon
AcNve
Messages
Job Startup
IntrospecNon
& Analysis
Support for Modern Networking Technology
(InﬁniBand, iWARP, RoCE, OmniPath)
Support for Modern MulN-/Many-core Architectures
(Intel-Xeon, OpenPower*, Xeon-Phi (MIC, KNL*), NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODP*
SR-
IOV
MulN
Rail
Transport Mechanisms
Shared
Memory
CMA IVSHMEM
Modern Features
MCDRAM* NVLink* CAPI*
* Upcoming

Timeline
Jan-04
Jan-10
Nov-12
MVAPICH2-X
OMB
MVAPICH2
MVAPICH
Oct-02
Nov-04
Apr-15
EOL
MVAPICH2-GDR
MVAPICH2-MIC
MVAPICH Project Timeline
Jul-15
MVAPICH2-Virt
Aug-14
Aug-15
Sep-15
MVAPICH2-EA
OSU-INAM

MVAPICH2 Soqware Family
Requirements MVAPICH2 Library to use
MPI with IB, iWARP and RoCE MVAPICH2
Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X
MPI with IB & GPU MVAPICH2-GDR
MPI with IB & MIC MVAPICH2-MIC
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

0
50000
100000
150000
200000
250000
300000
350000
Sep-04
Jan-05
May-05
Sep-05
Jan-06
May-06
Sep-06
Jan-07
May-07
Sep-07
Jan-08
May-08
Sep-08
Jan-09
May-09
Sep-09
Jan-10
May-10
Sep-10
Jan-11
May-11
Sep-11
Jan-12
May-12
Sep-12
Jan-13
May-13
Sep-13
Jan-14
May-14
Sep-14
Jan-15
May-15
Sep-15
Jan-16
Number of Downloads
Timeline
MV 0.9.4
MV2 0.9.0
MV2 0.9.8
MV2 1.0
MV 1.0
MV2 1.0.3
MV 1.1
MV2 1.4
MV2 1.5
MV2 1.6
MV2 1.7
MV2 1.8
MV2 1.9
MV2 2.1
MV2-GDR 2.0b
MV2-MIC 2.0
MV2-Virt 2.1rc2
MV2-GDR 2.2b
MV2-X 2.2b MV2 2.2b
MVAPICH/MVAPICH2 Release Timeline and Downloads

–  Support for highly-efficient inter-node and intra-node communicaRon (both two-sided and one-sided
RMA)
–  Support for advanced IB mechanisms (UMR and ODP)
–  Extremely minimal memory footprint
–  Scalable job start-up
•  CollecRve communicaRon
•  Unified RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI +
UPC, CAF, …)
•  InfiniBand Network Analysis and Monitoring (INAM)
•  Integrated Support for GPGPUs
•  Integrated Support for MICs
•  VirtualizaRon (SR-IOV and Container)
Overview of A Few Challenges being Addressed by the MVAPICH2
Project for Exascale

One-way Latency: MPI over IB with MVAPICH2
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00 Small Message Latency
Message Size (bytes)
Latency (us)
1.26
1.19
0.95
1.15
TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back
0
20
40
60
80
100
120
TrueScale-QDR
ConnectX-3-FDR
ConnectIB-DualFDR
ConnectX-4-EDR
Large Message Latency
Latency (us)

Bandwidth: MPI over IB with MVAPICH2
0
2000
4000
6000
8000
10000
12000
14000 UnidirecNonal Bandwidth
Bandwidth
(MBytes/sec)
12465
3387
6356
12104
0
5000
10000
15000
20000
25000
30000
TrueScale-QDR
ConnectX-3-FDR
ConnectIB-DualFDR
ConnectX-4-EDR
BidirecNonal Bandwidth
Bandwidth
(MBytes/sec)
21425
12161
24353
6308
TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back

0
0.5
1
0 1 2 4 8 16 32 64 128 256 512 1K
Latency (us)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance
(Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
Latest MVAPICH2 2.2b
Intel Ivy-bridge
0.18 us
0.45 us
0
5000
10000
15000
Bandwidth (MB/s)
Bandwidth (Inter-socket)
inter-Socket-CMA
inter-Socket-Shmem
inter-Socket-LiMIC
0
5000
10000
15000
Bandwidth (MB/s)
Bandwidth (Intra-socket)
intra-Socket-CMA
intra-Socket-Shmem
intra-Socket-LiMIC
14,250 MB/s
13,749 MB/s

•  Introduced by Mellanox to support direct local and remote nonconRguous
memory access
–  Avoid packing at sender and unpacking at receiver
•  Available with MVAPICH2-X 2.2b
User-mode Memory RegistraNon (UMR)
0
50
100
150
200
250
300
350
4K 16K 64K 256K 1M
Latency (us)
Small & Medium Message Latency
UMR
Default
0
5000
10000
15000
20000
2M 4M 8M 16M
Latency (us)
Large Message Latency
UMR
Default
Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, High Performance MPI Datatype Support with
User-mode Memory RegistraNon: Challenges, Designs and Beneﬁts, CLUSTER, 2015

•  Introduced by Mellanox to support direct remote memory access without pinning
•  Memory regions paged in/out dynamically by the HCA/OS
•  Size of registered buffers can be larger than physical memory
•  Will be available in future MVAPICH2 release
On-Demand Paging (ODP)
Connect-IB (54 Gbps): 2.6 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
0
500
1000
1500
16 32 64
Pin-down Buffer Size
(MB)
Number of Processes
Graph500 Pin-down Buffer Sizes
Pin-down ODP
0
1
2
3
4
5
16 32 64
ExecuNon Time (s)
Number of Processes
Graph500 BFS Kernel
Pin-down ODP

Minimizing Memory Footprint by Direct Connect (DC) Transport
Node
0
P1
P0 Node 1
P3
P2
Node 3
P7
P6
Node
2
P5
P4 IB
Network
•  Constant connecRon cost (One QP for any peer)
•  Full Feature Set (RDMA, Atomics etc)
•  Separate objects for send (DC IniRator) and receive (DC Target)
–  DC Target idenRﬁed by “DCT Number”
–  Messages routed with (DCT Number, LID)
–  Requires same “DC Key” to enable communicaRon
•  Available since MVAPICH2-X 2.2a
0
0.5
1
160 320 620
Normalized ExecuNon
Time
Number of Processes
NAMD - Apoa1: Large data set
RC DC-Pool UD XRC
10
22
47
97
1 1 1
2
10 10 10 10
1 1
3
5
1
10
100
80 160 320 640
ConnecNon Memory (KB)
Number of Processes
Memory Footprint for Alltoall
RC DC-Pool UD XRC
H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT)
of InﬁniBand : Early Experiences. IEEE InternaRonal SupercompuRng Conference (ISC ’14)

•  Near-constant MPI and OpenSHMEM
iniRalizaRon Rme at any process count
•  10x and 30x improvement in startup Rme
of MPI and OpenSHMEM respecRvely at
16,384 processes
•  Memory consumpRon reduced for
remote endpoint informaRon by
O(processes per node)
•  1GB Memory saved per node with 1M
processes and 16 processes per node
Towards High Performance and Scalable Startup at Exascale
P M
O
Job Startup Performance
Memory Required to Store
Endpoint InformaRon
P
M
PGAS – State of the art
MPI – State of the art
O PGAS/MPI – OpRmized
PMIX_Ring
PMIX_Ibarrier
PMIX_Iallgather
Shmem based PMI
On-demand
ConnecRon
On-demand ConnecNon Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and D K
Panda, 20th InternaRonal Workshop on High-level Parallel Programming Models and SupporRve Environments (HIPS ’15)
PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the 21st
European MPI Users' Group MeeRng (EuroMPI/Asia ’14)
Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th IEEE/
ACM InternaRonal Symposium on Cluster, Cloud and Grid CompuRng (CCGrid ’15)
SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K Panda, 16th
IEEE/ACM InternaRonal Symposium on Cluster, Cloud and Grid CompuRng (CCGrid ’16) , Accepted for Publica=on

•  SHMEMPMI allows MPI processes to directly read remote endpoint (EP) informaRon from the process
manager through shared memory segments
•  Only a single copy per node - O(processes per node) reducRon in memory usage
•  EsRmated savings of 1GB per node with 1 million processes and 16 processes per node
•  Up to 1,000 Rmes faster PMI Gets compared to default design. Will be available in MVAPICH2 2.2RC1.
Process Management Interface over Shared Memory (SHMEMPMI)
TACC Stampede - Connect-IB (54 Gbps): 2.6 GHz Quad Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR
SHMEMPMI – Shared Memory Based PMI for Performance and Scalability S. Chakraborty, H. Subramoni, J. Perkins, and D.K. Panda,
16th IEEE/ACM InternaRonal Symposium on Cluster, Cloud and Grid CompuRng (CCGrid ‘16), Accepted for publica=on
0
50
100
150
200
250
300
1 2 4 8 16 32
Time Taken (milliseconds)
Number of Processes per Node
Time Taken by one PMI_Get
Default
SHMEMPMI
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
16 64 256 1K 4K 16K 64K 256K 1M
Memory Usage per Node (MB)
Number of Processes per Job
Memory Usage for Remote EP InformaRon
Fence - Default
Allgather - Default
Fence - Shmem
Allgather - Shmem
EsNmated
1000x
Actual
16x

–  Oﬄoad and Non-blocking
–  Topology-aware
•  Uniﬁed RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …)


Modified HPL with Offload-Bcast does up to 4.5% be<er than default
version (512 Processes)
0
1
2
3
4
5
512 600 720 800
ApplicaNon Run-Time
(s)
Data Size
0
5
10
15
64 128 256 512
Run-Time (s)
Number of Processes
PCG-Default Modified-PCG-Offload
Co-Design with MPI-3 Non-Blocking CollecNves and CollecNve Offload Co-Direct
Hardware (Available since MVAPICH2-X 2.2a)
Modified P3DFFT with Offload-Alltoall does up to 17% be<er than
default version (128 Processes)
K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking All-to-All
with CollecNve Offload on InfiniBand Clusters: A Study with Parallel 3D FFT,
ISC 2011
17%
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50 60 70
Normalized
Performance
HPL-Offload HPL-1ring HPL-Host
HPL Problem Size (N) as % of Total Memory

4.5%
Modified Pre-Conjugate Gradient Solver with Offload-Allreduce
does up to 21.8% be<er than default version
K. Kandalla, et. al, Designing Non-blocking Broadcast with CollecNve Offload on
InfiniBand Clusters: A Case Study with HPL, HotI 2011
K. Kandalla, et. al., Designing Non-blocking Allreduce with CollecNve Offload on
InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12
21.8%
Can Network-Offload based Non-Blocking Neighborhood MPI CollecNves
Improve CommunicaNon Overheads of Irregular Graph Algorithms? K. Kandalla,
A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’
12

Network-Topology-Aware Placement of Processes
•  Can we design a highly scalable network topology detecRon service for IB?

•  How do we design the MPI communicaRon library in a network-topology-aware manner to efficiently leverage the topology
informaRon generated by our service?

•  What are the potenRal benefits of using a network-topology-aware MPI library on the performance of parallel scienRfic applicaRons?
Overall performance and Split up of physical communicaNon for MILC on Ranger
Performance for varying
system sizes Default for 2048 core run Topo-Aware for 2048 core run
15%
H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand
Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper Finalist
•  Reduce network topology discovery Nme from O(N2
hosts) to O(Nhosts)
•  15% improvement in MILC execuNon Nme @ 2048 cores
•  15% improvement in Hypre execuNon Nme @ 1024 cores

MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS ApplicaNons
MPI, OpenSHMEM, UPC, CAF, UPC++ or Hybrid (MPI +
PGAS) Applications
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI CallsUPC Calls
•  Uniﬁed communicaRon runRme for MPI, UPC, OpenSHMEM, CAF, UPC++ available with MVAPICH2-
X 1.9 onwards! (since 2012)
•  UPC++ support will be available in upcoming MVAPICH2-X 2.2RC1
•  Feature Highlights
–  Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, UPC++, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP)
+ UPC
–  MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with iniRal support
for UPC 1.3), CAF 2008 standard (OpenUH), UPC++
–  Scalable Inter-node and intra-node communicaRon – point-to-point and collecRves
CAF Calls UPC++ Calls

ApplicaNon Level Performance with Graph500 and Sort Graph500 ExecuNon Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,
InternaNonal SupercompuNng Conference (ISC’13), June 2013
J. Jose, K. Kandalla, M. Luo and D. K. Panda, SupporNng Hybrid MPI and OpenSHMEM over InﬁniBand: Design and Performance EvaluaNon,
Int'l Conference on Parallel Processing (ICPP '12), September 2012
0
5
10
15
20
25
30
35
4K 8K 16K
Time (s)
No. of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid (MPI+OpenSHMEM)
13X
7.6X
•  Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design
•  8,192 processes
- 2.4X improvement over MPI-CSR
- 7.6X improvement over MPI-Simple
- 1.5X improvement over MPI-CSR
- 13X improvement over MPI-Simple

J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, OpNmizing CollecNve CommunicaNon in OpenSHMEM, Int'l Conference on ParNNoned
Global Address Space Programming Models (PGAS '13), October 2013.
Sort ExecuNon Time
0
1000
2000
3000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Time (seconds)
Input Data - No. of Processes
MPI Hybrid
51%
•  Performance of Hybrid (MPI+OpenSHMEM) Sort
ApplicaRon
•  4,096 processes, 4 TB Input Size
- MPI – 2408 sec; 0.16 TB/min
- Hybrid – 1172 sec; 0.36 TB/min
- 51% improvement over MPI-design

MiniMD – Total ExecuNon Time
•  Hybrid design performs be<er than MPI implementaRon
-  17% improvement over MPI version
•  Strong Scaling
Input size: 128 * 128 * 128
Performance Strong Scaling
0
500
1000
1500
2000
2500
512 1,024
Hybrid-Barrier MPI-Original Hybrid-Advanced
17%
0
500
1000
1500
2000
2500
3000
256 512 1,024
Hybrid-Barrier MPI-Original Hybrid-Advanced
Time (ms)
Time (ms)
# of Cores # of Cores
M. Li, J. Lin, X. Lu, K. Hamidouche, K. Tomko and D. K. Panda, Scalable MiniMD Design with Hybrid MPI and OpenSHMEM, OpenSHMEM User Group
MeeNng (OUG ’14), held in conjuncNon with 8th InternaNonal Conference on ParNNoned Global Address Space Programming Models, (PGAS 14).

Hybrid MPI+UPC NAS-FT
•  Modiﬁed NAS FT UPC all-to-all pa<ern using MPI_Alltoall
•  Truly hybrid program
•  For FT (Class C, 128 processes)
•  34% improvement over UPC-GASNet
•  30% improvement over UPC-OSU

0
5
10
15
20
25
30
35
B-64 C-64 B-128 C-128
Time (s)
NAS Problem Size – System Size
UPC-GASNet
UPC-OSU
Hybrid-OSU
34%
J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI RunNmes: Experience with MVAPICH, Fourth Conference on
ParNNoned Global Address Space Programming Model (PGAS ’10), October 2010
Hybrid MPI + UPC Support
Available since
MVAPICH2-X 1.9 (2012)

Overview of OSU INAM
•  A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network
with inputs from the MPI runRme
–  h<p://mvapich.cse.ohio-state.edu/tools/osu-inam/
–  h<p://mvapich.cse.ohio-state.edu/userguide/osu-inam/
•  Monitors IB clusters in real Rme by querying various subnet management enRRes and gathering
input from the MPI runRmes
•  Capability to analyze and profile node-level, job-level and process-level acRviRes for MPI
communicaRon (Point-to-Point, CollecRves and RMA)
•  Ability to filter data based on type of counters using “drop down” list
•  Remotely monitor various metrics of MPI processes at user specified granularity
•  "Job Page" to display jobs in ascending/descending order of various performance metrics in
conjuncRon with MVAPICH2-X
•  Visualize the data transfer happening in a “live” or “historical” fashion for enRre network, job or set
of nodes

OSU INAM – Network Level View
•  Show network topology of large clusters
•  Visualize traﬃc pa<ern on diﬀerent links
•  Quickly idenRfy congested links/links in error state
•  See the history unfold – play back historical state of the network
Full Network (152 nodes) Zoomed-in View of the Network

OSU INAM – Job and Node Level Views
Visualizing a Job (5 Nodes) Finding Routes Between Nodes
•  Job level view
•  Show diﬀerent network metrics (load, error, etc.) for any live job
•  Play back historical data for completed jobs to idenRfy bo<lenecks
•  Node level view provides details per process or per node
•  CPU uRlizaRon for each rank/node
•  Bytes sent/received for MPI operaRons (pt-to-pt, collecRve, RMA)
•  Network metrics (e.g. XmitDiscard, RcvError) per rank/node

Live Node Level View

Live Switch Level View

List of Supported Switch Counters
•  The following counters are queried from the InﬁniBand Switches
•  Xmit Data
–  Total number of data octets, divided by 4, transmi<ed on all VLs from the port
–  This includes all octets between (and not including) the start of packet delimiter and the VCRC, and
may include packets containing errors
–  Excludes all link packets.
•  Rcv Data
–  Total number of data octets, divided by 4, received on all VLs from the port
–  This includes all octets between (and not including) the start of packet delimiter and the VCRC, and
may include packets containing errors
–  Excludes all link packets.
•  Max [Xmit Data/Rcv Data]: Maximum of the two values above

List of Supported MPI Process Level Counters
•  MVAPICH2-X collects addiRonal informaRon about the process’s network usage which can be displayed by OSU
INAM
•  Xmit Data
–  Total number of bytes transmi<ed as part of the MPI applicaRon
•  Rcv Data
–  Total number of bytes received as part of the MPI applicaRon
•  Max [Xmit Data/Rcv Data]
–  Maximum of the two values above
•  Point to Point Send
–  Total number of bytes transmi<ed as part of MPI point-to-point operaRons
•  Point to Point Rcvd
–  Total number of bytes received as part of MPI point-to-point operaRons
•  Max [Point to Point Sent/Rcvd]
•  Coll Bytes Sent
–  Total number of bytes transmi<ed as part of MPI collecRve operaRons
•  Coll Bytes Rcvd
–  Total number of bytes received as part of MPI collecRve operaRons

List of Supported MPI Process Level Counters (Cont.)
•  Max [Coll Bytes Sent/Rcvd]
•  RMA Bytes Sent
–  Total number of bytes transmi<ed as part of MPI RMA operaRons
–  Note that due to the nature of the RMA operaRons, bytes received for RMA operaRons cannot be counted
•  RC VBUF
–  The number of internal communicaRon buﬀers used for reliable connecRon (RC)
•  UD VBUF
–  The number of internal communicaRon buﬀers used for unreliable datagram (UD)
•  VM Size
–  Total number of bytes used by the program for its virtual memory
•  VM Peak
–  Maximum number of virtual memory bytes for the program
•  VM RSS
–  The number of bytes resident in the memory (Resident set size)
•  VM HWM
–  The maximum number of bytes that can be resident in memory (Peak resident set size or High water mark)

List of Supported Network Error Counters (Cont.)
•  XmtDiscards
–  Total number of outbound packets discarded by the port because the port is down or congested. Reasons for this include:
•  Output port is not in the acRve state
•  Packet length exceeded NeighborMTU
•  Switch LifeRme Limit exceeded
•  Switch HOQ LifeRme Limit exceeded This may also include packets discarded while in VLStalled State.
•  XmtConstraintErrors
–  Total number of packets not transmi<ed from the switch physical port for the following reasons:
•  FilterRawOutbound is true and packet is raw
•  ParRRonEnforcementOutbound is true and packet fails parRRon key check or IP version check
•  RcvConstraintErrors
–  Total number of packets not received from the switch physical port for the following reasons:
•  FilterRawInbound is true and packet is raw
•  ParRRonEnforcementInbound is true and packet fails parRRon key check or IP version check
•  LinkIntegrityErrors
–  The number of Rme s that the count of local physical errors exceeded the threshold specified by LocalPhyErrors
•  ExcBufOverrunErrors
–  The number of Rmes that OverrunErrors consecuRve flow control update periods occurred, each having at least one overrun error
•  VL15Dropped: Number of incoming VL15 packets dropped due to resource limitaRons (e.g., lack of buffers) in the port

List of Supported Network Error Counters
•  The following error counters are available both at switch and process level:
•  SymbolErrors
–  Total number of minor link errors detected on one or more physical lanes
•  LinkRecovers
–  Total number of Rmes the Port Training state machine has successfully completed the link error recovery process
•  LinkDowned
–  Total number of Rmes the Port Training state machine has failed the link error recovery process and downed the link
•  RcvErrors
–  Total number of packets containing an error that were received on the port. These errors include:
•  Local physical errors
•  Malformed data packet errors
•  Malformed link packet errors
•  Packets discarded due to buﬀer overrun
•  RcvRemotePhysErrors
–  Total number of packets marked with the EBP delimiter received on the port.
•  RcvSwitchRelayErrors
–  Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay

Conclusions

•  Provided an overview of programming models for
exascale systems
•  Outlined the associated challenges in designing
runRmes for the programming models challenges
•  Demonstrated how MVAPICH2 project is addressing
some of these challenges

•  Integrated Support for GPGPUs
•  Integrated Support for MICs
•  VirtualizaRon (SR-IOV and Container)
•  Best PracRce: Set of Tunings for Common ApplicaRons
(Available through the MVAPICH Website)

AddiNonal Challenges to be Covered in Today’s 1:30pm Talk

panda@cse.ohio-state.edu
Thank You!
The High-Performance Big Data Project
h<p://hibd.cse.ohio-state.edu/
Network-Based CompuRng Laboratory
h<p://nowlab.cse.ohio-state.edu/
The MVAPICH2 Project
h<p://mvapich.cse.ohio-state.edu/

Programming Models for Exascale Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Programming Models for Exascale Systems

Similar to Programming Models for Exascale Systems (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Programming Models for Exascale Systems