Programming Models for Exascale Systems

Programming Models for Exascale Systems
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
h<p://www.cse.ohio-state.edu/~panda
Keynote Talk at HPCAC-Stanford (Feb 2016)
by

HPCAC-Stanford (Feb ‘16) 2 Network Based CompuNng Laboratory
High-End CompuNng (HEC): ExaFlop & ExaByte
100-200
PFlops in
2016-2018
1 EFlops in
2020-2024?
F i g u r e 1
Source: IDC's Digital Universe Study, sponsored by EMC, December 2012
Within these broad outlines of the digital universe are some singularities worth noting.
First, while the portion of the digital universe holding potential analytic value is growing, only a tin
fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digita
10K-20K
EBytes in
2016-2018
40K EBytes
in 2020 ?
ExaFlop & HPC • 
ExaByte & BigData •

0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
Percentage of Clusters
Number of Clusters
Timeline
Percentage of Clusters
Number of Clusters
Trends for Commodity CompuNng Clusters in the Top 500 List
(hTp://www.top500.org)
85%

Drivers of Modern HPC Cluster Architectures
Tianhe – 2 Titan Stampede Tianhe – 1A
•  MulR-core/many-core technologies
•  Remote Direct Memory Access (RDMA)-enabled networking (InﬁniBand and RoCE)
•  Solid State Drives (SSDs), Non-VolaRle Random-Access Memory (NVRAM), NVMe-SSD
•  Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

Accelerators / Coprocessors
high compute density, high
performance/waT
>1 TFlop DP on a chip
High Performance Interconnects -
InﬁniBand
<1usec latency, 100Gbps Bandwidth> MulN-core Processors SSD, NVMe-SSD, NVRAM

•  235 IB Clusters (47%) in the Nov’ 2015 Top500 list (h<p://www.top500.org)
•  InstallaRons in the Top 50 (21 systems):
Large-scale InﬁniBand InstallaNons
462,462 cores (Stampede) at TACC (10th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (25th)
185,344 cores (Pleiades) at NASA/Ames (13th) 194,616 cores (Cascade) at PNNL (27th)
72,800 cores Cray CS-Storm in US (15th) 76,032 cores (Makman-2) at Saudi Aramco (32nd)
72,800 cores Cray CS-Storm in US (16th) 110,400 cores (Pangea) in France (33rd)
265,440 cores SGI ICE at Tulip Trading Australia (17th) 37,120 cores (Lomonosov-2) at Russia/MSU (35th)
124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (18th) 57,600 cores (SwifLucy) in US (37th)
72,000 cores (HPC2) in Italy (19th) 55,728 cores (Prometheus) at Poland/Cyfronet (38th)
152,692 cores (Thunder) at AFRL/USA (21st ) 50,544 cores (Occigen) at France/GENCI-CINES (43rd)
147,456 cores (SuperMUC) in Germany (22nd) 76,896 cores (Salomon) SGI ICE in Czech Republic (47th)
86,016 cores (SuperMUC Phase 2) in Germany (24th) and many more!

Towards Exascale System (Today and Target)
Systems 2016
Tianhe-2
2020-2024 Difference
Today & Exascale
System peak 55 PFlop/s 1 EFlop/s ~20x
Power 18 MW
(3 Gflops/W)
~20 MW
(50 Gflops/W)
O(1)
~15x
System memory 1.4 PB
(1.024PB CPU + 0.384PB CoP)
32 – 64 PB ~50X
Node performance 3.43TF/s
(0.4 CPU + 3 CoP)
1.2 or 15 TF O(1)
Node concurrency 24 core CPU +
171 cores CoP
O(1k) or O(10k) ~5x - ~50x
Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x
System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x
Total concurrency 3.12M
12.48M threads (4 /core)
O(billion)
for latency hiding
~100x
MTTI Few/day Many/day O(?)
Courtesy: Prof. Jack Dongarra

•  ScienRﬁc CompuRng
–  Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant
Programming Model
–  Many discussions towards ParRRoned Global Address Space (PGAS)
•  UPC, OpenSHMEM, CAF, etc.
–  Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
•  Big Data/Enterprise/Commercial CompuRng
–  Focuses on large data and data analysis
–  Hadoop (HDFS, HBase, MapReduce)
–  Spark is emerging for in-memory compuRng
–  Memcached is also used for Web 2.0
Two Major Categories of ApplicaNons

Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
ParRRoned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
•  Programming models provide abstract machine models
•  Models can be mapped on diﬀerent types of systems
–  e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
•  PGAS models and Hybrid MPI+PGAS models are gradually receiving
importance

ParNNoned Global Address Space (PGAS) Models
•  Key features
-  Simple shared memory abstracRons
-  Light weight one-sided communicaRon
-  Easier to express irregular communicaRon
•  Diﬀerent approaches to PGAS
-  Languages
•  Uniﬁed Parallel C (UPC)
•  Co-Array Fortran (CAF)
•  X10
•  Chapel
-  Libraries
•  OpenSHMEM
•  Global Arrays

Hybrid (MPI+PGAS) Programming
•  ApplicaRon sub-kernels can be re-wri<en in MPI/PGAS based on communicaRon
characterisRcs
•  Beneﬁts:
–  Best of Distributed CompuRng Model
–  Best of Shared Memory CompuRng Model
•  Exascale Roadmap*:
–  “Hybrid Programming is a pracRcal way to
program exascale systems”
* The Interna4onal Exascale So;ware Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011,
Interna4onal Journal of High Performance Computer Applica4ons, ISSN 1094-3420
Kernel 1
MPI
Kernel 2
MPI
Kernel 3
MPI
Kernel N
MPI
HPC ApplicaNon
Kernel 2
PGAS
Kernel N
PGAS

Designing CommunicaNon Libraries for MulN-Petaflop and
Exaflop Systems: Challenges
Programming Models
MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
ApplicaNon Kernels/ApplicaNons
Networking Technologies
(InfiniBand, 40/100GigE,
Aries, and OmniPath)
MulN/Many-core
Architectures
Accelerators
(NVIDIA and MIC)
Middleware
Co-Design
OpportuniNes
and
Challenges
across Various
Layers

Performance
Scalability
Fault-
Resilience
CommunicaNon Library or RunNme for Programming Models
Point-to-point
CommunicaNon
CollecNve
CommunicaNon
Energy-
Awareness
SynchronizaNon
and Locks
I/O and
File Systems
Fault
Tolerance

•  Scalability for million to billion processors
–  Support for highly-efficient inter-node and intra-node communicaRon (both two-sided and one-sided)
–  Scalable job start-up
•  Scalable CollecRve communicaRon
–  Offload
–  Non-blocking
–  Topology-aware
•  Balancing intra-node and inter-node communicaRon for next generaRon nodes (128-1024 cores)
–  MulRple end-points per node
•  Support for efficient mulR-threading
•  Integrated Support for GPGPUs and Accelerators
•  Fault-tolerance/resiliency
•  QoS support for communicaRon and I/O
•  Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM,
CAF, …)
•  VirtualizaRon
•  Energy-Awareness

Broad Challenges in Designing CommunicaNon Libraries for (MPI+X) at
Exascale

•  Extreme Low Memory Footprint
–  Memory per core conRnues to decrease

•  D-L-A Framework
–  Discover
•  Overall network topology (fat-tree, 3D, …), Network topology for processes for a given job
•  Node architecture, Health of network and node
–  Learn
•  Impact on performance and scalability
•  PotenRal for failure
–  Adapt
•  Internal protocols and algorithms
•  Process mapping
•  Fault-tolerance soluRons
–  Low overhead techniques while delivering performance, scalability and fault-tolerance

AddiNonal Challenges for Designing Exascale Somware Libraries

Overview of the MVAPICH2 Project
•  High Performance open-source MPI Library for InﬁniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)
–  MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
–  MVAPICH2-X (MPI + PGAS), Available since 2011
–  Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
–  Support for VirtualizaRon (MVAPICH2-Virt), Available since 2015
–  Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
–  Used by more than 2,525 organizaNons in 77 countries
–  More than 351,000 (> 0.35 million) downloads from the OSU site directly
–  Empowering many TOP500 clusters (Nov ‘15 ranking)
•  10th ranked 519,640-core cluster (Stampede) at TACC
•  13th ranked 185,344-core cluster (Pleiades) at NASA
•  25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo InsRtute of Technology and many others
–  Available with sofware stacks of many vendors and Linux Distros (RedHat and SuSE)
–  h<p://mvapich.cse.ohio-state.edu
•  Empowering Top500 systems for over a decade
–  System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->
–  Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops)

MVAPICH2 Architecture
High Performance Parallel Programming Models
Message Passing Interface
(MPI)
PGAS
(UPC, OpenSHMEM, CAF, UPC++*)
Hybrid --- MPI + X
(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable CommunicaNon RunNme
Diverse APIs and Mechanisms
Point-to-
point
PrimiNves
CollecNves
Algorithms
Energy-
Awareness
Remote
Memory
Access
I/O and
File Systems
Fault
Tolerance
VirtualizaNon
AcNve
Messages
Job Startup
IntrospecNon
& Analysis
Support for Modern Networking Technology
(InﬁniBand, iWARP, RoCE, OmniPath)
Support for Modern MulN-/Many-core Architectures
(Intel-Xeon, OpenPower*, Xeon-Phi (MIC, KNL*), NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODP*
SR-
IOV
MulN
Rail
Transport Mechanisms
Shared
Memory
CMA IVSHMEM
Modern Features
MCDRAM* NVLink* CAPI*
* - Upcoming

–  Support for highly-efficient inter-node and intra-node communicaRon (both two-sided and one-sided
RMA)
–  Support for advanced IB mechanisms (UMR and ODP)
–  Extremely minimal memory footprint
–  Scalable job start-up
•  CollecRve communicaRon
•  Integrated Support for GPGPUs
•  Integrated Support for MICs
•  Unified RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI +
UPC, CAF, …)
•  InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by the MVAPICH2
Project for Exascale

One-way Latency: MPI over IB with MVAPICH2
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00 Small Message Latency
Message Size (bytes)
Latency (us)
1.26
1.19
0.95
1.15
TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back
0
20
40
60
80
100
120
TrueScale-QDR
ConnectX-3-FDR
ConnectIB-DualFDR
ConnectX-4-EDR
Large Message Latency
Latency (us)

Bandwidth: MPI over IB with MVAPICH2
0
2000
4000
6000
8000
10000
12000
14000 UnidirecNonal Bandwidth
Bandwidth
(MBytes/sec)
12465
3387
6356
12104
0
5000
10000
15000
20000
25000
30000
TrueScale-QDR
ConnectX-3-FDR
ConnectIB-DualFDR
ConnectX-4-EDR
BidirecNonal Bandwidth
Bandwidth
(MBytes/sec)
21425
12161
24353
6308
TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back

0
0.5
1
0 1 2 4 8 16 32 64 128 256 512 1K
Latency (us)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance
(Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
Latest MVAPICH2 2.2b
Intel Ivy-bridge
0.18 us
0.45 us
0
5000
10000
15000
Bandwidth (MB/s)
Bandwidth (Inter-socket)
inter-Socket-CMA
inter-Socket-Shmem
inter-Socket-LiMIC
0
5000
10000
15000
Bandwidth (MB/s)
Bandwidth (Intra-socket)
intra-Socket-CMA
intra-Socket-Shmem
intra-Socket-LiMIC
14,250 MB/s
13,749 MB/s

•  Introduced by Mellanox to support direct local and remote nonconRguous
memory access
–  Avoid packing at sender and unpacking at receiver
•  Available with MVAPICH2-X 2.2b
User-mode Memory RegistraNon (UMR)
0
50
100
150
200
250
300
350
4K 16K 64K 256K 1M
Latency (us)
Small & Medium Message Latency
UMR
Default
0
5000
10000
15000
20000
2M 4M 8M 16M
Latency (us)
UMR
Default
Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, High Performance MPI Datatype Support with
User-mode Memory RegistraNon: Challenges, Designs and Beneﬁts, CLUSTER, 2015

•  Introduced by Mellanox to support direct remote memory access without pinning
•  Memory regions paged in/out dynamically by the HCA/OS
•  Size of registered buffers can be larger than physical memory
•  Will be available in upcoming MVAPICH2-X 2.2 RC1
On-Demand Paging (ODP)
Connect-IB (54 Gbps): 2.6 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
0
500
1000
1500
16 32 64
Pin-down Buffer Size
(MB)
Number of Processes
Graph500 Pin-down Buffer Sizes
Pin-down ODP
0
1
2
3
4
5
16 32 64
ExecuNon Time (s)
Number of Processes
Graph500 BFS Kernel
Pin-down ODP

Minimizing Memory Footprint by Direct Connect (DC) Transport
Node
0
P1
P0 Node 1
P3
P2
Node 3
P7
P6
Node
2
P5
P4 IB
Network
•  Constant connecRon cost (One QP for any peer)
•  Full Feature Set (RDMA, Atomics etc)
•  Separate objects for send (DC IniRator) and receive (DC Target)
–  DC Target idenRﬁed by “DCT Number”
–  Messages routed with (DCT Number, LID)
–  Requires same “DC Key” to enable communicaRon
•  Available since MVAPICH2-X 2.2a
0
0.5
1
160 320 620
Normalized ExecuNon
Time
Number of Processes
NAMD - Apoa1: Large data set
RC DC-Pool UD XRC
10
22
47
97
1 1 1
2
10 10 10 10
1 1
3
5
1
10
100
80 160 320 640
ConnecNon Memory (KB)
Number of Processes
Memory Footprint for Alltoall
RC DC-Pool UD XRC
H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT)
of InﬁniBand : Early Experiences. IEEE InternaRonal SupercompuRng Conference (ISC ’14)

•  Near-constant MPI and OpenSHMEM
iniRalizaRon Rme at any process count
•  10x and 30x improvement in startup Rme
of MPI and OpenSHMEM respecRvely at
16,384 processes
•  Memory consumpRon reduced for
remote endpoint informaRon by
O(processes per node)
•  1GB Memory saved per node with 1M
processes and 16 processes per node
Towards High Performance and Scalable Startup at Exascale
P M
O
Job Startup Performance
Memory Required to Store
Endpoint InformaRon
P
M
PGAS – State of the art
MPI – State of the art
O PGAS/MPI – OpRmized
PMIX_Ring
PMIX_Ibarrier
PMIX_Iallgather
Shmem based PMI
On-demand
ConnecRon
On-demand ConnecNon Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and D K
Panda, 20th InternaRonal Workshop on High-level Parallel Programming Models and SupporRve Environments (HIPS ’15)
PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the 21st
European MPI Users' Group MeeRng (EuroMPI/Asia ’14)
Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th IEEE/
ACM InternaRonal Symposium on Cluster, Cloud and Grid CompuRng (CCGrid ’15)
SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K Panda, 16th
IEEE/ACM InternaRonal Symposium on Cluster, Cloud and Grid CompuRng (CCGrid ’16) , Accepted for Publica6on

•  SHMEMPMI allows MPI processes to directly read remote endpoint (EP) informaRon from the process
manager through shared memory segments
•  Only a single copy per node - O(processes per node) reducRon in memory usage
•  EsRmated savings of 1GB per node with 1 million processes and 16 processes per node
•  Up to 1,000 Rmes faster PMI Gets compared to default design. Will be available in MVAPICH2 2.2RC1.
Process Management Interface over Shared Memory (SHMEMPMI)
TACC Stampede - Connect-IB (54 Gbps): 2.6 GHz Quad Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR
SHMEMPMI – Shared Memory Based PMI for Performance and Scalability S. Chakraborty, H. Subramoni, J. Perkins, and D.K. Panda,
16th IEEE/ACM InternaRonal Symposium on Cluster, Cloud and Grid CompuRng (CCGrid ‘16), Accepted for publica6on
0
50
100
150
200
250
300
1 2 4 8 16 32
Time Taken (milliseconds)
Number of Processes per Node
Time Taken by one PMI_Get
Default
SHMEMPMI
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
16 64 256 1K 4K 16K 64K 256K 1M
Memory Usage per Node (MB)
Number of Processes per Job
Memory Usage for Remote EP InformaRon
Fence - Default
Allgather - Default
Fence - Shmem
Allgather - Shmem
EsNmated
1000x
Actual
16x

–  Oﬄoad and Non-blocking
–  Topology-aware
•  Uniﬁed RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …)


Modified HPL with Offload-Bcast does up to 4.5% be<er than default
version (512 Processes)
0
1
2
3
4
5
512 600 720 800
ApplicaNon Run-Time
(s)
Data Size
0
5
10
15
64 128 256 512
Run-Time (s)
Number of Processes
PCG-Default Modified-PCG-Offload
Co-Design with MPI-3 Non-Blocking CollecNves and CollecNve Offload Co-Direct
Hardware (Available since MVAPICH2-X 2.2a)
Modified P3DFFT with Offload-Alltoall does up to 17% be<er than
default version (128 Processes)
K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking All-to-All
with CollecNve Offload on InfiniBand Clusters: A Study with Parallel 3D FFT,
ISC 2011
17%
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50 60 70
Normalized
Performance
HPL-Offload HPL-1ring HPL-Host
HPL Problem Size (N) as % of Total Memory

4.5%
Modified Pre-Conjugate Gradient Solver with Offload-Allreduce
does up to 21.8% be<er than default version
K. Kandalla, et. al, Designing Non-blocking Broadcast with CollecNve Offload on
InfiniBand Clusters: A Case Study with HPL, HotI 2011
K. Kandalla, et. al., Designing Non-blocking Allreduce with CollecNve Offload on
InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12
21.8%
Can Network-Offload based Non-Blocking Neighborhood MPI CollecNves
Improve CommunicaNon Overheads of Irregular Graph Algorithms? K. Kandalla,
A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’
12

Network-Topology-Aware Placement of Processes
•  Can we design a highly scalable network topology detecRon service for IB?

•  How do we design the MPI communicaRon library in a network-topology-aware manner to efficiently leverage the topology
informaRon generated by our service?

•  What are the potenRal benefits of using a network-topology-aware MPI library on the performance of parallel scienRfic applicaRons?
Overall performance and Split up of physical communicaNon for MILC on Ranger
Performance for varying
system sizes Default for 2048 core run Topo-Aware for 2048 core run
15%
H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand
Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper Finalist
•  Reduce network topology discovery Nme from O(N2
hosts) to O(Nhosts)
•  15% improvement in MILC execuNon Nme @ 2048 cores
•  15% improvement in Hypre execuNon Nme @ 1024 cores

–  CUDA-Aware MPI
–  GPUDirect RDMA (GDR) Support
–  CUDA-aware Non-blocking CollecRves
–  Support for Managed Memory
–  Eﬃcient datatype Processing
•  Uniﬁed RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI +
UPC, CAF, …)

PCIe
GPU
CPU
NIC
Switch
At Sender:
cudaMemcpy(s_hostbuf, s_devbuf, . . .);
MPI_Send(s_hostbuf, size, . . .);
At Receiver:
MPI_Recv(r_hostbuf, size, . . .);
cudaMemcpy(r_devbuf, r_hostbuf, . . .);
•  Data movement in applicaRons with standard MPI and CUDA interfaces
High Produc4vity and Low Performance
MPI + CUDA - Naive

PCIe
GPU
CPU
NIC
Switch
At Sender:
for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
<<Similar at receiver>>
•  Pipelining at user level with non-blocking MPI and CUDA interfaces
Low Produc4vity and High Performance
MPI + CUDA - Advanced

At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
•  Standard MPI interfaces used for uniﬁed data movement
•  Takes advantage of Uniﬁed Virtual Addressing (>= CUDA 4.0)
•  Overlaps data movement from GPU with RDMA transfers
High Performance and High Produc4vity
MPI_Send(s_devbuf, size, …);
GPU-Aware MPI Library: MVAPICH2-GPU

•  OFED with support for GPUDirect RDMA is
developed by NVIDIA and Mellanox
•  OSU has a design of MVAPICH2 using
GPUDirect RDMA
–  Hybrid design using GPU-Direct RDMA
•  GPUDirect RDMA and Host-based pipelining
•  Alleviates P2P bandwidth bo<lenecks on SandyBridge and
IvyBridge
–  Support for communicaRon using mulR-rail
–  Support for Mellanox Connect-IB and ConnectX VPI adapters
–  Support for RoCE with Mellanox ConnectX VPI adapters

GPU-Direct RDMA (GDR) with CUDA
IB Adapter
System
Memory
GPU
Memory
GPU
CPU
Chipset
P2P write: 5.2 GB/s
P2P read: < 1.0 GB/s
SNB E5-2670
P2P write: 6.4 GB/s
P2P read: 3.5 GB/s
IVB E5-2680V2
SNB E5-2670 /
IVB E5-2680V2

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases
•  Support for MPI communicaRon from NVIDIA GPU device memory
•  High performance RDMA-based inter-node point-to-point
communicaRon (GPU-GPU, GPU-Host and Host-GPU)
•  High performance intra-node point-to-point communicaRon for mulR-
GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
•  Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communicaRon for mulRple GPU adapters/node
•  OpRmized and tuned collecRves for GPU device buﬀers
•  MPI datatype support for point-to-point and collecRve communicaRon
from GPU device buﬀers

MVAPICH2-GDR-2.2b
Intel Ivy Bridge (E5-2680 v2) node - 20 cores
NVIDIA Tesla K40c GPU
Mellanox Connect-IB Dual-FDR HCA
CUDA 7
Mellanox OFED 2.4 with GPU-Direct-RDMA
10x
2X
11x
2x
Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)
0
5
10
15
20
25
30
0 2 8 32 128 512 2K
MV2-GDR2.2b MV2-GDR2.0b
MV2 w/o GDR
GPU-GPU internode latency
Latency (us)
2.18us
0
500
1000
1500
2000
2500
3000
1 4 16 64 256 1K 4K
MV2-GDR2.2b
MV2-GDR2.0b
MV2 w/o GDR
GPU-GPU Internode Bandwidth
Bandwidth (MB/
s)
11X
0
1000
2000
3000
4000
1 4 16 64 256 1K 4K
MV2-GDR2.2b
MV2-GDR2.0b
MV2 w/o GDR
GPU-GPU Internode Bi-Bandwidth
Bi-Bandwidth (MB/s)

•  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
•  HoomdBlue Version 1.0.5
•  GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768
MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
ApplicaNon-Level EvaluaNon (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Average Time Steps per
second (TPS)
Number of Processes
MV2 MV2+GDR
0
500
1000
1500
2000
2500
3000
3500
4 8 16 32
Average Time Steps per
second (TPS)
Number of Processes
64K ParNcles 256K ParNcles
2X
2X

0
20
40
60
80
100
120
4K 16K 64K 256K 1M
Overlap (%)
Medium/Large Message Overlap
(64 GPU nodes)
Ialltoall (1process/node)
Ialltoall (2process/node; 1process/GPU)
0
20
40
60
80
100
120
4K 16K 64K 256K 1M
Overlap (%)
Medium/Large Message Overlap
(64 GPU nodes)
Igather (1process/node)
Igather (2processes/node; 1process/
GPU)
Plazorm: Wilkes: Intel Ivy Bridge
NVIDIA Tesla K20c + Mellanox Connect-IB
Available since MVAPICH2-GDR 2.2a
CUDA-Aware Non-Blocking CollecNves
A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Oﬄoaded GPU
CollecNves using CORE-Direct and CUDA CapabiliNes on IB Clusters, HIPC,
2015

CommunicaNon RunNme with GPU Managed Memory
●  CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified)
memory allowing a common memory allocaRon for GPU or
CPU through cudaMallocManaged() call
●  Significant producRvity benefits due to abstracRon of
explicit allocaRon and cudaMemcpy()
●  Extended MVAPICH2 to perform communicaRons directly
from managed buffers (Available in MVAPICH2-GDR 2.2b)
●  OSU Micro-benchmarks extended to evaluate the
performance of point-to-point and collecRve
communicaRons using managed buffers
●  Available in OMB 5.2
D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance
CommunicaRon RunRme for GPUManaged Memory: Early Experiences, GPGPU-9
Workshop, to be held in conjuncRon with PPoPP ‘16
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 4 8 16 32 64 128 256 1K 4K 8K 16K
Halo Exchange Time (ms)
Total Dimension Size (Bytes)
2D Stencil Performance for Halowidth=1
Device
Managed

CPU
Progress
GPU
Time
Initiate
Kernel
Start
Send
Isend(1)
Initiate
Kernel
Start
Send
Initiate
Kernel
GPU
CPU
Initiate
Kernel
Start
Send
Wait For
Kernel
(WFK)
Kernel on Stream
Isend(1)
Existing Design
Proposed Design
Kernel on Stream
Kernel on Stream
Isend(2)Isend(3)
Kernel on Stream
Initiate
Kernel
Start
Send
Wait For
Kernel
(WFK)
Kernel on Stream
Isend(1)
Initiate
Kernel
Start
Send
Wait For
Kernel
(WFK)
Kernel on Stream
Isend(1) Wait
WFK
Start
Send
Wait
Progress
Start Finish Proposed Finish Existing
WFK
WFK
Expected Benefits
MPI Datatype Processing (CommunicaNon OpNmizaNon )
Waste of compuNng resources on CPU and GPU Common Scenario
*Buf1, Buf2…contain non-
conRguous MPI Datatype
MPI_Isend (A,.. Datatype,…)
MPI_Isend (B,.. Datatype,…)
MPI_Isend (C,.. Datatype,…)
MPI_Isend (D,.. Datatype,…)
…

MPI_Waitall (…);

ApplicaNon-Level EvaluaNon (HaloExchange - Cosmo)
0
0.5
1
1.5
16 32 64 96
Normalized ExecuNon Time
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
0
0.5
1
1.5
4 8 16 32
Normalized ExecuNon Time
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
•  2X improvement on 32 GPUs nodes
•  30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, ExploiNng Maximal Overlap for Non-
ConNguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

MPI ApplicaNons on MIC Clusters
Xeon Xeon Phi
MulR-core Centric
Many-core Centric
MPI
Program
MPI
Program
Offloaded
ComputaRon
MPI
Program
MPI Program
MPI Program
Host-only
Offload
(/reverse Offload)
Symmetric
Coprocessor-only
•  Flexibility in launching MPI jobs on clusters with Xeon Phi

MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC
•  Oﬄoad Mode
•  Intranode CommunicaRon
•  Coprocessor-only and Symmetric Mode
•  Internode CommunicaRon
•  Coprocessors-only and Symmetric Mode
•  MulR-MIC Node ConﬁguraRons
•  Running on three major systems
•  Stampede, Blueridge (Virginia Tech) and Beacon (UTK)

MIC-Remote-MIC P2P CommunicaNon with Proxy-based
CommunicaNon
Bandwidth
BeTer
BeTer BeTer
Latency (Large Messages)
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
Latency(usec)
0
2000
4000
6000
1 16 256 4K 64K 1M
Bandwidth (MB/sec)
5236
Intra-socket P2P
Inter-socket P2P
0
5000
10000
15000
8K 32K 128K 512K 2M
Latency(usec)
Latency (Large
Messages)
0
2000
4000
6000
1 16 256 4K 64K 1M
Bandwidth (MB/sec)
BeTer
5594
Bandwidth

OpNmized MPI CollecNves for MIC Clusters (Allgather & Alltoall)
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance
Alltoall and Allgather designs for InﬁniBand MIC Clusters; IPDPS’14, May 2014
0
10000
20000
30000
1 2 4 8 16 32 64 128 256 512 1K
Latency (usecs)
32-Node-Allgather (16H + 16 M)
Small Message Latency
MV2-MIC
MV2-MIC-Opt
0
500
1000
1500
8K 16K 32K 64K 128K 256K 512K 1M
Latency (usecs)
32-Node-Allgather (8H + 8 M)
MV2-MIC
MV2-MIC-Opt
0
500
1000
4K 8K 16K 32K 64K 128K 256K 512K
Latency (usecs)
32-Node-Alltoall (8H + 8 M)
MV2-MIC
MV2-MIC-Opt
0
20
40
60
MV2-MIC-Opt MV2-MIC
ExecuNon Time (secs)
32 Nodes (8H + 8M), Size = 2K*2K*1K
P3DFFT Performance
CommunicaRon
ComputaRon
76%
58%
55%

MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS ApplicaNons
MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS)
ApplicaNons
Unified MVAPICH2-X RunNme
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI Calls UPC Calls
•  Unified communicaRon runRme for MPI, UPC, OpenSHMEM, CAF available with MVAPICH2-X
1.9 (2012) onwards!
•  UPC++ support will be available in upcoming MVAPICH2-X 2.2RC1
•  Feature Highlights
–  Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP)
+ UPC + CAF
–  MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with iniRal
support for UPC 1.3), CAF 2008 standard (OpenUH)
–  Scalable Inter-node and intra-node communicaRon – point-to-point and collecRves
CAF Calls

ApplicaNon Level Performance with Graph500 and Sort Graph500 ExecuNon Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,
InternaNonal SupercompuNng Conference (ISC’13), June 2013
J. Jose, K. Kandalla, M. Luo and D. K. Panda, SupporNng Hybrid MPI and OpenSHMEM over InﬁniBand: Design and Performance EvaluaNon,
Int'l Conference on Parallel Processing (ICPP '12), September 2012
0
5
10
15
20
25
30
35
4K 8K 16K
Time (s)
No. of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid (MPI+OpenSHMEM)
13X
7.6X
•  Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design
•  8,192 processes
- 2.4X improvement over MPI-CSR
- 7.6X improvement over MPI-Simple
- 1.5X improvement over MPI-CSR
- 13X improvement over MPI-Simple

J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, OpNmizing CollecNve CommunicaNon in OpenSHMEM, Int'l Conference on ParNNoned
Global Address Space Programming Models (PGAS '13), October 2013.
Sort ExecuNon Time
0
1000
2000
3000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Time (seconds)
Input Data - No. of Processes
MPI Hybrid
51%
•  Performance of Hybrid (MPI+OpenSHMEM) Sort
ApplicaRon
•  4,096 processes, 4 TB Input Size
- MPI – 2408 sec; 0.16 TB/min
- Hybrid – 1172 sec; 0.36 TB/min
- 51% improvement over MPI-design

MiniMD – Total ExecuNon Time
•  Hybrid design performs be<er than MPI implementaRon
-  17% improvement over MPI version
•  Strong Scaling
Input size: 128 * 128 * 128
Performance Strong Scaling
0
500
1000
1500
2000
2500
512 1,024
Hybrid-Barrier MPI-Original Hybrid-Advanced
17%
0
500
1000
1500
2000
2500
3000
256 512 1,024
Hybrid-Barrier MPI-Original Hybrid-Advanced
Time (ms)
Time (ms)
# of Cores # of Cores
M. Li, J. Lin, X. Lu, K. Hamidouche, K. Tomko and D. K. Panda, Scalable MiniMD Design with Hybrid MPI and OpenSHMEM, OpenSHMEM User Group
MeeNng (OUG ’14), held in conjuncNon with 8th InternaNonal Conference on ParNNoned Global Address Space Programming Models, (PGAS 14).

Hybrid MPI+UPC NAS-FT
•  Modiﬁed NAS FT UPC all-to-all pa<ern using MPI_Alltoall
•  Truly hybrid program
•  For FT (Class C, 128 processes)
•  34% improvement over UPC-GASNet
•  30% improvement over UPC-OSU

0
5
10
15
20
25
30
35
B-64 C-64 B-128 C-128
Time (s)
NAS Problem Size – System Size
UPC-GASNet
UPC-OSU
Hybrid-OSU
34%
J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI RunNmes: Experience with MVAPICH, Fourth Conference on
ParNNoned Global Address Space Programming Model (PGAS ’10), October 2010
Hybrid MPI + UPC Support
Available since
MVAPICH2-X 1.9 (2012)

•  VirtualizaRon has many benefits
–  Fault-tolerance
–  Job migraRon
–  CompacRon
•  Have not been very popular in HPC due to overhead associated with
VirtualizaRon
•  New SR-IOV (Single Root – IO VirtualizaRon) support available with
Mellanox InfiniBand adapters changes the field
•  Enhanced MVAPICH2 support for SR-IOV
•  MVAPICH2-Virt 2.1 (with and without OpenStack) is publicly available

Can HPC and VirtualizaNon be Combined?
J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI ApplicaNons on SR-IOV based
Virtualized InfiniBand Clusters? EuroPar'14
J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand
Clusters, HiPC’14
J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build
HPC Clouds, CCGrid’15

•  Redesign MVAPICH2 to make it
virtual machine aware
–  SR-IOV shows near to naRve
performance for inter-node point to
point communicaRon
–  IVSHMEM offers zero-copy access to
data on shared memory of co-resident
VMs
–  Locality Detector: maintains the locality
informaRon of co-resident virtual machines
–  CommunicaRon Coordinator: selects the
communicaRon channel (SR-IOV, IVSHMEM)
adapRvely
Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM
Host Environment
Guest 1
Hypervisor PF Driver
Infiniband Adapter
Physical
Function
user space
kernel space
MPI
proc
PCI
Device
VF
Driver
Guest 2
user space
kernel space
MPI
proc
PCI
Device
VF
Driver
Virtual
Function
Virtual
Function
/dev/shm/
IV-SHM
IV-Shmem Channel
SR-IOV Channel
J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM
Shmem Benefit MPI ApplicaRons on SR-IOV based
Virtualized InfiniBand Clusters? Euro-Par, 2014.
J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High
Performance MPI Library over SR-IOV Enabled InfiniBand
Clusters. HiPC, 2014.

Nova
Glance
Neutron
Swift
Keystone
Cinder
Heat
Ceilometer
Horizon
VM
Backup
volumes in
Stores
images in
Provides
images
Provides
Network
Provisions
Provides
Volumes
Monitors
Provides
UI
Provides
Auth for
Orchestrates
cloud
•  OpenStack is one of the most popular
open-source soluRons to build clouds and
manage virtual machines
•  Deployment with OpenStack
–  SupporRng SR-IOV configuraRon
–  SupporRng IVSHMEM configuraRon
–  Virtual Machine aware design of MVAPICH2 with
SR-IOV
•  An efficient approach to build HPC Clouds
with MVAPICH2-Virt and OpenStack
MVAPICH2-Virt with SR-IOV and IVSHMEM over OpenStack
J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build
HPC Clouds. CCGrid, 2015.

0
50
100
150
200
250
300
350
400
milc leslie3d pop2 GAPgeofem zeusmp2 lu
ExecuNon Time (s)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-NaRve
1%
9.5%
0
1000
2000
3000
4000
5000
6000
22,20 24,10 24,16 24,20 26,10 26,16
ExecuNon Time (ms)
Problem Size (Scale, Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-NaRve
2%
•  32 VMs, 6 Core/VM
•  Compared to NaRve, 2-5% overhead for Graph500 with 128 Procs
•  Compared to NaRve, 1-9.5% overhead for SPEC MPI2007 with 128 Procs
ApplicaNon-Level Performance on Chameleon
SPEC MPI2007Graph500
5%

NSF Chameleon Cloud: A Powerful and Flexible
Experimental Instrument
•  Large-scale instrument
–  TargeRng Big Data, Big Compute, Big Instrument research
–  ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network
•  Reconﬁgurable instrument
–  Bare metal reconﬁguraRon, operated as single instrument, graduated approach for ease-of-use
•  Connected instrument
–  Workload and Trace Archive
–  Partnerships with producRon clouds: CERN, OSDC, Rackspace, Google, and others
–  Partnerships with users
•  Complementary instrument
–  ComplemenRng GENI, Grid’5000, and other testbeds
•  Sustainable instrument
–  Industry connecRons

h<p://www.chameleoncloud.org/

•  MVAPICH2-EA 2.1 (Energy-Aware)
•  A white-box approach
•  New Energy-Eﬃcient communicaRon protocols for pt-pt and collecRve operaRons
•  Intelligently apply the appropriate Energy saving techniques
•  ApplicaRon oblivious energy saving

•  OEMT
•  A library uRlity to measure energy consumpRon for MPI applicaRons
•  Works with all MPI runRmes
•  PRELOAD opRon for precompiled applicaRons
•  Does not require ROOT permission:
•  A safe kernel module to read only a subset of MSRs
Energy-Aware MVAPICH2 & OSU Energy Management Tool
(OEMT)

•  An energy eﬃcient runRme that
provides energy savings without
applicaRon knowledge
•  Uses automaRcally and
transparently the best energy
lever
•  Provides guarantees on
maximum degradaRon with
5-41% savings at <= 5%
degradaRon
•  PessimisRc MPI applies energy
reducRon lever to each MPI call
MVAPICH2-EA: ApplicaNon Oblivious Energy-Aware-MPI (EAM)
A Case for ApplicaNon-Oblivious Energy-Eﬃcient MPI RunNme A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D.
K. Panda, D. Kerbyson, and A. Hoise, SupercompuNng ‘15, Nov 2015 [Best Student Paper Finalist]
1

•  OSU INAM monitors IB clusters in real Rme by querying various subnet management enRRes
in the network
•  Major features of the OSU INAM tool include:
–  Analyze and profile network-level acRviRes with many parameters (data and errors) at user specified
granularity
–  Capability to analyze and profile node-level, job-level and process-level acRviRes for MPI
communicaRon (pt-to-pt, collecRves and RMA)
–  Remotely monitor CPU uRlizaRon of MPI processes at user specified granularity
–  Visualize the data transfer happening in a "live" fashion - Live View for
•  EnRre Network - Live Network Level View
•  ParRcular Job - Live Job Level View
•  One or mulRple Nodes - Live Node Level View
–  Capability to visualize data transfer that happened in the network at a Rme duraRon in the past
•  EnRre Network - Historical Network Level View
•  ParRcular Job - Historical Job Level View
•  One or mulRple Nodes - Historical Node Level View
Overview of OSU INAM

OSU INAM – Network Level View
•  Show network topology of large clusters
•  Visualize traﬃc pa<ern on diﬀerent links
•  Quickly idenRfy congested links/links in error state
•  See the history unfold – play back historical state of the network
Full Network (152 nodes) Zoomed-in View of the Network

OSU INAM – Job and Node Level Views
Visualizing a Job (5 Nodes) Finding Routes Between Nodes
•  Job level view
•  Show diﬀerent network metrics (load, error, etc.) for any live job
•  Play back historical data for completed jobs to idenRfy bo<lenecks
•  Node level view provides details per process or per node
•  CPU uRlizaRon for each rank/node
•  Bytes sent/received for MPI operaRons (pt-to-pt, collecRve, RMA)
•  Network metrics (e.g. XmitDiscard, RcvError) per rank/node

MVAPICH2 – Plans for Exascale
•  Performance and Memory scalability toward 1M cores
•  Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)
–  Support for task-based parallelism (UPC++)
•  Enhanced OpRmizaRon for GPU Support and Accelerators
•  Taking advantage of advanced features
–  User Mode Memory RegistraRon (UMR)
–  On-demand Paging
•  Enhanced Inter-node and Intra-node communicaRon schemes for upcoming OmniPath and
Knights Landing architectures
•  Extended RMA support (as in MPI 3.0)
•  Extended topology-aware collecRves
•  Energy-aware point-to-point (one-sided and two-sided) and collecRves
•  Extended Support for MPI Tools Interface (as in MPI 3.0)
•  Extended Checkpoint-Restart and migraRon support with SCR

•  Exascale systems will be constrained by
–  Power
–  Memory per core
–  Data movement cost
–  Faults
•  Programming Models and RunRmes for HPC need to be
designed for
–  Scalability
–  Performance
–  Fault-resilience
–  Energy-awareness
–  Programmability
–  ProducRvity
•  Highlighted some of the issues and challenges
•  Need conRnuous innovaRon on all these fronts
Looking into the Future ….

Funding Acknowledgments
Funding Support by
Equipment Support by

Personnel Acknowledgments
Current Students
–  A. AugusRne (M.S.)
–  A. Awan (Ph.D.)
–  S. Chakraborthy (Ph.D.)
–  C.-H. Chu (Ph.D.)
–  N. Islam (Ph.D.)
–  M. Li (Ph.D.)
Past Students
–  P. Balaji (Ph.D.)
–  S. Bhagvat (M.S.)
–  A. Bhat (M.S.)
–  D. BunRnas (Ph.D.)
–  L. Chai (Ph.D.)
–  B. Chandrasekharan (M.S.)
–  N. Dandapanthula (M.S.)
–  V. Dhanraj (M.S.)
–  T. Gangadharappa (M.S.)
–  K. Gopalakrishnan (M.S.)
–  G. Santhanaraman (Ph.D.)
–  A. Singh (Ph.D.)
–  J. Sridhar (M.S.)
–  S. Sur (Ph.D.)
–  H. Subramoni (Ph.D.)
–  K. Vaidyanathan (Ph.D.)
–  A. Vishnu (Ph.D.)
–  J. Wu (Ph.D.)
–  W. Yu (Ph.D.)
Past Research Scien4st
–  S. Sur
Current Post-Doc
–  J. Lin
–  D. Banerjee
Current Programmer
–  J. Perkins
Past Post-Docs
–  H. Wang
–  X. Besseron
–  H.-W. Jin
–  M. Luo
–  W. Huang (Ph.D.)
–  W. Jiang (M.S.)
–  J. Jose (Ph.D.)
–  S. Kini (M.S.)
–  M. Koop (Ph.D.)
–  R. Kumar (M.S.)
–  S. Krishnamoorthy (M.S.)
–  K. Kandalla (Ph.D.)
–  P. Lai (M.S.)
–  J. Liu (Ph.D.)
–  M. Luo (Ph.D.)
–  A. Mamidala (Ph.D.)
–  G. Marsh (M.S.)
–  V. Meshram (M.S.)
–  A. Moody (M.S.)
–  S. Naravula (Ph.D.)
–  R. Noronha (Ph.D.)
–  X. Ouyang (Ph.D.)
–  S. Pai (M.S.)
–  S. Potluri (Ph.D.)
–  R. Rajachandrasekar (Ph.D.)

–  K. Kulkarni (M.S.)
–  M. Rahman (Ph.D.)
–  D. Shankar (Ph.D.)
–  A. Venkatesh (Ph.D.)
–  J. Zhang (Ph.D.)
–  E. Mancini
–  S. Marcarelli
–  J. Vienne
Current Research Scien4sts Current Senior Research Associate
–  H. Subramoni
–  X. Lu
Past Programmers
–  D. Bureddy
- K. Hamidouche
Current Research Specialist
–  M. Arnold

InternaNonal Workshop on CommunicaNon
Architectures at Extreme Scale (Exacomm)
ExaComm 2015 was held with Int’l SupercompuRng Conference (ISC ‘15), at Frankfurt,
Germany, on Thursday, July 16th, 2015

One Keynote Talk: John M. Shalf, CTO, LBL/NERSC
Four Invited Talks: Dror Goldenberg (Mellanox); MarRn Schulz (LLNL);
Cyriel Minkenberg (IBM-Zurich); Arthur (Barney) Maccabe (ORNL)
Panel: Ron Brightwell (Sandia)
Two Research Papers
ExaComm 2016 will be held in conjuncRon with ISC ’16
h<p://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html
Technical Paper Submission Deadline: Friday, April 15, 2016

panda@cse.ohio-state.edu
Thank You!
The High-Performance Big Data Project
h<p://hibd.cse.ohio-state.edu/
Network-Based CompuRng Laboratory
h<p://nowlab.cse.ohio-state.edu/
The MVAPICH2 Project
h<p://mvapich.cse.ohio-state.edu/

Programming Models for Exascale Systems

More Related Content

Similar to Programming Models for Exascale Systems

More from inside-BigData.com

Recently uploaded

Programming Models for Exascale Systems