Programming	Models	for	Exascale	Systems	
Dhabaleswar	K.	(DK)	Panda	
The	Ohio	State	University	
E-mail:	panda@cse.ohio-state.edu	
h<p://www.cse.ohio-state.edu/~panda	
Keynote	Talk	at	HPCAC-Stanford	(Feb	2016)	
by
HPCAC-Stanford	(Feb	‘16)	 2	Network	Based	CompuNng	Laboratory	
High-End	CompuNng	(HEC):	ExaFlop	&	ExaByte	
100-200
PFlops in
2016-2018
1 EFlops in
2020-2024?
F i g u r e 1
Source: IDC's Digital Universe Study, sponsored by EMC, December 2012
Within these broad outlines of the digital universe are some singularities worth noting.
First, while the portion of the digital universe holding potential analytic value is growing, only a tin
fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digita
10K-20K
EBytes in
2016-2018
40K EBytes
in 2020 ?
ExaFlop	&	HPC	•  		
ExaByte	&	BigData	• 
HPCAC-Stanford	(Feb	‘16)	 3	Network	Based	CompuNng	Laboratory	
0	
10	
20	
30	
40	
50	
60	
70	
80	
90	
100	
0	
50	
100	
150	
200	
250	
300	
350	
400	
450	
500	
Percentage	of	Clusters		
Number	of	Clusters	
Timeline	
Percentage	of	Clusters	
Number	of	Clusters	
Trends	for	Commodity	CompuNng	Clusters	in	the	Top	500	List	
(hTp://www.top500.org)	
85%
HPCAC-Stanford	(Feb	‘16)	 4	Network	Based	CompuNng	Laboratory	
Drivers	of	Modern	HPC	Cluster	Architectures	
Tianhe	–	2		 Titan	 Stampede	 Tianhe	–	1A		
•  MulR-core/many-core	technologies	
•  Remote	Direct	Memory	Access	(RDMA)-enabled	networking	(InfiniBand	and	RoCE)	
•  Solid	State	Drives	(SSDs),	Non-VolaRle	Random-Access	Memory	(NVRAM),	NVMe-SSD	
•  Accelerators	(NVIDIA	GPGPUs	and	Intel	Xeon	Phi)	
	
	
Accelerators	/	Coprocessors		
high	compute	density,	high	
performance/waT	
>1	TFlop	DP	on	a	chip		
High	Performance	Interconnects	-	
InfiniBand	
<1usec	latency,	100Gbps	Bandwidth>	MulN-core	Processors	 SSD,	NVMe-SSD,	NVRAM
HPCAC-Stanford	(Feb	‘16)	 5	Network	Based	CompuNng	Laboratory	
•  235	IB	Clusters	(47%)	in	the	Nov’	2015	Top500	list		(h<p://www.top500.org)	
•  InstallaRons	in	the	Top	50	(21	systems):	
Large-scale	InfiniBand	InstallaNons	
462,462	cores	(Stampede)	at	TACC	(10th)	 76,032	cores	(Tsubame	2.5)	at	Japan/GSIC	(25th)	
185,344	cores	(Pleiades)	at	NASA/Ames	(13th)	 194,616	cores	(Cascade)	at	PNNL	(27th)	
72,800	cores	Cray	CS-Storm	in	US	(15th)	 76,032	cores	(Makman-2)	at	Saudi	Aramco	(32nd)	
72,800	cores	Cray	CS-Storm	in	US	(16th)	 110,400	cores	(Pangea)	in	France	(33rd)	
265,440	cores	SGI	ICE	at	Tulip	Trading	Australia	(17th)	 37,120	cores	(Lomonosov-2)	at	Russia/MSU	(35th)	
124,200	cores	(Topaz)	SGI	ICE	at	ERDC	DSRC	in	US		(18th)	 57,600	cores	(SwifLucy)	in	US	(37th)	
72,000	cores	(HPC2)	in	Italy	(19th)	 55,728	cores	(Prometheus)	at	Poland/Cyfronet	(38th)	
152,692	cores	(Thunder)	at	AFRL/USA	(21st	)	 50,544	cores	(Occigen)	at	France/GENCI-CINES	(43rd)	
147,456	cores	(SuperMUC)	in		Germany	(22nd)	 76,896	cores	(Salomon)	SGI	ICE	in	Czech	Republic	(47th)	
86,016	cores	(SuperMUC	Phase	2)	in		Germany	(24th)	 and	many	more!
HPCAC-Stanford	(Feb	‘16)	 6	Network	Based	CompuNng	Laboratory	
Towards	Exascale	System	(Today	and	Target)	
Systems	 2016	
Tianhe-2	
2020-2024	 Difference	
Today	&	Exascale	
System	peak	 55	PFlop/s	 1	EFlop/s	 ~20x	
Power	 18	MW	
(3	Gflops/W)	
~20	MW	
(50	Gflops/W)	
O(1)	
~15x	
System	memory	 1.4	PB	
(1.024PB	CPU	+	0.384PB	CoP)	
32	–	64	PB	 ~50X	
Node	performance	 3.43TF/s	
(0.4	CPU	+	3	CoP)	
1.2	or	15	TF	 O(1)	
Node	concurrency	 24	core	CPU	+		
171	cores	CoP	
O(1k)	or	O(10k)	 ~5x		-	~50x	
Total	node	interconnect	BW	 6.36	GB/s	 200	–	400	GB/s	 ~40x	-~60x	
System	size	(nodes)	 16,000	 O(100,000)	or	O(1M)	 ~6x	-	~60x	
Total	concurrency	 3.12M	
12.48M	threads	(4	/core)	
O(billion)		
	for	latency	hiding	
~100x	
MTTI	 Few/day	 Many/day	 O(?)	
Courtesy:	Prof.	Jack	Dongarra
HPCAC-Stanford	(Feb	‘16)	 7	Network	Based	CompuNng	Laboratory	
•  ScienRfic	CompuRng	
–  Message	Passing	Interface	(MPI),	including	MPI	+	OpenMP,	is	the	Dominant	
Programming	Model		
–  Many	discussions	towards	ParRRoned	Global	Address	Space	(PGAS)		
•  UPC,	OpenSHMEM,	CAF,	etc.	
–  Hybrid	Programming:	MPI	+	PGAS	(OpenSHMEM,	UPC)		
•  Big	Data/Enterprise/Commercial	CompuRng	
–  Focuses	on	large	data	and	data	analysis	
–  Hadoop	(HDFS,	HBase,	MapReduce)		
–  Spark	is	emerging	for	in-memory	compuRng	
–  Memcached	is	also	used	for	Web	2.0		
Two	Major	Categories	of	ApplicaNons
HPCAC-Stanford	(Feb	‘16)	 8	Network	Based	CompuNng	Laboratory	
Parallel	Programming	Models	Overview	
P1	 P2	 P3	
Shared	Memory	
P1	 P2	 P3	
Memory	 Memory	 Memory	
P1	 P2	 P3	
Memory	 Memory	 Memory	
Logical	shared	memory	
Shared	Memory	Model	
SHMEM,	DSM	
Distributed	Memory	Model		
MPI	(Message	Passing	Interface)	
ParRRoned	Global	Address	Space	(PGAS)	
Global	Arrays,	UPC,	Chapel,	X10,	CAF,	…	
•  Programming	models	provide	abstract	machine	models	
•  Models	can	be	mapped	on	different	types	of	systems	
–  e.g.	Distributed	Shared	Memory	(DSM),	MPI	within	a	node,	etc.	
•  PGAS	models	and	Hybrid	MPI+PGAS	models	are	gradually	receiving	
importance
HPCAC-Stanford	(Feb	‘16)	 9	Network	Based	CompuNng	Laboratory	
ParNNoned	Global	Address	Space	(PGAS)	Models	
•  Key	features	
-  Simple	shared	memory	abstracRons		
-  Light	weight	one-sided	communicaRon		
-  Easier	to	express	irregular	communicaRon	
•  Different	approaches	to	PGAS		
-  Languages		
•  Unified	Parallel	C	(UPC)	
•  Co-Array	Fortran	(CAF)	
•  X10	
•  Chapel	
-  Libraries	
•  OpenSHMEM	
•  Global	Arrays
HPCAC-Stanford	(Feb	‘16)	 10	Network	Based	CompuNng	Laboratory	
Hybrid	(MPI+PGAS)	Programming	
•  ApplicaRon	sub-kernels	can	be	re-wri<en	in	MPI/PGAS	based	on	communicaRon	
characterisRcs	
•  Benefits:	
–  Best	of	Distributed	CompuRng	Model	
–  Best	of	Shared	Memory	CompuRng	Model	
•  Exascale	Roadmap*:		
–  “Hybrid	Programming	is	a	pracRcal	way	to	
	program	exascale	systems”	
*	The	Interna4onal	Exascale	So;ware	Roadmap,	Dongarra,	J.,	Beckman,	P.	et	al.,	Volume	25,	Number	1,	2011,	
Interna4onal	Journal	of	High	Performance	Computer	Applica4ons,	ISSN	1094-3420	
Kernel	1	
MPI	
Kernel	2	
MPI	
Kernel	3	
MPI	
Kernel	N	
MPI	
HPC	ApplicaNon	
Kernel	2	
PGAS	
Kernel	N	
PGAS
HPCAC-Stanford	(Feb	‘16)	 11	Network	Based	CompuNng	Laboratory	
Designing	CommunicaNon	Libraries	for	MulN-Petaflop	and	
Exaflop	Systems:	Challenges		
Programming	Models	
MPI,	PGAS	(UPC,	Global	Arrays,	OpenSHMEM),	CUDA,	OpenMP,	
OpenACC,	Cilk,	Hadoop	(MapReduce),	Spark	(RDD,	DAG),	etc.	
ApplicaNon	Kernels/ApplicaNons		
Networking	Technologies	
(InfiniBand,	40/100GigE,		
Aries,	and	OmniPath)	
	MulN/Many-core	
Architectures	
Accelerators	
(NVIDIA	and	MIC)	
Middleware		
Co-Design	
OpportuniNes	
and	
Challenges	
across	Various	
Layers	
	
Performance	
Scalability	
Fault-
Resilience	
CommunicaNon	Library	or	RunNme	for	Programming	Models	
Point-to-point	
CommunicaNon	
CollecNve	
CommunicaNon	
Energy-	
Awareness	
SynchronizaNon	
and	Locks	
I/O	and	
File	Systems	
Fault	
Tolerance
HPCAC-Stanford	(Feb	‘16)	 12	Network	Based	CompuNng	Laboratory	
•  Scalability	for	million	to	billion	processors	
–  Support	for	highly-efficient	inter-node	and	intra-node	communicaRon	(both	two-sided	and	one-sided)	
–  Scalable	job	start-up	
•  Scalable	CollecRve	communicaRon	
–  Offload	
–  Non-blocking	
–  Topology-aware	
•  Balancing	intra-node	and	inter-node	communicaRon	for	next	generaRon	nodes	(128-1024	cores)	
–  MulRple	end-points	per	node	
•  Support	for	efficient	mulR-threading	
•  Integrated	Support	for	GPGPUs	and	Accelerators	
•  Fault-tolerance/resiliency	
•  QoS	support	for	communicaRon	and	I/O	
•  Support	for	Hybrid	MPI+PGAS	programming	(MPI	+	OpenMP,	MPI	+	UPC,	MPI	+	OpenSHMEM,	
CAF,	…)	
•  VirtualizaRon		
•  Energy-Awareness	
	
Broad	Challenges	in	Designing		CommunicaNon	Libraries	for	(MPI+X)	at	
Exascale
HPCAC-Stanford	(Feb	‘16)	 13	Network	Based	CompuNng	Laboratory	
•  Extreme	Low	Memory	Footprint	
–  Memory	per	core	conRnues	to	decrease	
	
•  D-L-A	Framework	
–  Discover	
•  Overall	network	topology	(fat-tree,	3D,	…),	Network	topology	for	processes	for	a	given	job	
•  Node	architecture,	Health	of	network	and	node	
–  Learn	
•  Impact	on	performance	and	scalability	
•  PotenRal	for	failure	
–  Adapt	
•  Internal	protocols	and	algorithms	
•  Process	mapping	
•  Fault-tolerance	soluRons		
–  Low	overhead	techniques	while	delivering	performance,	scalability	and	fault-tolerance	
	
AddiNonal	Challenges	for	Designing	Exascale	Somware	Libraries
HPCAC-Stanford	(Feb	‘16)	 14	Network	Based	CompuNng	Laboratory	
Overview	of	the	MVAPICH2	Project	
•  High	Performance	open-source	MPI	Library	for	InfiniBand,	10-40Gig/iWARP,	and	RDMA	over	Converged	Enhanced	Ethernet	(RoCE)	
–  MVAPICH	(MPI-1),	MVAPICH2	(MPI-2.2	and	MPI-3.0),	Available	since	2002	
–  MVAPICH2-X	(MPI	+	PGAS),	Available	since	2011	
–  Support	for	GPGPUs		(MVAPICH2-GDR)	and	MIC	(MVAPICH2-MIC),	Available	since	2014	
–  Support	for	VirtualizaRon	(MVAPICH2-Virt),	Available	since	2015	
–  Support	for	Energy-Awareness	(MVAPICH2-EA),	Available	since	2015	
–  Used	by	more	than	2,525	organizaNons	in	77	countries	
–  More	than	351,000	(>	0.35	million)	downloads	from	the	OSU	site	directly	
–  Empowering	many	TOP500	clusters	(Nov	‘15	ranking)	
•  10th	ranked	519,640-core	cluster	(Stampede)	at		TACC	
•  13th	ranked	185,344-core	cluster	(Pleiades)	at	NASA	
•  25th	ranked	76,032-core	cluster	(Tsubame	2.5)	at	Tokyo	InsRtute	of	Technology	and	many	others	
–  Available	with	sofware	stacks	of	many	vendors	and	Linux	Distros	(RedHat	and	SuSE)	
–  h<p://mvapich.cse.ohio-state.edu	
•  Empowering	Top500	systems	for	over	a	decade	
–  System-X	from	Virginia	Tech	(3rd	in	Nov	2003,	2,200	processors,	12.25	TFlops)	->	
–  Stampede	at	TACC	(10th	in	Nov’15,	519,640	cores,	5.168	Plops)
HPCAC-Stanford	(Feb	‘16)	 15	Network	Based	CompuNng	Laboratory	
MVAPICH2	Architecture	
High	Performance	Parallel	Programming	Models	
Message	Passing	Interface	
(MPI)	
PGAS	
(UPC,	OpenSHMEM,	CAF,	UPC++*)	
Hybrid	---	MPI	+	X	
(MPI	+	PGAS	+	OpenMP/Cilk)	
High	Performance	and	Scalable	CommunicaNon	RunNme	
Diverse	APIs	and	Mechanisms	
Point-to-
point	
PrimiNves	
CollecNves	
Algorithms	
Energy-	
Awareness	
Remote	
Memory	
Access	
I/O	and	
File	Systems	
Fault	
Tolerance	
VirtualizaNon	
AcNve	
Messages	
Job	Startup	
IntrospecNon	
&	Analysis	
Support	for	Modern	Networking	Technology	
(InfiniBand,	iWARP,	RoCE,	OmniPath)	
Support	for	Modern	MulN-/Many-core	Architectures	
(Intel-Xeon,	OpenPower*,	Xeon-Phi	(MIC,	KNL*),	NVIDIA	GPGPU)	
Transport	Protocols	 Modern	Features	
RC	 XRC	 UD	 DC	 UMR	 ODP*	
SR-
IOV	
MulN	
Rail	
Transport	Mechanisms	
Shared	
Memory	
CMA	 IVSHMEM	
Modern	Features	
MCDRAM*	 NVLink*	 CAPI*	
*	-	Upcoming
HPCAC-Stanford	(Feb	‘16)	 16	Network	Based	CompuNng	Laboratory	
•  Scalability	for	million	to	billion	processors	
–  Support	for	highly-efficient	inter-node	and	intra-node	communicaRon	(both	two-sided	and	one-sided	
RMA)	
–  Support	for	advanced	IB	mechanisms	(UMR	and	ODP)	
–  Extremely	minimal	memory	footprint	
–  Scalable	job	start-up	
•  CollecRve	communicaRon	
•  Integrated	Support	for	GPGPUs	
•  Integrated	Support	for	MICs	
•  Unified	RunRme	for	Hybrid	MPI+PGAS	programming	(MPI	+	OpenSHMEM,	MPI	+	
UPC,	CAF,	…)		
•  VirtualizaRon	
•  Energy-Awareness		
•  InfiniBand	Network	Analysis	and	Monitoring	(INAM)	
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Stanford	(Feb	‘16)	 17	Network	Based	CompuNng	Laboratory	
One-way	Latency:	MPI	over	IB	with	MVAPICH2	
0.00	
0.20	
0.40	
0.60	
0.80	
1.00	
1.20	
1.40	
1.60	
1.80	
2.00	 Small	Message	Latency	
Message	Size	(bytes)	
Latency	(us)	
1.26	
1.19	
0.95	
1.15	
TrueScale-QDR	-	2.8	GHz	Deca-core	(IvyBridge)	Intel	PCI	Gen3	with	IB	switch	
ConnectX-3-FDR	-	2.8	GHz	Deca-core	(IvyBridge)	Intel	PCI	Gen3	with	IB	switch	
ConnectIB-Dual	FDR	-	2.8	GHz	Deca-core	(IvyBridge)	Intel	PCI	Gen3	with	IB	switch	
ConnectX-4-EDR	-	2.8	GHz	Deca-core	(Haswell)	Intel	PCI	Gen3	Back-to-back	
0	
20	
40	
60	
80	
100	
120	
TrueScale-QDR	
ConnectX-3-FDR	
ConnectIB-DualFDR	
ConnectX-4-EDR	
Large	Message	Latency	
Message	Size	(bytes)	
Latency	(us)
HPCAC-Stanford	(Feb	‘16)	 18	Network	Based	CompuNng	Laboratory	
Bandwidth:	MPI	over	IB	with	MVAPICH2	
0	
2000	
4000	
6000	
8000	
10000	
12000	
14000	 UnidirecNonal	Bandwidth	
Bandwidth	
(MBytes/sec)	
Message	Size	(bytes)	
12465	
3387	
6356	
12104	
0	
5000	
10000	
15000	
20000	
25000	
30000	
TrueScale-QDR	
ConnectX-3-FDR	
ConnectIB-DualFDR	
ConnectX-4-EDR	
BidirecNonal	Bandwidth	
Bandwidth	
(MBytes/sec)	
Message	Size	(bytes)	
21425	
12161	
24353	
6308	
TrueScale-QDR	-	2.8	GHz	Deca-core	(IvyBridge)	Intel	PCI	Gen3	with	IB	switch	
ConnectX-3-FDR	-	2.8	GHz	Deca-core	(IvyBridge)	Intel	PCI	Gen3	with	IB	switch	
ConnectIB-Dual	FDR	-	2.8	GHz	Deca-core	(IvyBridge)	Intel	PCI	Gen3	with	IB	switch	
ConnectX-4-EDR	-	2.8	GHz	Deca-core	(Haswell)	Intel	PCI	Gen3	Back-to-back
HPCAC-Stanford	(Feb	‘16)	 19	Network	Based	CompuNng	Laboratory	
0	
0.5	
1	
0	 1	 2	 4	 8	 16	 32	 64	 128	 256	 512	 1K	
Latency	(us)	
Message	Size	(Bytes)	
Latency	
Intra-Socket	 Inter-Socket	
MVAPICH2	Two-Sided	Intra-Node	Performance	
(Shared	memory	and	Kernel-based	Zero-copy	Support	(LiMIC	and	CMA))	
Latest	MVAPICH2	2.2b	
Intel	Ivy-bridge	
0.18	us	
0.45	us	
0	
5000	
10000	
15000	
Bandwidth	(MB/s)	
Message	Size	(Bytes)	
Bandwidth	(Inter-socket)	
inter-Socket-CMA	
inter-Socket-Shmem	
inter-Socket-LiMIC	
0	
5000	
10000	
15000	
Bandwidth	(MB/s)	
Message	Size	(Bytes)	
Bandwidth	(Intra-socket)	
intra-Socket-CMA	
intra-Socket-Shmem	
intra-Socket-LiMIC	
14,250	MB/s	
13,749	MB/s
HPCAC-Stanford	(Feb	‘16)	 20	Network	Based	CompuNng	Laboratory	
•  Introduced	by	Mellanox	to	support	direct	local	and	remote	nonconRguous	
memory	access	
–  Avoid	packing	at	sender	and	unpacking	at	receiver		
•  Available	with	MVAPICH2-X	2.2b	
User-mode	Memory	RegistraNon	(UMR)	
0	
50	
100	
150	
200	
250	
300	
350	
4K	 16K	 64K	 256K	 1M	
Latency		(us)	
Message	Size	(Bytes)	
Small	&	Medium	Message	Latency	
UMR	
Default	
0	
5000	
10000	
15000	
20000	
2M	 4M	 8M	 16M	
Latency	(us)	
Message	Size	(Bytes)	
Large	Message	Latency	
UMR	
Default	
Connect-IB	(54	Gbps):	2.8	GHz	Dual	Ten-core	(IvyBridge)	Intel	PCI	Gen3	with	Mellanox	IB	FDR	switch	
M.	Li,	H.	Subramoni,	K.	Hamidouche,	X.	Lu	and	D.	K.	Panda,	High	Performance	MPI	Datatype	Support	with	
User-mode	Memory	RegistraNon:	Challenges,	Designs	and	Benefits,	CLUSTER,	2015
HPCAC-Stanford	(Feb	‘16)	 21	Network	Based	CompuNng	Laboratory	
•  Introduced	by	Mellanox	to	support	direct	remote	memory	access	without	pinning	
•  Memory	regions	paged	in/out	dynamically	by	the	HCA/OS	
•  Size	of	registered	buffers	can	be	larger	than	physical	memory		
•  Will	be	available	in	upcoming	MVAPICH2-X	2.2	RC1	
On-Demand	Paging	(ODP)	
Connect-IB	(54	Gbps):	2.6	GHz	Dual	Octa-core	(SandyBridge)	Intel	PCI	Gen3	with	Mellanox	IB	FDR	switch	
0	
500	
1000	
1500	
16	 32	 64	
Pin-down	Buffer	Size	
(MB)	
Number	of	Processes	
Graph500	Pin-down	Buffer	Sizes	
Pin-down	 ODP	
0	
1	
2	
3	
4	
5	
16	 32	 64	
ExecuNon	Time	(s)	
Number	of	Processes	
Graph500	BFS	Kernel	
Pin-down	 ODP
HPCAC-Stanford	(Feb	‘16)	 22	Network	Based	CompuNng	Laboratory	
Minimizing	Memory	Footprint	by	Direct	Connect	(DC)	Transport	
Node	
0	
P1	
P0	 Node	1	
P3	
P2	
Node	3	
P7	
P6	
Node	
2	
P5	
P4	IB	
Network	
•  Constant	connecRon	cost	(One	QP	for	any	peer)	
•  Full	Feature	Set	(RDMA,	Atomics	etc)	
•  Separate	objects	for	send	(DC	IniRator)	and	receive	(DC	Target)	
–  DC	Target	idenRfied	by	“DCT	Number”	
–  Messages	routed	with	(DCT	Number,	LID)	
–  Requires	same	“DC	Key”	to	enable	communicaRon	
•  Available	since	MVAPICH2-X	2.2a		
0	
0.5	
1	
160	 320	 620	
Normalized	ExecuNon	
Time	
Number	of	Processes	
NAMD	-	Apoa1:	Large	data	set	
RC	 DC-Pool	 UD	 XRC	
10	
22	
47	
97	
1	 1	 1	
2	
10	 10	 10	 10	
1	 1	
3	
5	
1	
10	
100	
80	 160	 320	 640	
ConnecNon	Memory	(KB)	
Number	of	Processes	
Memory	Footprint	for	Alltoall	
RC	 DC-Pool	 UD	 XRC	
H.	Subramoni,	K.	Hamidouche,	A.	Venkatesh,	S.	Chakraborty	and	D.	K.	Panda,	Designing	MPI	Library	with	Dynamic	Connected	Transport	(DCT)	
of	InfiniBand	:	Early	Experiences.	IEEE	InternaRonal	SupercompuRng	Conference	(ISC	’14)
HPCAC-Stanford	(Feb	‘16)	 23	Network	Based	CompuNng	Laboratory	
•  Near-constant	MPI	and	OpenSHMEM	
iniRalizaRon	Rme	at	any	process	count	
•  10x	and	30x	improvement	in	startup	Rme	
of		MPI	and	OpenSHMEM	respecRvely	at	
16,384	processes	
•  Memory	consumpRon	reduced	for	
remote	endpoint	informaRon	by	
O(processes	per	node)	
•  1GB	Memory	saved	per	node	with	1M	
processes	and	16	processes	per	node	
Towards	High	Performance	and	Scalable	Startup	at	Exascale	
P M
O
Job	Startup	Performance	
Memory	Required	to	Store	
Endpoint	InformaRon	
P
M
PGAS	–	State	of	the	art	
MPI	–	State	of	the	art	
O PGAS/MPI	–	OpRmized	
PMIX_Ring	
PMIX_Ibarrier	
PMIX_Iallgather	
Shmem	based	PMI	
On-demand		
ConnecRon	
								On-demand	ConnecNon	Management	for	OpenSHMEM	and	OpenSHMEM+MPI.		S.	Chakraborty,	H.	Subramoni,	J.	Perkins,	A.	A.	Awan,	and	D	K	
Panda,	20th	InternaRonal	Workshop	on	High-level	Parallel	Programming	Models	and	SupporRve	Environments	(HIPS	’15)	
								PMI	Extensions	for	Scalable	MPI	Startup.	S.	Chakraborty,	H.	Subramoni,	A.	Moody,	J.	Perkins,	M.	Arnold,	and	D	K	Panda,	Proceedings	of	the	21st	
European	MPI	Users'	Group	MeeRng	(EuroMPI/Asia	’14)	
															Non-blocking	PMI	Extensions	for	Fast	MPI	Startup.	S.	Chakraborty,	H.	Subramoni,	A.	Moody,	A.	Venkatesh,	J.	Perkins,	and	D	K	Panda,	15th	IEEE/
ACM	InternaRonal	Symposium	on	Cluster,	Cloud	and	Grid	CompuRng	(CCGrid	’15)	
								SHMEMPMI	–	Shared	Memory	based	PMI	for	Improved	Performance	and	Scalability.	S.	Chakraborty,	H.	Subramoni,	J.	Perkins,	and	D	K	Panda,	16th	
IEEE/ACM	InternaRonal	Symposium	on	Cluster,	Cloud	and	Grid	CompuRng	(CCGrid	’16)	,	Accepted	for	Publica6on
HPCAC-Stanford	(Feb	‘16)	 24	Network	Based	CompuNng	Laboratory	
•  SHMEMPMI	allows	MPI	processes	to	directly	read	remote	endpoint	(EP)	informaRon	from	the	process	
manager	through	shared	memory	segments	
•  Only	a	single	copy	per	node	-		O(processes	per	node)	reducRon	in	memory	usage		
•  EsRmated	savings	of	1GB	per	node	with	1	million	processes	and	16	processes	per	node	
•  Up	to	1,000	Rmes	faster	PMI	Gets	compared	to	default	design.	Will	be	available	in	MVAPICH2	2.2RC1.	
Process	Management	Interface	over	Shared	Memory	(SHMEMPMI)	
TACC	Stampede	-	Connect-IB	(54	Gbps):	2.6	GHz	Quad	Octa-core	(SandyBridge)	Intel	PCI	Gen3	with	Mellanox	IB	FDR	
SHMEMPMI	–	Shared	Memory	Based	PMI	for	Performance	and	Scalability	S.	Chakraborty,	H.	Subramoni,	J.	Perkins,	and	D.K.	Panda,	
16th	IEEE/ACM	InternaRonal	Symposium	on	Cluster,	Cloud	and	Grid	CompuRng	(CCGrid	‘16),	Accepted	for	publica6on	
0	
50	
100	
150	
200	
250	
300	
1	 2	 4	 8	 16	 32	
Time	Taken	(milliseconds)	
Number	of	Processes	per	Node	
Time	Taken	by	one	PMI_Get	
Default	
SHMEMPMI	
0.0001	
0.001	
0.01	
0.1	
1	
10	
100	
1000	
10000	
16	 64	 256	 1K	 4K	 16K	 64K	 256K	 1M	
Memory	Usage	per	Node	(MB)	
Number	of	Processes	per	Job	
Memory	Usage	for	Remote	EP	InformaRon	
Fence	-	Default	
Allgather	-	Default	
Fence	-	Shmem	
Allgather	-	Shmem	
EsNmated	
1000x	
Actual	
16x
HPCAC-Stanford	(Feb	‘16)	 25	Network	Based	CompuNng	Laboratory	
•  Scalability	for	million	to	billion	processors	
•  CollecRve	communicaRon	
–  Offload	and	Non-blocking	
–  Topology-aware	
•  Integrated	Support	for	GPGPUs	
•  Integrated	Support	for	MICs	
•  Unified	RunRme	for	Hybrid	MPI+PGAS	programming	(MPI	+	OpenSHMEM,	
MPI	+	UPC,	CAF,	…)	
•  VirtualizaRon	
•  Energy-Awareness	
•  InfiniBand	Network	Analysis	and	Monitoring	(INAM)	
		
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Stanford	(Feb	‘16)	 26	Network	Based	CompuNng	Laboratory	
Modified		HPL	with	Offload-Bcast	does	up	to	4.5%	be<er	than	default	
version	(512	Processes)	
0	
1	
2	
3	
4	
5	
512	 600	 720	 800	
ApplicaNon	Run-Time	
(s)	
Data	Size	
0	
5	
10	
15	
64	 128	 256	 512	
Run-Time	(s)	
Number	of	Processes	
PCG-Default	 Modified-PCG-Offload	
Co-Design	with	MPI-3	Non-Blocking	CollecNves	and	CollecNve	Offload	Co-Direct	
Hardware	(Available	since	MVAPICH2-X	2.2a)	
Modified		P3DFFT	with	Offload-Alltoall	does	up	to	17%	be<er	than	
default	version	(128	Processes)	
K.	Kandalla,	et.	al..	High-Performance	and	Scalable	Non-Blocking	All-to-All	
with	CollecNve	Offload	on	InfiniBand	Clusters:	A	Study	with	Parallel	3D	FFT,	
ISC	2011	
17%	
0	
0.2	
0.4	
0.6	
0.8	
1	
1.2	
10	 20	 30	 40	 50	 60	 70	
Normalized		
Performance		
HPL-Offload	 HPL-1ring	 HPL-Host	
HPL	Problem	Size	(N)	as	%	of	Total	Memory	
	
4.5%	
Modified		Pre-Conjugate	Gradient	Solver	with	Offload-Allreduce	
does	up	to	21.8%	be<er	than	default	version	
K.	Kandalla,	et.	al,	Designing	Non-blocking	Broadcast	with	CollecNve	Offload	on	
InfiniBand	Clusters:	A	Case	Study	with	HPL,	HotI	2011	
K.	Kandalla,	et.	al.,	Designing	Non-blocking	Allreduce	with	CollecNve	Offload	on	
InfiniBand	Clusters:	A	Case	Study	with	Conjugate	Gradient	Solvers,	IPDPS	’12	
21.8%	
Can	Network-Offload	based	Non-Blocking	Neighborhood	MPI	CollecNves	
Improve	CommunicaNon	Overheads	of	Irregular	Graph		Algorithms?	K.	Kandalla,	
A.	Buluc,	H.	Subramoni,	K.	Tomko,	J.	Vienne,	L.	Oliker,	and	D.	K.	Panda,	IWPAPS’	
12
HPCAC-Stanford	(Feb	‘16)	 27	Network	Based	CompuNng	Laboratory	
Network-Topology-Aware	Placement	of	Processes	
•  Can	we	design	a	highly	scalable	network	topology	detecRon	service	for	IB?	
	
•  How	do	we	design	the	MPI	communicaRon	library	in	a	network-topology-aware	manner	to	efficiently	leverage	the	topology	
informaRon	generated	by	our	service?	
	
•  What	are	the	potenRal	benefits	of	using	a	network-topology-aware	MPI	library	on	the	performance	of	parallel	scienRfic	applicaRons?	
Overall	performance	and	Split	up	of	physical	communicaNon	for	MILC	on	Ranger	
Performance	for	varying	
system	sizes	 Default	for	2048	core	run	 Topo-Aware	for	2048	core	run	
15%	
H.	Subramoni,	S.	Potluri,	K.	Kandalla,	B.	Barth,	J.	Vienne,	J.	Keasler,	K.	Tomko,	K.	Schulz,	A.	Moody,	and	D.	K.	Panda,	Design	of	a	Scalable	InfiniBand	
Topology	Service	to	Enable	Network-Topology-Aware	Placement	of	Processes,	SC'12	.	BEST		Paper	and	BEST	STUDENT	Paper	Finalist	
• 		Reduce	network	topology	discovery	Nme	from	O(N2
hosts)	to	O(Nhosts)	
• 		15%	improvement	in	MILC	execuNon	Nme	@	2048	cores	
• 		15%	improvement	in	Hypre	execuNon	Nme	@	1024	cores
HPCAC-Stanford	(Feb	‘16)	 28	Network	Based	CompuNng	Laboratory	
•  Scalability	for	million	to	billion	processors	
•  CollecRve	communicaRon	
•  Integrated	Support	for	GPGPUs	
–  CUDA-Aware	MPI	
–  GPUDirect	RDMA	(GDR)	Support	
–  CUDA-aware	Non-blocking	CollecRves	
–  Support	for	Managed	Memory	
–  Efficient	datatype	Processing	
•  Integrated	Support	for	MICs	
•  Unified	RunRme	for	Hybrid	MPI+PGAS	programming	(MPI	+	OpenSHMEM,	MPI	+	
UPC,	CAF,	…)	
•  VirtualizaRon	
•  Energy-Awareness		
•  InfiniBand	Network	Analysis	and	Monitoring	(INAM)	
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Stanford	(Feb	‘16)	 29	Network	Based	CompuNng	Laboratory	
PCIe
GPU
CPU
NIC
Switch
At Sender:
cudaMemcpy(s_hostbuf, s_devbuf, . . .);
MPI_Send(s_hostbuf, size, . . .);
At Receiver:
MPI_Recv(r_hostbuf, size, . . .);
cudaMemcpy(r_devbuf, r_hostbuf, . . .);
• 	Data	movement	in	applicaRons	with	standard	MPI	and	CUDA	interfaces		
High	Produc4vity	and	Low	Performance	
MPI	+	CUDA	-	Naive
HPCAC-Stanford	(Feb	‘16)	 30	Network	Based	CompuNng	Laboratory	
PCIe
GPU
CPU
NIC
Switch
At Sender:
for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
<<Similar at receiver>>
•  Pipelining	at	user	level	with	non-blocking	MPI	and	CUDA	interfaces	
Low	Produc4vity	and	High	Performance	
MPI	+	CUDA	-	Advanced
HPCAC-Stanford	(Feb	‘16)	 31	Network	Based	CompuNng	Laboratory	
At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
•  Standard	MPI	interfaces	used	for	unified	data	movement	
•  Takes	advantage	of	Unified	Virtual	Addressing	(>=	CUDA	4.0)		
•  Overlaps	data	movement	from	GPU	with	RDMA	transfers		
High	Performance	and	High	Produc4vity	
MPI_Send(s_devbuf, size, …);
GPU-Aware	MPI	Library:	MVAPICH2-GPU
HPCAC-Stanford	(Feb	‘16)	 32	Network	Based	CompuNng	Laboratory	
•  OFED	with	support	for	GPUDirect	RDMA	is	
developed	by	NVIDIA	and	Mellanox	
•  OSU	has	a	design	of	MVAPICH2	using		
						GPUDirect	RDMA	
–  Hybrid	design	using	GPU-Direct	RDMA	
•  GPUDirect	RDMA	and	Host-based	pipelining	
•  Alleviates	P2P	bandwidth	bo<lenecks	on	SandyBridge	and	
IvyBridge	
–  Support	for	communicaRon	using	mulR-rail	
–  Support	for	Mellanox	Connect-IB	and	ConnectX	VPI	adapters	
–  Support	for	RoCE	with	Mellanox	ConnectX	VPI	adapters	
	
GPU-Direct	RDMA	(GDR)	with	CUDA		
IB	Adapter	
System	
Memory	
GPU	
Memory
GPU	
CPU	
Chipset	
P2P write: 5.2 GB/s
P2P read: < 1.0 GB/s
SNB	E5-2670	
P2P write: 6.4 GB/s
P2P read: 3.5 GB/s
IVB	E5-2680V2	
SNB	E5-2670	/	
IVB	E5-2680V2
HPCAC-Stanford	(Feb	‘16)	 33	Network	Based	CompuNng	Laboratory	
CUDA-Aware	MPI:	MVAPICH2-GDR	1.8-2.2	Releases	
•  Support	for	MPI	communicaRon	from	NVIDIA	GPU	device	memory	
•  High	performance	RDMA-based	inter-node	point-to-point	
communicaRon	(GPU-GPU,	GPU-Host	and	Host-GPU)	
•  High	performance	intra-node	point-to-point	communicaRon	for	mulR-
GPU	adapters/node	(GPU-GPU,	GPU-Host	and	Host-GPU)	
•  Taking	advantage	of	CUDA	IPC	(available	since	CUDA	4.1)	in	intra-node	
communicaRon	for	mulRple	GPU	adapters/node	
•  OpRmized	and	tuned	collecRves	for	GPU	device	buffers	
•  MPI	datatype	support	for	point-to-point	and	collecRve	communicaRon	
from	GPU	device	buffers
HPCAC-Stanford	(Feb	‘16)	 34	Network	Based	CompuNng	Laboratory	
MVAPICH2-GDR-2.2b	
Intel	Ivy	Bridge	(E5-2680	v2)	node	-	20	cores	
NVIDIA	Tesla	K40c	GPU	
Mellanox	Connect-IB	Dual-FDR	HCA	
CUDA	7	
Mellanox	OFED	2.4	with	GPU-Direct-RDMA	
10x	
2X	
11x	
2x	
Performance	of	MVAPICH2-GPU	with	GPU-Direct	RDMA	(GDR)		
0	
5	
10	
15	
20	
25	
30	
0	 2	 8	 32	 128	 512	 2K	
MV2-GDR2.2b	 MV2-GDR2.0b	
MV2	w/o	GDR	
GPU-GPU		internode	latency	
Message	Size	(bytes)	
Latency	(us)	
2.18us	
0	
500	
1000	
1500	
2000	
2500	
3000	
1	 4	 16	 64	 256	 1K	 4K	
MV2-GDR2.2b	
MV2-GDR2.0b	
MV2	w/o	GDR	
GPU-GPU	Internode	Bandwidth	
Message	Size	(bytes)	
Bandwidth	(MB/
s)	
11X	
0	
1000	
2000	
3000	
4000	
1	 4	 16	 64	 256	 1K	 4K	
MV2-GDR2.2b	
MV2-GDR2.0b	
MV2	w/o	GDR	
GPU-GPU	Internode	Bi-Bandwidth	
Message	Size	(bytes)	
Bi-Bandwidth	(MB/s)
HPCAC-Stanford	(Feb	‘16)	 35	Network	Based	CompuNng	Laboratory	
•  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
•  HoomdBlue Version 1.0.5
•  GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768
MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
ApplicaNon-Level	EvaluaNon	(HOOMD-blue)	
0	
500	
1000	
1500	
2000	
2500	
4	 8	 16	 32	
Average	Time	Steps	per	
second	(TPS)	
Number	of	Processes		
MV2	 MV2+GDR	
0	
500	
1000	
1500	
2000	
2500	
3000	
3500	
4	 8	 16	 32	
Average	Time	Steps	per	
second	(TPS)	
Number	of	Processes		
64K	ParNcles		 256K	ParNcles		
2X	
2X
HPCAC-Stanford	(Feb	‘16)	 36	Network	Based	CompuNng	Laboratory	
0	
20	
40	
60	
80	
100	
120	
4K	 16K	 64K	 256K	 1M	
Overlap	(%)	
Message	Size	(Bytes)	
Medium/Large	Message	Overlap		
(64	GPU	nodes)	
Ialltoall	(1process/node)	
Ialltoall	(2process/node;	1process/GPU)	
0	
20	
40	
60	
80	
100	
120	
4K	 16K	 64K	 256K	 1M	
Overlap	(%)	
Message	Size	(Bytes)	
Medium/Large	Message	Overlap	
(64	GPU	nodes)	
Igather	(1process/node)	
Igather	(2processes/node;	1process/
GPU)	
Plazorm:	Wilkes:	Intel	Ivy	Bridge		
NVIDIA	Tesla	K20c	+	Mellanox	Connect-IB	
Available	since	MVAPICH2-GDR	2.2a	
CUDA-Aware	Non-Blocking	CollecNves	
A.	Venkatesh,	K.	Hamidouche,	H.	Subramoni,	and	D.	K.	Panda,	Offloaded	GPU	
CollecNves	using	CORE-Direct	and	CUDA	CapabiliNes	on	IB	Clusters,	HIPC,	
2015
HPCAC-Stanford	(Feb	‘16)	 37	Network	Based	CompuNng	Laboratory	
CommunicaNon	RunNme	with	GPU	Managed	Memory	
●  CUDA	6.0	NVIDIA	introduced	CUDA	Managed	(or	Unified)	
memory	allowing	a	common	memory	allocaRon	for	GPU	or	
CPU	through	cudaMallocManaged()	call	
●  Significant	producRvity	benefits	due	to	abstracRon	of	
explicit	allocaRon	and	cudaMemcpy()	
●  Extended	MVAPICH2	to	perform	communicaRons	directly	
from	managed	buffers	(Available	in	MVAPICH2-GDR	2.2b)	
●  OSU	Micro-benchmarks	extended	to	evaluate	the	
performance	of	point-to-point	and	collecRve	
communicaRons	using	managed	buffers		
●  Available	in	OMB	5.2	
D.	S.	Banerjee,	K	Hamidouche,	and	D.	K	Panda,	Designing	High	Performance	
CommunicaRon	RunRme	for	GPUManaged	Memory:	Early	Experiences,	GPGPU-9	
Workshop,	to	be	held	in	conjuncRon	with	PPoPP	‘16
0	
0.1	
0.2	
0.3	
0.4	
0.5	
0.6	
0.7	
0.8	
1	 2	 4	 8	 16	 32	 64	 128	256	 1K	 4K	 8K	 16K	
Halo	Exchange	Time	(ms)	
Total	Dimension	Size	(Bytes)	
2D	Stencil	Performance	for	Halowidth=1	
Device	
Managed
HPCAC-Stanford	(Feb	‘16)	 38	Network	Based	CompuNng	Laboratory	
CPU
Progress
GPU
Time
Initiate
Kernel
Start
Send
Isend(1)
Initiate
Kernel
Start
Send
Initiate
Kernel
GPU
CPU
Initiate
Kernel
Start
Send
Wait For
Kernel
(WFK)
Kernel on Stream
Isend(1)
Existing Design
Proposed Design
Kernel on Stream
Kernel on Stream
Isend(2)Isend(3)
Kernel on Stream
Initiate
Kernel
Start
Send
Wait For
Kernel
(WFK)
Kernel on Stream
Isend(1)
Initiate
Kernel
Start
Send
Wait For
Kernel
(WFK)
Kernel on Stream
Isend(1) Wait
WFK
Start
Send
Wait
Progress
Start Finish Proposed Finish Existing
WFK
WFK
Expected Benefits
MPI	Datatype	Processing	(CommunicaNon	OpNmizaNon	)	
Waste	of	compuNng	resources	on	CPU	and	GPU	Common	Scenario	
*Buf1,	 Buf2…contain	 non-
conRguous	MPI	Datatype	
MPI_Isend	(A,..	Datatype,…)	
MPI_Isend	(B,..	Datatype,…)	
MPI_Isend	(C,..	Datatype,…)	
MPI_Isend	(D,..	Datatype,…)	
…	
	
MPI_Waitall	(…);
HPCAC-Stanford	(Feb	‘16)	 39	Network	Based	CompuNng	Laboratory	
ApplicaNon-Level	EvaluaNon	(HaloExchange	-	Cosmo)	
0	
0.5	
1	
1.5	
16	 32	 64	 96	
Normalized	ExecuNon	Time	
Number	of	GPUs	
CSCS	GPU	cluster	
Default	 Callback-based	 Event-based	
0	
0.5	
1	
1.5	
4	 8	 16	 32	
Normalized	ExecuNon	Time	
Number	of	GPUs	
Wilkes	GPU	Cluster	
Default	 Callback-based	 Event-based	
•  2X	improvement	on	32	GPUs	nodes	
•  30%	improvement	on	96	GPU	nodes	(8	GPUs/node)		
C.	Chu,	K.	Hamidouche,	A.	Venkatesh,	D.	Banerjee	,	H.	Subramoni,	and	D.	K.	Panda,	ExploiNng	Maximal	Overlap	for	Non-
ConNguous	Data	Movement	Processing	on	Modern	GPU-enabled	Systems,	IPDPS’16
HPCAC-Stanford	(Feb	‘16)	 40	Network	Based	CompuNng	Laboratory	
•  Scalability	for	million	to	billion	processors	
•  CollecRve	communicaRon	
•  Integrated	Support	for	GPGPUs	
•  Integrated	Support	for	MICs	
•  Unified	RunRme	for	Hybrid	MPI+PGAS	programming	(MPI	+	OpenSHMEM,	
MPI	+	UPC,	CAF,	…)	
•  VirtualizaRon	
•  Energy-Awareness		
•  InfiniBand	Network	Analysis	and	Monitoring	(INAM)	
	
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Stanford	(Feb	‘16)	 41	Network	Based	CompuNng	Laboratory	
MPI	ApplicaNons	on	MIC	Clusters	
Xeon	 Xeon	Phi	
MulR-core	Centric	
Many-core	Centric	
MPI	
Program	
MPI	
Program	
Offloaded	
ComputaRon	
MPI	
Program	
MPI	Program	
MPI	Program	
Host-only	
Offload		
(/reverse	Offload)	
Symmetric	
Coprocessor-only	
• 	Flexibility	in	launching	MPI	jobs	on	clusters	with	Xeon	Phi
HPCAC-Stanford	(Feb	‘16)	 42	Network	Based	CompuNng	Laboratory	
MVAPICH2-MIC	2.0	Design	for	Clusters	with	IB	and		MIC	
•  Offload	Mode		
•  Intranode	CommunicaRon	
•  Coprocessor-only	and	Symmetric	Mode		
•  Internode	CommunicaRon		
•  Coprocessors-only	and	Symmetric	Mode	
•  MulR-MIC	Node	ConfiguraRons	
•  Running	on	three	major	systems	
•  Stampede,	Blueridge	(Virginia	Tech)	and	Beacon	(UTK)
HPCAC-Stanford	(Feb	‘16)	 43	Network	Based	CompuNng	Laboratory	
MIC-Remote-MIC	P2P	CommunicaNon	with	Proxy-based	
CommunicaNon	
Bandwidth	
BeTer	
BeTer	BeTer	
Latency	(Large	Messages)	
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
Latency(usec)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M
Bandwidth	(MB/sec)	
Message Size (Bytes)
5236	
Intra-socket	P2P	
Inter-socket	P2P	
0
5000
10000
15000
8K 32K 128K 512K 2M
Latency(usec)
Message Size (Bytes)
Latency	(Large	
Messages)	
0
2000
4000
6000
1 16 256 4K 64K 1M
Bandwidth	(MB/sec)	
Message Size (Bytes)
BeTer	
5594	
Bandwidth
HPCAC-Stanford	(Feb	‘16)	 44	Network	Based	CompuNng	Laboratory	
OpNmized	MPI	CollecNves	for	MIC	Clusters	(Allgather	&	Alltoall)	
A.	Venkatesh,	S.	Potluri,	R.	Rajachandrasekar,	M.	Luo,	K.	Hamidouche	and	D.	K.	Panda	-	High	Performance	
Alltoall	and	Allgather	designs	for	InfiniBand	MIC	Clusters;	IPDPS’14,	May	2014	
0	
10000	
20000	
30000	
1	 2	 4	 8	 16	 32	 64	 128	256	512	1K	
Latency	(usecs)	
Message	Size	(Bytes)	
32-Node-Allgather	(16H	+	16	M)	
Small	Message	Latency	
MV2-MIC	
MV2-MIC-Opt	
0	
500	
1000	
1500	
8K	 16K	 32K	 64K	 128K	256K	512K	 1M	
Latency	(usecs)	
Message	Size	(Bytes)	
32-Node-Allgather	(8H	+	8	M)	
Large	Message	Latency	
MV2-MIC	
MV2-MIC-Opt	
0	
500	
1000	
4K	 8K	 16K	 32K	 64K	 128K	256K	512K	
Latency	(usecs)	
Message	Size	(Bytes)	
32-Node-Alltoall	(8H	+	8	M)	
Large	Message	Latency	
MV2-MIC	
MV2-MIC-Opt	
0	
20	
40	
60	
MV2-MIC-Opt	 MV2-MIC	
ExecuNon	Time	(secs)	
32	Nodes	(8H	+	8M),	Size	=	2K*2K*1K	
P3DFFT	Performance	
CommunicaRon		
ComputaRon	
76%	
58%	
55%
HPCAC-Stanford	(Feb	‘16)	 45	Network	Based	CompuNng	Laboratory	
•  Scalability	for	million	to	billion	processors	
•  CollecRve	communicaRon	
•  Integrated	Support	for	GPGPUs	
•  Integrated	Support	for	MICs	
•  Unified	RunRme	for	Hybrid	MPI+PGAS	programming	(MPI	+	OpenSHMEM,	
MPI	+	UPC,	CAF,	…)	
•  VirtualizaRon	
•  Energy-Awareness		
•  InfiniBand	Network	Analysis	and	Monitoring	(INAM)	
	
	
		
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Stanford	(Feb	‘16)	 46	Network	Based	CompuNng	Laboratory	
MVAPICH2-X	for	Advanced	MPI	and	Hybrid	MPI	+	PGAS	ApplicaNons	
MPI,	OpenSHMEM,	UPC,	CAF	or		Hybrid	(MPI	+	PGAS)	
ApplicaNons	
Unified	MVAPICH2-X	RunNme	
InfiniBand,	RoCE,	iWARP	
OpenSHMEM	Calls	 MPI	Calls	UPC	Calls	
•  Unified	communicaRon	runRme	for	MPI,	UPC,	OpenSHMEM,	CAF	available	with	MVAPICH2-X	
1.9	(2012)	onwards!		
•  UPC++	support	will	be	available	in	upcoming	MVAPICH2-X	2.2RC1	
•  Feature	Highlights	
–  Supports	MPI(+OpenMP),	OpenSHMEM,	UPC,	CAF,	MPI(+OpenMP)	+	OpenSHMEM,	MPI(+OpenMP)	
+	UPC		+	CAF	
–  MPI-3	compliant,	OpenSHMEM	v1.0	standard	compliant,	UPC	v1.2	standard	compliant	(with	iniRal	
support	for	UPC	1.3),	CAF	2008	standard	(OpenUH)	
–  Scalable	Inter-node	and	intra-node	communicaRon	–	point-to-point	and	collecRves	
CAF	Calls
HPCAC-Stanford	(Feb	‘16)	 47	Network	Based	CompuNng	Laboratory	
ApplicaNon	Level	Performance	with	Graph500	and	Sort	Graph500	ExecuNon	Time	
J.	Jose,	S.	Potluri,	K.	Tomko	and	D.	K.	Panda,	Designing	Scalable	Graph500	Benchmark	with	Hybrid	MPI+OpenSHMEM	Programming	Models,	
InternaNonal	SupercompuNng	Conference	(ISC’13),	June	2013	
J.	Jose,	K.	Kandalla,	M.	Luo	and	D.	K.	Panda,	SupporNng	Hybrid	MPI	and	OpenSHMEM	over	InfiniBand:	Design	and	Performance	EvaluaNon,	
Int'l	Conference	on	Parallel	Processing	(ICPP	'12),	September	2012	
0	
5	
10	
15	
20	
25	
30	
35	
4K	 8K	 16K	
Time	(s)	
No.	of	Processes	
MPI-Simple	
MPI-CSC	
MPI-CSR	
Hybrid	(MPI+OpenSHMEM)	
13X	
7.6X	
•  Performance	of	Hybrid	(MPI+	OpenSHMEM)	Graph500	Design	
•  8,192	processes	
	-	2.4X	improvement	over	MPI-CSR	
	-	7.6X	improvement	over	MPI-Simple	
•  16,384	processes	
	-	1.5X	improvement	over	MPI-CSR	
	-	13X	improvement	over	MPI-Simple	
	
J.	Jose,	K.	Kandalla,	S.	Potluri,	J.	Zhang	and	D.	K.	Panda,	OpNmizing	CollecNve	CommunicaNon	in	OpenSHMEM,	Int'l	Conference	on	ParNNoned	
Global	Address	Space	Programming	Models	(PGAS	'13),	October	2013.	
Sort	ExecuNon	Time	
0	
1000	
2000	
3000	
500GB-512	 1TB-1K	 2TB-2K	 4TB-4K	
Time	(seconds)	
Input	Data	-	No.	of	Processes	
MPI	 Hybrid	
51%	
•  Performance	of	Hybrid	(MPI+OpenSHMEM)	Sort	
ApplicaRon	
•  4,096	processes,	4	TB	Input	Size	
	-	MPI	–	2408	sec;	0.16	TB/min	
	-	Hybrid	–	1172	sec;	0.36	TB/min	
	-	51%	improvement	over	MPI-design
HPCAC-Stanford	(Feb	‘16)	 48	Network	Based	CompuNng	Laboratory	
MiniMD	–	Total	ExecuNon	Time	
•  Hybrid	design	performs	be<er	than	MPI	implementaRon	
•  1,024	processes	
-  17%	improvement	over	MPI	version	
•  Strong	Scaling	
Input	size:	128	*	128	*	128	
Performance	 Strong	Scaling	
0	
500	
1000	
1500	
2000	
2500	
512	 1,024	
Hybrid-Barrier	 MPI-Original	 Hybrid-Advanced	
17%	
0	
500	
1000	
1500	
2000	
2500	
3000	
256	 512	 1,024	
Hybrid-Barrier	 MPI-Original	 Hybrid-Advanced	
Time	(ms)	
Time	(ms)	
#	of	Cores	 #	of	Cores	
M.	Li,	J.	Lin,	X.	Lu,	K.	Hamidouche,	K.	Tomko	and	D.	K.	Panda,	Scalable	MiniMD	Design	with	Hybrid	MPI	and	OpenSHMEM,	OpenSHMEM	User	Group	
MeeNng	(OUG	’14),	held	in	conjuncNon	with	8th	InternaNonal	Conference	on	ParNNoned	Global	Address	Space	Programming	Models,	(PGAS	14).
HPCAC-Stanford	(Feb	‘16)	 49	Network	Based	CompuNng	Laboratory	
Hybrid	MPI+UPC	NAS-FT	
•  Modified	NAS	FT	UPC	all-to-all	pa<ern	using	MPI_Alltoall	
•  Truly	hybrid	program	
•  For	FT	(Class	C,	128	processes)		
•  	34%	improvement	over	UPC-GASNet	
•  	30%	improvement	over	UPC-OSU	
	
0	
5	
10	
15	
20	
25	
30	
35	
B-64	 C-64	 B-128	 C-128	
Time	(s)	
NAS	Problem	Size	–	System	Size	
UPC-GASNet	
UPC-OSU	
Hybrid-OSU	
34%	
J.	Jose,	M.	Luo,	S.	Sur	and	D.	K.	Panda,	Unifying	UPC	and	MPI	RunNmes:	Experience	with	MVAPICH,	Fourth	Conference	on	
ParNNoned	Global	Address	Space	Programming	Model	(PGAS	’10),	October	2010	
Hybrid	MPI	+	UPC	Support	
Available	since	
MVAPICH2-X	1.9	(2012)
HPCAC-Stanford	(Feb	‘16)	 50	Network	Based	CompuNng	Laboratory	
•  Scalability	for	million	to	billion	processors	
•  CollecRve	communicaRon	
•  Integrated	Support	for	GPGPUs	
•  Integrated	Support	for	MICs	
•  Unified	RunRme	for	Hybrid	MPI+PGAS	programming	(MPI	+	OpenSHMEM,	
MPI	+	UPC,	CAF,	…)	
•  VirtualizaRon	
•  Energy-Awareness		
•  InfiniBand	Network	Analysis	and	Monitoring	(INAM)	
		
	
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Stanford	(Feb	‘16)	 51	Network	Based	CompuNng	Laboratory	
•  VirtualizaRon	has	many	benefits	
–  Fault-tolerance	
–  Job	migraRon	
–  CompacRon	
•  Have	not	been	very	popular	in	HPC	due	to	overhead	associated	with	
VirtualizaRon	
•  New	SR-IOV	(Single	Root	–	IO	VirtualizaRon)	support	available	with	
Mellanox	InfiniBand	adapters	changes	the	field	
•  Enhanced	MVAPICH2	support	for	SR-IOV	
•  MVAPICH2-Virt	2.1	(with	and	without	OpenStack)	is	publicly	available	
	
	
	
	
Can	HPC	and	VirtualizaNon	be	Combined?	
J.	Zhang,	X.	Lu,	J.	Jose,	R.	Shi	and	D.	K.	Panda,	Can	Inter-VM	Shmem	Benefit	MPI	ApplicaNons	on	SR-IOV	based	
Virtualized	InfiniBand	Clusters?	EuroPar'14	
J.	Zhang,	X.	Lu,	J.	Jose,	M.	Li,	R.	Shi	and	D.K.	Panda,	High	Performance	MPI	Libray	over	SR-IOV	enabled	InfiniBand	
Clusters,	HiPC’14					
J.	Zhang,	X	.Lu,	M.	Arnold	and	D.	K.	Panda,	MVAPICH2	Over	OpenStack	with	SR-IOV:	an	Efficient	Approach	to	build	
HPC	Clouds,	CCGrid’15
HPCAC-Stanford	(Feb	‘16)	 52	Network	Based	CompuNng	Laboratory	
•  Redesign	MVAPICH2	to	make	it	
virtual	machine	aware	
–  SR-IOV	shows	near	to	naRve	
performance	for	inter-node	point	to	
point	communicaRon	
–  IVSHMEM	offers	zero-copy	access	to	
data	on	shared	memory	of	co-resident	
VMs	
–  Locality	Detector:	maintains	the	locality	
informaRon	of	co-resident	virtual	machines	
–  CommunicaRon	Coordinator:	selects	the	
communicaRon	channel	(SR-IOV,	IVSHMEM)	
adapRvely	
Overview	of	MVAPICH2-Virt	with	SR-IOV	and	IVSHMEM	
Host Environment
Guest 1
Hypervisor PF Driver
Infiniband Adapter
Physical
Function
user space
kernel space
MPI
proc
PCI
Device
VF
Driver
Guest 2
user space
kernel space
MPI
proc
PCI
Device
VF
Driver
Virtual
Function
Virtual
Function
/dev/shm/
IV-SHM
IV-Shmem Channel
SR-IOV Channel
J.	Zhang,	X.	Lu,	J.	Jose,	R.	Shi,	D.	K.	Panda.	Can	Inter-VM	
Shmem	Benefit	MPI	ApplicaRons	on	SR-IOV	based	
Virtualized	InfiniBand	Clusters?	Euro-Par,	2014.	
J.	Zhang,	X.	Lu,	J.	Jose,	R.	Shi,	M.	Li,	D.	K.	Panda.	High	
Performance	MPI	Library	over	SR-IOV	Enabled	InfiniBand	
Clusters.	HiPC,	2014.
HPCAC-Stanford	(Feb	‘16)	 53	Network	Based	CompuNng	Laboratory	
Nova
Glance
Neutron
Swift
Keystone
Cinder
Heat
Ceilometer
Horizon
VM
Backup
volumes in
Stores
images in
Provides
images
Provides
Network
Provisions
Provides
Volumes
Monitors
Provides
UI
Provides
Auth for
Orchestrates
cloud
•  OpenStack	is	one	of	the	most	popular	
open-source	soluRons	to	build	clouds	and	
manage	virtual	machines	
•  Deployment	with	OpenStack	
–  SupporRng	SR-IOV	configuraRon	
–  SupporRng	IVSHMEM	configuraRon	
–  Virtual	Machine	aware	design	of	MVAPICH2	with	
SR-IOV	
•  An	efficient	approach	to	build	HPC	Clouds	
with	MVAPICH2-Virt	and	OpenStack
MVAPICH2-Virt	with	SR-IOV	and	IVSHMEM	over	OpenStack
J.	Zhang,	X.	Lu,	M.	Arnold,	D.	K.	Panda.	MVAPICH2	over	OpenStack	with	SR-IOV:	An	Efficient	Approach	to	Build	
HPC	Clouds.	CCGrid,	2015.
HPCAC-Stanford	(Feb	‘16)	 54	Network	Based	CompuNng	Laboratory	
0	
50	
100	
150	
200	
250	
300	
350	
400	
milc	 leslie3d	 pop2	 GAPgeofem	 zeusmp2	 lu	
ExecuNon	Time	(s)	
MV2-SR-IOV-Def	
MV2-SR-IOV-Opt	
MV2-NaRve	
1%
9.5%
0	
1000	
2000	
3000	
4000	
5000	
6000	
22,20	 24,10	 24,16	 24,20	 26,10	 26,16	
ExecuNon	Time	(ms)	
Problem	Size	(Scale,	Edgefactor)	
MV2-SR-IOV-Def	
MV2-SR-IOV-Opt	
MV2-NaRve	
2%
•  32	VMs,	6	Core/VM		
•  Compared	to	NaRve,	2-5%	overhead	for	Graph500	with	128	Procs	
•  Compared	to	NaRve,	1-9.5%	overhead	for	SPEC	MPI2007	with	128	Procs	
ApplicaNon-Level	Performance	on	Chameleon	
SPEC	MPI2007Graph500
5%
HPCAC-Stanford	(Feb	‘16)	 55	Network	Based	CompuNng	Laboratory	
NSF	Chameleon	Cloud:	A	Powerful	and	Flexible		
Experimental	Instrument
•  Large-scale	instrument	
–  TargeRng	Big	Data,	Big	Compute,	Big	Instrument	research	
–  ~650	nodes	(~14,500	cores),	5	PB	disk	over	two	sites,	2	sites	connected	with	100G	network	
•  Reconfigurable	instrument	
–  Bare	metal	reconfiguraRon,	operated	as	single	instrument,	graduated	approach	for	ease-of-use	
•  Connected	instrument	
–  Workload	and	Trace	Archive	
–  Partnerships	with	producRon	clouds:	CERN,	OSDC,	Rackspace,	Google,	and	others	
–  Partnerships	with	users	
•  Complementary	instrument	
–  ComplemenRng	GENI,	Grid’5000,	and	other	testbeds	
•  Sustainable	instrument	
–  Industry	connecRons	
	
h<p://www.chameleoncloud.org/
HPCAC-Stanford	(Feb	‘16)	 56	Network	Based	CompuNng	Laboratory	
•  Scalability	for	million	to	billion	processors	
•  CollecRve	communicaRon	
•  Integrated	Support	for	GPGPUs	
•  Integrated	Support	for	MICs	
•  Unified	RunRme	for	Hybrid	MPI+PGAS	programming	(MPI	+	OpenSHMEM,	
MPI	+	UPC,	CAF,	…)	
•  VirtualizaRon	
•  Energy-Awareness		
•  InfiniBand	Network	Analysis	and	Monitoring	(INAM)	
	
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Stanford	(Feb	‘16)	 57	Network	Based	CompuNng	Laboratory	
•  MVAPICH2-EA	2.1	(Energy-Aware)	
•  A	white-box	approach	
•  New	Energy-Efficient	communicaRon	protocols	for	pt-pt	and	collecRve	operaRons	
•  Intelligently	apply	the	appropriate	Energy	saving	techniques	
•  ApplicaRon	oblivious	energy	saving	
	
•  OEMT	
•  A	library	uRlity	to	measure	energy	consumpRon	for	MPI	applicaRons	
•  Works	with	all	MPI	runRmes	
•  PRELOAD	opRon	for	precompiled	applicaRons			
•  Does	not	require	ROOT	permission:		
•  A	safe	kernel	module	to	read	only	a	subset	of	MSRs		
Energy-Aware	MVAPICH2	&	OSU	Energy	Management	Tool	
(OEMT)
HPCAC-Stanford	(Feb	‘16)	 58	Network	Based	CompuNng	Laboratory	
•  An	energy	efficient	runRme	that	
provides	energy	savings	without	
applicaRon	knowledge	
•  Uses	automaRcally	and	
transparently		the	best	energy	
lever	
•  Provides	guarantees	on	
maximum	degradaRon	with	
5-41%	savings	at	<=	5%	
degradaRon	
•  PessimisRc	MPI	applies	energy	
reducRon	lever	to	each	MPI	call	
MVAPICH2-EA:	ApplicaNon	Oblivious	Energy-Aware-MPI	(EAM)	
A	Case	for	ApplicaNon-Oblivious	Energy-Efficient	MPI	RunNme	A.	Venkatesh,	A.	Vishnu,	K.	Hamidouche,	N.	Tallent,	D.	
K.	Panda,	D.	Kerbyson,	and	A.	Hoise,	SupercompuNng	‘15,	Nov	2015	[Best	Student	Paper	Finalist]	
1
HPCAC-Stanford	(Feb	‘16)	 59	Network	Based	CompuNng	Laboratory	
•  Scalability	for	million	to	billion	processors	
•  CollecRve	communicaRon	
•  Integrated	Support	for	GPGPUs	
•  Integrated	Support	for	MICs	
•  Unified	RunRme	for	Hybrid	MPI+PGAS	programming	(MPI	+	OpenSHMEM,	
MPI	+	UPC,	CAF,	…)	
•  VirtualizaRon	
•  Energy-Awareness		
•  InfiniBand	Network	Analysis	and	Monitoring	(INAM)	
	
Overview	of	A	Few	Challenges	being	Addressed	by	the	MVAPICH2	
Project	for	Exascale
HPCAC-Stanford	(Feb	‘16)	 60	Network	Based	CompuNng	Laboratory	
•  OSU	INAM	monitors	IB	clusters	in	real	Rme	by	querying	various	subnet	management	enRRes	
in	the	network	
•  Major	features	of	the	OSU	INAM	tool	include:	
–  Analyze	and	profile	network-level	acRviRes	with	many	parameters	(data	and	errors)	at	user	specified	
granularity	
–  Capability	to	analyze	and	profile	node-level,	job-level	and	process-level	acRviRes	for	MPI	
communicaRon	(pt-to-pt,	collecRves	and	RMA)	
–  Remotely	monitor	CPU	uRlizaRon	of	MPI	processes	at	user	specified	granularity	
–  Visualize	the	data	transfer	happening	in	a	"live"	fashion	-	Live	View	for	
•  EnRre	Network	-	Live	Network	Level	View	
•  ParRcular	Job	-	Live	Job	Level	View	
•  One	or	mulRple	Nodes	-	Live	Node	Level	View	
–  Capability	to	visualize	data	transfer	that	happened	in	the	network	at	a	Rme	duraRon	in	the	past		
•  EnRre	Network	-	Historical	Network	Level	View	
•  ParRcular	Job	-	Historical	Job	Level	View	
•  One	or	mulRple	Nodes	-	Historical	Node	Level	View	
Overview	of	OSU	INAM
HPCAC-Stanford	(Feb	‘16)	 61	Network	Based	CompuNng	Laboratory	
OSU	INAM	–	Network	Level	View		
•  Show	network	topology	of	large	clusters	
•  Visualize	traffic	pa<ern	on	different	links	
•  Quickly	idenRfy	congested	links/links	in	error	state	
•  See	the	history	unfold	–	play	back	historical	state	of	the	network	
Full	Network	(152	nodes)	 Zoomed-in	View	of	the	Network
HPCAC-Stanford	(Feb	‘16)	 62	Network	Based	CompuNng	Laboratory	
OSU	INAM	–	Job	and	Node	Level	Views	
Visualizing	a	Job	(5	Nodes)	 Finding	Routes	Between	Nodes	
•  Job	level	view	
•  Show	different	network	metrics	(load,	error,	etc.)	for	any	live	job	
•  Play	back	historical	data	for	completed	jobs	to	idenRfy	bo<lenecks	
•  Node	level	view	provides	details	per	process	or	per	node	
•  CPU	uRlizaRon	for	each	rank/node	
•  Bytes	sent/received	for	MPI	operaRons	(pt-to-pt,	collecRve,	RMA)	
•  Network	metrics	(e.g.	XmitDiscard,	RcvError)	per	rank/node
HPCAC-Stanford	(Feb	‘16)	 63	Network	Based	CompuNng	Laboratory	
MVAPICH2	–	Plans	for	Exascale	
•  Performance	and	Memory	scalability	toward	1M	cores	
•  Hybrid	programming	(MPI	+	OpenSHMEM,	MPI	+	UPC,	MPI	+	CAF	…)	
–  Support	for		task-based	parallelism	(UPC++)	
•  Enhanced	OpRmizaRon	for	GPU	Support	and	Accelerators	
•  Taking	advantage	of	advanced	features	
–  User	Mode	Memory	RegistraRon	(UMR)	
–  On-demand	Paging	
•  Enhanced	Inter-node	and	Intra-node	communicaRon	schemes	for	upcoming	OmniPath	and	
Knights	Landing	architectures		
•  Extended	RMA	support	(as	in	MPI	3.0)	
•  Extended	topology-aware	collecRves	
•  Energy-aware	point-to-point	(one-sided	and	two-sided)	and	collecRves	
•  Extended	Support	for	MPI	Tools	Interface	(as	in	MPI	3.0)	
•  Extended	Checkpoint-Restart	and	migraRon	support	with	SCR
HPCAC-Stanford	(Feb	‘16)	 64	Network	Based	CompuNng	Laboratory	
•  Exascale	systems	will	be	constrained	by	
–  Power	
–  Memory	per	core	
–  Data	movement	cost	
–  Faults	
•  Programming	Models	and	RunRmes	for	HPC	need	to	be	
designed	for	
–  Scalability	
–  Performance	
–  Fault-resilience	
–  Energy-awareness	
–  Programmability	
–  ProducRvity	
•  Highlighted	some	of	the	issues	and	challenges	
•  Need	conRnuous	innovaRon	on	all	these	fronts			
Looking	into	the	Future	….
HPCAC-Stanford	(Feb	‘16)	 65	Network	Based	CompuNng	Laboratory	
Funding	Acknowledgments	
Funding	Support	by	
Equipment	Support	by
HPCAC-Stanford	(Feb	‘16)	 66	Network	Based	CompuNng	Laboratory	
Personnel	Acknowledgments	
Current	Students		
–  A.	AugusRne	(M.S.)	
–  A.	Awan	(Ph.D.)	
–  S.	Chakraborthy		(Ph.D.)	
–  C.-H.	Chu	(Ph.D.)	
–  N.	Islam	(Ph.D.)	
–  M.	Li	(Ph.D.)	
Past	Students		
–  P.	Balaji	(Ph.D.)	
–  S.	Bhagvat	(M.S.)	
–  A.	Bhat	(M.S.)		
–  D.	BunRnas	(Ph.D.)	
–  L.	Chai	(Ph.D.)	
–  B.	Chandrasekharan	(M.S.)	
–  N.	Dandapanthula	(M.S.)	
–  V.	Dhanraj	(M.S.)	
–  T.	Gangadharappa	(M.S.)	
–  K.	Gopalakrishnan	(M.S.)	
–  G.	Santhanaraman	(Ph.D.)	
–  A.	Singh	(Ph.D.)	
–  J.	Sridhar	(M.S.)	
–  S.	Sur	(Ph.D.)	
–  H.	Subramoni	(Ph.D.)	
–  K.	Vaidyanathan	(Ph.D.)	
–  A.	Vishnu	(Ph.D.)	
–  J.	Wu	(Ph.D.)	
–  W.	Yu	(Ph.D.)	
Past	Research	Scien4st	
–  S.	Sur	
Current	Post-Doc	
–  J.	Lin	
–  D.	Banerjee	
Current	Programmer	
–  J.	Perkins	
Past	Post-Docs	
–  H.	Wang	
–  X.	Besseron	
–  H.-W.	Jin	
–  M.	Luo	
–  W.	Huang	(Ph.D.)	
–  W.	Jiang	(M.S.)	
–  J.	Jose	(Ph.D.)	
–  S.	Kini	(M.S.)	
–  M.	Koop	(Ph.D.)	
–  R.	Kumar	(M.S.)	
–  S.	Krishnamoorthy	(M.S.)	
–  K.	Kandalla	(Ph.D.)	
–  P.	Lai	(M.S.)	
–  J.	Liu	(Ph.D.)	
–  M.	Luo	(Ph.D.)	
–  A.	Mamidala	(Ph.D.)	
–  G.	Marsh	(M.S.)	
–  V.	Meshram	(M.S.)	
–  A.	Moody	(M.S.)	
–  S.	Naravula	(Ph.D.)	
–  R.	Noronha	(Ph.D.)	
–  X.	Ouyang	(Ph.D.)	
–  S.	Pai	(M.S.)	
–  S.	Potluri	(Ph.D.)
–  R.	Rajachandrasekar	(Ph.D.)	
	
–  K.	Kulkarni	(M.S.)	
–  M.	Rahman	(Ph.D.)	
–  D.	Shankar	(Ph.D.)	
–  A.	Venkatesh	(Ph.D.)	
–  J.	Zhang	(Ph.D.)	
–  E.	Mancini	
–  S.	Marcarelli	
–  J.	Vienne	
Current	Research	Scien4sts				Current	Senior	Research	Associate	
–  H.	Subramoni	
–  X.	Lu	
Past	Programmers	
–  D.	Bureddy	
						-		K.	Hamidouche	
Current	Research	Specialist	
–  M.	Arnold
HPCAC-Stanford	(Feb	‘16)	 67	Network	Based	CompuNng	Laboratory	
	InternaNonal	Workshop	on	CommunicaNon	
Architectures	at	Extreme	Scale	(Exacomm)
ExaComm	2015	was	held	with	Int’l	SupercompuRng	Conference	(ISC	‘15),	at	Frankfurt,	
Germany,	on	Thursday,	July	16th,	2015	
	
One	Keynote	Talk:	John	M.	Shalf,	CTO,	LBL/NERSC	
Four	Invited	Talks:	Dror	Goldenberg	(Mellanox);	MarRn	Schulz	(LLNL);	
Cyriel	Minkenberg	(IBM-Zurich);	Arthur	(Barney)	Maccabe	(ORNL)	
Panel:	Ron	Brightwell	(Sandia)	
Two	Research	Papers	
ExaComm	2016	will	be	held	in	conjuncRon	with	ISC	’16	
h<p://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html	
Technical	Paper	Submission	Deadline:		Friday,	April	15,	2016
HPCAC-Stanford	(Feb	‘16)	 68	Network	Based	CompuNng	Laboratory	
panda@cse.ohio-state.edu	
Thank	You!	
The	High-Performance	Big	Data	Project	
h<p://hibd.cse.ohio-state.edu/
Network-Based	CompuRng	Laboratory	
h<p://nowlab.cse.ohio-state.edu/
The	MVAPICH2	Project	
h<p://mvapich.cse.ohio-state.edu/

Programming Models for Exascale Systems