1Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
POWER AND MICROARCHITECTURE TRADEOFFS
IN NEXT-GENERATION MANYCORES
FROM HOMOGENOUS TO HETEROGENEOUS CORES AND EVERYTHING IN
BETWEEN
Partha Kundu
Technical Director
Infrastructure & Networking Group 2015 International Conference On Computer Aided
Design
The Premier Conference Devoted to Technical Innovations in Electronic Design Automation
November 2 - 6, 2015
Doubletree Hotel
Austin, TX
HOME ABOUT ICCAD CONFERENCE REGISTRATION TRAVEL & STAY
RESOURCE CENTER CONTACT US
Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the
Inevitable Power Wall?
Austin, TX, USA
Nov 2, 2015
2Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
STATE OF SERVER CPUS
So linear scalability is not the answer long-term. For Intel to make a pro
stacking cores, it has had to change the architecture with which cores a
that change runs deeper than you might have expected.
Although we call Xeon E5 v3 an “18
truthful, there’s just one model (“SK
the 2.3 GHz, 145W TDP E5-2699 v
die configurations in the v3 series is
design that may be scaled down for
The 18-core, 2.3 GHz Xeon E5-269
differently from the 12-core, 2.7 GHz Xeon E5-2697 v2 (no, that’s no ty
slower clock speed). Without rethinking the microarchitecture of the co2
3Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
END OF DENNARD SCALING
0	
0.2	
0.4	
0.6	
0.8	
1	
1.2	
1.4	
1.6	
1.8	
2	
1997	 2000	 2003	 2006	 2009	
Opera&ng	Voltage	(V)	
IBM	PowerPC	405LP	
Intel	Xscale	80200	
TransMeta	Crusoe	TM	5800	
Intel	Itanium	Montecito	
Atom	Silverthorne	
Vmin	
Vmax	
3
DVFS less useful to achieve energy proportionality
4Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
END OF DENNARD SCALING : WHAT CAN WE DO IN ARCHITECTURE?
•  Heterogeneous cores, Same ISA
•  Homogenous cores + specific accelerators
4
5Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
OUTLINE OF TALK
•  Heterogeneous cores, Same ISA
•  Homogenous cores + specific accelerators
5
6Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
HETEROGENEOUS CORES, SAME ISA
Single ISA heterogenous multi-core architectures for multi-threaded
applications, Kumar et al, ISCA ‘04
6
7Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
SAME ISA, HETEROGENEOUS CORES
Single ISA heterogenous multi-core architectures for multi-threaded
applications, Kumar et al, ISCA ‘04
7
8Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
ARM’S BIG.LITTLE
http://www.eetimes.com/document.asp?doc_id=1279167&page_number=1
8
9Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
ARM’S BIG.LITTLE
9
10Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
Widget: Wisconsin decoupled grid execution tiles, Watanabe, ISCA’10
FLEXIBLE HETEROGENEITY : SAME ISA
10
11Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
FLEXIBLE HETEROGENEITY : WIDGET UARCH
11
12Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
WIDGET: STEERING HEURISTIC
§  Based on dependence-based steering [Palacharla97]
§  Expose independent instr chains
§  Consumer directly behind the producer
§  Stall steering when no empty buffer is found
§  WiDGET: Power-performance goal
§  Emphasize locality & scalability
12
Cluster	0	 Cluster	1	
Outstanding	Ops?	
Producer	buf	
Empty	buf	
within	cluster	
Any	empty	buf	 Avail	behind	producer?	
Avail	behind	
either	of	producers?	
Empty	buf	in	
either	of	clusters	
0	 1	 2	
Y	 Y	 N	N	
•  Consumer-push	operand	transfers	
–  Send	steered	EU	ID	to	the	producer	EU	
–  MulV-cast	result	to	all	consumers
13Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
WIDGET
0.2	
0.3	
0.4	
0.5	
0.6	
0.7	
0.8	
0.9	
1	
0.3	 0.5	 0.7	 0.9	 1.1	 1.3	
Normalized	Chip	Power	
Normalized	Performance	
Neon	
Mite	
1	EU	
2	EUs	
3	EUs	
4	EUs	
5	EUs	
6	EUs	
7	EUs	
8	EUs	
•  Best-case: 2x of Neon, 21x of Mite
•  1.5x the efficiency of Xeon for the same performance
13
14Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
OUTLINE OF TALK
•  Heterogenous cores, Same ISA
•  Homogenous cores + specific accelerators
14
15Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
STATE OF HPC NETWORKING
15
Proprietary Interconnects still used in the highest performance systems
Low overhead API required to improve interconnect systems going forward
16Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
ADI	-	Ch ADI	-	Ch
Infiniband	(transport)	layer
MPI	apps PGAS	apps
MPICH OpenMPI Gasnet openShmem
MPI ON INFINIBAND
16
17Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
Accelerating High Performance
Computing Applications Through
MPI Offloading, Shainer et al, HPC Council whitepaper, 2011
MPI PROFILE ACROSS AMBER : A MOLECULAR DYNAMICS PACKAGE
17
Majority of time spent
in AllReduce
18Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
18
MPI_REDUCE
0 1 2 3
0
5 1 7 8 4 2
18 14
2 3
MPI_SUM
0 1 2 3
1
5 1 7 8 4 2
18 14
2 3
MPI_SUM
0 18 14
2 18 14 3 18 14
Example: perform average
@ node0
Example: perform std dev.
@ all nodes
MPI_ALLREDUCE
MPI_REDUCE
+
MPI_BROADCAST
19Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
19
TREE ALGORITHM : LARGE
MESSAGES
0 1 (k-1)
n
m
……………………….
Recursive doubling: short
messages
ALLREDUCE ALGORITHMS
Tree Algorithm : large messages
20Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
libfabric
Communication Services
Connection
Management
Address Vectors
Completion Srvcs
Event Queues
Counters
Data Transfer Services
Message Queues
Tag Matching
RMA
Atomics
Control Services
Discovery
OFI Provider
Address Vectors
Event Queues
Counters
Message
Queues
Tag Matching
RMA
Atomics
Discovery
Triggered
Operations
Triggered
Operations
MPI SHMEM PGAS
Libfabric Enabled Applications
. . .
NIC
TX Command
Queues
RX Command
Queues
Connection
Management
LibFabrics ó Portals*
Application & Software stack
20
*Portals Network Programming Interface: www.cs.sandia.gov/Portals/
21Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
IBVERBS VS OFI VS. PORTALS
21
22Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
22
BARRIER OFFLOADED USING PORTALS
CPU
NIC NIC
CPU
CtAlloc(.,&Ct1)
CtWait(&Ct1,N,*event)
P1 P0
CPU
NIC
P2
CPU
NIC
PN
……. Put(..,P0,..)Put(..,P0,..)
Put(..,P0,..)
Barrier
Ct1++
Ct1++
…..
Ct1++
(Ct1==N):
EventPost(*event)
P
N
P
1 …….
P
0
23Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
23
SYNCHRONIZATION : OFFLOADED WITH PORTALS
(SIMPLE)
24Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
24
MPI_ALLREDUCE (TREE ALGORITHM) WITH PORTALS ATOMICS
0 1 (k-1)
n
m
Atomic(…,UpNode_n,	...Sum,	int_8……)
TriggeredAtomic(…,UpNode_m,….Sum,int_8,TrigCtHandle,	k)
TriggeredPut(…,DownNode_n,..TrigCtHandle,(m-n-1))
TriggeredPut(..,DownNode_0,…,TrigCt,1)
TriggeredPut(..,DownNode_1,…,TrigCt,1)
…
TriggeredPut(..,DownNode_k-1,…,TrigCt,1)
……………………….
25Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
25
Short Messages (500ns message)
At higher node counts Recursive Doubling
does better than Triggered tree
Long Messages (1500ns message)
Triggered Tree does best consistently
PERFORMANCE : HOST VS. TREE VS. RECURSIVE
Enabling Flexible Collective Communication Offload with Triggered Operations, Underwood et. al.,
IEEE High Performance Interconnects, 2011
26Broadcom Proprietary and Confidential. © 2013 Broadcom Corporation. All rights reserved.
HOME ABOUT ICCAD CON
Special session : Dennard Scaling is History and Moore's Law is
Aging: How to Break the Inevitable Power Wall? 
Partha Kundu
CONCLUSIONS
§  Application mixes low ILP and high ILP in phases, different programs
§  Heterogeneity (same ISA) reduces power
§  Accelerators that offload main CPU effectively allowing communication and
computation need to be overlapped. Improve performance and power.
§  Accelerator API needs to close the semantic gap between application libraries
and hardware
26

ICCD talk

  • 1.
    1Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu POWER AND MICROARCHITECTURE TRADEOFFS IN NEXT-GENERATION MANYCORES FROM HOMOGENOUS TO HETEROGENEOUS CORES AND EVERYTHING IN BETWEEN Partha Kundu Technical Director Infrastructure & Networking Group 2015 International Conference On Computer Aided Design The Premier Conference Devoted to Technical Innovations in Electronic Design Automation November 2 - 6, 2015 Doubletree Hotel Austin, TX HOME ABOUT ICCAD CONFERENCE REGISTRATION TRAVEL & STAY RESOURCE CENTER CONTACT US Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Austin, TX, USA Nov 2, 2015
  • 2.
    2Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu STATE OF SERVER CPUS So linear scalability is not the answer long-term. For Intel to make a pro stacking cores, it has had to change the architecture with which cores a that change runs deeper than you might have expected. Although we call Xeon E5 v3 an “18 truthful, there’s just one model (“SK the 2.3 GHz, 145W TDP E5-2699 v die configurations in the v3 series is design that may be scaled down for The 18-core, 2.3 GHz Xeon E5-269 differently from the 12-core, 2.7 GHz Xeon E5-2697 v2 (no, that’s no ty slower clock speed). Without rethinking the microarchitecture of the co2
  • 3.
    3Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu END OF DENNARD SCALING 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1997 2000 2003 2006 2009 Opera&ng Voltage (V) IBM PowerPC 405LP Intel Xscale 80200 TransMeta Crusoe TM 5800 Intel Itanium Montecito Atom Silverthorne Vmin Vmax 3 DVFS less useful to achieve energy proportionality
  • 4.
    4Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu END OF DENNARD SCALING : WHAT CAN WE DO IN ARCHITECTURE? •  Heterogeneous cores, Same ISA •  Homogenous cores + specific accelerators 4
  • 5.
    5Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu OUTLINE OF TALK •  Heterogeneous cores, Same ISA •  Homogenous cores + specific accelerators 5
  • 6.
    6Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu HETEROGENEOUS CORES, SAME ISA Single ISA heterogenous multi-core architectures for multi-threaded applications, Kumar et al, ISCA ‘04 6
  • 7.
    7Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu SAME ISA, HETEROGENEOUS CORES Single ISA heterogenous multi-core architectures for multi-threaded applications, Kumar et al, ISCA ‘04 7
  • 8.
    8Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu ARM’S BIG.LITTLE http://www.eetimes.com/document.asp?doc_id=1279167&page_number=1 8
  • 9.
    9Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu ARM’S BIG.LITTLE 9
  • 10.
    10Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu Widget: Wisconsin decoupled grid execution tiles, Watanabe, ISCA’10 FLEXIBLE HETEROGENEITY : SAME ISA 10
  • 11.
    11Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu FLEXIBLE HETEROGENEITY : WIDGET UARCH 11
  • 12.
    12Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu WIDGET: STEERING HEURISTIC §  Based on dependence-based steering [Palacharla97] §  Expose independent instr chains §  Consumer directly behind the producer §  Stall steering when no empty buffer is found §  WiDGET: Power-performance goal §  Emphasize locality & scalability 12 Cluster 0 Cluster 1 Outstanding Ops? Producer buf Empty buf within cluster Any empty buf Avail behind producer? Avail behind either of producers? Empty buf in either of clusters 0 1 2 Y Y N N •  Consumer-push operand transfers –  Send steered EU ID to the producer EU –  MulV-cast result to all consumers
  • 13.
    13Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu WIDGET 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.3 0.5 0.7 0.9 1.1 1.3 Normalized Chip Power Normalized Performance Neon Mite 1 EU 2 EUs 3 EUs 4 EUs 5 EUs 6 EUs 7 EUs 8 EUs •  Best-case: 2x of Neon, 21x of Mite •  1.5x the efficiency of Xeon for the same performance 13
  • 14.
    14Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu OUTLINE OF TALK •  Heterogenous cores, Same ISA •  Homogenous cores + specific accelerators 14
  • 15.
    15Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu STATE OF HPC NETWORKING 15 Proprietary Interconnects still used in the highest performance systems Low overhead API required to improve interconnect systems going forward
  • 16.
    16Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu ADI - Ch ADI - Ch Infiniband (transport) layer MPI apps PGAS apps MPICH OpenMPI Gasnet openShmem MPI ON INFINIBAND 16
  • 17.
    17Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu Accelerating High Performance Computing Applications Through MPI Offloading, Shainer et al, HPC Council whitepaper, 2011 MPI PROFILE ACROSS AMBER : A MOLECULAR DYNAMICS PACKAGE 17 Majority of time spent in AllReduce
  • 18.
    18Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu 18 MPI_REDUCE 0 1 2 3 0 5 1 7 8 4 2 18 14 2 3 MPI_SUM 0 1 2 3 1 5 1 7 8 4 2 18 14 2 3 MPI_SUM 0 18 14 2 18 14 3 18 14 Example: perform average @ node0 Example: perform std dev. @ all nodes MPI_ALLREDUCE MPI_REDUCE + MPI_BROADCAST
  • 19.
    19Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu 19 TREE ALGORITHM : LARGE MESSAGES 0 1 (k-1) n m ………………………. Recursive doubling: short messages ALLREDUCE ALGORITHMS Tree Algorithm : large messages
  • 20.
    20Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu libfabric Communication Services Connection Management Address Vectors Completion Srvcs Event Queues Counters Data Transfer Services Message Queues Tag Matching RMA Atomics Control Services Discovery OFI Provider Address Vectors Event Queues Counters Message Queues Tag Matching RMA Atomics Discovery Triggered Operations Triggered Operations MPI SHMEM PGAS Libfabric Enabled Applications . . . NIC TX Command Queues RX Command Queues Connection Management LibFabrics ó Portals* Application & Software stack 20 *Portals Network Programming Interface: www.cs.sandia.gov/Portals/
  • 21.
    21Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu IBVERBS VS OFI VS. PORTALS 21
  • 22.
    22Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu 22 BARRIER OFFLOADED USING PORTALS CPU NIC NIC CPU CtAlloc(.,&Ct1) CtWait(&Ct1,N,*event) P1 P0 CPU NIC P2 CPU NIC PN ……. Put(..,P0,..)Put(..,P0,..) Put(..,P0,..) Barrier Ct1++ Ct1++ ….. Ct1++ (Ct1==N): EventPost(*event) P N P 1 ……. P 0
  • 23.
    23Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu 23 SYNCHRONIZATION : OFFLOADED WITH PORTALS (SIMPLE)
  • 24.
    24Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu 24 MPI_ALLREDUCE (TREE ALGORITHM) WITH PORTALS ATOMICS 0 1 (k-1) n m Atomic(…,UpNode_n, ...Sum, int_8……) TriggeredAtomic(…,UpNode_m,….Sum,int_8,TrigCtHandle, k) TriggeredPut(…,DownNode_n,..TrigCtHandle,(m-n-1)) TriggeredPut(..,DownNode_0,…,TrigCt,1) TriggeredPut(..,DownNode_1,…,TrigCt,1) … TriggeredPut(..,DownNode_k-1,…,TrigCt,1) ……………………….
  • 25.
    25Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu 25 Short Messages (500ns message) At higher node counts Recursive Doubling does better than Triggered tree Long Messages (1500ns message) Triggered Tree does best consistently PERFORMANCE : HOST VS. TREE VS. RECURSIVE Enabling Flexible Collective Communication Offload with Triggered Operations, Underwood et. al., IEEE High Performance Interconnects, 2011
  • 26.
    26Broadcom Proprietary andConfidential. © 2013 Broadcom Corporation. All rights reserved. HOME ABOUT ICCAD CON Special session : Dennard Scaling is History and Moore's Law is Aging: How to Break the Inevitable Power Wall? Partha Kundu CONCLUSIONS §  Application mixes low ILP and high ILP in phases, different programs §  Heterogeneity (same ISA) reduces power §  Accelerators that offload main CPU effectively allowing communication and computation need to be overlapped. Improve performance and power. §  Accelerator API needs to close the semantic gap between application libraries and hardware 26