SlideShare a Scribd company logo
Op#mizing	
  for	
  the	
  Inner	
  Kernel:	
  
A	
  Low	
  Power,	
  High	
  Performance	
  Core	
  
for	
  Matrix	
  Mul#plica#on	
  Kernels	
  
	
  
Ardavan	
  Pedram	
  
The	
  GEMM	
  Micro-­‐kernel	
  
•  BLIS	
  exposes	
  outer	
  two	
  loops	
  of	
  the	
  inner	
  kernel	
  
•  Key	
  observa+on:	
  Virtually	
  all	
  of	
  the	
  differences	
  between	
  
level-­‐3	
  inner	
  kernels	
  reside	
  in	
  the	
  outer	
  two	
  loops	
  
9/8/13	
   2	
  
for ( 0 to NC: NR )
for ( 0 to MC: MR )
for ( 0 to KC: 1 )
// block-dot product
endfor
endfor
endfor
inner	
  kernel	
  /	
  
macro-­‐kernel	
  
micro-­‐kernel	
  
©	
  A.	
  Pedram	
  
9/8/13
Era of Heterogeneous Computing
•  Physical limits of technology scaling
– Power wall and dark silicon
•  Only a fraction of a chip may be active
•  Efficiency/optimality vs. flexibility/generality
•  GFLOPS/W (energy per operation)
Ø Opportunity for specialization
Ø Heterogeneous multi-core /
Asynchronous CMP
Ø On-chip accelerators
Nvidia	
  Tegra	
  2	
  System	
  on	
  Chip	
  
© A. Pedram 3
Implementa#on	
  Spectrum	
  
9/8/13	
   4	
  
E F F I CI E N C Y
FLEXIBILITY
Source: T. Noll, RWTH Aachen, via R. Leupers, “From ASIP to MPSoC”,
Computer Engineering Colloquium, TU Delft, 2006
Linear	
  Algebra	
  
Processor	
  
©	
  A.	
  Pedram	
  
The	
  GEMM	
  Micro-­‐kernel	
  
9/8/13	
   5	
  
+=	
  
KC	
  NR	
  
MR	
  
NR	
  
C	
   A	
  
B	
  
for ( 0 to NC: NR )
for ( 0 to MC: MR )
for ( 0 to KC: 1 )
// block-dot product
endfor
endfor
endfor
©	
  A.	
  Pedram	
  
C	
  
The	
  GEMM	
  Micro-­‐kernel	
  
9/8/13	
   6	
  
+=	
  
KC	
  NR	
  
MR	
  
NR	
  
α1
α2
α3
α0
β1
β0
 β2
 β3
γ00
γ10
γ20
γ30
γ01
γ11
γ21
γ31
γ02
γ12
γ22
γ32
γ03
γ13
γ23
γ33
+=	
  
A	
  
B	
  
for ( 0 to NC: NR )
for ( 0 to MC: MR )
for ( 0 to KC: 1 )
// block-dot product
endfor
endfor
endfor
•  Typical	
  micro-­‐kernel	
  loop	
  itera#on	
  
(“block-­‐dot	
  product”)	
  
–  Load	
  column	
  of	
  packed	
  A	
  
–  Load	
  row	
  of	
  packed	
  B	
  
–  Compute	
  outer	
  product	
  
–  Update	
  C	
  (kept	
  in	
  registers)	
  
©	
  A.	
  Pedram	
  
9/8/13
Linear Algebra Core (LAC)
•  Scalable 2-D array of nr×nr processing elements (PEs)
–  Specialized floating-point units w/ 1 MAC/cycle throughput
–  Broadcast busses (no need to pipeline up to nr=16)
–  Distributed memory architecture
–  Distributed, PE-local control
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
`
MEM B
Addr1
Row Bus Write (RBW)
Column Bus Write (CBW)
A B
Controller
Column Bus
Read (CBR)
Row Bus
Read (RBR)
MAC
ACC_in
Accumulator
Cin
Memory Interface
Addr2
RF
MEM A
© A. Pedram 7
9/8/13
The “Rank-1” Machine
BpCi
Ai,p
+= x
+=
+
+
nr
nr
+ …
x
x
x
BpCi
Ai,p
+= x
+=
+
+
nr
nr
+ …
x
x
x
C += A x B
© A. Pedram 8
9/8/13
Core-Level Implementation
•  4x4 LAC
•  256 kB of local SRAM
•  @ 1GHz
•  Running GEMM @45nm technology
W
mm2
GFLOPS
mm2
GFLOPS
W
Utilization
Cell SPE (SP) 0.4 6.4 16 83%
NVidia GTX480 SM (SP) 0.5 4.5 8.4 70%
NVidia GTX480 SM (DP) 0.5 2.05 4.1 70%
Intel Core (DP) 0.5 0.4 0.9 95%
ClearSpeed (DP) 0.02 0.3 13 80%
LAC (SP) 0.2 20 104 95+%
LAC (DP) 0.3 16 47 95+%
© A. Pedram 9
Design	
  Trade-­‐off	
  Challenge	
  
•  Some	
  Linear	
  algebra	
  data	
  handling	
  requirements	
  
–  Matrix	
  Transpose	
  (SYRK)	
  
–  Vector	
  Transpose	
  (Matrix-­‐vector	
  mul#ply)	
  (LU)	
  	
  
–  Broadcast	
  (GEMM)	
  
•  Common	
  Sources	
  of	
  Inefficiencies	
  
	
  in	
  CPUs	
  &	
  GPUs	
  
–  Instruc#on	
  handling	
  
–  Mul#-­‐ported	
  register	
  file	
  
–  Cache	
  overheads:	
  tags	
  and	
  coherency	
  
•  What	
  is	
  the	
  best	
  solu#on	
  for	
  those	
  data	
  handling	
  
requirements?	
  
9/8/13	
   10	
  ©	
  A.	
  Pedram	
  
Studied	
  Register	
  File	
  Organiza#ons	
  
9/8/13	
   11	
  
16 Wide SIMD
SIMD Register File
/ Bypass network
Memory Interface
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU FPU FPU FPU
Load/
Store
Unit
Load/
Store
Unit
RF/Bypass
4 Wide SIMD
RF/Bypass
4 Wide SIMD
RF/Bypass
4 Wide SIMD
RF/Bypass
4 Wide SIMD
FPU FPU FPU .. FPU
Multi-Port Flat Register File
/ Bypass network Load/
Store
Unit
..
..
..
(c)(b)(a)
•  Organiza#ons	
  for	
  16	
  FPUs	
  
–  (a)	
  Flat	
  mul#ple-­‐ported	
  register	
  file	
  
–  (b)	
  Wide	
  SIMD	
  with	
  wide	
  ports	
  
–  (c)	
  Array	
  of	
  short	
  SIMD	
  units	
  
•  No	
  limits	
  	
  
–  load-­‐store	
  unit	
  data	
  movement	
  and	
  bandwidth	
  
–  Scalar	
  pipeline	
  (#	
  instruc#ons	
  that	
  are	
  executed)	
  
©	
  A.	
  Pedram	
  
Data	
  Handling	
  Trade-­‐off	
  in	
  Core	
  
•  GEMM[ASAP2011]	
  
– Regular	
  data	
  movement	
  	
  
– Regular	
  access	
  paoern	
  
– Easily	
  mapped	
  on	
  most	
  SIMD	
  Cores	
  
•  SYRK[ASAP2012],[SBAC-­‐PAD2012]	
  
– Matrix	
  Transpose	
  needed	
  
– Data	
  manipula#on	
  for	
  Transpose	
  is	
  a	
  challenge	
  	
  
9/8/13	
   12	
  ©	
  A.	
  Pedram	
  
Memory	
  Hierarchy	
  of	
  GEMM	
  
•  Blocking	
  in	
  	
  
higher	
  levels	
  
of	
  memory	
  is	
  
almost	
  the	
  	
  
same	
  
	
  
9/8/13	
   13	
  
+=
xAi,p
C A B
C
x
x
+=
x+=
Accumulator
(Register Level)
Local Store Level
(LAC)
On-chip Memory
n
Bp
Ap
Ci
Bp,j
Main Memory
LAP
1 x w SIMD
h x w SIMD
L2 Cache
(CPUs) L1 Cache
(CPUs)
L3 Cache
(CPUs)
©	
  A.	
  Pedram	
  
GEMM	
  Register-­‐level	
  Opera#ons	
  
•  Accumulator	
  
delay	
  
mul#plies	
  the	
  
number	
  of	
  
registers	
  
needed	
  in	
  core	
  
9/8/13	
   14	
  
Ai+dM-1,k Bk,j
cij
Ci+1,j
Blocks of Ci,j
1-Stream Out
2-Current
3-Prefetch
+=
mc
nr
Ci+1,j
Ci+2,j
Ci+d-1,j
dA
dM
dA
©	
  A.	
  Pedram	
  
C A Bx+=
+=
+=
+=
x
x
+= xnr
nr nr
w
dA
w x dA
w
h
w x dA
LAP
1 x w SIMD
h x w SIMD
Ai,p in Local Store of PEs Bp,j Bp,j+1
Blocks of Ci,j
1-Stream Out
2-Current
3-Prefetch
+=
x
kc
mc kc
9/8/13
Non-Blocked nr x nr SYRK on LAC
•  LAC takes advantage of the
broadcast buses
•  Overlap computation with
communication with transposition
•  Least operation count in part of
bigger SYRK problem
•  Simple control
–  3-stage state machine in PEs
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a0(S1)
© A. Pedram 15
9/8/13
Non-Blocked nr x nr SYRK on LAC
•  LAC takes advantage of the
broadcast buses
•  Overlap computation with
communication with transposition
•  Least operation count in part of
bigger SYRK problem
•  Simple control
–  3-stage state machine in PEs
(0,0) PE(0,1) PE(0,2) PE(0,3)
(1,0) PE(1,1) PE(1,2) PE(1,3)
(2,0) PE(2,1) PE(2,2) PE(2,3)
(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a0(S1)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a1
Broadcast a0
T
Rank-1 update C+=a0a0
T
(S1)
(S1)
(S1)
© A. Pedram 16
Power	
  and	
  Performance	
  Analysis	
  
•  Analy#cal	
  formulae	
  
–  U#liza#on	
  
–  Bandwidth	
  
–  Size	
  of	
  local	
  stores	
  
–  Size	
  of	
  register	
  file	
  
–  #ports	
  of	
  memory	
  levels	
  
•  Component	
  selec#ons	
  
–  Storage	
  model	
  [CACTI	
  6.5]	
  
•  Pure	
  SRAM	
  Model	
  for	
  
Register	
  file	
  
•  Cache	
  Models	
  for	
  L1	
  &	
  L2	
  
	
  
•  Configura#on	
  
–  1	
  GHz	
  	
  
–  16	
  FPUs	
  
–  LAC	
  
•  256+32	
  KB	
  aggregate	
  Local	
  
store	
  
–  Register	
  file	
  organiza#ons	
  
•  256	
  	
  KB	
  L2,	
  64B	
  line,	
  8	
  way	
  
•  32	
  	
  	
  	
  KB	
  L1,	
  64B	
  line,	
  8way	
  
•  Register	
  file	
  Size	
  variable	
  
–  Aimed	
  for	
  peak	
  performance	
  
9/8/13	
   17	
  ©	
  A.	
  Pedram	
  
GEMM	
  Power	
  and	
  Area	
  
•  SIMD	
  architectures	
  
–  Effec#vely	
  reduce	
  the	
  power	
  
consump#on	
  
•  LAC	
  is	
  the	
  best	
  not	
  using	
  
caches	
  
•  The	
  flexibility	
  of	
  flat	
  register	
  
files	
  is	
  of	
  no	
  use	
  for	
  GEMM.	
  
•  4×4-­‐wide	
  array	
  of	
  SIMD	
  
requires	
  less	
  area.	
  
•  Most	
  of	
  the	
  area	
  is	
  used	
  by	
  
Caches	
  
9/8/13	
   18	
  
0

2

4

6

8

10

12

14

FLAT
 SIMD
 A
SIMD
X
h
 LAC

mm^2

SRAM

L2
Cache

L1
Cache

RF

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

FLAT
 SIMD
 A
SIMD
X
h
 LAC

W/GFLOP

SRAM

L2
Cache

L1
Cache

RF

©	
  A.	
  Pedram	
  
SYRK	
  Power	
  and	
  Area	
  
•  Wide	
  SIMD	
  takes	
  many	
  
cycles	
  to	
  transpose	
  
•  4×4-­‐wide	
  SIMD	
  has	
  best	
  
efficiency	
  
–  Less	
  power	
  than	
  FLAT	
  
–  Higher	
  u#liza#on	
  then	
  Wide	
  
SIMD	
  
•  4×4-­‐wide	
  array	
  of	
  SIMD	
  
requires	
  less	
  area.	
  
•  Most	
  of	
  the	
  area	
  is	
  used	
  by	
  
Caches	
  
9/8/13	
   19	
  
0

2

4

6

8

10

12

14

16

18

FLAT
 SIMD
 A
SIMD
X
h
 LAC

mm^2

SRAM

L2
Cache

L1
Cache

RF

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

FLAT
 SIMD
 A
SIMD
X
h
 LAC

W/GFLOP

SRAM

L2
Cache

L1
Cache

RF

©	
  A.	
  Pedram	
  
Conclusion	
  
•  Architectures	
  Studied	
  
•  Linear	
  Algebra	
  Core	
  (LAC)	
  
•  FLAT	
  register	
  file	
  
•  1D	
  wide	
  SIMD	
  
•  4×4-­‐wide	
  array	
  of	
  SIMD	
  	
  
•  Algorithm/Architecture	
  co-­‐design	
  
•  GEMM	
  (emphasize	
  on	
  data	
  bandwidth	
  to-­‐from	
  core)	
  
•  SYRK 	
  (	
  emphasize	
  on	
  data	
  manipula#on	
  in	
  core)	
  
•  Analy#cal	
  formulae	
  for	
  different	
  layers	
  of	
  
memory	
  hierarchy	
  
•  Power	
  and	
  efficiency	
  es#ma#on	
  
9/8/13	
   20	
  ©	
  A.	
  Pedram	
  
Linear	
  Algebra	
  Core	
  
•  Results @ 1GHz
–  DP: 32 GFLOPS, 47 GFLOPS/W
–  0.6 Watts
–  2.8 mm2 in 45nm
–  4 GB/s external BW
–  Orders of magnitude improvement
Ø On-going and future work
–  Generalization: full Level-3 BLAS in hardware
–  Implementation: synthesis and tape-out
9/8/13	
   21	
  ©	
  A.	
  Pedram	
  
Ques#ons	
  ?	
  
9/8/13	
   22	
  ©	
  A.	
  Pedram	
  

More Related Content

What's hot

ARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsARM AAE - Intrustion Sets
ARM AAE - Intrustion Sets
Anh Dung NGUYEN
 
Chapter 04 the processor
Chapter 04   the processorChapter 04   the processor
Chapter 04 the processor
Bảo Hoang
 
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC 02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
IEEE SSCS AlexSC
 
OMAP
OMAPOMAP
CArcMOOC 05.03 - Pipeline hazards
CArcMOOC 05.03 - Pipeline hazardsCArcMOOC 05.03 - Pipeline hazards
CArcMOOC 05.03 - Pipeline hazards
Alessandro Bogliolo
 
Arm cortex R(real time)processor series
Arm cortex R(real time)processor series Arm cortex R(real time)processor series
Arm cortex R(real time)processor series
Ronak047
 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
Andrew Daws
 
Gv2512441247
Gv2512441247Gv2512441247
Gv2512441247
IJERA Editor
 
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_arm
Prashant Ahire
 
ARM cortex A15
ARM cortex A15ARM cortex A15
ARM cortex A15
KOMAL YAMGAR
 
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
Anh Dung NGUYEN
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
Shien-Chun Luo
 
ASTROSAT SSR - 2015-05-15
ASTROSAT SSR - 2015-05-15ASTROSAT SSR - 2015-05-15
ASTROSAT SSR - 2015-05-15
Aritra Sarkar
 
ARM Architecture
ARM ArchitectureARM Architecture
ARM Architecture
Kshitij Gorde
 
Archi arm2
Archi arm2Archi arm2
Archi arm2
Ajit Saraf
 
ARM Architecture
ARM ArchitectureARM Architecture
ARM Architecture
Dwight Sabio
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
Sunil Thorat
 
Why a zynq should power your next project
Why a zynq should power your next projectWhy a zynq should power your next project
Why a zynq should power your next project
Mark Smith
 
ARM AAE - Architecture
ARM AAE - ArchitectureARM AAE - Architecture
ARM AAE - Architecture
Anh Dung NGUYEN
 
F9 microkernel code reading part 4 memory management
F9 microkernel code reading part 4 memory managementF9 microkernel code reading part 4 memory management
F9 microkernel code reading part 4 memory management
Benux Wei
 

What's hot (20)

ARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsARM AAE - Intrustion Sets
ARM AAE - Intrustion Sets
 
Chapter 04 the processor
Chapter 04   the processorChapter 04   the processor
Chapter 04 the processor
 
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC 02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
 
OMAP
OMAPOMAP
OMAP
 
CArcMOOC 05.03 - Pipeline hazards
CArcMOOC 05.03 - Pipeline hazardsCArcMOOC 05.03 - Pipeline hazards
CArcMOOC 05.03 - Pipeline hazards
 
Arm cortex R(real time)processor series
Arm cortex R(real time)processor series Arm cortex R(real time)processor series
Arm cortex R(real time)processor series
 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
 
Gv2512441247
Gv2512441247Gv2512441247
Gv2512441247
 
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_arm
 
ARM cortex A15
ARM cortex A15ARM cortex A15
ARM cortex A15
 
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
ASTROSAT SSR - 2015-05-15
ASTROSAT SSR - 2015-05-15ASTROSAT SSR - 2015-05-15
ASTROSAT SSR - 2015-05-15
 
ARM Architecture
ARM ArchitectureARM Architecture
ARM Architecture
 
Archi arm2
Archi arm2Archi arm2
Archi arm2
 
ARM Architecture
ARM ArchitectureARM Architecture
ARM Architecture
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
Why a zynq should power your next project
Why a zynq should power your next projectWhy a zynq should power your next project
Why a zynq should power your next project
 
ARM AAE - Architecture
ARM AAE - ArchitectureARM AAE - Architecture
ARM AAE - Architecture
 
F9 microkernel code reading part 4 memory management
F9 microkernel code reading part 4 memory managementF9 microkernel code reading part 4 memory management
F9 microkernel code reading part 4 memory management
 

Similar to Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multiplication

Microelectronics U4.pptx.ppt
Microelectronics U4.pptx.pptMicroelectronics U4.pptx.ppt
Microelectronics U4.pptx.ppt
PavikaSharma3
 
TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...
chiportal
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
sunil kumar
 
DF_Captronics06
DF_Captronics06DF_Captronics06
DF_Captronics06
David Fraboulet
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
rinnocente
 
EC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCET
EC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCETEC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCET
EC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCET
SeshaVidhyaS
 
Prezentacja_Profil_Portfolio - EN
Prezentacja_Profil_Portfolio - ENPrezentacja_Profil_Portfolio - EN
Prezentacja_Profil_Portfolio - EN
Tomasz Janicki
 
SRAM
SRAMSRAM
Unit2 arm
Unit2 armUnit2 arm
Unit2 arm
Karthik Vivek
 
Memory elements 1
Memory elements  1Memory elements  1
Memory elements 1
Mahantesh Gutti
 
DB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and controlDB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and control
Florence Dubois
 
Decoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMIDecoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMI
Allan Cantle
 
Ics21 workshop decoupling compute from memory, storage & io with omi - ...
Ics21 workshop   decoupling compute from memory, storage & io with omi - ...Ics21 workshop   decoupling compute from memory, storage & io with omi - ...
Ics21 workshop decoupling compute from memory, storage & io with omi - ...
Vaibhav R
 
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017
Bruno Teixeira
 
LTE-Advanced Physical Layer
LTE-Advanced Physical LayerLTE-Advanced Physical Layer
LTE-Advanced Physical Layer
Praveen Kumar
 
Osmocom
OsmocomOsmocom
Osmocom
0xDEADC0DE
 
Van jaconson netchannels
Van jaconson netchannelsVan jaconson netchannels
Van jaconson netchannels
Susant Sahani
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
Allan Cantle
 
Optimized CAM Design
Optimized CAM DesignOptimized CAM Design
Optimized CAM Design
IJMER
 
Cisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdf
Cisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdfCisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdf
Cisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdf
Varghese Martin
 

Similar to Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multiplication (20)

Microelectronics U4.pptx.ppt
Microelectronics U4.pptx.pptMicroelectronics U4.pptx.ppt
Microelectronics U4.pptx.ppt
 
TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
DF_Captronics06
DF_Captronics06DF_Captronics06
DF_Captronics06
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
EC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCET
EC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCETEC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCET
EC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCET
 
Prezentacja_Profil_Portfolio - EN
Prezentacja_Profil_Portfolio - ENPrezentacja_Profil_Portfolio - EN
Prezentacja_Profil_Portfolio - EN
 
SRAM
SRAMSRAM
SRAM
 
Unit2 arm
Unit2 armUnit2 arm
Unit2 arm
 
Memory elements 1
Memory elements  1Memory elements  1
Memory elements 1
 
DB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and controlDB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and control
 
Decoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMIDecoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMI
 
Ics21 workshop decoupling compute from memory, storage & io with omi - ...
Ics21 workshop   decoupling compute from memory, storage & io with omi - ...Ics21 workshop   decoupling compute from memory, storage & io with omi - ...
Ics21 workshop decoupling compute from memory, storage & io with omi - ...
 
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017
 
LTE-Advanced Physical Layer
LTE-Advanced Physical LayerLTE-Advanced Physical Layer
LTE-Advanced Physical Layer
 
Osmocom
OsmocomOsmocom
Osmocom
 
Van jaconson netchannels
Van jaconson netchannelsVan jaconson netchannels
Van jaconson netchannels
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
 
Optimized CAM Design
Optimized CAM DesignOptimized CAM Design
Optimized CAM Design
 
Cisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdf
Cisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdfCisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdf
Cisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdf
 

Recently uploaded

The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ortus Solutions, Corp
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Ukraine
 

Recently uploaded (20)

The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
 

Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multiplication

  • 1. Op#mizing  for  the  Inner  Kernel:   A  Low  Power,  High  Performance  Core   for  Matrix  Mul#plica#on  Kernels     Ardavan  Pedram  
  • 2. The  GEMM  Micro-­‐kernel   •  BLIS  exposes  outer  two  loops  of  the  inner  kernel   •  Key  observa+on:  Virtually  all  of  the  differences  between   level-­‐3  inner  kernels  reside  in  the  outer  two  loops   9/8/13   2   for ( 0 to NC: NR ) for ( 0 to MC: MR ) for ( 0 to KC: 1 ) // block-dot product endfor endfor endfor inner  kernel  /   macro-­‐kernel   micro-­‐kernel   ©  A.  Pedram  
  • 3. 9/8/13 Era of Heterogeneous Computing •  Physical limits of technology scaling – Power wall and dark silicon •  Only a fraction of a chip may be active •  Efficiency/optimality vs. flexibility/generality •  GFLOPS/W (energy per operation) Ø Opportunity for specialization Ø Heterogeneous multi-core / Asynchronous CMP Ø On-chip accelerators Nvidia  Tegra  2  System  on  Chip   © A. Pedram 3
  • 4. Implementa#on  Spectrum   9/8/13   4   E F F I CI E N C Y FLEXIBILITY Source: T. Noll, RWTH Aachen, via R. Leupers, “From ASIP to MPSoC”, Computer Engineering Colloquium, TU Delft, 2006 Linear  Algebra   Processor   ©  A.  Pedram  
  • 5. The  GEMM  Micro-­‐kernel   9/8/13   5   +=   KC  NR   MR   NR   C   A   B   for ( 0 to NC: NR ) for ( 0 to MC: MR ) for ( 0 to KC: 1 ) // block-dot product endfor endfor endfor ©  A.  Pedram  
  • 6. C   The  GEMM  Micro-­‐kernel   9/8/13   6   +=   KC  NR   MR   NR   α1 α2 α3 α0 β1 β0 β2 β3 γ00 γ10 γ20 γ30 γ01 γ11 γ21 γ31 γ02 γ12 γ22 γ32 γ03 γ13 γ23 γ33 +=   A   B   for ( 0 to NC: NR ) for ( 0 to MC: MR ) for ( 0 to KC: 1 ) // block-dot product endfor endfor endfor •  Typical  micro-­‐kernel  loop  itera#on   (“block-­‐dot  product”)   –  Load  column  of  packed  A   –  Load  row  of  packed  B   –  Compute  outer  product   –  Update  C  (kept  in  registers)   ©  A.  Pedram  
  • 7. 9/8/13 Linear Algebra Core (LAC) •  Scalable 2-D array of nr×nr processing elements (PEs) –  Specialized floating-point units w/ 1 MAC/cycle throughput –  Broadcast busses (no need to pipeline up to nr=16) –  Distributed memory architecture –  Distributed, PE-local control PE(0,0) PE(0,1) PE(0,2) PE(0,3) PE(1,0) PE(1,1) PE(1,2) PE(1,3) PE(2,0) PE(2,1) PE(2,2) PE(2,3) PE(3,0) PE(3,1) PE(3,2) PE(3,3) ` MEM B Addr1 Row Bus Write (RBW) Column Bus Write (CBW) A B Controller Column Bus Read (CBR) Row Bus Read (RBR) MAC ACC_in Accumulator Cin Memory Interface Addr2 RF MEM A © A. Pedram 7
  • 8. 9/8/13 The “Rank-1” Machine BpCi Ai,p += x += + + nr nr + … x x x BpCi Ai,p += x += + + nr nr + … x x x C += A x B © A. Pedram 8
  • 9. 9/8/13 Core-Level Implementation •  4x4 LAC •  256 kB of local SRAM •  @ 1GHz •  Running GEMM @45nm technology W mm2 GFLOPS mm2 GFLOPS W Utilization Cell SPE (SP) 0.4 6.4 16 83% NVidia GTX480 SM (SP) 0.5 4.5 8.4 70% NVidia GTX480 SM (DP) 0.5 2.05 4.1 70% Intel Core (DP) 0.5 0.4 0.9 95% ClearSpeed (DP) 0.02 0.3 13 80% LAC (SP) 0.2 20 104 95+% LAC (DP) 0.3 16 47 95+% © A. Pedram 9
  • 10. Design  Trade-­‐off  Challenge   •  Some  Linear  algebra  data  handling  requirements   –  Matrix  Transpose  (SYRK)   –  Vector  Transpose  (Matrix-­‐vector  mul#ply)  (LU)     –  Broadcast  (GEMM)   •  Common  Sources  of  Inefficiencies    in  CPUs  &  GPUs   –  Instruc#on  handling   –  Mul#-­‐ported  register  file   –  Cache  overheads:  tags  and  coherency   •  What  is  the  best  solu#on  for  those  data  handling   requirements?   9/8/13   10  ©  A.  Pedram  
  • 11. Studied  Register  File  Organiza#ons   9/8/13   11   16 Wide SIMD SIMD Register File / Bypass network Memory Interface FPU +LS FPU +LS FPU +LS FPU +LS FPU +LS FPU +LS FPU +LS FPU +LS FPU FPU FPU FPU Load/ Store Unit Load/ Store Unit RF/Bypass 4 Wide SIMD RF/Bypass 4 Wide SIMD RF/Bypass 4 Wide SIMD RF/Bypass 4 Wide SIMD FPU FPU FPU .. FPU Multi-Port Flat Register File / Bypass network Load/ Store Unit .. .. .. (c)(b)(a) •  Organiza#ons  for  16  FPUs   –  (a)  Flat  mul#ple-­‐ported  register  file   –  (b)  Wide  SIMD  with  wide  ports   –  (c)  Array  of  short  SIMD  units   •  No  limits     –  load-­‐store  unit  data  movement  and  bandwidth   –  Scalar  pipeline  (#  instruc#ons  that  are  executed)   ©  A.  Pedram  
  • 12. Data  Handling  Trade-­‐off  in  Core   •  GEMM[ASAP2011]   – Regular  data  movement     – Regular  access  paoern   – Easily  mapped  on  most  SIMD  Cores   •  SYRK[ASAP2012],[SBAC-­‐PAD2012]   – Matrix  Transpose  needed   – Data  manipula#on  for  Transpose  is  a  challenge     9/8/13   12  ©  A.  Pedram  
  • 13. Memory  Hierarchy  of  GEMM   •  Blocking  in     higher  levels   of  memory  is   almost  the     same     9/8/13   13   += xAi,p C A B C x x += x+= Accumulator (Register Level) Local Store Level (LAC) On-chip Memory n Bp Ap Ci Bp,j Main Memory LAP 1 x w SIMD h x w SIMD L2 Cache (CPUs) L1 Cache (CPUs) L3 Cache (CPUs) ©  A.  Pedram  
  • 14. GEMM  Register-­‐level  Opera#ons   •  Accumulator   delay   mul#plies  the   number  of   registers   needed  in  core   9/8/13   14   Ai+dM-1,k Bk,j cij Ci+1,j Blocks of Ci,j 1-Stream Out 2-Current 3-Prefetch += mc nr Ci+1,j Ci+2,j Ci+d-1,j dA dM dA ©  A.  Pedram   C A Bx+= += += += x x += xnr nr nr w dA w x dA w h w x dA LAP 1 x w SIMD h x w SIMD Ai,p in Local Store of PEs Bp,j Bp,j+1 Blocks of Ci,j 1-Stream Out 2-Current 3-Prefetch += x kc mc kc
  • 15. 9/8/13 Non-Blocked nr x nr SYRK on LAC •  LAC takes advantage of the broadcast buses •  Overlap computation with communication with transposition •  Least operation count in part of bigger SYRK problem •  Simple control –  3-stage state machine in PEs PE(0,0) PE(0,1) PE(0,2) PE(0,3) PE(1,0) PE(1,1) PE(1,2) PE(1,3) PE(2,0) PE(2,1) PE(2,2) PE(2,3) PE(3,0) PE(3,1) PE(3,2) PE(3,3) Broadcast a0(S1) © A. Pedram 15
  • 16. 9/8/13 Non-Blocked nr x nr SYRK on LAC •  LAC takes advantage of the broadcast buses •  Overlap computation with communication with transposition •  Least operation count in part of bigger SYRK problem •  Simple control –  3-stage state machine in PEs (0,0) PE(0,1) PE(0,2) PE(0,3) (1,0) PE(1,1) PE(1,2) PE(1,3) (2,0) PE(2,1) PE(2,2) PE(2,3) (3,0) PE(3,1) PE(3,2) PE(3,3) Broadcast a0(S1) PE(0,0) PE(0,1) PE(0,2) PE(0,3) PE(1,0) PE(1,1) PE(1,2) PE(1,3) PE(2,0) PE(2,1) PE(2,2) PE(2,3) PE(3,0) PE(3,1) PE(3,2) PE(3,3) Broadcast a1 Broadcast a0 T Rank-1 update C+=a0a0 T (S1) (S1) (S1) © A. Pedram 16
  • 17. Power  and  Performance  Analysis   •  Analy#cal  formulae   –  U#liza#on   –  Bandwidth   –  Size  of  local  stores   –  Size  of  register  file   –  #ports  of  memory  levels   •  Component  selec#ons   –  Storage  model  [CACTI  6.5]   •  Pure  SRAM  Model  for   Register  file   •  Cache  Models  for  L1  &  L2     •  Configura#on   –  1  GHz     –  16  FPUs   –  LAC   •  256+32  KB  aggregate  Local   store   –  Register  file  organiza#ons   •  256    KB  L2,  64B  line,  8  way   •  32        KB  L1,  64B  line,  8way   •  Register  file  Size  variable   –  Aimed  for  peak  performance   9/8/13   17  ©  A.  Pedram  
  • 18. GEMM  Power  and  Area   •  SIMD  architectures   –  Effec#vely  reduce  the  power   consump#on   •  LAC  is  the  best  not  using   caches   •  The  flexibility  of  flat  register   files  is  of  no  use  for  GEMM.   •  4×4-­‐wide  array  of  SIMD   requires  less  area.   •  Most  of  the  area  is  used  by   Caches   9/8/13   18   0
 2
 4
 6
 8
 10
 12
 14
 FLAT
 SIMD
 A
SIMD
X
h
 LAC
 mm^2
 SRAM
 L2
Cache
 L1
Cache
 RF
 0
 0.01
 0.02
 0.03
 0.04
 0.05
 0.06
 0.07
 FLAT
 SIMD
 A
SIMD
X
h
 LAC
 W/GFLOP
 SRAM
 L2
Cache
 L1
Cache
 RF
 ©  A.  Pedram  
  • 19. SYRK  Power  and  Area   •  Wide  SIMD  takes  many   cycles  to  transpose   •  4×4-­‐wide  SIMD  has  best   efficiency   –  Less  power  than  FLAT   –  Higher  u#liza#on  then  Wide   SIMD   •  4×4-­‐wide  array  of  SIMD   requires  less  area.   •  Most  of  the  area  is  used  by   Caches   9/8/13   19   0
 2
 4
 6
 8
 10
 12
 14
 16
 18
 FLAT
 SIMD
 A
SIMD
X
h
 LAC
 mm^2
 SRAM
 L2
Cache
 L1
Cache
 RF
 0
 0.02
 0.04
 0.06
 0.08
 0.1
 0.12
 0.14
 0.16
 0.18
 0.2
 FLAT
 SIMD
 A
SIMD
X
h
 LAC
 W/GFLOP
 SRAM
 L2
Cache
 L1
Cache
 RF
 ©  A.  Pedram  
  • 20. Conclusion   •  Architectures  Studied   •  Linear  Algebra  Core  (LAC)   •  FLAT  register  file   •  1D  wide  SIMD   •  4×4-­‐wide  array  of  SIMD     •  Algorithm/Architecture  co-­‐design   •  GEMM  (emphasize  on  data  bandwidth  to-­‐from  core)   •  SYRK  (  emphasize  on  data  manipula#on  in  core)   •  Analy#cal  formulae  for  different  layers  of   memory  hierarchy   •  Power  and  efficiency  es#ma#on   9/8/13   20  ©  A.  Pedram  
  • 21. Linear  Algebra  Core   •  Results @ 1GHz –  DP: 32 GFLOPS, 47 GFLOPS/W –  0.6 Watts –  2.8 mm2 in 45nm –  4 GB/s external BW –  Orders of magnitude improvement Ø On-going and future work –  Generalization: full Level-3 BLAS in hardware –  Implementation: synthesis and tape-out 9/8/13   21  ©  A.  Pedram  
  • 22. Ques#ons  ?   9/8/13   22  ©  A.  Pedram