Op#mizing	
  for	
  the	
  Inner	
  Kernel:	
  
A	
  Low	
  Power,	
  High	
  Performance	
  Core	
  
for	
  Matrix	
  Mul#plica#on	
  Kernels	
  
	
  
Ardavan	
  Pedram	
  
The	
  GEMM	
  Micro-­‐kernel	
  
•  BLIS	
  exposes	
  outer	
  two	
  loops	
  of	
  the	
  inner	
  kernel	
  
•  Key	
  observa+on:	
  Virtually	
  all	
  of	
  the	
  differences	
  between	
  
level-­‐3	
  inner	
  kernels	
  reside	
  in	
  the	
  outer	
  two	
  loops	
  
9/8/13	
   2	
  
for ( 0 to NC: NR )
for ( 0 to MC: MR )
for ( 0 to KC: 1 )
// block-dot product
endfor
endfor
endfor
inner	
  kernel	
  /	
  
macro-­‐kernel	
  
micro-­‐kernel	
  
©	
  A.	
  Pedram	
  
9/8/13
Era of Heterogeneous Computing
•  Physical limits of technology scaling
– Power wall and dark silicon
•  Only a fraction of a chip may be active
•  Efficiency/optimality vs. flexibility/generality
•  GFLOPS/W (energy per operation)
Ø Opportunity for specialization
Ø Heterogeneous multi-core /
Asynchronous CMP
Ø On-chip accelerators
Nvidia	
  Tegra	
  2	
  System	
  on	
  Chip	
  
© A. Pedram 3
Implementa#on	
  Spectrum	
  
9/8/13	
   4	
  
E F F I CI E N C Y
FLEXIBILITY
Source: T. Noll, RWTH Aachen, via R. Leupers, “From ASIP to MPSoC”,
Computer Engineering Colloquium, TU Delft, 2006
Linear	
  Algebra	
  
Processor	
  
©	
  A.	
  Pedram	
  
The	
  GEMM	
  Micro-­‐kernel	
  
9/8/13	
   5	
  
+=	
  
KC	
  NR	
  
MR	
  
NR	
  
C	
   A	
  
B	
  
for ( 0 to NC: NR )
for ( 0 to MC: MR )
for ( 0 to KC: 1 )
// block-dot product
endfor
endfor
endfor
©	
  A.	
  Pedram	
  
C	
  
The	
  GEMM	
  Micro-­‐kernel	
  
9/8/13	
   6	
  
+=	
  
KC	
  NR	
  
MR	
  
NR	
  
α1
α2
α3
α0
β1
β0
 β2
 β3
γ00
γ10
γ20
γ30
γ01
γ11
γ21
γ31
γ02
γ12
γ22
γ32
γ03
γ13
γ23
γ33
+=	
  
A	
  
B	
  
for ( 0 to NC: NR )
for ( 0 to MC: MR )
for ( 0 to KC: 1 )
// block-dot product
endfor
endfor
endfor
•  Typical	
  micro-­‐kernel	
  loop	
  itera#on	
  
(“block-­‐dot	
  product”)	
  
–  Load	
  column	
  of	
  packed	
  A	
  
–  Load	
  row	
  of	
  packed	
  B	
  
–  Compute	
  outer	
  product	
  
–  Update	
  C	
  (kept	
  in	
  registers)	
  
©	
  A.	
  Pedram	
  
9/8/13
Linear Algebra Core (LAC)
•  Scalable 2-D array of nr×nr processing elements (PEs)
–  Specialized floating-point units w/ 1 MAC/cycle throughput
–  Broadcast busses (no need to pipeline up to nr=16)
–  Distributed memory architecture
–  Distributed, PE-local control
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
`
MEM B
Addr1
Row Bus Write (RBW)
Column Bus Write (CBW)
A B
Controller
Column Bus
Read (CBR)
Row Bus
Read (RBR)
MAC
ACC_in
Accumulator
Cin
Memory Interface
Addr2
RF
MEM A
© A. Pedram 7
9/8/13
The “Rank-1” Machine
BpCi
Ai,p
+= x
+=
+
+
nr
nr
+ …
x
x
x
BpCi
Ai,p
+= x
+=
+
+
nr
nr
+ …
x
x
x
C += A x B
© A. Pedram 8
9/8/13
Core-Level Implementation
•  4x4 LAC
•  256 kB of local SRAM
•  @ 1GHz
•  Running GEMM @45nm technology
W
mm2
GFLOPS
mm2
GFLOPS
W
Utilization
Cell SPE (SP) 0.4 6.4 16 83%
NVidia GTX480 SM (SP) 0.5 4.5 8.4 70%
NVidia GTX480 SM (DP) 0.5 2.05 4.1 70%
Intel Core (DP) 0.5 0.4 0.9 95%
ClearSpeed (DP) 0.02 0.3 13 80%
LAC (SP) 0.2 20 104 95+%
LAC (DP) 0.3 16 47 95+%
© A. Pedram 9
Design	
  Trade-­‐off	
  Challenge	
  
•  Some	
  Linear	
  algebra	
  data	
  handling	
  requirements	
  
–  Matrix	
  Transpose	
  (SYRK)	
  
–  Vector	
  Transpose	
  (Matrix-­‐vector	
  mul#ply)	
  (LU)	
  	
  
–  Broadcast	
  (GEMM)	
  
•  Common	
  Sources	
  of	
  Inefficiencies	
  
	
  in	
  CPUs	
  &	
  GPUs	
  
–  Instruc#on	
  handling	
  
–  Mul#-­‐ported	
  register	
  file	
  
–  Cache	
  overheads:	
  tags	
  and	
  coherency	
  
•  What	
  is	
  the	
  best	
  solu#on	
  for	
  those	
  data	
  handling	
  
requirements?	
  
9/8/13	
   10	
  ©	
  A.	
  Pedram	
  
Studied	
  Register	
  File	
  Organiza#ons	
  
9/8/13	
   11	
  
16 Wide SIMD
SIMD Register File
/ Bypass network
Memory Interface
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU FPU FPU FPU
Load/
Store
Unit
Load/
Store
Unit
RF/Bypass
4 Wide SIMD
RF/Bypass
4 Wide SIMD
RF/Bypass
4 Wide SIMD
RF/Bypass
4 Wide SIMD
FPU FPU FPU .. FPU
Multi-Port Flat Register File
/ Bypass network Load/
Store
Unit
..
..
..
(c)(b)(a)
•  Organiza#ons	
  for	
  16	
  FPUs	
  
–  (a)	
  Flat	
  mul#ple-­‐ported	
  register	
  file	
  
–  (b)	
  Wide	
  SIMD	
  with	
  wide	
  ports	
  
–  (c)	
  Array	
  of	
  short	
  SIMD	
  units	
  
•  No	
  limits	
  	
  
–  load-­‐store	
  unit	
  data	
  movement	
  and	
  bandwidth	
  
–  Scalar	
  pipeline	
  (#	
  instruc#ons	
  that	
  are	
  executed)	
  
©	
  A.	
  Pedram	
  
Data	
  Handling	
  Trade-­‐off	
  in	
  Core	
  
•  GEMM[ASAP2011]	
  
– Regular	
  data	
  movement	
  	
  
– Regular	
  access	
  paoern	
  
– Easily	
  mapped	
  on	
  most	
  SIMD	
  Cores	
  
•  SYRK[ASAP2012],[SBAC-­‐PAD2012]	
  
– Matrix	
  Transpose	
  needed	
  
– Data	
  manipula#on	
  for	
  Transpose	
  is	
  a	
  challenge	
  	
  
9/8/13	
   12	
  ©	
  A.	
  Pedram	
  
Memory	
  Hierarchy	
  of	
  GEMM	
  
•  Blocking	
  in	
  	
  
higher	
  levels	
  
of	
  memory	
  is	
  
almost	
  the	
  	
  
same	
  
	
  
9/8/13	
   13	
  
+=
xAi,p
C A B
C
x
x
+=
x+=
Accumulator
(Register Level)
Local Store Level
(LAC)
On-chip Memory
n
Bp
Ap
Ci
Bp,j
Main Memory
LAP
1 x w SIMD
h x w SIMD
L2 Cache
(CPUs) L1 Cache
(CPUs)
L3 Cache
(CPUs)
©	
  A.	
  Pedram	
  
GEMM	
  Register-­‐level	
  Opera#ons	
  
•  Accumulator	
  
delay	
  
mul#plies	
  the	
  
number	
  of	
  
registers	
  
needed	
  in	
  core	
  
9/8/13	
   14	
  
Ai+dM-1,k Bk,j
cij
Ci+1,j
Blocks of Ci,j
1-Stream Out
2-Current
3-Prefetch
+=
mc
nr
Ci+1,j
Ci+2,j
Ci+d-1,j
dA
dM
dA
©	
  A.	
  Pedram	
  
C A Bx+=
+=
+=
+=
x
x
+= xnr
nr nr
w
dA
w x dA
w
h
w x dA
LAP
1 x w SIMD
h x w SIMD
Ai,p in Local Store of PEs Bp,j Bp,j+1
Blocks of Ci,j
1-Stream Out
2-Current
3-Prefetch
+=
x
kc
mc kc
9/8/13
Non-Blocked nr x nr SYRK on LAC
•  LAC takes advantage of the
broadcast buses
•  Overlap computation with
communication with transposition
•  Least operation count in part of
bigger SYRK problem
•  Simple control
–  3-stage state machine in PEs
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a0(S1)
© A. Pedram 15
9/8/13
Non-Blocked nr x nr SYRK on LAC
•  LAC takes advantage of the
broadcast buses
•  Overlap computation with
communication with transposition
•  Least operation count in part of
bigger SYRK problem
•  Simple control
–  3-stage state machine in PEs
(0,0) PE(0,1) PE(0,2) PE(0,3)
(1,0) PE(1,1) PE(1,2) PE(1,3)
(2,0) PE(2,1) PE(2,2) PE(2,3)
(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a0(S1)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a1
Broadcast a0
T
Rank-1 update C+=a0a0
T
(S1)
(S1)
(S1)
© A. Pedram 16
Power	
  and	
  Performance	
  Analysis	
  
•  Analy#cal	
  formulae	
  
–  U#liza#on	
  
–  Bandwidth	
  
–  Size	
  of	
  local	
  stores	
  
–  Size	
  of	
  register	
  file	
  
–  #ports	
  of	
  memory	
  levels	
  
•  Component	
  selec#ons	
  
–  Storage	
  model	
  [CACTI	
  6.5]	
  
•  Pure	
  SRAM	
  Model	
  for	
  
Register	
  file	
  
•  Cache	
  Models	
  for	
  L1	
  &	
  L2	
  
	
  
•  Configura#on	
  
–  1	
  GHz	
  	
  
–  16	
  FPUs	
  
–  LAC	
  
•  256+32	
  KB	
  aggregate	
  Local	
  
store	
  
–  Register	
  file	
  organiza#ons	
  
•  256	
  	
  KB	
  L2,	
  64B	
  line,	
  8	
  way	
  
•  32	
  	
  	
  	
  KB	
  L1,	
  64B	
  line,	
  8way	
  
•  Register	
  file	
  Size	
  variable	
  
–  Aimed	
  for	
  peak	
  performance	
  
9/8/13	
   17	
  ©	
  A.	
  Pedram	
  
GEMM	
  Power	
  and	
  Area	
  
•  SIMD	
  architectures	
  
–  Effec#vely	
  reduce	
  the	
  power	
  
consump#on	
  
•  LAC	
  is	
  the	
  best	
  not	
  using	
  
caches	
  
•  The	
  flexibility	
  of	
  flat	
  register	
  
files	
  is	
  of	
  no	
  use	
  for	
  GEMM.	
  
•  4×4-­‐wide	
  array	
  of	
  SIMD	
  
requires	
  less	
  area.	
  
•  Most	
  of	
  the	
  area	
  is	
  used	
  by	
  
Caches	
  
9/8/13	
   18	
  
0

2

4

6

8

10

12

14

FLAT
 SIMD
 A
SIMD
X
h
 LAC

mm^2

SRAM

L2
Cache

L1
Cache

RF

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

FLAT
 SIMD
 A
SIMD
X
h
 LAC

W/GFLOP

SRAM

L2
Cache

L1
Cache

RF

©	
  A.	
  Pedram	
  
SYRK	
  Power	
  and	
  Area	
  
•  Wide	
  SIMD	
  takes	
  many	
  
cycles	
  to	
  transpose	
  
•  4×4-­‐wide	
  SIMD	
  has	
  best	
  
efficiency	
  
–  Less	
  power	
  than	
  FLAT	
  
–  Higher	
  u#liza#on	
  then	
  Wide	
  
SIMD	
  
•  4×4-­‐wide	
  array	
  of	
  SIMD	
  
requires	
  less	
  area.	
  
•  Most	
  of	
  the	
  area	
  is	
  used	
  by	
  
Caches	
  
9/8/13	
   19	
  
0

2

4

6

8

10

12

14

16

18

FLAT
 SIMD
 A
SIMD
X
h
 LAC

mm^2

SRAM

L2
Cache

L1
Cache

RF

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

FLAT
 SIMD
 A
SIMD
X
h
 LAC

W/GFLOP

SRAM

L2
Cache

L1
Cache

RF

©	
  A.	
  Pedram	
  
Conclusion	
  
•  Architectures	
  Studied	
  
•  Linear	
  Algebra	
  Core	
  (LAC)	
  
•  FLAT	
  register	
  file	
  
•  1D	
  wide	
  SIMD	
  
•  4×4-­‐wide	
  array	
  of	
  SIMD	
  	
  
•  Algorithm/Architecture	
  co-­‐design	
  
•  GEMM	
  (emphasize	
  on	
  data	
  bandwidth	
  to-­‐from	
  core)	
  
•  SYRK 	
  (	
  emphasize	
  on	
  data	
  manipula#on	
  in	
  core)	
  
•  Analy#cal	
  formulae	
  for	
  different	
  layers	
  of	
  
memory	
  hierarchy	
  
•  Power	
  and	
  efficiency	
  es#ma#on	
  
9/8/13	
   20	
  ©	
  A.	
  Pedram	
  
Linear	
  Algebra	
  Core	
  
•  Results @ 1GHz
–  DP: 32 GFLOPS, 47 GFLOPS/W
–  0.6 Watts
–  2.8 mm2 in 45nm
–  4 GB/s external BW
–  Orders of magnitude improvement
Ø On-going and future work
–  Generalization: full Level-3 BLAS in hardware
–  Implementation: synthesis and tape-out
9/8/13	
   21	
  ©	
  A.	
  Pedram	
  
Ques#ons	
  ?	
  
9/8/13	
   22	
  ©	
  A.	
  Pedram	
  

Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multiplication

  • 1.
    Op#mizing  for  the  Inner  Kernel:   A  Low  Power,  High  Performance  Core   for  Matrix  Mul#plica#on  Kernels     Ardavan  Pedram  
  • 2.
    The  GEMM  Micro-­‐kernel   •  BLIS  exposes  outer  two  loops  of  the  inner  kernel   •  Key  observa+on:  Virtually  all  of  the  differences  between   level-­‐3  inner  kernels  reside  in  the  outer  two  loops   9/8/13   2   for ( 0 to NC: NR ) for ( 0 to MC: MR ) for ( 0 to KC: 1 ) // block-dot product endfor endfor endfor inner  kernel  /   macro-­‐kernel   micro-­‐kernel   ©  A.  Pedram  
  • 3.
    9/8/13 Era of HeterogeneousComputing •  Physical limits of technology scaling – Power wall and dark silicon •  Only a fraction of a chip may be active •  Efficiency/optimality vs. flexibility/generality •  GFLOPS/W (energy per operation) Ø Opportunity for specialization Ø Heterogeneous multi-core / Asynchronous CMP Ø On-chip accelerators Nvidia  Tegra  2  System  on  Chip   © A. Pedram 3
  • 4.
    Implementa#on  Spectrum   9/8/13   4   E F F I CI E N C Y FLEXIBILITY Source: T. Noll, RWTH Aachen, via R. Leupers, “From ASIP to MPSoC”, Computer Engineering Colloquium, TU Delft, 2006 Linear  Algebra   Processor   ©  A.  Pedram  
  • 5.
    The  GEMM  Micro-­‐kernel   9/8/13   5   +=   KC  NR   MR   NR   C   A   B   for ( 0 to NC: NR ) for ( 0 to MC: MR ) for ( 0 to KC: 1 ) // block-dot product endfor endfor endfor ©  A.  Pedram  
  • 6.
    C   The  GEMM  Micro-­‐kernel   9/8/13   6   +=   KC  NR   MR   NR   α1 α2 α3 α0 β1 β0 β2 β3 γ00 γ10 γ20 γ30 γ01 γ11 γ21 γ31 γ02 γ12 γ22 γ32 γ03 γ13 γ23 γ33 +=   A   B   for ( 0 to NC: NR ) for ( 0 to MC: MR ) for ( 0 to KC: 1 ) // block-dot product endfor endfor endfor •  Typical  micro-­‐kernel  loop  itera#on   (“block-­‐dot  product”)   –  Load  column  of  packed  A   –  Load  row  of  packed  B   –  Compute  outer  product   –  Update  C  (kept  in  registers)   ©  A.  Pedram  
  • 7.
    9/8/13 Linear Algebra Core(LAC) •  Scalable 2-D array of nr×nr processing elements (PEs) –  Specialized floating-point units w/ 1 MAC/cycle throughput –  Broadcast busses (no need to pipeline up to nr=16) –  Distributed memory architecture –  Distributed, PE-local control PE(0,0) PE(0,1) PE(0,2) PE(0,3) PE(1,0) PE(1,1) PE(1,2) PE(1,3) PE(2,0) PE(2,1) PE(2,2) PE(2,3) PE(3,0) PE(3,1) PE(3,2) PE(3,3) ` MEM B Addr1 Row Bus Write (RBW) Column Bus Write (CBW) A B Controller Column Bus Read (CBR) Row Bus Read (RBR) MAC ACC_in Accumulator Cin Memory Interface Addr2 RF MEM A © A. Pedram 7
  • 8.
    9/8/13 The “Rank-1” Machine BpCi Ai,p +=x += + + nr nr + … x x x BpCi Ai,p += x += + + nr nr + … x x x C += A x B © A. Pedram 8
  • 9.
    9/8/13 Core-Level Implementation •  4x4LAC •  256 kB of local SRAM •  @ 1GHz •  Running GEMM @45nm technology W mm2 GFLOPS mm2 GFLOPS W Utilization Cell SPE (SP) 0.4 6.4 16 83% NVidia GTX480 SM (SP) 0.5 4.5 8.4 70% NVidia GTX480 SM (DP) 0.5 2.05 4.1 70% Intel Core (DP) 0.5 0.4 0.9 95% ClearSpeed (DP) 0.02 0.3 13 80% LAC (SP) 0.2 20 104 95+% LAC (DP) 0.3 16 47 95+% © A. Pedram 9
  • 10.
    Design  Trade-­‐off  Challenge   •  Some  Linear  algebra  data  handling  requirements   –  Matrix  Transpose  (SYRK)   –  Vector  Transpose  (Matrix-­‐vector  mul#ply)  (LU)     –  Broadcast  (GEMM)   •  Common  Sources  of  Inefficiencies    in  CPUs  &  GPUs   –  Instruc#on  handling   –  Mul#-­‐ported  register  file   –  Cache  overheads:  tags  and  coherency   •  What  is  the  best  solu#on  for  those  data  handling   requirements?   9/8/13   10  ©  A.  Pedram  
  • 11.
    Studied  Register  File  Organiza#ons   9/8/13   11   16 Wide SIMD SIMD Register File / Bypass network Memory Interface FPU +LS FPU +LS FPU +LS FPU +LS FPU +LS FPU +LS FPU +LS FPU +LS FPU FPU FPU FPU Load/ Store Unit Load/ Store Unit RF/Bypass 4 Wide SIMD RF/Bypass 4 Wide SIMD RF/Bypass 4 Wide SIMD RF/Bypass 4 Wide SIMD FPU FPU FPU .. FPU Multi-Port Flat Register File / Bypass network Load/ Store Unit .. .. .. (c)(b)(a) •  Organiza#ons  for  16  FPUs   –  (a)  Flat  mul#ple-­‐ported  register  file   –  (b)  Wide  SIMD  with  wide  ports   –  (c)  Array  of  short  SIMD  units   •  No  limits     –  load-­‐store  unit  data  movement  and  bandwidth   –  Scalar  pipeline  (#  instruc#ons  that  are  executed)   ©  A.  Pedram  
  • 12.
    Data  Handling  Trade-­‐off  in  Core   •  GEMM[ASAP2011]   – Regular  data  movement     – Regular  access  paoern   – Easily  mapped  on  most  SIMD  Cores   •  SYRK[ASAP2012],[SBAC-­‐PAD2012]   – Matrix  Transpose  needed   – Data  manipula#on  for  Transpose  is  a  challenge     9/8/13   12  ©  A.  Pedram  
  • 13.
    Memory  Hierarchy  of  GEMM   •  Blocking  in     higher  levels   of  memory  is   almost  the     same     9/8/13   13   += xAi,p C A B C x x += x+= Accumulator (Register Level) Local Store Level (LAC) On-chip Memory n Bp Ap Ci Bp,j Main Memory LAP 1 x w SIMD h x w SIMD L2 Cache (CPUs) L1 Cache (CPUs) L3 Cache (CPUs) ©  A.  Pedram  
  • 14.
    GEMM  Register-­‐level  Opera#ons   •  Accumulator   delay   mul#plies  the   number  of   registers   needed  in  core   9/8/13   14   Ai+dM-1,k Bk,j cij Ci+1,j Blocks of Ci,j 1-Stream Out 2-Current 3-Prefetch += mc nr Ci+1,j Ci+2,j Ci+d-1,j dA dM dA ©  A.  Pedram   C A Bx+= += += += x x += xnr nr nr w dA w x dA w h w x dA LAP 1 x w SIMD h x w SIMD Ai,p in Local Store of PEs Bp,j Bp,j+1 Blocks of Ci,j 1-Stream Out 2-Current 3-Prefetch += x kc mc kc
  • 15.
    9/8/13 Non-Blocked nr xnr SYRK on LAC •  LAC takes advantage of the broadcast buses •  Overlap computation with communication with transposition •  Least operation count in part of bigger SYRK problem •  Simple control –  3-stage state machine in PEs PE(0,0) PE(0,1) PE(0,2) PE(0,3) PE(1,0) PE(1,1) PE(1,2) PE(1,3) PE(2,0) PE(2,1) PE(2,2) PE(2,3) PE(3,0) PE(3,1) PE(3,2) PE(3,3) Broadcast a0(S1) © A. Pedram 15
  • 16.
    9/8/13 Non-Blocked nr xnr SYRK on LAC •  LAC takes advantage of the broadcast buses •  Overlap computation with communication with transposition •  Least operation count in part of bigger SYRK problem •  Simple control –  3-stage state machine in PEs (0,0) PE(0,1) PE(0,2) PE(0,3) (1,0) PE(1,1) PE(1,2) PE(1,3) (2,0) PE(2,1) PE(2,2) PE(2,3) (3,0) PE(3,1) PE(3,2) PE(3,3) Broadcast a0(S1) PE(0,0) PE(0,1) PE(0,2) PE(0,3) PE(1,0) PE(1,1) PE(1,2) PE(1,3) PE(2,0) PE(2,1) PE(2,2) PE(2,3) PE(3,0) PE(3,1) PE(3,2) PE(3,3) Broadcast a1 Broadcast a0 T Rank-1 update C+=a0a0 T (S1) (S1) (S1) © A. Pedram 16
  • 17.
    Power  and  Performance  Analysis   •  Analy#cal  formulae   –  U#liza#on   –  Bandwidth   –  Size  of  local  stores   –  Size  of  register  file   –  #ports  of  memory  levels   •  Component  selec#ons   –  Storage  model  [CACTI  6.5]   •  Pure  SRAM  Model  for   Register  file   •  Cache  Models  for  L1  &  L2     •  Configura#on   –  1  GHz     –  16  FPUs   –  LAC   •  256+32  KB  aggregate  Local   store   –  Register  file  organiza#ons   •  256    KB  L2,  64B  line,  8  way   •  32        KB  L1,  64B  line,  8way   •  Register  file  Size  variable   –  Aimed  for  peak  performance   9/8/13   17  ©  A.  Pedram  
  • 18.
    GEMM  Power  and  Area   •  SIMD  architectures   –  Effec#vely  reduce  the  power   consump#on   •  LAC  is  the  best  not  using   caches   •  The  flexibility  of  flat  register   files  is  of  no  use  for  GEMM.   •  4×4-­‐wide  array  of  SIMD   requires  less  area.   •  Most  of  the  area  is  used  by   Caches   9/8/13   18   0
 2
 4
 6
 8
 10
 12
 14
 FLAT
 SIMD
 A
SIMD
X
h
 LAC
 mm^2
 SRAM
 L2
Cache
 L1
Cache
 RF
 0
 0.01
 0.02
 0.03
 0.04
 0.05
 0.06
 0.07
 FLAT
 SIMD
 A
SIMD
X
h
 LAC
 W/GFLOP
 SRAM
 L2
Cache
 L1
Cache
 RF
 ©  A.  Pedram  
  • 19.
    SYRK  Power  and  Area   •  Wide  SIMD  takes  many   cycles  to  transpose   •  4×4-­‐wide  SIMD  has  best   efficiency   –  Less  power  than  FLAT   –  Higher  u#liza#on  then  Wide   SIMD   •  4×4-­‐wide  array  of  SIMD   requires  less  area.   •  Most  of  the  area  is  used  by   Caches   9/8/13   19   0
 2
 4
 6
 8
 10
 12
 14
 16
 18
 FLAT
 SIMD
 A
SIMD
X
h
 LAC
 mm^2
 SRAM
 L2
Cache
 L1
Cache
 RF
 0
 0.02
 0.04
 0.06
 0.08
 0.1
 0.12
 0.14
 0.16
 0.18
 0.2
 FLAT
 SIMD
 A
SIMD
X
h
 LAC
 W/GFLOP
 SRAM
 L2
Cache
 L1
Cache
 RF
 ©  A.  Pedram  
  • 20.
    Conclusion   •  Architectures  Studied   •  Linear  Algebra  Core  (LAC)   •  FLAT  register  file   •  1D  wide  SIMD   •  4×4-­‐wide  array  of  SIMD     •  Algorithm/Architecture  co-­‐design   •  GEMM  (emphasize  on  data  bandwidth  to-­‐from  core)   •  SYRK  (  emphasize  on  data  manipula#on  in  core)   •  Analy#cal  formulae  for  different  layers  of   memory  hierarchy   •  Power  and  efficiency  es#ma#on   9/8/13   20  ©  A.  Pedram  
  • 21.
    Linear  Algebra  Core   •  Results @ 1GHz –  DP: 32 GFLOPS, 47 GFLOPS/W –  0.6 Watts –  2.8 mm2 in 45nm –  4 GB/s external BW –  Orders of magnitude improvement Ø On-going and future work –  Generalization: full Level-3 BLAS in hardware –  Implementation: synthesis and tape-out 9/8/13   21  ©  A.  Pedram  
  • 22.
    Ques#ons  ?   9/8/13   22  ©  A.  Pedram