SlideShare a Scribd company logo
1 of 22
Download to read offline
Op#mizing	
  for	
  the	
  Inner	
  Kernel:	
  
A	
  Low	
  Power,	
  High	
  Performance	
  Core	
  
for	
  Matrix	
  Mul#plica#on	
  Kernels	
  
	
  
Ardavan	
  Pedram	
  
The	
  GEMM	
  Micro-­‐kernel	
  
•  BLIS	
  exposes	
  outer	
  two	
  loops	
  of	
  the	
  inner	
  kernel	
  
•  Key	
  observa+on:	
  Virtually	
  all	
  of	
  the	
  differences	
  between	
  
level-­‐3	
  inner	
  kernels	
  reside	
  in	
  the	
  outer	
  two	
  loops	
  
9/8/13	
   2	
  
for ( 0 to NC: NR )
for ( 0 to MC: MR )
for ( 0 to KC: 1 )
// block-dot product
endfor
endfor
endfor
inner	
  kernel	
  /	
  
macro-­‐kernel	
  
micro-­‐kernel	
  
©	
  A.	
  Pedram	
  
9/8/13
Era of Heterogeneous Computing
•  Physical limits of technology scaling
– Power wall and dark silicon
•  Only a fraction of a chip may be active
•  Efficiency/optimality vs. flexibility/generality
•  GFLOPS/W (energy per operation)
Ø Opportunity for specialization
Ø Heterogeneous multi-core /
Asynchronous CMP
Ø On-chip accelerators
Nvidia	
  Tegra	
  2	
  System	
  on	
  Chip	
  
© A. Pedram 3
Implementa#on	
  Spectrum	
  
9/8/13	
   4	
  
E F F I CI E N C Y
FLEXIBILITY
Source: T. Noll, RWTH Aachen, via R. Leupers, “From ASIP to MPSoC”,
Computer Engineering Colloquium, TU Delft, 2006
Linear	
  Algebra	
  
Processor	
  
©	
  A.	
  Pedram	
  
The	
  GEMM	
  Micro-­‐kernel	
  
9/8/13	
   5	
  
+=	
  
KC	
  NR	
  
MR	
  
NR	
  
C	
   A	
  
B	
  
for ( 0 to NC: NR )
for ( 0 to MC: MR )
for ( 0 to KC: 1 )
// block-dot product
endfor
endfor
endfor
©	
  A.	
  Pedram	
  
C	
  
The	
  GEMM	
  Micro-­‐kernel	
  
9/8/13	
   6	
  
+=	
  
KC	
  NR	
  
MR	
  
NR	
  
α1
α2
α3
α0
β1
β0
 β2
 β3
γ00
γ10
γ20
γ30
γ01
γ11
γ21
γ31
γ02
γ12
γ22
γ32
γ03
γ13
γ23
γ33
+=	
  
A	
  
B	
  
for ( 0 to NC: NR )
for ( 0 to MC: MR )
for ( 0 to KC: 1 )
// block-dot product
endfor
endfor
endfor
•  Typical	
  micro-­‐kernel	
  loop	
  itera#on	
  
(“block-­‐dot	
  product”)	
  
–  Load	
  column	
  of	
  packed	
  A	
  
–  Load	
  row	
  of	
  packed	
  B	
  
–  Compute	
  outer	
  product	
  
–  Update	
  C	
  (kept	
  in	
  registers)	
  
©	
  A.	
  Pedram	
  
9/8/13
Linear Algebra Core (LAC)
•  Scalable 2-D array of nr×nr processing elements (PEs)
–  Specialized floating-point units w/ 1 MAC/cycle throughput
–  Broadcast busses (no need to pipeline up to nr=16)
–  Distributed memory architecture
–  Distributed, PE-local control
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
`
MEM B
Addr1
Row Bus Write (RBW)
Column Bus Write (CBW)
A B
Controller
Column Bus
Read (CBR)
Row Bus
Read (RBR)
MAC
ACC_in
Accumulator
Cin
Memory Interface
Addr2
RF
MEM A
© A. Pedram 7
9/8/13
The “Rank-1” Machine
BpCi
Ai,p
+= x
+=
+
+
nr
nr
+ …
x
x
x
BpCi
Ai,p
+= x
+=
+
+
nr
nr
+ …
x
x
x
C += A x B
© A. Pedram 8
9/8/13
Core-Level Implementation
•  4x4 LAC
•  256 kB of local SRAM
•  @ 1GHz
•  Running GEMM @45nm technology
W
mm2
GFLOPS
mm2
GFLOPS
W
Utilization
Cell SPE (SP) 0.4 6.4 16 83%
NVidia GTX480 SM (SP) 0.5 4.5 8.4 70%
NVidia GTX480 SM (DP) 0.5 2.05 4.1 70%
Intel Core (DP) 0.5 0.4 0.9 95%
ClearSpeed (DP) 0.02 0.3 13 80%
LAC (SP) 0.2 20 104 95+%
LAC (DP) 0.3 16 47 95+%
© A. Pedram 9
Design	
  Trade-­‐off	
  Challenge	
  
•  Some	
  Linear	
  algebra	
  data	
  handling	
  requirements	
  
–  Matrix	
  Transpose	
  (SYRK)	
  
–  Vector	
  Transpose	
  (Matrix-­‐vector	
  mul#ply)	
  (LU)	
  	
  
–  Broadcast	
  (GEMM)	
  
•  Common	
  Sources	
  of	
  Inefficiencies	
  
	
  in	
  CPUs	
  &	
  GPUs	
  
–  Instruc#on	
  handling	
  
–  Mul#-­‐ported	
  register	
  file	
  
–  Cache	
  overheads:	
  tags	
  and	
  coherency	
  
•  What	
  is	
  the	
  best	
  solu#on	
  for	
  those	
  data	
  handling	
  
requirements?	
  
9/8/13	
   10	
  ©	
  A.	
  Pedram	
  
Studied	
  Register	
  File	
  Organiza#ons	
  
9/8/13	
   11	
  
16 Wide SIMD
SIMD Register File
/ Bypass network
Memory Interface
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU FPU FPU FPU
Load/
Store
Unit
Load/
Store
Unit
RF/Bypass
4 Wide SIMD
RF/Bypass
4 Wide SIMD
RF/Bypass
4 Wide SIMD
RF/Bypass
4 Wide SIMD
FPU FPU FPU .. FPU
Multi-Port Flat Register File
/ Bypass network Load/
Store
Unit
..
..
..
(c)(b)(a)
•  Organiza#ons	
  for	
  16	
  FPUs	
  
–  (a)	
  Flat	
  mul#ple-­‐ported	
  register	
  file	
  
–  (b)	
  Wide	
  SIMD	
  with	
  wide	
  ports	
  
–  (c)	
  Array	
  of	
  short	
  SIMD	
  units	
  
•  No	
  limits	
  	
  
–  load-­‐store	
  unit	
  data	
  movement	
  and	
  bandwidth	
  
–  Scalar	
  pipeline	
  (#	
  instruc#ons	
  that	
  are	
  executed)	
  
©	
  A.	
  Pedram	
  
Data	
  Handling	
  Trade-­‐off	
  in	
  Core	
  
•  GEMM[ASAP2011]	
  
– Regular	
  data	
  movement	
  	
  
– Regular	
  access	
  paoern	
  
– Easily	
  mapped	
  on	
  most	
  SIMD	
  Cores	
  
•  SYRK[ASAP2012],[SBAC-­‐PAD2012]	
  
– Matrix	
  Transpose	
  needed	
  
– Data	
  manipula#on	
  for	
  Transpose	
  is	
  a	
  challenge	
  	
  
9/8/13	
   12	
  ©	
  A.	
  Pedram	
  
Memory	
  Hierarchy	
  of	
  GEMM	
  
•  Blocking	
  in	
  	
  
higher	
  levels	
  
of	
  memory	
  is	
  
almost	
  the	
  	
  
same	
  
	
  
9/8/13	
   13	
  
+=
xAi,p
C A B
C
x
x
+=
x+=
Accumulator
(Register Level)
Local Store Level
(LAC)
On-chip Memory
n
Bp
Ap
Ci
Bp,j
Main Memory
LAP
1 x w SIMD
h x w SIMD
L2 Cache
(CPUs) L1 Cache
(CPUs)
L3 Cache
(CPUs)
©	
  A.	
  Pedram	
  
GEMM	
  Register-­‐level	
  Opera#ons	
  
•  Accumulator	
  
delay	
  
mul#plies	
  the	
  
number	
  of	
  
registers	
  
needed	
  in	
  core	
  
9/8/13	
   14	
  
Ai+dM-1,k Bk,j
cij
Ci+1,j
Blocks of Ci,j
1-Stream Out
2-Current
3-Prefetch
+=
mc
nr
Ci+1,j
Ci+2,j
Ci+d-1,j
dA
dM
dA
©	
  A.	
  Pedram	
  
C A Bx+=
+=
+=
+=
x
x
+= xnr
nr nr
w
dA
w x dA
w
h
w x dA
LAP
1 x w SIMD
h x w SIMD
Ai,p in Local Store of PEs Bp,j Bp,j+1
Blocks of Ci,j
1-Stream Out
2-Current
3-Prefetch
+=
x
kc
mc kc
9/8/13
Non-Blocked nr x nr SYRK on LAC
•  LAC takes advantage of the
broadcast buses
•  Overlap computation with
communication with transposition
•  Least operation count in part of
bigger SYRK problem
•  Simple control
–  3-stage state machine in PEs
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a0(S1)
© A. Pedram 15
9/8/13
Non-Blocked nr x nr SYRK on LAC
•  LAC takes advantage of the
broadcast buses
•  Overlap computation with
communication with transposition
•  Least operation count in part of
bigger SYRK problem
•  Simple control
–  3-stage state machine in PEs
(0,0) PE(0,1) PE(0,2) PE(0,3)
(1,0) PE(1,1) PE(1,2) PE(1,3)
(2,0) PE(2,1) PE(2,2) PE(2,3)
(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a0(S1)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a1
Broadcast a0
T
Rank-1 update C+=a0a0
T
(S1)
(S1)
(S1)
© A. Pedram 16
Power	
  and	
  Performance	
  Analysis	
  
•  Analy#cal	
  formulae	
  
–  U#liza#on	
  
–  Bandwidth	
  
–  Size	
  of	
  local	
  stores	
  
–  Size	
  of	
  register	
  file	
  
–  #ports	
  of	
  memory	
  levels	
  
•  Component	
  selec#ons	
  
–  Storage	
  model	
  [CACTI	
  6.5]	
  
•  Pure	
  SRAM	
  Model	
  for	
  
Register	
  file	
  
•  Cache	
  Models	
  for	
  L1	
  &	
  L2	
  
	
  
•  Configura#on	
  
–  1	
  GHz	
  	
  
–  16	
  FPUs	
  
–  LAC	
  
•  256+32	
  KB	
  aggregate	
  Local	
  
store	
  
–  Register	
  file	
  organiza#ons	
  
•  256	
  	
  KB	
  L2,	
  64B	
  line,	
  8	
  way	
  
•  32	
  	
  	
  	
  KB	
  L1,	
  64B	
  line,	
  8way	
  
•  Register	
  file	
  Size	
  variable	
  
–  Aimed	
  for	
  peak	
  performance	
  
9/8/13	
   17	
  ©	
  A.	
  Pedram	
  
GEMM	
  Power	
  and	
  Area	
  
•  SIMD	
  architectures	
  
–  Effec#vely	
  reduce	
  the	
  power	
  
consump#on	
  
•  LAC	
  is	
  the	
  best	
  not	
  using	
  
caches	
  
•  The	
  flexibility	
  of	
  flat	
  register	
  
files	
  is	
  of	
  no	
  use	
  for	
  GEMM.	
  
•  4×4-­‐wide	
  array	
  of	
  SIMD	
  
requires	
  less	
  area.	
  
•  Most	
  of	
  the	
  area	
  is	
  used	
  by	
  
Caches	
  
9/8/13	
   18	
  
0

2

4

6

8

10

12

14

FLAT
 SIMD
 A
SIMD
X
h
 LAC

mm^2

SRAM

L2
Cache

L1
Cache

RF

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

FLAT
 SIMD
 A
SIMD
X
h
 LAC

W/GFLOP

SRAM

L2
Cache

L1
Cache

RF

©	
  A.	
  Pedram	
  
SYRK	
  Power	
  and	
  Area	
  
•  Wide	
  SIMD	
  takes	
  many	
  
cycles	
  to	
  transpose	
  
•  4×4-­‐wide	
  SIMD	
  has	
  best	
  
efficiency	
  
–  Less	
  power	
  than	
  FLAT	
  
–  Higher	
  u#liza#on	
  then	
  Wide	
  
SIMD	
  
•  4×4-­‐wide	
  array	
  of	
  SIMD	
  
requires	
  less	
  area.	
  
•  Most	
  of	
  the	
  area	
  is	
  used	
  by	
  
Caches	
  
9/8/13	
   19	
  
0

2

4

6

8

10

12

14

16

18

FLAT
 SIMD
 A
SIMD
X
h
 LAC

mm^2

SRAM

L2
Cache

L1
Cache

RF

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

FLAT
 SIMD
 A
SIMD
X
h
 LAC

W/GFLOP

SRAM

L2
Cache

L1
Cache

RF

©	
  A.	
  Pedram	
  
Conclusion	
  
•  Architectures	
  Studied	
  
•  Linear	
  Algebra	
  Core	
  (LAC)	
  
•  FLAT	
  register	
  file	
  
•  1D	
  wide	
  SIMD	
  
•  4×4-­‐wide	
  array	
  of	
  SIMD	
  	
  
•  Algorithm/Architecture	
  co-­‐design	
  
•  GEMM	
  (emphasize	
  on	
  data	
  bandwidth	
  to-­‐from	
  core)	
  
•  SYRK 	
  (	
  emphasize	
  on	
  data	
  manipula#on	
  in	
  core)	
  
•  Analy#cal	
  formulae	
  for	
  different	
  layers	
  of	
  
memory	
  hierarchy	
  
•  Power	
  and	
  efficiency	
  es#ma#on	
  
9/8/13	
   20	
  ©	
  A.	
  Pedram	
  
Linear	
  Algebra	
  Core	
  
•  Results @ 1GHz
–  DP: 32 GFLOPS, 47 GFLOPS/W
–  0.6 Watts
–  2.8 mm2 in 45nm
–  4 GB/s external BW
–  Orders of magnitude improvement
Ø On-going and future work
–  Generalization: full Level-3 BLAS in hardware
–  Implementation: synthesis and tape-out
9/8/13	
   21	
  ©	
  A.	
  Pedram	
  
Ques#ons	
  ?	
  
9/8/13	
   22	
  ©	
  A.	
  Pedram	
  

More Related Content

What's hot

ARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsAnh Dung NGUYEN
 
Chapter 04 the processor
Chapter 04   the processorChapter 04   the processor
Chapter 04 the processorBảo Hoang
 
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC 02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC IEEE SSCS AlexSC
 
CArcMOOC 05.03 - Pipeline hazards
CArcMOOC 05.03 - Pipeline hazardsCArcMOOC 05.03 - Pipeline hazards
CArcMOOC 05.03 - Pipeline hazardsAlessandro Bogliolo
 
Arm cortex R(real time)processor series
Arm cortex R(real time)processor series Arm cortex R(real time)processor series
Arm cortex R(real time)processor series Ronak047
 
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armPrashant Ahire
 
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAnh Dung NGUYEN
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
ASTROSAT SSR - 2015-05-15
ASTROSAT SSR - 2015-05-15ASTROSAT SSR - 2015-05-15
ASTROSAT SSR - 2015-05-15Aritra Sarkar
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overviewSunil Thorat
 
Why a zynq should power your next project
Why a zynq should power your next projectWhy a zynq should power your next project
Why a zynq should power your next projectMark Smith
 
F9 microkernel code reading part 4 memory management
F9 microkernel code reading part 4 memory managementF9 microkernel code reading part 4 memory management
F9 microkernel code reading part 4 memory managementBenux Wei
 

What's hot (20)

ARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsARM AAE - Intrustion Sets
ARM AAE - Intrustion Sets
 
Chapter 04 the processor
Chapter 04   the processorChapter 04   the processor
Chapter 04 the processor
 
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC 02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
02 : ARM Cortex M4 Specs || IEEE SSCS AlexSC
 
OMAP
OMAPOMAP
OMAP
 
CArcMOOC 05.03 - Pipeline hazards
CArcMOOC 05.03 - Pipeline hazardsCArcMOOC 05.03 - Pipeline hazards
CArcMOOC 05.03 - Pipeline hazards
 
Arm cortex R(real time)processor series
Arm cortex R(real time)processor series Arm cortex R(real time)processor series
Arm cortex R(real time)processor series
 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
 
Gv2512441247
Gv2512441247Gv2512441247
Gv2512441247
 
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_arm
 
ARM cortex A15
ARM cortex A15ARM cortex A15
ARM cortex A15
 
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
ASTROSAT SSR - 2015-05-15
ASTROSAT SSR - 2015-05-15ASTROSAT SSR - 2015-05-15
ASTROSAT SSR - 2015-05-15
 
ARM Architecture
ARM ArchitectureARM Architecture
ARM Architecture
 
Archi arm2
Archi arm2Archi arm2
Archi arm2
 
ARM Architecture
ARM ArchitectureARM Architecture
ARM Architecture
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
Why a zynq should power your next project
Why a zynq should power your next projectWhy a zynq should power your next project
Why a zynq should power your next project
 
ARM AAE - Architecture
ARM AAE - ArchitectureARM AAE - Architecture
ARM AAE - Architecture
 
F9 microkernel code reading part 4 memory management
F9 microkernel code reading part 4 memory managementF9 microkernel code reading part 4 memory management
F9 microkernel code reading part 4 memory management
 

Similar to Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multiplication

Microelectronics U4.pptx.ppt
Microelectronics U4.pptx.pptMicroelectronics U4.pptx.ppt
Microelectronics U4.pptx.pptPavikaSharma3
 
TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...chiportal
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computingrinnocente
 
EC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCET
EC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCETEC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCET
EC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCETSeshaVidhyaS
 
Prezentacja_Profil_Portfolio - EN
Prezentacja_Profil_Portfolio - ENPrezentacja_Profil_Portfolio - EN
Prezentacja_Profil_Portfolio - ENTomasz Janicki
 
DB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and controlDB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and controlFlorence Dubois
 
Decoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMIDecoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMIAllan Cantle
 
Ics21 workshop decoupling compute from memory, storage & io with omi - ...
Ics21 workshop   decoupling compute from memory, storage & io with omi - ...Ics21 workshop   decoupling compute from memory, storage & io with omi - ...
Ics21 workshop decoupling compute from memory, storage & io with omi - ...Vaibhav R
 
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017Bruno Teixeira
 
LTE-Advanced Physical Layer
LTE-Advanced Physical LayerLTE-Advanced Physical Layer
LTE-Advanced Physical LayerPraveen Kumar
 
Van jaconson netchannels
Van jaconson netchannelsVan jaconson netchannels
Van jaconson netchannelsSusant Sahani
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureAllan Cantle
 
Optimized CAM Design
Optimized CAM DesignOptimized CAM Design
Optimized CAM DesignIJMER
 
Cisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdf
Cisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdfCisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdf
Cisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdfVarghese Martin
 

Similar to Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multiplication (20)

Microelectronics U4.pptx.ppt
Microelectronics U4.pptx.pptMicroelectronics U4.pptx.ppt
Microelectronics U4.pptx.ppt
 
TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
DF_Captronics06
DF_Captronics06DF_Captronics06
DF_Captronics06
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
EC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCET
EC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCETEC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCET
EC8392 -DIGITAL ELECTRONICS -II YEAR ECE-by S.SESHA VIDHYA /ASP/ ECE/ RMKCET
 
Prezentacja_Profil_Portfolio - EN
Prezentacja_Profil_Portfolio - ENPrezentacja_Profil_Portfolio - EN
Prezentacja_Profil_Portfolio - EN
 
SRAM
SRAMSRAM
SRAM
 
Unit2 arm
Unit2 armUnit2 arm
Unit2 arm
 
Memory elements 1
Memory elements  1Memory elements  1
Memory elements 1
 
DB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and controlDB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and control
 
Decoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMIDecoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMI
 
Ics21 workshop decoupling compute from memory, storage & io with omi - ...
Ics21 workshop   decoupling compute from memory, storage & io with omi - ...Ics21 workshop   decoupling compute from memory, storage & io with omi - ...
Ics21 workshop decoupling compute from memory, storage & io with omi - ...
 
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Las Vegas 2017
 
LTE-Advanced Physical Layer
LTE-Advanced Physical LayerLTE-Advanced Physical Layer
LTE-Advanced Physical Layer
 
Osmocom
OsmocomOsmocom
Osmocom
 
Van jaconson netchannels
Van jaconson netchannelsVan jaconson netchannels
Van jaconson netchannels
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
 
Optimized CAM Design
Optimized CAM DesignOptimized CAM Design
Optimized CAM Design
 
Cisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdf
Cisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdfCisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdf
Cisco ASR 9000 Architecture - BRKARC-2003 3rd session.pdf
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multiplication

  • 1. Op#mizing  for  the  Inner  Kernel:   A  Low  Power,  High  Performance  Core   for  Matrix  Mul#plica#on  Kernels     Ardavan  Pedram  
  • 2. The  GEMM  Micro-­‐kernel   •  BLIS  exposes  outer  two  loops  of  the  inner  kernel   •  Key  observa+on:  Virtually  all  of  the  differences  between   level-­‐3  inner  kernels  reside  in  the  outer  two  loops   9/8/13   2   for ( 0 to NC: NR ) for ( 0 to MC: MR ) for ( 0 to KC: 1 ) // block-dot product endfor endfor endfor inner  kernel  /   macro-­‐kernel   micro-­‐kernel   ©  A.  Pedram  
  • 3. 9/8/13 Era of Heterogeneous Computing •  Physical limits of technology scaling – Power wall and dark silicon •  Only a fraction of a chip may be active •  Efficiency/optimality vs. flexibility/generality •  GFLOPS/W (energy per operation) Ø Opportunity for specialization Ø Heterogeneous multi-core / Asynchronous CMP Ø On-chip accelerators Nvidia  Tegra  2  System  on  Chip   © A. Pedram 3
  • 4. Implementa#on  Spectrum   9/8/13   4   E F F I CI E N C Y FLEXIBILITY Source: T. Noll, RWTH Aachen, via R. Leupers, “From ASIP to MPSoC”, Computer Engineering Colloquium, TU Delft, 2006 Linear  Algebra   Processor   ©  A.  Pedram  
  • 5. The  GEMM  Micro-­‐kernel   9/8/13   5   +=   KC  NR   MR   NR   C   A   B   for ( 0 to NC: NR ) for ( 0 to MC: MR ) for ( 0 to KC: 1 ) // block-dot product endfor endfor endfor ©  A.  Pedram  
  • 6. C   The  GEMM  Micro-­‐kernel   9/8/13   6   +=   KC  NR   MR   NR   α1 α2 α3 α0 β1 β0 β2 β3 γ00 γ10 γ20 γ30 γ01 γ11 γ21 γ31 γ02 γ12 γ22 γ32 γ03 γ13 γ23 γ33 +=   A   B   for ( 0 to NC: NR ) for ( 0 to MC: MR ) for ( 0 to KC: 1 ) // block-dot product endfor endfor endfor •  Typical  micro-­‐kernel  loop  itera#on   (“block-­‐dot  product”)   –  Load  column  of  packed  A   –  Load  row  of  packed  B   –  Compute  outer  product   –  Update  C  (kept  in  registers)   ©  A.  Pedram  
  • 7. 9/8/13 Linear Algebra Core (LAC) •  Scalable 2-D array of nr×nr processing elements (PEs) –  Specialized floating-point units w/ 1 MAC/cycle throughput –  Broadcast busses (no need to pipeline up to nr=16) –  Distributed memory architecture –  Distributed, PE-local control PE(0,0) PE(0,1) PE(0,2) PE(0,3) PE(1,0) PE(1,1) PE(1,2) PE(1,3) PE(2,0) PE(2,1) PE(2,2) PE(2,3) PE(3,0) PE(3,1) PE(3,2) PE(3,3) ` MEM B Addr1 Row Bus Write (RBW) Column Bus Write (CBW) A B Controller Column Bus Read (CBR) Row Bus Read (RBR) MAC ACC_in Accumulator Cin Memory Interface Addr2 RF MEM A © A. Pedram 7
  • 8. 9/8/13 The “Rank-1” Machine BpCi Ai,p += x += + + nr nr + … x x x BpCi Ai,p += x += + + nr nr + … x x x C += A x B © A. Pedram 8
  • 9. 9/8/13 Core-Level Implementation •  4x4 LAC •  256 kB of local SRAM •  @ 1GHz •  Running GEMM @45nm technology W mm2 GFLOPS mm2 GFLOPS W Utilization Cell SPE (SP) 0.4 6.4 16 83% NVidia GTX480 SM (SP) 0.5 4.5 8.4 70% NVidia GTX480 SM (DP) 0.5 2.05 4.1 70% Intel Core (DP) 0.5 0.4 0.9 95% ClearSpeed (DP) 0.02 0.3 13 80% LAC (SP) 0.2 20 104 95+% LAC (DP) 0.3 16 47 95+% © A. Pedram 9
  • 10. Design  Trade-­‐off  Challenge   •  Some  Linear  algebra  data  handling  requirements   –  Matrix  Transpose  (SYRK)   –  Vector  Transpose  (Matrix-­‐vector  mul#ply)  (LU)     –  Broadcast  (GEMM)   •  Common  Sources  of  Inefficiencies    in  CPUs  &  GPUs   –  Instruc#on  handling   –  Mul#-­‐ported  register  file   –  Cache  overheads:  tags  and  coherency   •  What  is  the  best  solu#on  for  those  data  handling   requirements?   9/8/13   10  ©  A.  Pedram  
  • 11. Studied  Register  File  Organiza#ons   9/8/13   11   16 Wide SIMD SIMD Register File / Bypass network Memory Interface FPU +LS FPU +LS FPU +LS FPU +LS FPU +LS FPU +LS FPU +LS FPU +LS FPU FPU FPU FPU Load/ Store Unit Load/ Store Unit RF/Bypass 4 Wide SIMD RF/Bypass 4 Wide SIMD RF/Bypass 4 Wide SIMD RF/Bypass 4 Wide SIMD FPU FPU FPU .. FPU Multi-Port Flat Register File / Bypass network Load/ Store Unit .. .. .. (c)(b)(a) •  Organiza#ons  for  16  FPUs   –  (a)  Flat  mul#ple-­‐ported  register  file   –  (b)  Wide  SIMD  with  wide  ports   –  (c)  Array  of  short  SIMD  units   •  No  limits     –  load-­‐store  unit  data  movement  and  bandwidth   –  Scalar  pipeline  (#  instruc#ons  that  are  executed)   ©  A.  Pedram  
  • 12. Data  Handling  Trade-­‐off  in  Core   •  GEMM[ASAP2011]   – Regular  data  movement     – Regular  access  paoern   – Easily  mapped  on  most  SIMD  Cores   •  SYRK[ASAP2012],[SBAC-­‐PAD2012]   – Matrix  Transpose  needed   – Data  manipula#on  for  Transpose  is  a  challenge     9/8/13   12  ©  A.  Pedram  
  • 13. Memory  Hierarchy  of  GEMM   •  Blocking  in     higher  levels   of  memory  is   almost  the     same     9/8/13   13   += xAi,p C A B C x x += x+= Accumulator (Register Level) Local Store Level (LAC) On-chip Memory n Bp Ap Ci Bp,j Main Memory LAP 1 x w SIMD h x w SIMD L2 Cache (CPUs) L1 Cache (CPUs) L3 Cache (CPUs) ©  A.  Pedram  
  • 14. GEMM  Register-­‐level  Opera#ons   •  Accumulator   delay   mul#plies  the   number  of   registers   needed  in  core   9/8/13   14   Ai+dM-1,k Bk,j cij Ci+1,j Blocks of Ci,j 1-Stream Out 2-Current 3-Prefetch += mc nr Ci+1,j Ci+2,j Ci+d-1,j dA dM dA ©  A.  Pedram   C A Bx+= += += += x x += xnr nr nr w dA w x dA w h w x dA LAP 1 x w SIMD h x w SIMD Ai,p in Local Store of PEs Bp,j Bp,j+1 Blocks of Ci,j 1-Stream Out 2-Current 3-Prefetch += x kc mc kc
  • 15. 9/8/13 Non-Blocked nr x nr SYRK on LAC •  LAC takes advantage of the broadcast buses •  Overlap computation with communication with transposition •  Least operation count in part of bigger SYRK problem •  Simple control –  3-stage state machine in PEs PE(0,0) PE(0,1) PE(0,2) PE(0,3) PE(1,0) PE(1,1) PE(1,2) PE(1,3) PE(2,0) PE(2,1) PE(2,2) PE(2,3) PE(3,0) PE(3,1) PE(3,2) PE(3,3) Broadcast a0(S1) © A. Pedram 15
  • 16. 9/8/13 Non-Blocked nr x nr SYRK on LAC •  LAC takes advantage of the broadcast buses •  Overlap computation with communication with transposition •  Least operation count in part of bigger SYRK problem •  Simple control –  3-stage state machine in PEs (0,0) PE(0,1) PE(0,2) PE(0,3) (1,0) PE(1,1) PE(1,2) PE(1,3) (2,0) PE(2,1) PE(2,2) PE(2,3) (3,0) PE(3,1) PE(3,2) PE(3,3) Broadcast a0(S1) PE(0,0) PE(0,1) PE(0,2) PE(0,3) PE(1,0) PE(1,1) PE(1,2) PE(1,3) PE(2,0) PE(2,1) PE(2,2) PE(2,3) PE(3,0) PE(3,1) PE(3,2) PE(3,3) Broadcast a1 Broadcast a0 T Rank-1 update C+=a0a0 T (S1) (S1) (S1) © A. Pedram 16
  • 17. Power  and  Performance  Analysis   •  Analy#cal  formulae   –  U#liza#on   –  Bandwidth   –  Size  of  local  stores   –  Size  of  register  file   –  #ports  of  memory  levels   •  Component  selec#ons   –  Storage  model  [CACTI  6.5]   •  Pure  SRAM  Model  for   Register  file   •  Cache  Models  for  L1  &  L2     •  Configura#on   –  1  GHz     –  16  FPUs   –  LAC   •  256+32  KB  aggregate  Local   store   –  Register  file  organiza#ons   •  256    KB  L2,  64B  line,  8  way   •  32        KB  L1,  64B  line,  8way   •  Register  file  Size  variable   –  Aimed  for  peak  performance   9/8/13   17  ©  A.  Pedram  
  • 18. GEMM  Power  and  Area   •  SIMD  architectures   –  Effec#vely  reduce  the  power   consump#on   •  LAC  is  the  best  not  using   caches   •  The  flexibility  of  flat  register   files  is  of  no  use  for  GEMM.   •  4×4-­‐wide  array  of  SIMD   requires  less  area.   •  Most  of  the  area  is  used  by   Caches   9/8/13   18   0
 2
 4
 6
 8
 10
 12
 14
 FLAT
 SIMD
 A
SIMD
X
h
 LAC
 mm^2
 SRAM
 L2
Cache
 L1
Cache
 RF
 0
 0.01
 0.02
 0.03
 0.04
 0.05
 0.06
 0.07
 FLAT
 SIMD
 A
SIMD
X
h
 LAC
 W/GFLOP
 SRAM
 L2
Cache
 L1
Cache
 RF
 ©  A.  Pedram  
  • 19. SYRK  Power  and  Area   •  Wide  SIMD  takes  many   cycles  to  transpose   •  4×4-­‐wide  SIMD  has  best   efficiency   –  Less  power  than  FLAT   –  Higher  u#liza#on  then  Wide   SIMD   •  4×4-­‐wide  array  of  SIMD   requires  less  area.   •  Most  of  the  area  is  used  by   Caches   9/8/13   19   0
 2
 4
 6
 8
 10
 12
 14
 16
 18
 FLAT
 SIMD
 A
SIMD
X
h
 LAC
 mm^2
 SRAM
 L2
Cache
 L1
Cache
 RF
 0
 0.02
 0.04
 0.06
 0.08
 0.1
 0.12
 0.14
 0.16
 0.18
 0.2
 FLAT
 SIMD
 A
SIMD
X
h
 LAC
 W/GFLOP
 SRAM
 L2
Cache
 L1
Cache
 RF
 ©  A.  Pedram  
  • 20. Conclusion   •  Architectures  Studied   •  Linear  Algebra  Core  (LAC)   •  FLAT  register  file   •  1D  wide  SIMD   •  4×4-­‐wide  array  of  SIMD     •  Algorithm/Architecture  co-­‐design   •  GEMM  (emphasize  on  data  bandwidth  to-­‐from  core)   •  SYRK  (  emphasize  on  data  manipula#on  in  core)   •  Analy#cal  formulae  for  different  layers  of   memory  hierarchy   •  Power  and  efficiency  es#ma#on   9/8/13   20  ©  A.  Pedram  
  • 21. Linear  Algebra  Core   •  Results @ 1GHz –  DP: 32 GFLOPS, 47 GFLOPS/W –  0.6 Watts –  2.8 mm2 in 45nm –  4 GB/s external BW –  Orders of magnitude improvement Ø On-going and future work –  Generalization: full Level-3 BLAS in hardware –  Implementation: synthesis and tape-out 9/8/13   21  ©  A.  Pedram  
  • 22. Ques#ons  ?   9/8/13   22  ©  A.  Pedram