Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multiplication

Op#mizing
for
the
Inner
Kernel:

A
Low
Power,
High
Performance
Core

for
Matrix
Mul#plica#on
Kernels

Ardavan
Pedram

The
GEMM
Micro-‐kernel

•  BLIS
exposes
outer
two
loops
of
the
inner
kernel

•  Key
observa+on:
Virtually
all
of
the
diﬀerences
between

level-‐3
inner
kernels
reside
in
the
outer
two
loops

9/8/13
2

for ( 0 to NC: NR )
for ( 0 to MC: MR )
for ( 0 to KC: 1 )
// block-dot product
endfor
endfor
endfor
inner
kernel
/

macro-‐kernel

micro-‐kernel

©
A.
Pedram

9/8/13
Era of Heterogeneous Computing
•  Physical limits of technology scaling
– Power wall and dark silicon
•  Only a fraction of a chip may be active
•  Efficiency/optimality vs. flexibility/generality
•  GFLOPS/W (energy per operation)
Ø Opportunity for specialization
Ø Heterogeneous multi-core /
Asynchronous CMP
Ø On-chip accelerators
Nvidia
Tegra
2
System
on
Chip

© A. Pedram 3

Implementa#on
Spectrum

9/8/13
4

E F F I CI E N C Y
FLEXIBILITY
Source: T. Noll, RWTH Aachen, via R. Leupers, “From ASIP to MPSoC”,
Computer Engineering Colloquium, TU Delft, 2006
Linear
Algebra

Processor

©
A.
Pedram

The
GEMM
Micro-‐kernel

9/8/13
5

+=

KC
NR

MR

NR

C
A

B

for ( 0 to NC: NR )
for ( 0 to MC: MR )
for ( 0 to KC: 1 )
endfor
endfor
endfor
©
A.
Pedram

C

The
GEMM
Micro-‐kernel

9/8/13
6

+=

KC
NR

MR

NR

α1
α2
α3
α0
β1
β0
β2
β3
γ00
γ10
γ20
γ30
γ01
γ11
γ21
γ31
γ02
γ12
γ22
γ32
γ03
γ13
γ23
γ33
+=

A

B

for ( 0 to NC: NR )
for ( 0 to MC: MR )
for ( 0 to KC: 1 )
endfor
endfor
endfor
•  Typical
micro-‐kernel
loop
itera#on

(“block-‐dot
product”)

–  Load
column
of
packed
A

–  Load
row
of
packed
B

–  Compute
outer
product

–  Update
C
(kept
in
registers)

©
A.
Pedram

9/8/13
Linear Algebra Core (LAC)
•  Scalable 2-D array of nr×nr processing elements (PEs)
–  Specialized floating-point units w/ 1 MAC/cycle throughput
–  Broadcast busses (no need to pipeline up to nr=16)
–  Distributed memory architecture
–  Distributed, PE-local control
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
`
MEM B
Addr1
Row Bus Write (RBW)
Column Bus Write (CBW)
A B
Controller
Column Bus
Read (CBR)
Row Bus
Read (RBR)
MAC
ACC_in
Accumulator
Cin
Memory Interface
Addr2
RF
MEM A
© A. Pedram 7

9/8/13
The “Rank-1” Machine
BpCi
Ai,p
+= x
+=
+
+
nr
nr
+ …
x
x
x
BpCi
Ai,p
+= x
+=
+
+
nr
nr
+ …
x
x
x
C += A x B
© A. Pedram 8

9/8/13
Core-Level Implementation
•  4x4 LAC
•  256 kB of local SRAM
•  @ 1GHz
•  Running GEMM @45nm technology
W
mm2
GFLOPS
mm2
GFLOPS
W
Utilization
Cell SPE (SP) 0.4 6.4 16 83%
NVidia GTX480 SM (SP) 0.5 4.5 8.4 70%
NVidia GTX480 SM (DP) 0.5 2.05 4.1 70%
Intel Core (DP) 0.5 0.4 0.9 95%
ClearSpeed (DP) 0.02 0.3 13 80%
LAC (SP) 0.2 20 104 95+%
LAC (DP) 0.3 16 47 95+%
© A. Pedram 9

Design
Trade-‐off
Challenge

•  Some
Linear
algebra
data
handling
requirements

–  Matrix
Transpose
(SYRK)

–  Vector
Transpose
(Matrix-‐vector
mul#ply)
(LU)

–  Broadcast
(GEMM)

•  Common
Sources
of
Inefficiencies

in
CPUs
&
GPUs

–  Instruc#on
handling

–  Mul#-‐ported
register
file

–  Cache
overheads:
tags
and
coherency

•  What
is
the
best
solu#on
for
those
data
handling

requirements?

9/8/13
10
©
A.
Pedram

Studied
Register
File
Organiza#ons

9/8/13
11

16 Wide SIMD
SIMD Register File
/ Bypass network
Memory Interface
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU
+LS
FPU FPU FPU FPU
Load/
Store
Unit
Load/
Store
Unit
RF/Bypass
4 Wide SIMD
RF/Bypass
4 Wide SIMD
RF/Bypass
4 Wide SIMD
RF/Bypass
4 Wide SIMD
FPU FPU FPU .. FPU
Multi-Port Flat Register File
/ Bypass network Load/
Store
Unit
..
..
..
(c)(b)(a)
•  Organiza#ons
for
16
FPUs

–  (a)
Flat
mul#ple-‐ported
register
ﬁle

–  (b)
Wide
SIMD
with
wide
ports

–  (c)
Array
of
short
SIMD
units

•  No
limits

–  load-‐store
unit
data
movement
and
bandwidth

–  Scalar
pipeline
(#
instruc#ons
that
are
executed)

©
A.
Pedram

Data
Handling
Trade-‐oﬀ
in
Core

•  GEMM[ASAP2011]

– Regular
data
movement

– Regular
access
paoern

– Easily
mapped
on
most
SIMD
Cores

•  SYRK[ASAP2012],[SBAC-‐PAD2012]

– Matrix
Transpose
needed

– Data
manipula#on
for
Transpose
is
a
challenge

9/8/13
12
©
A.
Pedram

Memory
Hierarchy
of
GEMM

•  Blocking
in

higher
levels

of
memory
is

almost
the

same

9/8/13
13

+=
xAi,p
C A B
C
x
x
+=
x+=
Accumulator
(Register Level)
Local Store Level
(LAC)
On-chip Memory
n
Bp
Ap
Ci
Bp,j
Main Memory
LAP
1 x w SIMD
h x w SIMD
L2 Cache
(CPUs) L1 Cache
(CPUs)
L3 Cache
(CPUs)
©
A.
Pedram

GEMM
Register-‐level
Opera#ons

•  Accumulator

delay

mul#plies
the

number
of

registers

needed
in
core

9/8/13
14

Ai+dM-1,k Bk,j
cij
Ci+1,j
Blocks of Ci,j
1-Stream Out
2-Current
3-Prefetch
+=
mc
nr
Ci+1,j
Ci+2,j
Ci+d-1,j
dA
dM
dA
©
A.
Pedram

C A Bx+=
+=
+=
+=
x
x
+= xnr
nr nr
w
dA
w x dA
w
h
w x dA
LAP
1 x w SIMD
h x w SIMD
Ai,p in Local Store of PEs Bp,j Bp,j+1
Blocks of Ci,j
1-Stream Out
2-Current
3-Prefetch
+=
x
kc
mc kc

9/8/13
Non-Blocked nr x nr SYRK on LAC
•  LAC takes advantage of the
broadcast buses
•  Overlap computation with
communication with transposition
•  Least operation count in part of
bigger SYRK problem
•  Simple control
–  3-stage state machine in PEs
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a0(S1)
© A. Pedram 15

9/8/13
Non-Blocked nr x nr SYRK on LAC
•  LAC takes advantage of the
broadcast buses
•  Overlap computation with
communication with transposition
•  Least operation count in part of
bigger SYRK problem
•  Simple control
–  3-stage state machine in PEs
(0,0) PE(0,1) PE(0,2) PE(0,3)
(1,0) PE(1,1) PE(1,2) PE(1,3)
(2,0) PE(2,1) PE(2,2) PE(2,3)
(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a0(S1)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a1
Broadcast a0
T
Rank-1 update C+=a0a0
T
(S1)
(S1)
(S1)
© A. Pedram 16

Power
and
Performance
Analysis

•  Analy#cal
formulae

–  U#liza#on

–  Bandwidth

–  Size
of
local
stores

–  Size
of
register
file

–  #ports
of
memory
levels

•  Component
selec#ons

–  Storage
model
[CACTI
6.5]

•  Pure
SRAM
Model
for

Register
file

•  Cache
Models
for
L1
&
L2

•  Configura#on

–  1
GHz

–  16
FPUs

–  LAC

•  256+32
KB
aggregate
Local

store

–  Register
file
organiza#ons

•  256

KB
L2,
64B
line,
8
way

•  32

KB
L1,
64B
line,
8way

•  Register
file
Size
variable

–  Aimed
for
peak
performance

9/8/13
17
©
A.
Pedram

GEMM
Power
and
Area

•  SIMD
architectures

–  Effec#vely
reduce
the
power

consump#on

•  LAC
is
the
best
not
using

caches

•  The
flexibility
of
flat
register

files
is
of
no
use
for
GEMM.

•  4×4-‐wide
array
of
SIMD

requires
less
area.

•  Most
of
the
area
is
used
by

Caches

9/8/13
18

0 
2 
4 
6 
8 
10 
12 
14 
FLAT  SIMD  A SIMD X h  LAC 
mm^2 
SRAM 
L2 Cache 
L1 Cache 
RF 
0 
0.01 
0.02 
0.03 
0.04 
0.05 
0.06 
0.07 
W/GFLOP 
SRAM 
L2 Cache 
L1 Cache 
RF 
©
A.
Pedram

SYRK
Power
and
Area

•  Wide
SIMD
takes
many

cycles
to
transpose

•  4×4-‐wide
SIMD
has
best

eﬃciency

–  Less
power
than
FLAT

–  Higher
u#liza#on
then
Wide

SIMD

•  4×4-‐wide
array
of
SIMD

requires
less
area.

•  Most
of
the
area
is
used
by

Caches

9/8/13
19

0 
2 
4 
6 
8 
10 
12 
14 
16 
18 
mm^2 
SRAM 
L2 Cache 
L1 Cache 
RF 
0 
0.02 
0.04 
0.06 
0.08 
0.1 
0.12 
0.14 
0.16 
0.18 
0.2 
W/GFLOP 
SRAM 
L2 Cache 
L1 Cache 
RF 
©
A.
Pedram

Conclusion

•  Architectures
Studied

•  Linear
Algebra
Core
(LAC)

•  FLAT
register
file

•  1D
wide
SIMD

•  4×4-‐wide
array
of
SIMD

•  Algorithm/Architecture
co-‐design

•  GEMM
(emphasize
on
data
bandwidth
to-‐from
core)

•  SYRK
(
emphasize
on
data
manipula#on
in
core)

•  Analy#cal
formulae
for
different
layers
of

memory
hierarchy

•  Power
and
efficiency
es#ma#on

9/8/13
20
©
A.
Pedram

Linear
Algebra
Core

•  Results @ 1GHz
–  DP: 32 GFLOPS, 47 GFLOPS/W
–  0.6 Watts
–  2.8 mm2 in 45nm
–  4 GB/s external BW
–  Orders of magnitude improvement
Ø On-going and future work
–  Generalization: full Level-3 BLAS in hardware
–  Implementation: synthesis and tape-out
9/8/13
21
©
A.
Pedram

Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multiplication

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multiplication

Similar to Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multiplication (20)

Recently uploaded

Recently uploaded (20)

Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multiplication