Why Systolic Architectures

https://www.appliedimage.com/reference-info/using-nbs-1010a-resolution-test-target/

Dark Silicon
https://www.publicdomainpictures.net/en/view-image.php?image=44607&picture=portrait-of-the-dark-sides-man

Systolic Array
脈衝陣列列？！

Systolic Array
• Easy to describe in software language.

• easy to program w some kind of domain speciﬁc
language.

• Elegant

• Layout friendly

Memory Bandwidth
grows slowly
Spec Year MB/s
DDR 2000 2667
DDR2 2003 5333
DDR3 2007 12800
DDR4 2014 19200

Increasing Operations / IO
H. T. Kung 1982

y1 = y1+w1x1
ϵ = ϵ+w2x1
ϵ = ϵ+w3x1
ϵ = ϵ+w4x1
y2 = y2+w1x2
y1 = y1+w2x2
ϵ = ϵ+w3x2
ϵ = ϵ+w4x2
y3 = y3+w1x3
y2 = y2+w2x3
y1 = y1+w3x3
ϵ = ϵ+w4x3
y4 = y4+w1x4
y3 = y3+w2x4
y2 = y2+w3x4
y1 = y1+w4x4
y5 = y5+w1x5
y4 = y4+w2x5
y3 = y3+w3x5
y2 = y2+w4x5
output : y1
time(broadcast)
space(move)
H. T. Kung 1982

H. T. Kung 1982
Better precision of summation, 
if MAC has more digit than bus
Require separate bus for collecting  
output from individual cells

Without global data communication
Better precision of summation  
(same as B2)
Systolic output path 
(or use next row in 2D)
Nodes are activated half of the time.

x1
x1
x1
x2
x2
x2
x2
x2x3
x3
x3
x3
x3
x4
x4
x4
w1
w1
w1 w2
w2
w2 w3
circle 0
circle 1
circle 2
circle 3
circle 4
circle 5
circle 6
Half nodes are
activated at any
given time.

Without global data communication
1 node / cycle
1 node / 2 cycles
Register to keep w
Better precision of summation

w1
w1
w2
w2
w1
w1
x1
x1
x1
x2
x2
x2
x2x3
x3
x3
x3
w3
w2
w3
w2
w1
x3
w1
x4
w2
x4
w3
w3
x4
x4x5
x5
w1
y4
y4
y4
x6
w1
w1
w2
w2
w3
Y5
x4x5
x6
x7
w2
w3
w3
w1
w2
w1
w3
w2
w2 w1
w1
w3
x5x6
x7
x8

Odd-Even Transposition
Sort Active
Comp & Swap
O((n/k)log(n/k)) + O(k(n/k))
H. T. Kung 1979

Finite Impulse
Response Filtering

In Matrix Form
H. T. Kung 1979

• insert()

• delete()

• extract_min()
Priority Queue Operations
For n operations:
O(n log n) O(n)
Key : One operation
can be issued after
another in time.O(1)

Priority Queue Operations
insert(k) delete(k) extract_min()
Sink down the  
element with key k
A) Sink down a fake
element with key k
to ﬁnd target.
B) Remove the target.
C) Bubble up the
below ones.
A) Take ﬁrst element
B) Bubble up the below
ones.

xi = R(xi−1, . . . , xi−k)
xi = axi−1 + bxi−2 + cxi−k + d

Cloud TPU
Google Cloud Platform Blog  
https://cloud.google.com/tpu/
TPU V3TPU V2
TPU V2 Pod

TPU Programming
• A cloud TPU has 4 chips x 2
cores x 1 or 2 MXU

• MXU

• 128x128 systolic array

• 16K MAC / cycle

• bﬂoat16

• TPU memory prefer 8 bytes
alignment.

• 8 or 16GB HBM2 / core
https://cloud.google.com/tpu/docs/tpus 
https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/
Titan X has  
3.5K cuda cores
So, each TPU V3 card has
4 chips x 2 cores x 2 MXU x 16K MAC / cycle
= 256K MAC / cycle at most.

https://cloud.google.com/tpu/docs/system-architecture

TPU Programming
• XLA compiler for TensorFlow programs.

• Tiling => Need reshape

• Shape => No dynamic batch

• Padding => under utilize TPU, more memory usage

• op_proﬁle tool

TPU Programming
• Dense vector and matrix computations are fast

• M x M, M x v, Convolution

• Data movement on PCIe is slow.

• Only dense parts of the model, loss and gradient subgraphs are on TPU.

• I/O, reading data, writing checkpoint, preprocessing data is on CPU.

• decoding compressed images, randomly sampling/cropping, assembling training minibatches

• Non-matrix operations will likely not achieve high MXU utilization.

• add, reshape, or concatenate

• feature dimension => 128 x

• Batch dimension => 8 x

TPUEstimator
• TPUEstimator provides a graph operator to build and run
a replicated computation
https://www.tensorﬂow.org/api_docs/python/tf/contrib/tpu/TPUEstimator

Module: tf.contrib.tpu
https://www.tensorﬂow.org/api_docs/python/tf/contrib/tpu

https://en.wikipedia.org/wiki/The_Boss_Baby

Why Systolic Architectures

More Related Content

What's hot

Similar to Why Systolic Architectures

More from Mindos Cheng

Recently uploaded

Why Systolic Architectures