Systolic Array
https://www.appliedimage.com/reference-info/using-nbs-1010a-resolution-test-target/
Dark Silicon
https://www.publicdomainpictures.net/en/view-image.php?image=44607&picture=portrait-of-the-dark-sides-man
Systolic Array
脈衝陣列列?!
Systolic Array
• Easy to describe in software language.

• easy to program w some kind of domain specific
language.

• Elegant

• Layout friendly
Memory Bandwidth
grows slowly
Spec Year MB/s
DDR 2000 2667
DDR2 2003 5333
DDR3 2007 12800
DDR4 2014 19200
Increasing Operations / IO
H. T. Kung 1982
Convolution Problem
y1 = y1+w1x1
ϵ = ϵ+w2x1
ϵ = ϵ+w3x1
ϵ = ϵ+w4x1
y2 = y2+w1x2
y1 = y1+w2x2
ϵ = ϵ+w3x2
ϵ = ϵ+w4x2
y3 = y3+w1x3
y2 = y2+w2x3
y1 = y1+w3x3
ϵ = ϵ+w4x3
y4 = y4+w1x4
y3 = y3+w2x4
y2 = y2+w3x4
y1 = y1+w4x4
y5 = y5+w1x5
y4 = y4+w2x5
y3 = y3+w3x5
y2 = y2+w4x5
output : y1
time(broadcast)
space(move)
H. T. Kung 1982
H. T. Kung 1982
Better precision of summation,

if MAC has more digit than bus
Require separate bus for collecting 

output from individual cells
Could be pipelined
adder tree
Without global data communication
Better precision of summation 

(same as B2)
Systolic output path

(or use next row in 2D)
Nodes are activated half of the time.
x1
x1
x1
x2
x2
x2
x2
x2x3
x3
x3
x3
x3
x4
x4
x4
w1
w1
w1 w2
w2
w2 w3
circle 0
circle 1
circle 2
circle 3
circle 4
circle 5
circle 6
Half nodes are
activated at any
given time.
Without global data communication
1 node / cycle
1 node / 2 cycles
Register to keep w
Better precision of summation 

w1
w1
w2
w2
w1
w1
x1
x1
x1
x2
x2
x2
x2x3
x3
x3
x3
w3
w2
w3
w2
w1
x3
w1
x4
w2
x4
w3
w3
x4
x4x5
x5
w1
y4
y4
y4
x6
w1
w1
w2
w2
w3
Y5
x4x5
x6
x7
w2
w3
w3
w1
w2
w1
w3
w2
w2 w1
w1
w3
x5x6
x7
x8
Sorting
Odd-Even Transposition
Sort Active
Comp & Swap
O((n/k)log(n/k)) + O(k(n/k))
H. T. Kung 1979
Finite Impulse
Response Filtering
In Matrix Form
H. T. Kung 1979
H. T. Kung 1979
H. T. Kung 1979
0
0
0
0
Priority Queue
• insert()

• delete()

• extract_min()
Priority Queue Operations
For n operations:
O(n log n) O(n)
Key : One operation
can be issued after
another in time.O(1)
Priority Queue Operations
insert(k) delete(k) extract_min()
Sink down the 

element with key k
A) Sink down a fake
element with key k
to find target.
B) Remove the target.
C) Bubble up the
below ones.
A) Take first element
B) Bubble up the below
ones.
Recurrence Evaluation
xi = R(xi−1, . . . , xi−k)
xi = axi−1 + bxi−2 + cxi−k + d
Removing Loops
Alternatives
Cloud TPU
Google Cloud Platform Blog 

https://cloud.google.com/tpu/
TPU V3TPU V2
TPU V2 Pod
TPU Programming
• A cloud TPU has 4 chips x 2
cores x 1 or 2 MXU

• MXU

• 128x128 systolic array

• 16K MAC / cycle

• bfloat16

• TPU memory prefer 8 bytes
alignment.

• 8 or 16GB HBM2 / core
https://cloud.google.com/tpu/docs/tpus

https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/
Titan X has 

3.5K cuda cores
So, each TPU V3 card has
4 chips x 2 cores x 2 MXU x 16K MAC / cycle
= 256K MAC / cycle at most.
https://cloud.google.com/tpu/docs/system-architecture
TPU Programming
• XLA compiler for TensorFlow programs.

• Tiling => Need reshape

• Shape => No dynamic batch

• Padding => under utilize TPU, more memory usage

• op_profile tool
TPU Programming
• Dense vector and matrix computations are fast

• M x M, M x v, Convolution

• Data movement on PCIe is slow.

• Only dense parts of the model, loss and gradient subgraphs are on TPU.

• I/O, reading data, writing checkpoint, preprocessing data is on CPU.

• decoding compressed images, randomly sampling/cropping, assembling training minibatches

• Non-matrix operations will likely not achieve high MXU utilization.

• add, reshape, or concatenate

• feature dimension => 128 x

• Batch dimension => 8 x
TPUEstimator
• TPUEstimator provides a graph operator to build and run
a replicated computation
https://www.tensorflow.org/api_docs/python/tf/contrib/tpu/TPUEstimator
Module: tf.contrib.tpu
Module: tf.contrib.tpu
https://www.tensorflow.org/api_docs/python/tf/contrib/tpu
Affinity
https://en.wikipedia.org/wiki/The_Boss_Baby

Why Systolic Architectures