IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
19-7960-07-notes.pptx
1. 1
Lecture: Eyeriss Dataflow
• Topics: Eyeriss architecture and dataflow (digital CNN
accelerator)
We had previously seen basic ANNs that used tiling/buffers/NFUs (DianNao), multiple chips to
distribute the work and eliminate memory accesses (DaDianNao), and a few examples that exploited
sparsity (Deep compression, EIE, Cnvlutin). Now, we’ll examine a spatial architecture that exploits
various dataflows to reduce energy overheads.
Spatial architectures get inputs from their neighbors, thus helping computations that exhibit
producer-consumer relationships and/or data reuse. The next slide shows the overall Eyeriss
architecture. The chip has a global buffer that has a few hundred kilo-bytes of data. It then has an
array of processing elements (PEs), connected by a mesh network.
Each PE has a single MAC unit and a register file (that can store a few hundred values). There may
also be FIFO input/output buffers within and between PEs, but we won’t discuss those today since
they aren’t essential.
2. 2
Dataflow Optimizations
Data can be fetched from 4 different locations, each with varying energy cost: off-chip DRAM, global
buffer, a neighboring PE’s regfile, and local regfile. The key here is to find a dataflow that can reduce
the overall energy cost for a large CNN computation.
4. 4
One Primitive
Let’s first take a look at a single PE in Eyeriss. Let’s also first focus on a single 1D convolution (or the
computation required by 1 row of a 2D convolution). This is defined as one primitive and one PE is
responsible for one primitive. Before the computation starts, the PE loads its register file with 1 row
of kernel weights (size R) and 1 row of an input feature map (size H). In the example above, a 3-entry
kernel is applied on a 5-entry row. 9 computations are performed sequentially to produce 3 output
pixels. The resulting outputs are only partial sums (the pink boxes labeled 1, 2, 3). This computation
has a lot of reuse (kernel and ifmap), and all of that reuse is exploited thru the regfile.
5. 5
One Primitive
Consider a larger example. Take a row of the input feature map with 64 elements. Consider one row
of the kernel with 3 elements. Once these 67 elements are in the regfile, we perform computations
that produce 62 partial sum outputs. This is done with 186 MAC operations. The register file may or
may not have to store these partial sums (depending on how bypassing is done).
This is called row-stationary, i.e., a row sits in a PE and performs all required computations. The
computations are done sequentially on a single MAC unit.
6. 6
Dataflow for a 2D Convolution
Now let’s carefully go through the 2D convolution. Let’s discuss “where” first; we’ll later discuss
“when”. See the figures on the next slide to follow along.
If the 2D convolution involves a 3x3 kernel, we’ll partition it into 3 “primitives”, with each primitive
responsible for 1 row of the input feature map and 1 row of the kernel (btw, remember that we use
the terms kernel/filter/weights interchangeably). Take the first column of PEs. PE1,1 receives filter
row 1 and input feature map row 1. They produce partial sums for the first row of the output feature
map. Similarly, PE2,1 and PE3,1 deal with rows 2 and 3 of the kernel/ifmap. The partial sums for all
three PEs must be added to produce the first row of the ofmap (see the red dataflow).
Next, to produce the second row of the ofmap, we’ll move to the second column of PEs. To produce
the second row, the following rows must be convolved: ifmap-row2 and filter-row1; ifmap-row3 and
filter-row2; ifmap-row4 and filter-row3. To make sure the right rows collide at the second column of
PEs, the ifmap and filter rows follow the blue and green rows, as shown in the figure. And this
process continues.
To recap, inputs arrive at the first column. Each PE does a row’s worth of work (1 primitive). The
partial sums in the column are added to produce the first row of the ofmap. Then, the ifmap and
filter rows shift diagonally and laterally to the second column. The partial sums of the second column
are aggregated to produce the second row of the ofmap. If we are working with a 64x64 ifmap and a
3x3 filter, we’ll need a 62x3 array of PEs.
7. 7
Dataflow for a 2D Convolution
The next slide also shows the required computations for our 2D convolution example. We’ll refer to
the first column and the last row of the PE array as “edge PEs”. To perform their computation, these
PEs must first receive 64 ifmap elements and 3 filter elements. The 64 ifmap elements must be
fetched from the global buffer. We next perform 186 MAC operations, each requiring regfile reads
and regfile writes (for partial sums). When the final result is produced, it is sent to the neighboring
PE. In non-edge PEs, the computations are similar. They key difference is that the 64 ifmap elements
are read from the neighboring PE, not from the global buffer. We see here that most reads are from
the local regfile or from a neighboring regfile; a few accesses are from the global buffer. We’ve thus
created a dataflow that again exploits locality and reuse. This is in addition to the reuse we’ve already
exploited within 1 primitive.
Note that eventually, we want to do a 4D computation (lots of ifmaps, lots of ofmaps, multiple
images). We did this by first doing a 1D computation efficiently in one PE. We then set up a grid of
PEs on a 2D chip to process a 2D convolution efficiently. Unfortunately, we can’t have a 3D chip, so
we have to stop with this 2D convolution primitive. To perform the 4D computation, we have to make
repeated calls to a 2D convolution primitive with the following nested loops:
for (images 120)
for (ofmaps 18)
for (ifmaps 14)
perform 2D convolution
8. 8
Dataflow for a 2D Convolution
Given these nested loops, we can compute the total number of DRAM/globalbuffer/PE/regfile
accesses and compute energy. Note that the outputs of each 2D-conv are now partial sums and have
to be aggregated as we walk through these nested loops. The compiler has to try different kinds of
nested loops to figure out the best way to reduce energy.
The entire 4D computation has to thus be carefully decomposed into multiple 2D convolutions (and
each 2D convolution is itself decomposed into multiple 1D convolutions, 1 per PE). Further, the 62x3
PE grid that we just constructed has to be mapped on to the actual PE grid that exists on the chip
(e.g., the fabricated Eyeriss chip has a 12x14 grid).
Having addressed the “where”, let’s now talk about “when”. To simplify the discussion, I said that one
column of PEs is fully processed before we move to the next column. In reality, once an element of
the ifmap arrives at a PE, we can also pass it to the next PE column in the next cycle. So the second
column can start one cycle after the first column. We are thus pipelining the computations in all the
columns. Once we reach steady state, all the columns will be busy, thus achieving very high
parallelism.
9. 9
Row Stationary Dataflow for one 2D Convolution
Example: 4 64x64 inputs; 4x3x3 kernel wts; 8 62x62 outputs; 20 image batch
• Edge prim: (glb) 64 inp, 3 wts; (reg) 186 MACs psums (124 rg/by) (62 PE)
• Other prims: (PE) 64 inp, 3 wts; (reg) 186 MACs psums (124 rg/by) (62 PE)
• The first step is done ~64 times; the second step is done ~122 times
• Eventually: 4K outputs to global buffer or DRAM
10. 10
Folding
• May have to fold a 2D convolution over a small physical
set of PEs
• Must eventually take the 4D convolution and fold it into
multiple 2D convolutions – the 2D convolution has to be done
C (input filters) x M (output filters) x N (image batch) times
• Can exploit global buffer reuse and register reuse depending
on which order you do this (note that you have to deal with
inputs, weights, and partial sums)
11. 11
Weight Stationary
Example: 4 64x64 inputs; 4x3x3 kernel wts; 8 61x61 outputs; 20 image batch
• Weight Stationary: the weights stay in place
• Assume that we lay out 3x3 weights across 9 PEs
• Let the inputs stream over these – each weight has to be seen 61x61 times
-- no easy way to move the pixels around to promote this
There are other reasonable dataflows as well. In weight stationary, the 9 weights in a filter may
occupy a 3x3 grid of PEs. The ifmap has to flow through these PEs. Note that each element in the
ifmap has to combine with each of the weights, so there’s plenty of required data movement.
12. 12
Output Stationary
Example: 4 64x64 inputs; 4x3x3 kernel wts; 8 61x61 outputs; 20 image batch
• Output Stationary: the output neuron stays in place
• Need to use all PEs to compute a subset of a 4D space of neurons at a
time – can move inputs around to promote reuse
In output stationary, each PE is responsible for producing one output value. So all the necessary
ifmap and filter values have to find their way to all necessary PEs.
14. 14
Energy Estimates
• Most ops in convolutional layers
• Most energy in ALU and RF in convs; most energy in buffer in FC
• More storage = more delay
The first figure shows that the CONV layers account for 90%
of all energy. Most energy is in the RF and ALUs. The FC
layers have lower reuse and therefore see more energy in
the global buffer. The optimal design has a good balance of
storage and PEs.
15. 15
Summary
• All about reduced energy and data movement; assume
PEs are busy most of the time (except edge effects)
• Reduced data movement low energy and low area
(from fewer interconnects)
• While Row Stationary is best, need a detailed design space
exploration to identify the best traversal thru the 4D array
• It’s not always about reducing DRAM accesses; even
global buffer accesses must be reduced
• More PEs allows for better data reuse, so not terrible
even if it means smaller global buffer
• Convs are 90% of all ops and growing
• Their best designs are with 256 PEs, 0.5KB regfile/PE,
128KB global buffer; filter/psum/act = 224/24/12
16. 16
WAX
Recent work makes the argument that the storage hierarchy in Eyeriss is sub-optimal.
Accessing a 0.5KB register file is expensive, as is accessing a 108KB global buffer. A more
efficient hierarchy (as shown here) is to have a handful of registers per PE and a local
buffer of size 8KB. This is one tile and if more data is required, it is fetched from nearby
tiles. This is also an example of near-data processing where an array of 32 MACs is placed
next to an 8KB buffer (subarray). A row of 32 values read from the buffer gets placed in
either a W register (weights) or A register (activations). These registers feed the 32 MACs
in parallel in one cycle. In the next cycle, the A register performs a shift operation. This
enables high reuse while performing a convolution operation. The WAX paper constructs
a number of dataflows to reduce the overall energy required by a deep network.
17. 17
References
• “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for
Convolutional Neural Networks,” Y-H. Chen et al., ISCA 2016
• “Wire-Aware Architecture and Dataflow for CNN Accelerators,”
S. Gudaparthi et al., MICRO 2019