Eyeriss Introduction

Introduction to Eyeriss1
Michael (Tao-Yi) Lee
tylee@mlpanda.rocks
NTU IoX Center
October 24, 2017
1Y. H. Chen et al. “Eyeriss: An Energy-Eﬃcient Reconﬁgurable Accelerator for Deep
Convolutional Neural Networks”. In: IEEE Journal of Solid-State Circuits 52.1 (Jan. 2017),
pp. 127–138.

Outline
1 Introduction
2 Eyeriss Highlights
Memory Hierarchy
Row Stationary Data Flow
Network-on-a-chip (NoC)
Compression and Data Gating
3 Summary
4 Appendix
Michael Lee (NTU) Introduction to Eyeriss October 24, 2017 2 / 37

Introduction
Contributions of Eyeriss
A novel energy-efficient CNN dataflow that has been verified in
a fabricated chip
A taxonomy of CNN dataflows that classifies previous work into
three categories (WS, OS, NLR)
Figure: Eyeriss Die Photo (35 fps @ 278 mW running AlexNet[10])
4000µm
4000µm
168 PE
GlobalBuffer

Introduction
Features of Eyeriss
Use row stationary (RS) on spatial architecture with 168
processing elements to reduce energy cost of data ﬂow
4 level memory hierachy: Maximally local data reuse
Network-on-a-chip (NoC)
Multicast
P2P single cycle delievery
Compression and data gating
Run-length compression (RLC)
PE data gating

Introduction
Architecture
Core clock : 100-250 MHz / Link Clock: 90 MHz

Introduction
Recap on CNN
Forward computation in CONV layers
Given ofmap O, ifmap I, bias B, weight W, stride size U
O[z][u][x][y] =ReLU B[u] +
C−1
k=0
R−1
i=0
S−1
j=0
I[z][k][Ux + i][Uy + j] × W[u][k][i][j]
partial sum



(1)
where 0 ≤ z < N, 0 ≤ u < M, 0 ≤ y < E, 0 ≤ x < F
E = (H − R + U)/U (2)
F = (W − S + U)/U (3)

Eyeriss Highlights Memory Hierarchy
PE Matrix and Memory Hierarchy
1. Spatial Architecture: Allows data to ﬂow in four directions
2. PE operates independently with one CLKcore (i.e. not systolic)
MAC
pixel
W
psumi
psumo
MAC
pixel
W
psumi
psumo
MAC
pixel
W
psumi
psumo
MAC
pixel
W
psumi
psumo
MAC
pixel
W
psumi
psumo
MAC
pixel
W
psumi
psumo
MAC
pixel
W
psumi
psumo
MAC
pixel
W
psumi
psumo
MAC
pixel
W
psumi
psumo
DRAM
Challenge
How to optimize data ﬂow in order to minimize energy consumption?

Eyeriss Highlights Memory Hierarchy
PE Matrix and Memory Hierarchy (Eyeriss)
4 level memory hierachy
DRAM → Global Buﬀer (GLB) → Network-on-a-Chip (NoC) →
Register File (RF)
on-chip
RF(1kB)
(EC=1X)
RF(1kB)
(EC=1X)
RF(1kB)
(EC=1X)
RF(1kB)
(EC=1X)
RF(1kB)
(EC=1X)
RF(1kB)
(EC=1X)
RF(1kB)
(EC=1X)
RF(1kB)
(EC=1X)
RF(1kB)
(EC=1X)
NoC (EC=2X)
DRAM GLB
(EC=6X)
FIFO (EC2
=500X)
2Relative energy cost

Eyeriss Highlights Row Stationary Data Flow
CNN dataﬂows
Row Stationary
Weight Stationary (WS)
Output Stationary (OS)
No Local Reuse (NLR)

Comparison of Dataflows (I)
Focus on flows of psum, weight and pixels in next slides
RS uses 1.4X – 2.5X lower energy than other dataflows

Row Stationary in 1D Convolution
a b c
Kernel
∗
a b c d e
Image
=
a b c
PSum

Row Stationary in 1D Convolution (PE)
a b c
Kernel
∗ a b c d e
Image
= a b c
PSum
PEReg File
c b a
c b a
a
de

a b c
Kernel
∗ a b c d e
Image
= a b c
PSum
PEReg File
d c b
c b a
b
e
a

a b c
Kernel
∗ a b c d e
Image
= a b c
PSum
PEReg File
e d c
c b a
c b a

Row Stationary in 2D Convolution
1a 1b 1c
2a 2b 2c
3a 3b 3c
Kernel
∗
1a 1b 1c 1d 1e
2a 2b 2c 2d 2e
3a 3b 3c 3d 3e
4a 4b 4c 4d 4e
5a 5b 5c 5d 5e
Image
=
1a 1b 1c
2a 2b 2c
3a 3b 3c
PSum

1a 2a 3a
PEReg File
3c 3b 3a
3c 3b 3a
1a
3d3e
PEReg File
4c 4b 4a
3c 3b 3a
2a
4d4e
PEReg File
5c 5b 5a
3c 3b 3a
3a
5d5e
PEReg File
2c 2b 2a
2c 2b 2a
1a
2d2e
PEReg File
3c 3b 3a
2c 2b 2a
2a
3d3e
PEReg File
4c 4b 4a
2c 2b 2a
3a
4d4e
PEReg File
1c 1b 1a
1c 1b 1a
1a
1d1e
PEReg File
2c 2b 2a
1c 1b 1a
2a
2d2e
PEReg File
3c 3b 3a
1c 1b 1a
3a
3d3e

1a 2a 3a1b 2b 3b
PEReg File
3d 3c 3b
3c 3b 3a
1b
3e
1b
PEReg File
4d 4c 4b
3c 3b 3a
2b
4e
2b
PEReg File
5d 5c 5b
3c 3b 3a
3b
5e
3b
PEReg File
2d 2c 2b
2c 2b 2a
1b
2e
1b
PEReg File
3d 3c 3b
2c 2b 2a
2b
3e
2b
PEReg File
4d 4c 4b
2c 2b 2a
3b
4e
3b
PEReg File
1d 1c 1b
1c 1b 1a
1b
1e
1b
PEReg File
2d 2c 2b
1c 1b 1a
2b
2e
2b
PEReg File
3d 3c 3b
1c 1b 1a
3b
3e
3b

1a 2a 3a1b 2b 3b1c 2c 3c
PEReg File
3e 3d 3c
3c 3b 3a
1c 1b 1a
PEReg File
4e 4d 4c
3c 3b 3a
2c 2b 2a
PEReg File
5e 5d 5c
3c 3b 3a
3c 3b 3a
PEReg File
2e 2d 2c
2c 2b 2a
1c 1b 1a
PEReg File
3e 3d 3c
2c 2b 2a
2c 2b 2a
PEReg File
4e 4d 4c
2c 2b 2a
3c 3b 3a
PEReg File
1e 1d 1c
1c 1b 1a
1c 1b 1a
PEReg File
2e 2d 2c
1c 1b 1a
2c 2b 2a
PEReg File
3e 3d 3c
1c 1b 1a
3c 3b 3a

1a 2a 3a1b 2b 3b1c 2c 3c
PEReg File
3e 3d 3c
3c 3b 3a
1c 1b 1a
PEReg File
4e 4d 4c
3c 3b 3a
2c 2b 2a
PEReg File
5e 5d 5c
3c 3b 3a
3c 3b 3a
PEReg File
2e 2d 2c
2c 2b 2a
1c 1b 1a
PEReg File
3e 3d 3c
2c 2b 2a
2c 2b 2a
PEReg File
4e 4d 4c
2c 2b 2a
3c 3b 3a
PEReg File
1e 1d 1c
1c 1b 1a
1c 1b 1a
PEReg File
2e 2d 2c
1c 1b 1a
2c 2b 2a
PEReg File
3e 3d 3c
1c 1b 1a
3c 3b 3a
Psum propagate vertically Psum propagate vertically Psum propagate vertically
Psum propagate diagnally Pixel propagate diagnally Pixel propagate diagnally
Pixel propagate diagnally
Pixel propagate diagnally
Weight propagate horizontally

Weight Stationary (WS)3
Minimize weight read energy consumption
maximize convolutional and filter reuse of weights
Examples:
Chakradhar et al. 2010
Gokhale et al. 2014
Park et al. 2015
Cavigelli et al. 2015
3Image adopted from Yu-Hsin Chen, Joel Emer, and Vivienne Sze. ISCA 2016 Slides of
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks.

Output Stationary (OS)4
Minimize partial sum R/W energy consumption
maximize local accumulation
Examples:
Gupta et al. 2015
Du et al. 2015
Peemen et al. 2013

No Local Reuse (NLR)5
Use a large global buﬀer as shared storage
Reduce DRAM access energy consumption
Examples:
Chen et al. 2014
Chen et al. 2014
Zhang et al. 2015

Comparison of Dataﬂows (II)
RS reuses data in local register ﬁles (RF), a lot! ⇒ Saves energy of
moving data

Beyond 2D Convolution - Multiple Images
Processing in PE
Concatenate image rows

Beyond 2D Convolution - Multiple Filters
Processing in PE
Interleave ﬁlter rows

Beyond 2D Convolution - Multiple Channels
Processing in PE
Interleave channels

AlexNet Revisited
AlexNet Convolutional Layer Conﬁgurations
Layer Filter Size (R) # Filters (M) # Channels (C) Stride Max Pooling
1 11x11 96 3 4
2 5x5 256 48 1 3×3 S2
3 3x3 384 256 1 3×3 S2
4 3x3 256 192 1
5 3x3 256 192 1 3×3 S2

AlexNet PE Mapping

AlexNet Inter-Pass Data Caching

AlexNet Shape Parameters
How do we map AlexNet onto Eyeriss?
L H6
R E C M U
1 227 11 55 3 96 4
2 31 5 27 48 256 1
3 15 3 13 256 384 1
4 15 3 13 192 384 1
5 15 3 13 192 256 1
m7
n e p q r t
96 1 7 16 1 1 2
64 1 27 16 2 1 1
64 4 13 16 4 1 4
64 4 13 16 3 2 2
64 4 13 16 3 2 2
6H: ifmap width, R: kernel width, E: ofmap width, C: Channels, M: # kernels, U: Stride
7m: # ofmap chan stored in GLB, n: # ifmap, e: width of PE set, p: # filters proc., q: #
chan proc., r: # pe proc. diff. chan., t: # pe proc. diff. filter.

AlexNet Shape Mapping Illustrated

Eyeriss Highlights Network-on-a-chip (NoC)
NoC Optimized for RS
Global input/output network: use Multicast Controller (MC) to
broadcaset GLB data into assigned PE. Data is augmented with
(row, col) in GLB
ﬁlter GI/ON
ifmap GI/ON
psum GI/ON
Local network: dedicated 64b data bus is implemented to pass
the psums from the bottom PE to the top PE directly

Global Input / Output Network

Populate Data with Global Input / Output Network

Eyeriss Highlights Compression and Data Gating
Run-Length Compression (RLC)
ReLU produces many zeros in activated ofmap, use RLC to save
power in DRAM R/W

Eyeriss Highlights Compression and Data Gating
Data Gating / Zero Skipping
Simply skip tasks when either pixel or weight is zero

Summary
Row Stationary Energy Breakdown

Summary
Performance Summary and Comparison
Eyeriss[5] NVIDIA TK1
Technology 65nm 1P9M 28nm
Chip Size 4.0×4.0 N/A
Core Area 3.5×3.5 N/A
Gate Count 1176k N/A
Word Bit-Width 16b Fixed 32b Float
Core Clock(MHz) 200 852
On-Chip Buﬀer Size (kB) 108 64
Total Register Size (kB) 75.3 256
#MAC 168 192
Throughput(fps) 34.7 68
Measured Power Idle (mW) 3700
Measured Power Active (mW) 278 10002

Summary
Summary
RS optimizes for best overall energy efficiency while existing
CNN dataflows only focus on certain data types.
RS has higher energy efficiency than existing dataflows
1.4X ∼ 2.5X higher in CONV layers
at least 1.3X higher in FC layers. (batch size ≥ 16)

Appendix
Bibliography I
Lukas Cavigelli et al. “Origami: A Convolutional Network Accelerator”. In: Proceedings of the 25th Edition on Great
Lakes Symposium on VLSI. GLSVLSI ’15. Pittsburgh, Pennsylvania, USA: ACM, 2015, pp. 199–204. isbn:
978-1-4503-3474-7. doi: 10.1145/2742060.2743766. url: http://doi.acm.org/10.1145/2742060.2743766.
Srimat Chakradhar et al. “A Dynamically Configurable Coprocessor for Convolutional Neural Networks”. In:
Proceedings of the 37th Annual International Symposium on Computer Architecture. ISCA ’10. Saint-Malo, France:
ACM, 2010, pp. 247–257. isbn: 978-1-4503-0053-7.
Yu-Hsin Chen, Joel Emer, and Vivienne Sze. ISCA 2016 Slides of Eyeriss: A Spatial Architecture for Energy-Efficient
Dataflow for Convolutional Neural Networks.
Tianshi Chen et al. “DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning”. In:
Proceedings of the 19th International Conference on Architectural Support for Programming Languages and
Operating Systems. ASPLOS ’14. Salt Lake City, Utah, USA: ACM, 2014, pp. 269–284. isbn: 978-1-4503-2305-5.
doi: 10.1145/2541940.2541967. url: http://doi.acm.org/10.1145/2541940.2541967.
Y. H. Chen et al. “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural
Networks”. In: IEEE Journal of Solid-State Circuits 52.1 (Jan. 2017), pp. 127–138.
Y. Chen et al. “DaDianNao: A Machine-Learning Supercomputer”. In: 2014 47th Annual IEEE/ACM International
Symposium on Microarchitecture. Dec. 2014, pp. 609–622. doi: 10.1109/MICRO.2014.58.

Appendix
Bibliography II
Z. Du et al. “ShiDianNao: Shifting vision processing closer to the sensor”. In: 2015 ACM/IEEE 42nd Annual
International Symposium on Computer Architecture (ISCA). June 2015, pp. 92–104. doi:
10.1145/2749469.2750389.
V. Gokhale et al. “A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks”. In: 2014 IEEE Conference on
Computer Vision and Pattern Recognition Workshops. June 2014, pp. 696–701. doi: 10.1109/CVPRW.2014.106.
Suyog Gupta et al. “Deep Learning with Limited Numerical Precision”. In: Proceedings of the 32Nd International
Conference on International Conference on Machine Learning - Volume 37. ICML’15. Lille, France: JMLR.org, 2015,
pp. 1737–1746. url: http://dl.acm.org/citation.cfm?id=3045118.3045303.
Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. “ImageNet Classiﬁcation with Deep Convolutional Neural
Networks”. In: Advances in Neural Information Processing Systems 25. Ed. by F. Pereira et al. 2012, pp. 1097–1105.
S. Park et al. “4.6 A1.93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture
for big-data applications”. In: 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical
Papers. Feb. 2015, pp. 1–3. doi: 10.1109/ISSCC.2015.7062935.
M. Peemen et al. “Memory-centric accelerator design for Convolutional Neural Networks”. In: 2013 IEEE 31st
International Conference on Computer Design (ICCD). Oct. 2013, pp. 13–19. doi: 10.1109/ICCD.2013.6657019.

Appendix
Bibliography III
Chen Zhang et al. “Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks”. In:
Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA ’15.
Monterey, California, USA: ACM, 2015, pp. 161–170. isbn: 978-1-4503-3315-3. doi: 10.1145/2684746.2689060.
url: http://doi.acm.org/10.1145/2684746.2689060.

Eyeriss Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Eyeriss Introduction

Similar to Eyeriss Introduction (20)

More from Michael Lee

More from Michael Lee (7)

Recently uploaded

Recently uploaded (20)

Eyeriss Introduction