Flow Mapping and Data Distribution on Mesh-based Deep Learning Accelerator

Flow Mapping and Data Distribution on Mesh-based
Deep Learning Accelerator
Science and Research
Branch of Azad
University
Presenting by Hesam Shabani
Seyedeh Yasaman Hosseini Mirmahaleh1, Midia Reshadi1, Hesam Shabani2, Xiaochen Guo2, Nader Bagherzadeh3
1Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran,
2Lehigh University, Bethlehem, PA, USA
3Department of Electrical Engineering and Computer Science, University of California Irvine, Irvine, CA, USA
yasaman.hosseini@srbiau.ac.ir
NOCS2019

Titles of presentation
Introduction
Investigating some related works
The purposes of our proposed deep learning accelerator
Evaluated parameters
Flow mapping method on a mesh topology
Influence of dataflow on energy consumption
Row-node stationary-based dataflow approach
Traffic distribution based on distributer nodes
Experimental results
Conclusion
Acknowledgment
1NOCS2019

 Deploying machine learning algorithm-based applications
 Internet of Things (IoT)
 Web search engines
 Image processing and data mining-based applications
 Increasing depth and complexity of neural networks
 Challenges regarding increasing depth and complexity of
convolutional and deep neural networks (CNN and DNN)
 Increasing energy consumption
 Memory capacity
 Bandwidth requirement
 Memory access
 Delay
 Proposed deep learning accelerators for facing CNN and DNN
problems
 Supercomputer
 Communication networks
 Memory logics
 Proposed our method for improving delay, energy consumption,
bandwidth, and memory requirements
 Flow mapping
 Distributer nodes
 New traffic distribution mechanism on a mesh topology
 Simple structure for router with tiny switches
Introduction
Conclusion
Acknowledgment
2
NOCS2019

Investigating advantages and disadvantages of proposed deep learning
accelerators (DLA)
Accelerator Advantage Disadvantage
TPU [6] Speed up processing Dataflow dependency
DaDianNao [1] Speedup processing compared with
GPU, Improving memory capacity
and energy consumption
Inflexible, complexity of neuron
mapping, Implementing train and
inference phases, integrating optical
interconnections and electrical
connections, computation dependency
Eyeriss [5] Improving memory access, reducing
bandwidth requirement and delay
No flexibility and scableity, No
supporting sparse DNN (SDNN),
computation dependency
Eyeriss V.2 [16] Scableity, supporting SDNN Increasing complexity of MAC
MAERI [8] Speed up processing, improving
memory access, flexibility,
independent to dataflow
Restricted to only one direction for
traffic distribution, increasing power
consumption compared other
accelerators
Introduction
Conclusion
Acknowledgment
Advantage and Disadvantage GPU-based systems [38]
Advantage Flexibility
Disadvantage High energy consumption
3
NOCS2019

 A new traffic distribution mechanism on a mesh topology using
distributer nodes
 Providing a flexible structure of proposed our DLA based on
filter, kernel, and channel sizes of CNN and DNN trained models
 Focus on a mesh topology as a communication network for
accelerating
 Flexible location of distributer nodes on a mesh topology based
on filter, kernel, and channel sizes
 Row-node stationary for flow mapping
 Improving online implementing trained models using reducing
the parameters
 Delay
 Energy consumption
 Memory access
 Bandwidth requirement
 Analyzing and distributing the traffic of AlexNet, VGG-16, and
GoogleNet as the examples of CNN and DNN models
Introduction
Conclusion
Acknowledgment
4
NOCS2019

Area consumption
Energy consumption
Delay
Average utilization
Bandwidth requirement
Memory access
Introduction
Conclusion
Acknowledgment
5
NOCS2019

AlexNet traffic distribution as an example of CNN on a
mesh topology
Partitioning the mesh based on kernel, filter, and channel
sizes of AlexNet as an example for describing partitioning
Our proposed mesh based DLA architecture
 Architecture of proposed DLA
 Router
 Switches
 Switch selector
Introduction
Conclusion
Acknowledgment
6
NOCS2019

AlexNet traffic distribution on 12×14 2D
mesh
2D mesh 12×14
(a)
2D mesh 12×14 2D mesh 12×14
(c)
2D mesh 12×14
(d)
2D mesh 12×14
(e)(b)
CONV1
11×55 CONV2
5×27
CONV3
3×13
CONV4
3×13
CONV5
3×13
7
NOCS2019

Partitioning the mesh based on kernel, filter, and
channel sizes of AlexNet for CONV1
Introduction
Conclusion
Acknowledgment
8
NOCS2019
AlexNet
architecture
[19]

Partitioning the mesh based on kernel, filter,
and channel sizes of AlexNet for CONV1
Introduction
Conclusion
Acknowledgment
11×7
9
NOCS2019

Introduction
Conclusion
Acknowledgment
11×7 11×7
10
NOCS2019

Introduction
Conclusion
Acknowledgment
11
NOCS2019

Introduction
Conclusion
Acknowledgment
5×13
12
NOCS2019

Introduction
Conclusion
Acknowledgment
5×13
5×14
13
NOCS2019

and channel sizes of AlexNet for CONV3-5
Introduction
Conclusion
Acknowledgment
14
NOCS2019

and channel sizes of AlexNet for CONV3-5
Introduction
Conclusion
Acknowledgment
3×13
3×13
3×13
3×13
15
NOCS2019

Architecture of proposed DLAIntroduction
Conclusion
Acknowledgment
ifmap
Filter
Psum
GlobalBuffer
16
NOCS2019

Architecture of proposed DLA
Introduction
Conclusion
Acknowledgment
ifmap
Filter
Psum
GlobalBuffer
Switch selector
17
NOCS2019

Architecture of proposed DLA
Introduction
Conclusion
Acknowledgment
ifmap
Filter
Psum
GlobalBuffer
12×15 2D Mesh
12×14
Switch selector
18
NOCS2019

Router
Introduction
Conclusion
Acknowledgment
North
Switch
West
South
East
Multicast
Buffer
Local Buffer
Buffer
Buffer
Buffer
Buffer
Utilizing multicast buffer,
on/off buffer backpressure
mechanism, and two-stage
pipeline
19
NOCS2019

SwitchIntroduction
Conclusion
Acknowledgment
N
S
W
E
Clk EN
s0
s1 s3
s2
N
S
W
E
MUX
DeMUX
Local port Local port
20
NOCS2019

Switch selectorIntroduction
Conclusion
Acknowledgment
S1
S0
S2
S3
S4
S1S0S2S3S4
S1S0S2S3S4
11111
11111
EN
EN
In0
In1
In3
Switch
address
Switch
address
C-decoderR-decoder
Mux
0
Mux
N-1
0
N-1
0
N-1
21
NOCS2019

 Weight stationary (WS): Weight elements are received from the GB and
broadcasted to PEs and after fixing in each PE, convolution calculation is
performed between fixed weight in each PE and ifmap elements
broadcasted from GB onto PEs [3], [4].
 Microswitch array [12]
 Output stationary (OS): In output-stationary DLA, outputs or both
weights and input activations are mapped to PEs from GB. The Psum
results are sent to the GB after finishing local computation [2], [4], [7].
 TPU
 Systolic array
 Row stationary (RS): The ifmap and filter are transferred from the GB to
PE units horizontally, whereas Psums are accumulated vertically by a
multiply-accumulate (MAC) operation of PEs, and accumulated Psums
are transferred to the GB [5].
 Eyeriss [5]
 Eyeriss V.2 [16]
 Microswitch array [4]
 Row-node stationary (RNS): We propose row-node stationary (RNS)
dataflow as a state-of-the-art approach for traffic distribution of DNN
trained models based on flow mapping and memory access mechanism.
An accelerator can transfer data on sets of nodes based on RNS dataflow
in the vertical and horizontal directions using distributer nodes in parallel.
Introduction
Conclusion
Acknowledgment
22
NOCS2019

Introduction
Conclusion
Acknowledgment
Filter row 1
Filterrow2
Filterrow3
Filterrow3
Ifmap row 1
Ifmaprow2
Ifm
ap
row
3
Ifmaprow4
Ifmaprow5
Ifmaprow3
Ifmap row 3
Ifmap row 2 Ifmap row 4
Filter row 2
Filter row 3
Filterrow3
Filterrow3
Node
(a) (b)
Filterrow1
Ifmaprow1
Distributer
Node
Psum
row3
Psum
row1
Psum
row2
Filter row 3
(c)
A row of ifmap values is
reused and distributed in
vertical and horizontal
directions based on the
location of distributer node
A row of filter weights is reused
and distributed in vertical and
horizontal directions based on
the location of distributer node
A row of Psums is
accumulated
vertically
23
NOCS2019

Introduction
Conclusion
Acknowledgment 12×15 2D Mesh
(a)Destination node
12×14
ifmap
Psum
Filter
Shared bus
Distributer node
AlexNet traffic distribution for
CONV1 on 12×15 2D mesh
using distributer nodes
24
NOCS2019

Introduction
Conclusion
Acknowledgment
12×15 2D Mesh
(a)Destination node
12×14
ifmap
Psum
Filter
Shared bus
Distributer node
25
NOCS2019

Introduction
Conclusion
Acknowledgment
12×15 2D Mesh
(a)Destination node
12×14
ifmap
Psum
Filter
Shared bus
Distributer node
CONV1 on 12×15 2D mesh using
distributer nodes
26
NOCS2019

Introduction
Conclusion
Acknowledgment
12×15 2D Mesh
(b)
Distributer node
12×14
Destination node
27
NOCS2019

Introduction
Conclusion
Acknowledgment
12×15 2D Mesh
(b)
Distributer node
12×14
Destination node
AlexNet traffic distribution
for CONV2 on 12×15 2D mesh
28
NOCS2019

Introduction
Conclusion
Acknowledgment
12×15 2D Mesh
(b)
Distributer node
12×14
Destination node
distributer nodes
29
NOCS2019

Introduction
Conclusion
Acknowledgment
12×15 2D Mesh
(c)
12×14
ifmap
Psum
Filter
Shared bus
Destination node
CONV1 on 12×15 2D mesh without
distributer nodes
30
NOCS2019

Introduction
Conclusion
Acknowledgment
12×15 2D Mesh
(c)
12×14
ifmap
Psum
Filter
Shared bus
Destination node
without distributer nodes
31
NOCS2019

Introduction
Conclusion
Acknowledgment
12×15 2D Mesh
(d)
12×14
Destination node
32
NOCS2019

Introduction
Conclusion
Acknowledgment
Destination node
12×15 2D Mesh
(d)
12×14
33
NOCS2019

Introduction
Conclusion
Acknowledgment
12×15 2D Mesh 12×15 2D Mesh
(a) (b)
12×15 2D Mesh 12×15 2D Mesh
(c) (d)
Distributer node
Destination node
12×14 12×14
12×14 12×14
ifmap
Psum
Filter
ifmap
Psum
Filter
Shared bus
Shared bus
distributer nodes
34
NOCS2019

Introduction
Conclusion
Acknowledgment
0.00E+00
5.00E-06
1.00E-05
1.50E-05
2.00E-05
2.50E-05
3.00E-05
12×15 mesh without
distributer node
12×15 mesh with
distributer node
Maeri
Totalenergy(J)
Total Energy
12×15 mesh without
distributer node
12×15 mesh with distributer
node
Maeri
Comparing total energy of 12×15
2D mesh with distributer nodes,
12×15 2D mesh without distributer
nodes and Maeri
4600
4620
4640
4660
4680
4700
4720
12×15 mesh without
distributer node
12×15 mesh with
distributer node
Maeri
Totaldelay(Cycle)
Total Delay
12×15 mesh without
distributer node
12×15 mesh with
distributer node
Maeri
Comparing total delay of 12×15 2D
mesh with distributer nodes, 12×15
2D mesh without distributer nodes
and Maeri
35
NOCS2019

Introduction
Conclusion
Acknowledgment
0.00E+00
1.00E+03
2.00E+03
3.00E+03
4.00E+03
5.00E+03
6.00E+03
7.00E+03
8.00E+03
9.00E+03
Eyeriss Maeri Mesh
NumberofLUTs
FPGA LUT
Eyeriss
Maeri
Mesh
Comparing switch area consumption
of 12×15 2D mesh with distributer
nodes, 168 switches of Eyeriss and
64 multiplier switches of Maeri
0
50
100
150
200
250
300
350
12×15 mesh without distributer
node
12×15 mesh with distributer node
Memoryaccess(Cycles)
Memory access
Comparing memory access of 12×15 2D mesh
with distributer nodes and without using
distributer nodes for AlexNet traffic
distribution
based on cycles for writing and read memory
36
NOCS2019

Table 1. Total run time comparing between various dataflows with 168 PEs for
CONV1 and CONV11 of VGG-16Introduction
Conclusion
Acknowledgment
CONV Dataflow Total runtime (Cycle)
1 RN 17034
1 NLR 501258240
1 Ws 25961600
1 Shi 249446400
1 DLA 1157409792
1 RS 164204544
11 RN 17722
11 NLR 360316928
11 Ws 217317376
11 Shi 2020081664
11 DLA 673876224
11 RS 830472192
Table2. Average utilization and run time comparison between various topologies
for AlexNet and GoogleNet traffic distribution
Trained model Topology Array size
Compute
runtime
(Cycle)
Average
utilization
(%)
AlexNet
Proposed mesh based
DLA
12×14 113352 88.57
AlexNet TPU 256×256 10026200 96.25
AlexNet Systolic array 32×32 2504183 99.12
AlexNet Eyeriss 12×14 16377164 98.05
GoogleNet
Proposed mesh based
DLA
12×14 180182 84.52
GoogleNet TPU 256×256 259827 68.67
GoogleNet Systolic array 256×256 297163 68.67
37
NOCS2019

Table3. Bandwidth requirement comparing between various topologies for
AlexNet, GoogleNet and VGG-16 traffic distributions
Introduction
Conclusion
Acknowledgment
Trained
model
Topology Array size
Bandwidth requirement
(Byte/Cycle)
GoogleNet
Proposed mesh based
DLA
12×14 0.08
GoogleNet TPU 256×256 3.62
GoogleNet Systolic array 256×256 49.71
AlexNet
Proposed mesh based
DLA
12×14 0.08
AlexNet TPU 256×256 3.14
AlexNet Systolic array 256×256 3.14
AlexNet Eyeriss 12×14 1.02
VGG-16
Proposed mesh based
DLA
12×14 0.08
VGG-16 TPU 256×256 4.38
VGG-16 Systolic array 256×256 12.108
VGG-16 Eyeriss 12×14 0.9
0.00E+00
5.00E+04
1.00E+05
1.50E+05
2.00E+05
2.50E+05
3.00E+05
3.50E+05
AlexNet VGG-16 GoogleNet
Totalruntime(Cycle)
Trained models
Total Runtime
Total runtime of traffic
distribution of AlexNet, VGG-
16, and GoogleNet on the
mesh
38
NOCS2019

Introducing used simulation tools
 Deploying a cycle-accurate simulation tool based on SystemC inspired
by the Noxim tool [13], [10], [15]
 Xilinx Vivado tool [11], [14]
 Scale-sim as a Python-based cycle-accurate tool [17], [18]
 Maestro as a SystemC-based tool [9], [12]
A summary of simulation results
 Reducing energy consumption for distributing traffic with distributer
nodes by approximately 8% compared to without distributer nodes
 Decreasing energy consumption and total delay for 12×15 2D mesh
with distributer node by approximately 43.66% and 0.59% compared
with Maeri, respectively
 Reducing area consumption based on LUT for 12×15 2D mesh with
distributer nodes by approximately 93.56% as compared to Maeri
 Reducing memory access approximately 62.5% compared to using no
distributer nodes in AlexNet traffic on 12×15 mesh
 Decreasing total runtime for row-node stationary (RN) by
approximately 99% compared with weight stationary (WS) dataflow in
CONV1 and CONV11 of VGG-16
 Improving compute runtime and average utilization of our proposed
DLA by approximately 30.65 % and 18.75% compared with TPU for
first nine-convolutions of GoogleNet, respectively
 Improving bandwidth requirement for mesh by approximately 98.17
and 91.1% compared with TPU and Eyeriss for VGG-16 traffic
distribution, respectively
Introduction
Conclusion
Acknowledgment
39
NOCS2019

Flow mapping method reduced the total energy and
delay with the distributer nodes compared with the
pattern without the distributer nodes
Traffic distribution of CNN and DNN on a mesh
network with distributers nodes improving the
performance and throughput requirements
Row-node stationary-based dataflow has impressive
effect on reducing delay and energy consumption
Proposed router with simpler structure and tiny
switches decreased area consumption and delay
Multicast traffic distribution in multi-side with the
distributer nodes decreases total energy and flow on the
mesh
Introduction
Conclusion
Acknowledgment
40
NOCS2019

We thank the Synergy lab team from Georgia Institute
of Technology for responding our questions and
providing more information about the Maeri project and
their kind help in compiling and using Maestro and
Scale-sim simulators.
Introduction
Conclusion
Acknowledgment
41
NOCS2019

REFERENCES
[1] Tao Luo, Shaoli Liu, Ling Li, Yuqing Wang, Shijin Zhang, Tianshi Chen, Zhiwei Xu, Olivier Temam, and Yunji Chen, DaDianNao: A Machine-Learning Supercomputer. Journal (Transactions on
Computers), 2016.
[2] Z. Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam, ShiDianNao: Shifting Vision Processing Closer to the Sensor. Conference
(ISCA), 2015.
[3]Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, Vivek Srikumar, ISAAC: A Convolutional Neural Network
Accelerator with In-Situ Analog Arithmetic in Crossbars. Conference (Computer Architecture), 2016.
[4] Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna, Rethinking NoCs for Spatial Neural Network Accelerators. Conference (NOCS), 2017.
[5] Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze, Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. Journal (SOLID-STATE
CIRCUITS), 2016.
[6] N P. Jouppi et al., In-Datacenter Performance Analysis of a Tensor Processing Unit. Conference (ArXiv), 2017.
[7] Bert Moons, and Marian Verhelst, A 0.3-2.6 UPS/W Precision-Scalable Processor for Real-Time Large-Scale ConvNets. Symposium (VLSI), 2016.
[8] Hyoukjun Kwon, Joel S. Emer, and Tushar Krishna, MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. Conference (ASPLOS’18), 2018.
[9] Hyoukjun Kwon, Michael Pellauer, and Tushar Krishna, MAESTRO: An Open-source Infrastructure for Modeling Dataflows within Deep Learning Accelerators. Conference (ArXiv), 2018.
[10] https://github.com/davidepatti/noxim
[11] https://www.xilinx.com/products/design-tools/vivado.html
[12] http://synergy.ece.gatech.edu/tools/maestro/
[13] Vincenzo Catania, Andrea Mineo, Maurizio Palesi, Davide Patti, and Salvatore Monteleone, Cycle-Accurate Network on Chip Simulation with Noxim. Journal (TOMACS), 2016.
[14] Hyoukjun Kwon, and Tushar Krishna, OpenSMART: Single-Cycle Multi-hop NoC Generator in BSV and Chisel. Conference (ISPASS), 2017.
[15] Kun-Chih, Jimmy Chen, and Ting-Yi Wang, NN-Noxim: High-Level Cycle-Accurate NoC-based Neural Networks Simulator. Conference (NOCARC), 2018.
[16] Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze, Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks. Journal (ArXiv), 2018.
[17] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna, SCALE-Sim: Systolic CNN Accelerator Simulator. Conference (ASPLOS’18), 2018.
[18] https://github.com/ARM-software/SCALE-Sim
[19] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks. Conference (NIPS), 2012.
42
NOCS2019

Thank you for your attention
?
NOCS2019

Flow Mapping and Data Distribution on Mesh-based Deep Learning Accelerator

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flow Mapping and Data Distribution on Mesh-based Deep Learning Accelerator

Similar to Flow Mapping and Data Distribution on Mesh-based Deep Learning Accelerator (20)

Recently uploaded

Recently uploaded (20)

Flow Mapping and Data Distribution on Mesh-based Deep Learning Accelerator