2020 icldla-updated

Copyright 2020 ITRI 工業技術研究院
ITRI DLA Accelerating System
design, system, tools, and applications
工業技術研究院 Industrial Technology Research Institute (ITRI)
資訊與通訊研究所 Information and Communication Research Lab (ICL)

CNN Models Advance Fast
2
Source: Alfredo Canziani, 2017
We need high accuracy
with low computation 
Many computer
vision tasks and
DNN models;
classification is the
basic 
Different DNN Models for the
same classification based on
ImageNet database 

Three Steps for A High Efficient Accelerator
3
1. Increase MAC PEs with high parallelism
2. Ensure the data supplement to those PEs
3. Improve energy efficiency, adaptive to the models
Throughput
Computation Power
3
2
Concepts of step 1~ 3 Take Alexnet for example
Throughput Curves
given various DRAM BW
Convolution contains many independent MACs and data overlap

FPS/Throughput of Various Models
-- profiled using 256MAC, 128KB, INT8, DLA inference
4

C2C Ratio Preference in Classification Models
5
AlexNet prefers
Need bandwidth due to
heavy-weight FC layers
Inception prefers
More computation
power because
many branches of
CNN computations,
where concat layer
is memory BW free
ResNet prefers
evenly memory bandwidth
and computation power;
element-wise add need
bandwidth
MobileNet prefers
More memory
bandwidth, although
DW-CONV layers
reduce computation but
increase intermediate
activations
Heavy parameters
in last 3 FC layers
Concatenate
small CNN layers
Element-wise
add two activations
Depth-wise & point-wise CONV
Replace conventional CONV

Customization
6

Customize Flow for An Accelerator
7
User’s
AI Framework
& Models
Framework
Converter
8-bit
Retrain
Framework
Network
Compiler
API, Driver (FW)
Accelerator (HW)
User’s
PPA SPEC
Candidate
HW SPEC
Model
Parser
Coarse Compile &
PPA Profiler
InferenceSynthesisAnalysis
HW Assembler
APP
Call API
HW Library
From NN analysis, synthesis, to inference
NV-DLA
NV-DLA
NV-DLA

Copyright 2020 ITRI 工業技術研究院 8
DLA Architecture Customizable and Configurable
1. Variable CONV MAC resources
• 64-MAC to 2048-MAC for convolution processer
• Variable size of convolutional buffer
2. Configurable NN operator processors
• Options for batch normalization, PReLU, scale,
bias, quantization, element-wise operators
• Options for down sample ( like pool) operators
• Options for nonlinear LUTs
• Options for user to add new processors
3. Custom memories and host CPUs
• Can be driven by user’s MCU or CPU
• Options for shared or private DRAM / SRAM /
NVM
Convolutional
Processor
Element-Wise
Processor
Pool
Processor
Nonlinear
Processor
InterfaceUnit
ConfigurationUnit
AXI
APB
BUS
AXI
Bridge
APB
Bridge
Flow
controller
High
Speed IO
DRAMIF
Peripherals
DLA IP
Custom
SRAMIF
User’s New
Processor
AXI
option B: Board Integrationoption A: SoC Integration
CPU DSP
Custom Host System

DLA Reference SPECs
9
1. Atomic operation size (atomic C and K) of convolution
2. Convolutional buffer structure
3. Optional : nonlinear LUT, data reshape, weight decompression
Original NVDLA 64-MAC 256-MAC 512-MAC 1024-MAC 2048-MAC
Data type INT8 INT8 INT8 INT8 INT8
MAC for channel # 8 32 32 32 64
MAC for kernel # 8 8 16 32 32
Internal Buffer Size 128 KB 128 KB 512 KB 512 KB 512 KB
AXI (DBB) width 64 64 128 256 256
AXI (DBB) burst 1 1 4 4 4
CONV SRAM width X X X 256 256
CONV SRAM burst X X X 4 4
Status OK Not complete, bugs to generate
ITRI Version 64-MAC 256-MAC 512-MAC 1024-MAC 2048-MAC
AXI (DBB) burst upto 8 upto 8 upto 8 upto 8 upto 8
CONV SRAM width 64 64 128 256 256
CONV SRAM burst upto 8 upto 8 upto 8 upto 8 upto 8
Status OK OK OK TBA OK
Additional Functions Depth-wise convolution, Up-sampling
DEV Tools Bare-metal compiler, Performance profiler, Golden pattern generator & simulator
ITRIVersionImprovement

Features of DLA Hardware width
height
IN IN
IN
OUT
kernels
Stride 1, no pad
Channel first Plane first
3D CONV example
1. Variable HW resource
• Search an efficient resource to models
• Adaptive performance & power consumption
2. Suit for long-channel convolution
• Output pixel first, share IN, avoiding partial sum storage
• Support any kernel size (n x m) ，the same data flow
3. Revision for depth-wise convolution
• Output pixel first, channel = 1 convolution
• Support any kernel size (n x m) ，the same data flow
4. Data reuse and hetero-layer fusion
• Input reuse or weight reuse by setup
• Fuse popular layers [CONV(BN)–Quantize-PReLU–Pooling ]
5. Program time hiding
• Configure N and N+1 layer simultaneously
• Cover the configuration time during layer change

Exclusive HW View
of Depth-wise CONV
Convolution
DMA
(CDMA)
Convolution Buffer
(CBUF)
DW + Original CSC
Convolution MAC Array (CMAC)
DW + Original CACC
Global Unit (GLB)
(interrupt/fault)
CSB Master
MCIF
DW + Original SDP
Non-convolutional ProcessorsConvolutional Processor
Controller
DRAM
(AXI)
APB
CSB to APB
CVIF
Cross-Channel Data Processor (CDP)
Planar Data Processor (PDP)
RUBIK engine (RUBIK)
SRAM
(AXI)
BDMA
Fused depth-wise convolution engine
• DW data flow controller on CSC, CACC, SDP
• Fused DW-CONV with BN and PReLU

API in C
API in C
API in C
API in C
12
NN-to-DLA Translator Flow and Verification
Layer Queue
CFGs
Model
Parse
Layer
Fuse
Layer
Partition
Model
Graph
HW-aware
Quantize
Insert
Direct Quantize
or Re-train
(Tensorflow)
Weight
Convert &
Partition
Model
Weights
Quantized
Weights
API in C
Baremetal Inference Example
1. Allocate free memory space
2. Capture image
3. Call coarse object detection API
4. Draw bounding boxes, capture each ROI
5. Call detailed classification API
6. Post processing, and loop back
Libraries

HW Inference Queue Examples
13
ID name type
1 conv1 Convolution
2 bn1 BatchNorm
3 scale1 Scale
4 relu1 ReLU
5 pool1 Pooling
6 conv2 Convolution
7 bn2 BatchNorm
8 scale2 Scale
9 relu2 ReLU
10 pool2 Pooling
11 conv3 Convolution
12 bn3 BatchNorm
13 scale3 Scale
14 relu3 ReLU
15 pool3 Pooling
17 bn4 BatchNorm
18 scale4 Scale
19 relu4 ReLU
20 pool4 Pooling
22 bn5 BatchNorm
23 scale5 Scale
24 relu5 ReLU
25 pool5 Pooling
27 bn6 BatchNorm
28 scale6 Scale
29 relu6 ReLU
30 pool6 Pooling
32 bn7 BatchNorm
33 scale7 Scale
34 relu7 ReLU
36 bn8 BatchNorm
37 scale8 Scale
38 relu8 ReLU
39 fc9 InnerProduct
#
Layer
Number
0~13 Hybrid Layer 1
14~20 Hybrid Layer 2
26 Hybrid Layer 5
27 Hybrid Layer 6
28 Hybrid Layer 7
29 Hybrid Layer 8
30 FC9
Tiny YOLO v1
(39 DNN layers)
Tiny YOLO v1
HW Inference Queue
9 graph layers, 30 HW layers
* Detection done by host CPU
Conv1 + bn + scale + Relu + cvt(ReQ) #0 - #3
pool1(MaxPool) #4
Res2a_branch2a + bn + scale + relu + cvt(ReQ) #5 - #6
Res2a_branch2b + bn + scale + relu + cvt(ReQ) #7 - #9
Res2a_branch2c + bn + scale + cvt(ReQ) #10-#11
Res2a_branch1 + bn + scale+ReQ #12-#13
res2a(Eltwise) + relu + cvt(ReQ) #14
Res2b_branch2a + bn + scale + relu + cvt(ReQ) #15-#21
Res2b_branch2b + bn + scale + relu + cvt(ReQ) #22-#24
Res2b_branch2c + bn + scale + cvt(ReQ) #25-#26
ReQ_res2a(ReQ) + res2b(Eltwise) + relu + cvt(ReQ) #27
Res2c_branch2a + bn + scale + relu + cvt(ReQ) #28-#34
Res2c_branch2b + bn + scale + relu + cvt(ReQ) #35-#37
Res2c_branch2c + bn + scale + cvt(ReQ) #38
res2c_branch1_maxPool(MaxPool) #39
res2c_branch1_ReQ(ReQ) + res2c(Eltwise) + relu + cvt(ReQ) #40
Res3a_branch2a + bn + scale + relu + cvt(ReQ) #41-#43
Res3a_branch2b + bn + scale + relu + cvt(ReQ) #44
Res3a_branch2c + bn + scale + cvt(ReQ) #45
Res3a_branch1 + bn + scale + cvt(ReQ) #46-#47
Res3b_branch2b + bn + scale + relu + cvt(ReQ) #56
Res3b_branch2c + bn + scale + cvt(ReQ) #57
Res3c_branch2b + bn + scle + relu + cvt(ReQ) #66
ReQ_res3b(ReQ) + res3c(Eltwise) + relu + cvt(ReQ) #68
Res3d_branch2a + bn + scale + relu + cvt(ReQ) #69-#75
Res3d_branch2b + bn + scale + relu + cvt(ReQ) #76
Res3d_branch2c + bn + scale + cvt(ReQ) #77
res3c_branch1_maxPool(MaxPool) #78
res3c_branch1_ReQ(ReQ) + res3d(Eltwise) + relu + cvt(ReQ) #79
Res4a_branch2a + bn + scale + relu + cvt(ReQ) #80
Res4a_branch1 + bn + scale + cvt(ReQ) #83
Res4c_branch2b + bn + scale + relu + cvt(ReQ) #92
ReQ_res4b(ReQ) + res4c(Eltwise) + relu + cvt(ReQ) #94
Res4d_branch2a + bn + scale + relu + cvt(ReQ) #95-#96
Res4d_branch2b + bn + scale + relu + cvt(ReQ) #97
Res4d_branch2c + bn + scale + cvt(ReQ) #98
ReQ_res4c(ReQ) + res4d(Eltwise) + relu + cvt(ReQ) #99
Res4e_branch2a + bn + scale + relu + cvt(ReQ) #100-#101
Res4e_branch2b + bn + scale + relu + cvt(ReQ) #102
Res4e_branch2c + bn + scale + cvt(ReQ) #103
ReQ_res4d(ReQ) + res4e(Eltwise) + relu + cvt(ReQ) #104
Res4f_branch2a + bn + scale + relu + cvt(ReQ) #105-#106
Res4f_branch2b + bn + scale + relu + cvt(ReQ) #107
Res4f_branch2c + bn + scale + cvt(ReQ) #108
res4e_branch1_maxPool(MaxPool) #109
res4e_branch1_ReQ(ReQ) + res4f(Eltwise) + relu + cvt(ReQ) #110
Res5a_branch2a + bn + scale + relu + cvt(ReQ) #111
Res5a_branch1 + bn + scale + ReQ #114
Res5b_branch2a + bn + scale + relu + cvt(ReQ) #116
Res5c_branch2a + bn + scale + relu + cvt(ReQ) #120
Res5c_branch2b + bn + scale + relu + cvt(ReQ) #121
ReQ_res5b(ReQ) + res5c(Eltwise) + relu + cvt(ReQ) + pool5 #123
fc1000 + fc_bias + cvt(ReQ) #124
Resnet50
HW Inference
Queue
74 graph layers,
125 HW layers

Quantize and Rescale in Classical NNs
14
mAP in voc-2007 dataset; ACC: Top1-accuracy in ImageNet
VGG-like Element-Add Concat Depth-wise
Tiny YOLO Resnet-v1-50 Inception-v3 MobileNet-v1
mAP: INT8 / FP ACC: INT8 / FP
v1: 38.08 / 40.86* 67.59 / 68.96+ 75.64 / 78.91+ 61.02 / 60.93+
v2: 48.03 / 49.92*
Add_E
ReQ
ReQ
ReQReQ
ReQ
ReQ
ReQ
Obj. Det.
Layer
ReQ
DW-C
ReQ
PW-C
ReQ
Concate
Output
Input

Support Operators
15
Operator Applied Network
Convolution
Standard kernels R x S,
can be dilated
All CNN
Depthwise (ITRI Exclusive) Xception, MobileNet
Pointwise, kernel = 1*1 Inception, SuffleNet, MobileNet
Normalization
Local Response Norm. (LRN) AlexNet, Inception
Batch Norm. ResNet, DenseNet, MobileNet, Yolo
Activation
Tanh, Sigmoid RNN-LSTM
ReLU, PReLU AlexNet, VGG, Inception, Yolo
Pooling
Up-sampling
Max Pooling, kernel = 2*2, 3*3 All CNN
Avg Pooling, Global Avg Pooling ResNet, DenseNet, NiN, SSD
Up-sample Segamentation/SuperRes/YOLOv3
Elementwise
Concat, Split, Slice Inception, DenseNet, SSD
Add, Scale ResNet、RNN-LSTM
Fully Connection Tensor Mult., Add All CNN, RNN-LSTM

Simulation & Visualization
16

NN-to-DLA Model Translation Tools
for Profile and Bare-metal Compile
Netron supports
ONNX (.onnx, .pb, .pbtxt),
Keras (.h5, .keras),
Core ML (.mlmodel),
Caffe (.caffemodel, .prototxt),
Caffe2 (predict_net.pb, predict_net.pbtxt),
MXNet(.model, -symbol.json),
TorchScript (.pt, .pth),
NCNN (.param)
TensorFlow Lite(.tflite).
Intermediate Format
Caffe-based considering
• asymmetric pad
• quantized layer
Pattern Generator
Parameter Formatter
HW Config
Generator
DLA
System
NN
Graph
Real Parameters
MUX
GUI Profiler
NN
Models
Compile / Translate
17

Integrated Netron Executable Version
18
DNN模型
DLA 設定
https://github.com/SCLUO/Op
en-DLA-Performance-Profiler
1. MAC Utilization: average MAC
utilization under aggressive FPS
2. Roofline Factor: the ratio of memory
access cycles / total cycles
3. Conservative FPS: consider the
memory access and computation is
fully overlapped
4. Aggressive FPS: consider the
memory access and computation is
fully interleaved

Equation-based Profiler
Network GMAC
Peak 1GBps DRAM
BW
Peak 1GBps DRAM
+ 9.6GBps SRAM for ACTs
Peak 2GBps DRAM
MAC Util. Est. FPS MAC Util. Est. FPS
SRAM
Size (MB)
MAC Util. Est. FPS
Alexnet(224) 0.73 12% 12.3 12% 12.4 0.6 18% 19.1
InceptionResnetV2(224) 9.13 77% 6.5 79% 6.6 2.0 91% 7.7
InceptionV1(224) 1.73 49% 22.0 56% 25.0 1.1 56% 24.7
InceptionV2(231) 2.25 66% 22.5 78% 26.8 1.1 77% 26.3
InceptionV3(299) 5.75 75% 10.0 86% 11.5 2.0 88% 11.7
InceptionV4(299) 12.47 88% 5.4 96% 5.9 2.0 95% 5.8
MobileNetV1(224) 0.54 45% 60.4 62% 83.1 1.1 63% 84.5
MobileNetV2(224) 0.43 25% 44.9 48% 85.0 1.4 44% 78.5
MobileV1-SSD(416) 2.13 50% 18.2 69% 24.8 4.0 67% 24.1
MobileV2-SSD(416) 1.13 26% 17.3 45% 30.6 5.0 42% 28.5
Resnet50(224) 3.86 57% 11.3 66% 13.1 1.9 71% 14.1
TinyYOLOv1(448) 1.61 38% 18.1 38% 18.2 1.1 48% 22.9
TinyYOLOv2(416) 3.28 64% 15.1 65% 15.2 1.0 78% 18.3
TinyYOLOv3(416) 2.79 74% 20.4 75% 20.7 1.0 75% 20.6
example: 256 MAC & 128 KB @ 300MHz with
1GBps DRAM / 1GBps DRAM+SRAM for ACTs / 2GBps DRAM
19

Reduction of Activation (feature map)
Network GMAC
Peak
2GBps DRAM BW
Total ACT Traffic per
frame (MB)
Weight
(MB)
Average DRAM BW
=(ACT+Weight) x FPS
MAC Util. Est. FPS Original DLA Same Original DLA
Alexnet(224) 0.73 18% 19.1 4.1 2.8 60 1224 1199
InceptionResnetV2(224) 9.13 91% 7.7 189 71 30 1686 778
InceptionV1(224) 1.73 56% 24.7 20 12 6.9 664 467
InceptionV2(231) 2.25 77% 26.3 36 18 11 1236 763
InceptionV3(299) 5.75 88% 11.7 83 35 23 1240 679
InceptionV4(299) 12.47 95% 5.8 145 58 41 1079 574
MobileNetV1(224) 0.54 63% 84.5 40 10 4.1 3726 1191
MobileNetV2(224) 0.43 44% 78.5 70 18 3.3 5754 1672
MobileV1-SSD(416) 2.13 67% 24.1 140 42 5.6 3509 1147
MobileV2-SSD(416) 1.13 42% 28.5 102 60 3.0 2993 1796
Resnet50(224) 3.86 71% 14.1 93 32 25 1664 804
TinyYOLOv1(448) 1.61 48% 22.9 54 4.8 26 1832 705
TinyYOLOv2(416) 3.28 78% 18.3 48 4.7 16 1171 379
TinyYOLOv3(416) 2.79 75% 20.6 52 6.4 8.6 1248 309
Example: 256 MAC & 128 KB @ 300MHz
See the reduction of ACT, and average DRAM BW
20

Comparisons with RTL and Equation-based Profiler
21
@400MHz, 0.5 GB/s DRAM model
Tiny YOLO v1 for DLA_64
• RTL = 46.6 M cycles
• Profiler = 41 M cycles
• RTL = 23.9M cycles
• Profiler = 28.4 M cycles
@400MHz, 0.5 GB/s DRAM model
Resnet50 for DLA_64
• RTL = 74M cycles
• Profiler = 74M cycles
Resnet50 for DLA_256
MobileNet_v1 for DLA_256
Inception_v3 for DLA_256
• Profiler = 54.1M cycles

Reference System Design
22
Demos
https://sites.google.com/view
/itri-icl-dla/demonstrations

Implementation of USB Accelerator
Host Linux machine
1. Load RISC-V INIT, NN CFGs, Weights
2. Capture an image + preprocessing
3. Call object detection (YOLO) Start
4. Return output; send next image
5. Draw bounding boxes, display
INIT
Ready
Send Image + Start
Done
Ready for next
image + Start
Read output
RV INIT,
NN CFGs
image
weights
swap
output
0x0
Address Map
of USB Stick

USB Acceleration System
• RV32-IM RISC-V & 64-MAC DLA on CESYS EFM-03 (Xilinx Artix-7)
@100MHz, achieving 3 inference per second (3 fps) of Tiny YOLO v1
• RV32-IM RISC-V & 256-MAC DLA on Xilinx ZCU102 @150MHz,
achieving 9 inference per second (9 fps) of Tiny YOLO v1
• RV32-IM RISC-V & 2048-MAC DLA on Xilinx VCU118 @150MHz,
achieving 21 inference per second (21 fps) of Tiny YOLO v1
Linux mini PC
USB FPGA USB Live Camera
Screen of
the Linux
mini PC
Test figures from
Win PC
USB Accelerator
FPGA Prototype
VCU 118
USB interface
NB Host
24

inputoutput
ASIC Implementation
25
Layout View
• RV32-IM RISC-V & 64-MAC DLA
• Clock network optimization
• Register reduction
• Data path pipeline retiming
• Coarse & fine-grained clock gating
USB
GPIF
DRAM
IF
RISC-V
Cache
DLA
AXI
A
P
B
Peripherals
PLL
Block View
SoCEVB
Demo video
https://www.youtube.com/watch?v=qKF82386Wf4

ZCU102 FPGA Object Detection Setup
26
DRAM (1GB)
Input Image
Model Weights
OS Controlled Space
DRAM
CTRL
DP
USB ARM CPU
(FPGA)
DLA
(Processing System)
Temp Activations
Output Data
Reserved
for DLA
(~64MB)
Program INIT
Set parameters
Load Weight
Image Capture (YUV)
Re-Format to RGB
Activate DLA
Post Processing
Display
DLA Finished

Standalone ZCU102 FPGA Demonstration
27
Tiny YOLO v3, Object Detection
DLA256 @ 200MHz, 12 fps DNN,
9 fps include mp4 decoding
MobileNet v1 Classification
DLA256 @ 200MHz, 32 fps DNN,
27 fps include image resize

Summary
28

Features of ITRI’s Solutions
• Support from profiling to implementation
▪ Profiler, NN-to-DLA translator, SoC/FPGA references
• Support complete inference on RTL simulation
▪ Accurate, straightforward for conventional IC design
• Support of various DLA SPECs, from 64 to 2048 MAC cores
▪ Successful ASIC and FPGA implementation references
▪ Exclusive operator support (DW CONV, up-sample)
• Collaborate compiler and software partner, Skymizer
• Complete HW-aware integer training flow
▪ Transparent model compression and quantization

Our Services
Design Reference / License
DLA series with verification tool kits
Exclusive architecture of NN operator (DW-CONV, up-sample…)
Design Consultant / Service
System performance analysis and consultant
Customization of efficient & exclusive HW
HW-aware model compression
Design & Application Service
DNN Model profile, analysis, NN-to-DLA translate
HW-aware quantization and re-training
30

THANK YOU!
QUESTIONS AND COMMENTS?
Introduction
https://sites.google.co
m/view/itri-icl-dla/
31
ITRI-OpenDLA
https://github.com/SCLU
O/ITRI-OpenDLA
DLA Perf Profiler
https://github.com/SCLUO/Open-
DLA-Performance-Profiler

2020 icldla-updated

More Related Content

What's hot

Similar to 2020 icldla-updated

Recently uploaded

2020 icldla-updated