Moving CNNs from Academic Theory to Embedded Reality

Copyright © 2017 Synopsys 1
Moving CNNs from Academic Theory to
Embedded Reality
Tom Michiels, System Architect
May 2017

• Embedded vision processor leverages many silicon-proven IP products
• DesignWare®: ARC® HS processor, AXI, DMA, Memory Compiler …
• HAPS® FPGA-based rapid prototyping system
Synopsys at a Glance
>5,500
Masters/PhD
Degrees
>2,400
IP Designers
>2,100
Applications
Engineers
>$2.2B
FY15
Revenue
33%
Revenue
on R&D
>10,200
Employees

Requirements for embedded
CNN implementations
car
car
sky
building
building

• Object detection, classification & localization,
face recognition
• Visual attention, facial expression recognition
• Gesture recognition / hand tracking
• Scene recognition and labelling, semantic
segmentation
• Sky, mountain, road, tree, building …
• Resolution upscaling
CNN for a Wide Range of Vision Applications
car
car
sky
building
building
road

Accuracy
Computationalcomplexity
Lenet (1994)
4 layers
AlexNet (2012)
8 layers
100MByte
VGG-19 (2014)
19 layers
270MByte
GoogleNet (2014)
22 layer
20MByte
ResNet (2015)
152 layers!
10MByte
1 GOPs/frame
10 GOPs/frame
Computation Requirements for CNNs

• Advanced CNN applications
• Object classification, detection, localization
• Scene segmentation
• Super resolution
• Recursive neural networks
• Implementation on GPP and GP-GPU
• Typical customer targets for 1080p @30 fps
Typical Power, Performance and Area
Based on 28 nm process node
<500 mW 1-2 mm2
100 – 1000 GMAC/s
1-10 W 50-100 mm2

CNN Technical Challenges & Solutions

Bit Width Impact on Detection Accuracy
Functional sim. model w/varying bit widths (ILSVRC Graphs/Caffe Trained Models)

Efficient Implementation of Convolution
Loop over all layers
for (layer = 0;..;layer ++) {
for (d_z = 0;…;…) {
for (d_y = 0;…;…) {
for (d_x = 0;…;…) {
r = 0;
for (s_z = 0;…;…) {
for (c_y = 0;…;…) {
for (c_x = 0;…;…) {
r+= kernel[layer][d_z][s_z][c_y][c_x]
* F[layer][s_z][d_y-c_y][d_x-c_x];
}
}
}
F[layer+1][d_y][d_x] = ReLu( r );
}
}
}
}
Loop over the three dimensions
of the output blob,
Loop over the X-Y dimension of
the convolution stencil
Loop over the Z-dimension of
the input
It’s just 7 nested loops!
s
dc

Efficient Implementation of Convolution
for (d_z = 0;…;…) {
for (d_y = 0;…;…) {
for (d_x = 0;…;…) {
r = 0;
for (s_z = 0;…;…) {
for (c_y = 0;…;…) {
for (c_x = 0;…;…) {
}
}
}
}
}
}
}
Design Choices
Efficiency Impact
• Over which of these 7 loops do we vectorize?
• Do we split up loops in fine-grain and course-grain?
• How do we nest these loops?
• What intermediate data can we cache?
• Efficiency of vectorization
• Data-reuse of register and local memory
• External memory bandwidth
• Local memory size and bandwidth requirements
• Cost of mux logic
• Opportunity to exploit sparseness of kernels

• Vectorizing too much over one dimension is not
efficient
• Vectorizing over both input-feature maps and
convolution stencils increases computation without
increasing accumulator memory access
• Challenge is efficient vectorization over the
convolution stencil
• Vectorizing over Z-dimension of the input feature
maps increases parallelism without increasing
accumulator bandwidth
Different Ways of Vectorizing Convolutions

• Vectorizing and the loop dimension will determine bandwidth
• Orthogonal loop order  lower bandwidth
Vectorizing versus Loop Nesting
3x3
12x3
8x1
3x3
8x1
3x3
12x3 12x3 8x1
Iterate Horizontally Iterate Vertically

• Energy cost of local and external memory access
Reduce external memory access by optimizing local memory reuse
Reduce local memory access by optimizing register reuse
Cost of Memory Access
1
2

• Once vectorized, every one of the 6 nested loops can be tiled, and every level of the loop can
be nested
• The choice of loop ordering will impact
• Data reuse opportunities
• Bandwidth
• Local memory requirements
• To optimize for power and performance, different loop ordering is needed for different layers
Different Ways of Nesting Convolution Loops
1
2
1
3
4

• CNN layers can have 10s of MBs of feature maps and coefficients
• Storing theses intermediate feature maps in external memory may not be
necessary if for subsequent layers the coefficients fit in the local memory
• Convolutions can be tiled between network layers to keep the
intermediate feature maps in local memory
External Memory Bandwidth reduction: Example
3x3
3x3

• Scene segmentation on 5-channel 1920x1080 images
• Segmenting into 11 categories
• Weights: Over 100 K values
Automotive Example
5x5,
20
fmap
Max
2x2
5x5,
40
fmap
Max
2x2
5x5,
80
fmap
1x1,
11
fmap
1920
x108
0x5
473x
263x
5
road
building
Frames per second 18 FPS
Cycles per frame 51M Cycles
MAX VM (Storage of Feature Maps) 151K Bytes
MAX WM (Storage of Weights) 155K Bytes
DMA BW Read 503 MB/s
DMA BW Write 102 MB/s
car
car
sky
building

External Memory Bandwidth reduction: Example
for (d_z = 0;…;…) {
for (d_y = 0;…;…) {
for (d_x = 0;…;…) {
r = 0;
for (s_z = 0;…;…) {
for (c_y = 0;…;…) {
for (c_x = 0;…;…) {
}
}
}
}
}
}
}
This loop should not be the outer-
loop if you want to minimize the
external bandwidth

Convolutional Networks with LSTM
CNN LSTM
CNNs are used in conjunction with Recurrent Neural Networks (like LSTM)
Image Caption Generation People Detection in Crowded Scenes

• Characteristics of convolutions
• Data-reuse to keep the MACs busy efficiently
• Complicated loop-order and vector trade-offs
• Low bandwidth per MAC
• Challenge of fully connected layers, LSTM, RNN
• High bandwidth per MAC
• Data management, not raw compute power
• Exploiting sparsity in computation
Convolutions vs Fully Connected Layers, LSTM, RNN
Very different
characteristics

• Choose bit-widths wisely!
• Lower bit-width saves power and area, but below 10 bits
classification accuracy drops
• Reduce internal and external memory bandwidth
• Choose loop nesting based on the shapes of the convolution layers
Conclusions

• Website: Synopsys DesignWare EV6 Embedded Vision Processors
• 2016 Embedded Vision Summit presentations:
• Programming Embedded Vision Processors Using OpenVX
• Using the OpenCL C Kernel Language for Embedded Vision
Processors
• Embedded Vision Alliance article: Facial Analysis Delivers Diverse
Vision Processing Capabilities
• Visit the Synopsys booth for demos on Deep Learning & CNN
Resources

Thank You
Tom Michiels, System Architect
May 2017

Moving CNNs from Academic Theory to Embedded Reality

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Moving CNNs from Academic Theory to Embedded Reality

Similar to Moving CNNs from Academic Theory to Embedded Reality (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

Moving CNNs from Academic Theory to Embedded Reality