HKG18-405 - Accelerating Neural Networks for Vision Systems via FPGAs

Accelerating Neural Networks
for Vision Systems via FPGAs
March 2018
Glenn Steiner, Sr. Manager, Xilinx, Inc.
Michaela Blott, Principal Engineer, Xilinx Labs
Giulio Gambardella, Research Scientist, Xilinx Labs
Andreas Schuler, Director, Missing Link Electronics

© Copyright 2018 Xilinx
11:00 - 11:25, 22 March 2018
The performance and accuracy of Convolutional Neural Networks for visual recognition has
reached the point where researchers generally consider them to be more accurate than
traditional algorithmic approaches. In this session we will examine implementation of a
Binary Neural Network (BNN) on an FPGA with embedded processing system demonstrating
four orders of magnitude greater performance than a software implementation on an
embedded processor. We will start with the basic concepts of Convolutional Neural Networks.
Next, we will examine why FPGAs with embedded processors provide the necessary flexibility
to accommodate network precision as well as varying number of neurons and layers. Finally,
we will demonstrate a BNN running on a 96 board doing real-time traffic recognition at over
8,000 images per second.
Page 2
Abstract

Challenges in Implementing Neural Networks
Heterogeneous All Programmable Devices
for Neural Networks
Reduced Precision Neural Networks
Design Example - German Traffic Sign Recognition
Summary
Page 3
Agenda

© Copyright 2018 XilinxPage 4
Challenges in Implementing Neural Networks

Neural Networks (NNs) are based on simple
models
of the human brain (neurons and synapses)
NNs have the theoretical property of being a
“universal approximation function”
–Empirically outperforming other approximator functions
–Increasing adoption for new use cases
Requires less expertise/specialization in the
target domain
NNs are rapidly becoming the predominant
algorithm
–Outperforming humans and traditional Computer Vision
algorithms for image recognitionPage 5
What Are Neural Networks?
Neural Networks are replacing other approximation solutions

Machine Learning Challenges
Challenge 1:
Different use cases require different networks and
different figures of merits (speed, latency, energy,
accuracy)
Challenge 1:
Different use cases require different networks and
different figures of merits (speed, latency, energy,
accuracy)
Challenge 2:
Billions of multiply-accumulate operations and tens
of megabytes of parameter data
Challenge 2:
Billions of multiply-accumulate operations and tens
of megabytes of parameter data
Challenge 3:
Continuous stream of new algorithms Flexibility
is key
Challenge 3:
Continuous stream of new algorithms Flexibility
is key

Each trained neural network yields to a different design trade-off in
error, cost, throughput, latency, or power
Page 7
Customized Neural Networks
Neural Networks need design flexibility
hardware cost / performance / power
error

Heterogeneous All Programmable Devices
for Neural Networks
Page 8

Zynq UltraScale+ MPSoC
16nm Programmable Logic
• Any-to-Any Connectivity
• Processor Offloading
Graphics Processor
• ARM Mali-400/MP2
• 2D/3D Visualization
Real-Time Processor
• 32-bit Dual-core R5
• 128KB TCM w/ ECC
R5
A5
3
Application Processor
• 64-bit Multicore A53
• Up to 1.5GHz
Heterogeneous Multiprocessing SoC for ADAS

Functional Partitioning for Optimal Performance and Safety
Page 10
Heterogeneous Multicore for ADAS Applications
Sensor fusion /
stitching and
object classification
Sensor fusion /
stitching and
object classification
Environmental
Characterization
Sensing domain
Sensor processing
and tracking
Sensor processing
and tracking
Sensor processing
& tracking
Sensor processing
& tracking
Vision/IR
Radar/Lidar
IR/US/other
Assessment and
decision making
Assessment and
decision making
Feature
Implementation
ARM® Processing
ARM processor suited for complex decision-
making algorithms common in ADAS apps
ARM also enables feature bundling such as
camera sensors used for multiple applications
Programmable Logic
Programmable logic enables parallel
processing necessary for pixel level analysis
DSP blocks enable hardware acceleration of
real-time sensors inputs
Integrated processors and programmable logic enable flexible partitioning between HW & SW

Radar/Laser
Sensor(s)
Radar/Laser
Sensor(s)
OVT10635
Front_Cam
OVT10635
Right_Cam
OVT10635
Left_Cam
OVT10635
Fwd_Cam
VLink
MegaPixel
Rear_Cam
Display
QSPI
FLASH
Capture
Image Storage Image
Retrieval
Image
Storage
Display
Connectivity
Video Output
Control
Image
Retrieval
Functional Safety Elements
Vehicle Comms
(Actuators & Vehicle
Status)
CAN
VLink
VLink
VLink
VLink
Capture
Capture
Capture
Capture
Distortion
Correction,
Perspective
Projection,
Stitching, PiP
3D Surround View &
Rear View Camera
Image
Scaling
Gradient
Extraction,
HoG, SVM
Analytics
Acceleration
for Pedestrian
Detection
Gaussian
Filter,
Edge
Detect &
Thin, Lane
Pattern
Search
Analytics
Acceleration
for Lane
Departure
Warning
Pattern
Recognition,
Optical Flow
Motion
Estimation
Analytics
Acceleration
for Traffic Sign
Recognition
Image
Scaling
Haar
Descriptor,
SVM
Analytics
Acceleration
for Vehicle
Detection
(FCW)
Optical
Flow
Motion
Estimation
Analytics
Acceleration
for Blind Spot
Detection
Headlamp/
TailLamp
Classification
Analytics
Acceleration
for Headlamp
Control
Radar/Laser
Sensor(s)
Sensor
Processing
Sensor
Fusion
Accleration
Fusion
Central ADAS Module Mapping to Zynq
UltraScale+ MPSoC
Peripherals
Peripherals
CSU PMU
R5 R5
A5x A5x
A5x A5x
GPU
DDRC
High Speed
Connectivity
OCM Cache
LPDDR3/DDR4
(Frame buffers, ARM Code,
Status Repository)
Zynq
UltraScale
MPSoC
Sensor Processing
Algorithms & Environmental
Characterization
Safety Critical Functions
Algo Config &Control
Ped Det Proc & Range Est.
LDW Proc & Warning
Collision Warning Proc.
Traffic Sign Recog Proc.
Headlamp Control Proc.
A5x’s perform sensor processing and environmental
characterization tasks in conjunction with HW accelerators.
A5x’s also implement processing control decisions by setting
parametric registers in PL accelerators (e.g. thresholds for edge
detection). ASIL A/B
Safety critical countermeasure decisions &
actuator commands on lockstep R5's. Data
sharing via OCM. CAN output commands & key
decision points initiated in lockstep R5's with
cross-monitoring and diagnostic-protected
voting in PL. ASIL C/D
System Control Decisions
Diagnostics /FuncSafety
Vehicle Comms
Warping and Main Processing
Accelerator partitioning
between APU and PL is
flexible based on
actual loading/resource
utilization
XILINX CONFIDENTIALPage 11

Reduced Precision Neural Networks

Logic cost per operation is greatly reduced
–Today’s FPGAs have a much higher peak performance for reduced precision
operations
Memory cost is greatly reduced
–Large networks can fit entirely into on-chip memory
–(OCM) (UltraRAM, BRAM)
Reducing Precision and Fixed Point saves PowerPage 13
Reduced Precision Saves Logic, Memory & Power with Increased
Performance
Precision Cost per Op
LUT
Cost per Op
DSP
MB
needed
(AlexNet)
TOps/s
(VU9P)**
TOps/s
(ZU19EG)*
1b 2.5 0 7.6 ~100 ~66
8b 45 0 61 ~6 ~4
32b 178 2 244 ~1 ~0.3
100x
*Assumptions: Application can fill device to 70% (fully parallelizable) 250MHZ
**Assumptions: Application can fill device to 70% (fully parallelizable) 300MHZ
***Assumptions: HLS overhead included
Source: Bill Dally (Stanford), Cadence Embedded Neural
Network Summit, February 1, 2017

Custom-tailored hardware
–Customized dataflow architecture to
match network topology
–Customized data types
–Customized to meet design targets
Keep all parameters on-chip
Automatically generated from CNN
description
–Uses a synthesizable C++ NN description
–Supports portability, scalability & rapid
exploration
Page 14
Design Principles
1MOPS1MOPS
10MOPS10MOPS
1PE1PE
10PE10PE
Customized Dataflow Architecture
Synthesizable CNN Description

Just reducing precision:
–Reduces hardware cost & increases
error
Recuperate accuracy
–By retraining & increasing network
size
1b, 2b and 4b provide
pare to optimal solutions
Maintaing Accuracy?
Compensating Quantization with Network
Complexity
Page 15

Design Example
German Traffic Sign Recognition

German Road Sign Database
–50,000+ 32x32 bit images for training
–44 classes (43 road signs, 1 background)
–Training via Amazon Web Services
• AWS: p2.xlarge Instance – 8 hours  $7.78 6,5e
Binary Neural Network Characteristics
–6 convolutional layers
–2 max pool layers
–3 fully connected layers
Page 17
Neural Network Example

Neural Network Performance Results
Up to 8,600 times faster when accelerated with programmable logic
Performance Metric Software Only Programmable Logic
Accelerated
Tiles per second 2.2 19,000
Scene rate (fps) 0.011 (92 sec per frame) 94
Overall Acceleration - 8,600X

Summary
Page 20

Neural Networks enable accurate and efficient image classification
Programmable logic enables network and computation adaption
based on system level requirements
Single-chip Zynq UltraScale+ MPSoCs enable:
–Increased performance
–Reduced power
–Reduced cost
–Flexible field upgrades
Page 21
Zynq UltraScale+ MPSoCs Ideal for Neural
Networks
See our demonstration Friday

Submit your most creative, most out-of-the box AI or ML application
at the Xilinx or Avnet table during Demo Friday (12:00 – 14:00).
The best 30 get a FREE Ultra96 board plus software to
help you realize your vision.
The 1st
twenty to submit a working design by MAY 25th
, 2018 get a
$25 Amazon Gift Card.
ONE Winner announced through Xilinx social media channels. If
it’s you, you’re invited to present your design to your peers in
industry at Xilinx Developer Forum 2018.
Page 22
The Future is Ultra96 Xilinx Contest

Accelerating Neural
Networks
for Vision Systems via
FPGAs
Questions?

German Road Sign Database
– ~1,000 images, rotated & scaled to various angles & sizes  ~50,000 32x32 bit images for training
– http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset
– 44 Classes (43 Road Signs, 1 background)
– Amazon Web Services used for training (500 epochs)
• p2.xlarge Instance – 8 hours
• $7.78 6,5e
BNN Characteristics
– Topology: 6 convolutional layers, 2 Max Pool layers and 3 Fully Connected layers
– Compute requirement: 112 MOPS/Frame
– Memory requirement: 1.54 MParams (fully binarized)
Scene Tiling
– For compute efficiency did frame resizing rather than tile resizing
• Three layers to cover different sign sizes: 54x32, 78x44, 110x64 (pixel),
– 202 32x32 Tiles processed per Scene
• 6 in first layer, 36 in second layer, 160 in third layer
• 13% step size
Page 24
Neural Network Example

HKG18-405 - Accelerating Neural Networks for Vision Systems via FPGAs

Recommended

Recommended

More Related Content

More from Linaro

More from Linaro (20)

Recently uploaded

Recently uploaded (20)

HKG18-405 - Accelerating Neural Networks for Vision Systems via FPGAs