SlideShare a Scribd company logo
1 of 24
Download to read offline
Accelerating Neural Networks
for Vision Systems via FPGAs
March 2018
Glenn Steiner, Sr. Manager, Xilinx, Inc.
Michaela Blott, Principal Engineer, Xilinx Labs
Giulio Gambardella, Research Scientist, Xilinx Labs
Andreas Schuler, Director, Missing Link Electronics
© Copyright 2018 Xilinx
11:00 - 11:25, 22 March 2018
The performance and accuracy of Convolutional Neural Networks for visual recognition has
reached the point where researchers generally consider them to be more accurate than
traditional algorithmic approaches. In this session we will examine implementation of a
Binary Neural Network (BNN) on an FPGA with embedded processing system demonstrating
four orders of magnitude greater performance than a software implementation on an
embedded processor. We will start with the basic concepts of Convolutional Neural Networks.
Next, we will examine why FPGAs with embedded processors provide the necessary flexibility
to accommodate network precision as well as varying number of neurons and layers. Finally,
we will demonstrate a BNN running on a 96 board doing real-time traffic recognition at over
8,000 images per second.
Page 2
Abstract
© Copyright 2018 Xilinx
Challenges in Implementing Neural Networks
Heterogeneous All Programmable Devices
for Neural Networks
Reduced Precision Neural Networks
Design Example - German Traffic Sign Recognition
Summary
Page 3
Agenda
© Copyright 2018 XilinxPage 4
Challenges in Implementing Neural Networks
© Copyright 2018 Xilinx
Neural Networks (NNs) are based on simple
models
of the human brain (neurons and synapses)
NNs have the theoretical property of being a
“universal approximation function”
–Empirically outperforming other approximator functions
–Increasing adoption for new use cases
Requires less expertise/specialization in the
target domain
NNs are rapidly becoming the predominant
algorithm
–Outperforming humans and traditional Computer Vision
algorithms for image recognitionPage 5
What Are Neural Networks?
Neural Networks are replacing other approximation solutions
© Copyright 2018 XilinxPage 6
Machine Learning Challenges
Challenge 1:
Different use cases require different networks and
different figures of merits (speed, latency, energy,
accuracy)
Challenge 1:
Different use cases require different networks and
different figures of merits (speed, latency, energy,
accuracy)
Challenge 2:
Billions of multiply-accumulate operations and tens
of megabytes of parameter data
Challenge 2:
Billions of multiply-accumulate operations and tens
of megabytes of parameter data
Challenge 3:
Continuous stream of new algorithms Flexibility
is key
Challenge 3:
Continuous stream of new algorithms Flexibility
is key
© Copyright 2018 Xilinx
Each trained neural network yields to a different design trade-off in
error, cost, throughput, latency, or power
Page 7
Customized Neural Networks
Neural Networks need design flexibility
hardware cost / performance / power
error
© Copyright 2018 Xilinx
Heterogeneous All Programmable Devices
for Neural Networks
Page 8
© Copyright 2018 XilinxPage 9
Zynq UltraScale+ MPSoC
16nm Programmable Logic
• Any-to-Any Connectivity
• Processor Offloading
Graphics Processor
• ARM Mali-400/MP2
• 2D/3D Visualization
Real-Time Processor
• 32-bit Dual-core R5
• 128KB TCM w/ ECC
R5
A5
3
Application Processor
• 64-bit Multicore A53
• Up to 1.5GHz
Heterogeneous Multiprocessing SoC for ADAS
© Copyright 2018 Xilinx
Functional Partitioning for Optimal Performance and Safety
Page 10
Heterogeneous Multicore for ADAS Applications
Sensor fusion /
stitching and
object classification
Sensor fusion /
stitching and
object classification
Environmental
Characterization
Sensing domain
Sensor processing
and tracking
Sensor processing
and tracking
Sensor processing
& tracking
Sensor processing
& tracking
Vision/IR
Radar/Lidar
IR/US/other
Assessment and
decision making
Assessment and
decision making
Feature
Implementation
ARM® Processing
ARM processor suited for complex decision-
making algorithms common in ADAS apps
ARM also enables feature bundling such as
camera sensors used for multiple applications
Programmable Logic
Programmable logic enables parallel
processing necessary for pixel level analysis
DSP blocks enable hardware acceleration of
real-time sensors inputs
Integrated processors and programmable logic enable flexible partitioning between HW & SW
© Copyright 2018 Xilinx
Radar/Laser
Sensor(s)
Radar/Laser
Sensor(s)
OVT10635
Front_Cam
OVT10635
Right_Cam
OVT10635
Left_Cam
OVT10635
Fwd_Cam
VLink
MegaPixel
Rear_Cam
Display
QSPI
FLASH
Capture
Image Storage Image
Retrieval
Image
Storage
Display
Connectivity
Video Output
Control
Image
Retrieval
Functional Safety Elements
Vehicle Comms
(Actuators & Vehicle
Status)
CAN
VLink
VLink
VLink
VLink
Capture
Capture
Capture
Capture
Distortion
Correction,
Perspective
Projection,
Stitching, PiP
3D Surround View &
Rear View Camera
Image
Scaling
Gradient
Extraction,
HoG, SVM
Analytics
Acceleration
for Pedestrian
Detection
Gaussian
Filter,
Edge
Detect &
Thin, Lane
Pattern
Search
Analytics
Acceleration
for Lane
Departure
Warning
Pattern
Recognition,
Optical Flow
Motion
Estimation
Analytics
Acceleration
for Traffic Sign
Recognition
Image
Scaling
Haar
Descriptor,
SVM
Analytics
Acceleration
for Vehicle
Detection
(FCW)
Optical
Flow
Motion
Estimation
Analytics
Acceleration
for Blind Spot
Detection
Headlamp/
TailLamp
Classification
Analytics
Acceleration
for Headlamp
Control
Radar/Laser
Sensor(s)
Sensor
Processing
Sensor
Fusion
Accleration
Fusion
Central ADAS Module Mapping to Zynq
UltraScale+ MPSoC
Peripherals
Peripherals
CSU PMU
R5 R5
A5x A5x
A5x A5x
GPU
DDRC
High Speed
Connectivity
OCM Cache
LPDDR3/DDR4
(Frame buffers, ARM Code,
Status Repository)
Zynq
UltraScale
MPSoC
Sensor Processing
Algorithms & Environmental
Characterization
Safety Critical Functions
Algo Config &Control
Ped Det Proc & Range Est.
LDW Proc & Warning
Collision Warning Proc.
Traffic Sign Recog Proc.
Headlamp Control Proc.
A5x’s perform sensor processing and environmental
characterization tasks in conjunction with HW accelerators.
A5x’s also implement processing control decisions by setting
parametric registers in PL accelerators (e.g. thresholds for edge
detection). ASIL A/B
Safety critical countermeasure decisions &
actuator commands on lockstep R5's. Data
sharing via OCM. CAN output commands & key
decision points initiated in lockstep R5's with
cross-monitoring and diagnostic-protected
voting in PL. ASIL C/D
System Control Decisions
Diagnostics /FuncSafety
Vehicle Comms
Warping and Main Processing
Accelerator partitioning
between APU and PL is
flexible based on
actual loading/resource
utilization
XILINX CONFIDENTIALPage 11
© Copyright 2018 XilinxPage 12
Reduced Precision Neural Networks
© Copyright 2018 Xilinx
Logic cost per operation is greatly reduced
–Today’s FPGAs have a much higher peak performance for reduced precision
operations
Memory cost is greatly reduced
–Large networks can fit entirely into on-chip memory
–(OCM) (UltraRAM, BRAM)
Reducing Precision and Fixed Point saves PowerPage 13
Reduced Precision Saves Logic, Memory & Power with Increased
Performance
Precision Cost per Op
LUT
Cost per Op
DSP
MB
needed
(AlexNet)
TOps/s
(VU9P)**
TOps/s
(ZU19EG)*
1b 2.5 0 7.6 ~100 ~66
8b 45 0 61 ~6 ~4
32b 178 2 244 ~1 ~0.3
100x
*Assumptions: Application can fill device to 70% (fully parallelizable) 250MHZ
**Assumptions: Application can fill device to 70% (fully parallelizable) 300MHZ
***Assumptions: HLS overhead included
Source: Bill Dally (Stanford), Cadence Embedded Neural
Network Summit, February 1, 2017
© Copyright 2018 Xilinx
Custom-tailored hardware
–Customized dataflow architecture to
match network topology
–Customized data types
–Customized to meet design targets
Keep all parameters on-chip
Automatically generated from CNN
description
–Uses a synthesizable C++ NN description
–Supports portability, scalability & rapid
exploration
Page 14
Design Principles
1MOPS1MOPS
10MOPS10MOPS
1PE1PE
10PE10PE
Customized Dataflow Architecture
Synthesizable CNN Description
© Copyright 2018 Xilinx
Just reducing precision:
–Reduces hardware cost & increases
error
Recuperate accuracy
–By retraining & increasing network
size
1b, 2b and 4b provide
pare to optimal solutions
Maintaing Accuracy?
Compensating Quantization with Network
Complexity
Page 15
© Copyright 2018 XilinxPage 16
Design Example
German Traffic Sign Recognition
© Copyright 2018 Xilinx
German Road Sign Database
–50,000+ 32x32 bit images for training
–44 classes (43 road signs, 1 background)
–Training via Amazon Web Services
• AWS: p2.xlarge Instance – 8 hours  $7.78 6,5e
Binary Neural Network Characteristics
–6 convolutional layers
–2 max pool layers
–3 fully connected layers
Page 17
Neural Network Example
© Copyright 2018 XilinxPage 18
Neural Network Performance Results
Up to 8,600 times faster when accelerated with programmable logic
Performance Metric Software Only Programmable Logic
Accelerated
Tiles per second 2.2 19,000
Scene rate (fps) 0.011 (92 sec per frame) 94
Overall Acceleration - 8,600X
© Copyright 2018 XilinxPage 19
© Copyright 2018 XilinxPage 20
Summary
Page 20
© Copyright 2018 Xilinx
Neural Networks enable accurate and efficient image classification
Programmable logic enables network and computation adaption
based on system level requirements
Single-chip Zynq UltraScale+ MPSoCs enable:
–Increased performance
–Reduced power
–Reduced cost
–Flexible field upgrades
Page 21
Zynq UltraScale+ MPSoCs Ideal for Neural
Networks
See our demonstration Friday
Submit your most creative, most out-of-the box AI or ML application
at the Xilinx or Avnet table during Demo Friday (12:00 – 14:00).
The best 30 get a FREE Ultra96 board plus software to
help you realize your vision.
The 1st
twenty to submit a working design by MAY 25th
, 2018 get a
$25 Amazon Gift Card.
ONE Winner announced through Xilinx social media channels. If
it’s you, you’re invited to present your design to your peers in
industry at Xilinx Developer Forum 2018.
Page 22
The Future is Ultra96 Xilinx Contest
Accelerating Neural
Networks
for Vision Systems via
FPGAs
Questions?
© Copyright 2018 Xilinx
German Road Sign Database
– ~1,000 images, rotated & scaled to various angles & sizes  ~50,000 32x32 bit images for training
– http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset
– 44 Classes (43 Road Signs, 1 background)
– Amazon Web Services used for training (500 epochs)
• p2.xlarge Instance – 8 hours
• $7.78 6,5e
BNN Characteristics
– Topology: 6 convolutional layers, 2 Max Pool layers and 3 Fully Connected layers
– Compute requirement: 112 MOPS/Frame
– Memory requirement: 1.54 MParams (fully binarized)
Scene Tiling
– For compute efficiency did frame resizing rather than tile resizing
• Three layers to cover different sign sizes: 54x32, 78x44, 110x64 (pixel),
– 202 32x32 Tiles processed per Scene
• 6 in first layer, 36 in second layer, 160 in third layer
• 13% step size
Page 24
Neural Network Example

More Related Content

More from Linaro

Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...Linaro
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramLinaro
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNLinaro
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...Linaro
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...Linaro
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionLinaro
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersLinaro
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
 

More from Linaro (20)

Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready Program
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NN
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: Introduction
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 Servers
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
 

Recently uploaded

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Recently uploaded (20)

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

HKG18-405 - Accelerating Neural Networks for Vision Systems via FPGAs

  • 1. Accelerating Neural Networks for Vision Systems via FPGAs March 2018 Glenn Steiner, Sr. Manager, Xilinx, Inc. Michaela Blott, Principal Engineer, Xilinx Labs Giulio Gambardella, Research Scientist, Xilinx Labs Andreas Schuler, Director, Missing Link Electronics
  • 2. © Copyright 2018 Xilinx 11:00 - 11:25, 22 March 2018 The performance and accuracy of Convolutional Neural Networks for visual recognition has reached the point where researchers generally consider them to be more accurate than traditional algorithmic approaches. In this session we will examine implementation of a Binary Neural Network (BNN) on an FPGA with embedded processing system demonstrating four orders of magnitude greater performance than a software implementation on an embedded processor. We will start with the basic concepts of Convolutional Neural Networks. Next, we will examine why FPGAs with embedded processors provide the necessary flexibility to accommodate network precision as well as varying number of neurons and layers. Finally, we will demonstrate a BNN running on a 96 board doing real-time traffic recognition at over 8,000 images per second. Page 2 Abstract
  • 3. © Copyright 2018 Xilinx Challenges in Implementing Neural Networks Heterogeneous All Programmable Devices for Neural Networks Reduced Precision Neural Networks Design Example - German Traffic Sign Recognition Summary Page 3 Agenda
  • 4. © Copyright 2018 XilinxPage 4 Challenges in Implementing Neural Networks
  • 5. © Copyright 2018 Xilinx Neural Networks (NNs) are based on simple models of the human brain (neurons and synapses) NNs have the theoretical property of being a “universal approximation function” –Empirically outperforming other approximator functions –Increasing adoption for new use cases Requires less expertise/specialization in the target domain NNs are rapidly becoming the predominant algorithm –Outperforming humans and traditional Computer Vision algorithms for image recognitionPage 5 What Are Neural Networks? Neural Networks are replacing other approximation solutions
  • 6. © Copyright 2018 XilinxPage 6 Machine Learning Challenges Challenge 1: Different use cases require different networks and different figures of merits (speed, latency, energy, accuracy) Challenge 1: Different use cases require different networks and different figures of merits (speed, latency, energy, accuracy) Challenge 2: Billions of multiply-accumulate operations and tens of megabytes of parameter data Challenge 2: Billions of multiply-accumulate operations and tens of megabytes of parameter data Challenge 3: Continuous stream of new algorithms Flexibility is key Challenge 3: Continuous stream of new algorithms Flexibility is key
  • 7. © Copyright 2018 Xilinx Each trained neural network yields to a different design trade-off in error, cost, throughput, latency, or power Page 7 Customized Neural Networks Neural Networks need design flexibility hardware cost / performance / power error
  • 8. © Copyright 2018 Xilinx Heterogeneous All Programmable Devices for Neural Networks Page 8
  • 9. © Copyright 2018 XilinxPage 9 Zynq UltraScale+ MPSoC 16nm Programmable Logic • Any-to-Any Connectivity • Processor Offloading Graphics Processor • ARM Mali-400/MP2 • 2D/3D Visualization Real-Time Processor • 32-bit Dual-core R5 • 128KB TCM w/ ECC R5 A5 3 Application Processor • 64-bit Multicore A53 • Up to 1.5GHz Heterogeneous Multiprocessing SoC for ADAS
  • 10. © Copyright 2018 Xilinx Functional Partitioning for Optimal Performance and Safety Page 10 Heterogeneous Multicore for ADAS Applications Sensor fusion / stitching and object classification Sensor fusion / stitching and object classification Environmental Characterization Sensing domain Sensor processing and tracking Sensor processing and tracking Sensor processing & tracking Sensor processing & tracking Vision/IR Radar/Lidar IR/US/other Assessment and decision making Assessment and decision making Feature Implementation ARM® Processing ARM processor suited for complex decision- making algorithms common in ADAS apps ARM also enables feature bundling such as camera sensors used for multiple applications Programmable Logic Programmable logic enables parallel processing necessary for pixel level analysis DSP blocks enable hardware acceleration of real-time sensors inputs Integrated processors and programmable logic enable flexible partitioning between HW & SW
  • 11. © Copyright 2018 Xilinx Radar/Laser Sensor(s) Radar/Laser Sensor(s) OVT10635 Front_Cam OVT10635 Right_Cam OVT10635 Left_Cam OVT10635 Fwd_Cam VLink MegaPixel Rear_Cam Display QSPI FLASH Capture Image Storage Image Retrieval Image Storage Display Connectivity Video Output Control Image Retrieval Functional Safety Elements Vehicle Comms (Actuators & Vehicle Status) CAN VLink VLink VLink VLink Capture Capture Capture Capture Distortion Correction, Perspective Projection, Stitching, PiP 3D Surround View & Rear View Camera Image Scaling Gradient Extraction, HoG, SVM Analytics Acceleration for Pedestrian Detection Gaussian Filter, Edge Detect & Thin, Lane Pattern Search Analytics Acceleration for Lane Departure Warning Pattern Recognition, Optical Flow Motion Estimation Analytics Acceleration for Traffic Sign Recognition Image Scaling Haar Descriptor, SVM Analytics Acceleration for Vehicle Detection (FCW) Optical Flow Motion Estimation Analytics Acceleration for Blind Spot Detection Headlamp/ TailLamp Classification Analytics Acceleration for Headlamp Control Radar/Laser Sensor(s) Sensor Processing Sensor Fusion Accleration Fusion Central ADAS Module Mapping to Zynq UltraScale+ MPSoC Peripherals Peripherals CSU PMU R5 R5 A5x A5x A5x A5x GPU DDRC High Speed Connectivity OCM Cache LPDDR3/DDR4 (Frame buffers, ARM Code, Status Repository) Zynq UltraScale MPSoC Sensor Processing Algorithms & Environmental Characterization Safety Critical Functions Algo Config &Control Ped Det Proc & Range Est. LDW Proc & Warning Collision Warning Proc. Traffic Sign Recog Proc. Headlamp Control Proc. A5x’s perform sensor processing and environmental characterization tasks in conjunction with HW accelerators. A5x’s also implement processing control decisions by setting parametric registers in PL accelerators (e.g. thresholds for edge detection). ASIL A/B Safety critical countermeasure decisions & actuator commands on lockstep R5's. Data sharing via OCM. CAN output commands & key decision points initiated in lockstep R5's with cross-monitoring and diagnostic-protected voting in PL. ASIL C/D System Control Decisions Diagnostics /FuncSafety Vehicle Comms Warping and Main Processing Accelerator partitioning between APU and PL is flexible based on actual loading/resource utilization XILINX CONFIDENTIALPage 11
  • 12. © Copyright 2018 XilinxPage 12 Reduced Precision Neural Networks
  • 13. © Copyright 2018 Xilinx Logic cost per operation is greatly reduced –Today’s FPGAs have a much higher peak performance for reduced precision operations Memory cost is greatly reduced –Large networks can fit entirely into on-chip memory –(OCM) (UltraRAM, BRAM) Reducing Precision and Fixed Point saves PowerPage 13 Reduced Precision Saves Logic, Memory & Power with Increased Performance Precision Cost per Op LUT Cost per Op DSP MB needed (AlexNet) TOps/s (VU9P)** TOps/s (ZU19EG)* 1b 2.5 0 7.6 ~100 ~66 8b 45 0 61 ~6 ~4 32b 178 2 244 ~1 ~0.3 100x *Assumptions: Application can fill device to 70% (fully parallelizable) 250MHZ **Assumptions: Application can fill device to 70% (fully parallelizable) 300MHZ ***Assumptions: HLS overhead included Source: Bill Dally (Stanford), Cadence Embedded Neural Network Summit, February 1, 2017
  • 14. © Copyright 2018 Xilinx Custom-tailored hardware –Customized dataflow architecture to match network topology –Customized data types –Customized to meet design targets Keep all parameters on-chip Automatically generated from CNN description –Uses a synthesizable C++ NN description –Supports portability, scalability & rapid exploration Page 14 Design Principles 1MOPS1MOPS 10MOPS10MOPS 1PE1PE 10PE10PE Customized Dataflow Architecture Synthesizable CNN Description
  • 15. © Copyright 2018 Xilinx Just reducing precision: –Reduces hardware cost & increases error Recuperate accuracy –By retraining & increasing network size 1b, 2b and 4b provide pare to optimal solutions Maintaing Accuracy? Compensating Quantization with Network Complexity Page 15
  • 16. © Copyright 2018 XilinxPage 16 Design Example German Traffic Sign Recognition
  • 17. © Copyright 2018 Xilinx German Road Sign Database –50,000+ 32x32 bit images for training –44 classes (43 road signs, 1 background) –Training via Amazon Web Services • AWS: p2.xlarge Instance – 8 hours  $7.78 6,5e Binary Neural Network Characteristics –6 convolutional layers –2 max pool layers –3 fully connected layers Page 17 Neural Network Example
  • 18. © Copyright 2018 XilinxPage 18 Neural Network Performance Results Up to 8,600 times faster when accelerated with programmable logic Performance Metric Software Only Programmable Logic Accelerated Tiles per second 2.2 19,000 Scene rate (fps) 0.011 (92 sec per frame) 94 Overall Acceleration - 8,600X
  • 19. © Copyright 2018 XilinxPage 19
  • 20. © Copyright 2018 XilinxPage 20 Summary Page 20
  • 21. © Copyright 2018 Xilinx Neural Networks enable accurate and efficient image classification Programmable logic enables network and computation adaption based on system level requirements Single-chip Zynq UltraScale+ MPSoCs enable: –Increased performance –Reduced power –Reduced cost –Flexible field upgrades Page 21 Zynq UltraScale+ MPSoCs Ideal for Neural Networks See our demonstration Friday
  • 22. Submit your most creative, most out-of-the box AI or ML application at the Xilinx or Avnet table during Demo Friday (12:00 – 14:00). The best 30 get a FREE Ultra96 board plus software to help you realize your vision. The 1st twenty to submit a working design by MAY 25th , 2018 get a $25 Amazon Gift Card. ONE Winner announced through Xilinx social media channels. If it’s you, you’re invited to present your design to your peers in industry at Xilinx Developer Forum 2018. Page 22 The Future is Ultra96 Xilinx Contest
  • 23. Accelerating Neural Networks for Vision Systems via FPGAs Questions?
  • 24. © Copyright 2018 Xilinx German Road Sign Database – ~1,000 images, rotated & scaled to various angles & sizes  ~50,000 32x32 bit images for training – http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset – 44 Classes (43 Road Signs, 1 background) – Amazon Web Services used for training (500 epochs) • p2.xlarge Instance – 8 hours • $7.78 6,5e BNN Characteristics – Topology: 6 convolutional layers, 2 Max Pool layers and 3 Fully Connected layers – Compute requirement: 112 MOPS/Frame – Memory requirement: 1.54 MParams (fully binarized) Scene Tiling – For compute efficiency did frame resizing rather than tile resizing • Three layers to cover different sign sizes: 54x32, 78x44, 110x64 (pixel), – 202 32x32 Tiles processed per Scene • 6 in first layer, 36 in second layer, 160 in third layer • 13% step size Page 24 Neural Network Example