"Using High-level Synthesis to Bridge the Gap Between Deep Learning Frameworks and Custom Hardware Accelerators," a Presentation from Mentor

© 2019 Mentor Graphics, A Siemens Business
Using High-level Synthesis to Bridge
the Gap Between Deep Learning
Frameworks and Custom Hardware
Accelerators
Mike Fingeroff
High-level Synthesis Technologist

Agenda
Machine learning has massive design complexity requirements
Why Catapult High-level Synthesis (HLS) is crucial to getting designs to
market on time
Verification of the quantized algorithm
Customer successes and future direction of Machine Learning and HLS

Machine Learning
Hardware is Evolving
Rapidly

Machine Learning Algorithms Have Massive
Computational Complexity
Training
• Very large datasets & memory,
CPU/GPU farms, floating point
required
• Not real time, can take
days/weeks
This is where
Catapult HLS fits
Inferencing
• Uses weights from trained network
• Memory storage/bandwidth challenges
• Often real-time
• Can be reduced to fixed point, dramatically
reduce the power

Numerous Possible Hardware/Memory NN Architectures for
Inference Engines
Machine learning
architectures are still
evolving
• How to know which one
is right for the
application?
• Not enough time to do
them all in RTL
On-chip memory, memory
bandwidth, power
performance and area
are all important

Memory Architecture and Power Considerations
Keeping data local is key to
minimizing power consumption
• Very important for ASIC
Floating-point is costly
• Used in training of networks
• Not needed in network
inference engine
Processor ML architectures are
fixed bit-width
• Not power efficient
*MIT/NVIDIA 2017

Data in the Real World is Exploding
Data traffic is going to
increase exponentially
over the next decade
• Frame rates,
sensor/camera
resolution will keep
doubling every few
years
How can processing
technology keep up?
• General purpose
solutions wont work,
too much power
Tractica 2018
EB = 10^18 bytes

Machine Learning Design Flow
Algorithm Engineers work here.
They don’t understand hardware
AI Development
Platforms
Pruning
Quantization
Compression
HW
Implementation
Weights
Retraining
Compilation
Hardware Engineers work here and
are already building NN HW using
Catapult HLS. They don’t
understand the NN platforms

Why Catapult HLS is
Crucial to Getting Designs
to Market on Time

Catapult HLS is the Best Solution for Rapid Algorithm to HW
void func (short a[N],
for (int i=0; i<N; i++) {
if (cond)
z+=a[i]*b[i];
else
RTL
Enable late functional changes without impacting schedule
• Algorithms can be easily modified and regenerated
• New technology nodes are easy (or FPGA to ASIC)
Quickly evaluate power and performance of algorithms
• Rapidly explore multiple options for optimal power,
performance and area (PPA)
Accelerate design time with higher level of abstraction
• 1 Year reduced to a few months
• New features added in days not weeks
• 5X less code than RTL

Catapult Synthesizes C++ and SystemC to Optimal ASIC
or FPGA Hardware
void simple_function(<function interface variables>){
<function body>
}
class simpleClass{
…
public:
void simple_function(<function interface variables>){
<function body>
}
};
SC_MODULE(simpleClass){
<module ports>
SC_CTOR(simpleClass){
SC_THREAD(run)
}
void run(){
<function body>
}
};
ASIC FPGAs
Catapult synthesizes C++/SystemC to optimized Register Transfer Level (RTL)

High-level Synthesis Models Bit-accuracy in the C++
Source
Arbitrary precision Integer, fixed-point, and floating-point
• New bfloat16, ac_std_float<E,M> ac_ieee_float
HLS uses exact bit-widths to meet specification and save
power/area
• Hardware bit-widths are not always pow2 (1, 8, 16, 32,
64 bits)
Rapid simulation of true hardware behavior
Bit-accurate
C++/SystemC
Verify
Refine/Explore
Precision
Model
using floating-point
Bit-accurate RTL
Catapult Ultra Verify
The Algorithmic C fixed point
data types are declared as:
ac_fixed<W,I,S> x;
width #integer bits

Constraint Driven Exploration of Parallelism/Timing
Exploration done using loop transformations
Loop unrolling drives parallelism
Timing closure is automatic
Architecture
Constraints
+x
+
x
x
x
x
+
+
Catapult Architectural Constraints View
Loops in design

Simplifies designing memory architecture
C++ arrays automatically mapped to ASIC or FPGA memories/registers
User control over memory mapping, banking, etc.
Arrays on the design interface can be synthesized as memory interfaces or
AXI4 Master/slave interfaces
Constraint-driven Creation of Memories/Memory Architecture
void simple_function(… ,int data[1024]){
int mem[1024];
<function body>
}
43,264
words
Width = 17
Ram1
676
words
Width = 17
Ram2
676
words
Ram64
676
words
Catapult Constraint GUI

Verification of the Quantized
Algorithm in Hardware

Automatic Verification of C++ vs Hardware
Implementation
C++ algorithm is fully verified before synthesis
No RTL debug required
Bit-accurate
C++ Reference
Model
Stimulus ==
Synthesizable
Model
Is the same?
C++ Testbench
Catapult C++
Synthesis
RTL
Automated
RTL Sanity
Check
Algorithm Input
Algorithm Output

Swap any layer or the entire design
Easily Test HW C++ Models Directly in Tensorflow
catapult
conv2d
Sliding-
Window
Convolution/
Max Pooling
FIFO
Sliding-
Window
Convolution/
Max Pooling
…. FIFO
In-place
Convolution/
Max Pooling
Off-chip DRAM
AXI4 stream
Weights and results
Tensorflow Python File
Tensorflow
Operator
wrapper call
Tensorflow C++ API
Operator Wrapper
HLS Model in C++
Tensorflow C++ API Wrapper

Customer Successes and
Future Direction of Machine
Learning and HLS

Chips&Media Success: Deep Learning Object Detection IP
19
Successfully delivered inference-targeted
deep learning IP with move to HLS
• RTL designers now plan to use HLS on all
future new computer vision/deep learning IP
• HLS is key to finding power optimized specific
DNN
Cut the block/IP design and verification
time in half
• New DNN architecture
• Delivered critical FPGA customer demonstrator
early
HLS helped find optimal power/performance
architecture that RTL “would not have had time”

NVIDIA Research with DARPA - New
methodology for 10x faster chip design
HLS to target 80% of future NVIDIA chips
• Open-Source Matchlib HLS IP
2 Tapeouts - 20M+ gate machine learning
accelerator SoC
Foundation for NVDLA HW
• NVIDIA Deep Learning Accelerators
NVIDIA Research New Methodology with Catapult
Machine Learning Accelerator SoC using an Object-Oriented HLS flow
20

Vision: Enable Fast Path to Custom AI/Neural
Network Accelerators with Catapult HLS
21
• Build low-power HW from
trained network
• Quickly produce deployable
proof-of-concept
• Optimize performance,
power and area when final
requirements are set
• Make FPGA design flow a
viable alternative for neural
networking vs GPU
AI Development
Platforms
HLS Model in C++
HLS IP
Catapult HLS

Conclusion
Machine learning hardware implementations are massively complex
• Implementing real-time HW solutions on-time is very challenging
• General purpose solutions will not be power efficient
Catapult High-level Synthesis enables designers to rapidly deliver custom
hardware solutions for machine learning algorithms
• Hardware is optimized for the ML network/algorithm
• Most efficient power
Verification in C++ is the most flexible solution
• Easily verify the hardware model in Tensorflow ML framework

Backup Material

Catapult HLS Resources
24
Catapult Customer White Papers
Chips&Media
Design and Verification of Deep Learning
Object Detection IP
NVDIA
Digital VLSI Flow for High-Productivity SoC
Design
Hardware Accelerator for Mobile Computer
Vision Applications
Design and Verification of a Machine
Learning Accelerator SoC Using an Object-
Oriented HLS-Based Design Flow
SeeCubic
CATAPULT HLS Enables ULTRA-D 3D without
Glasses
ST Imaging
STMicroelectronics Quickly Brings
Automotive Image Signal Processing to
Market with HL
Google
Google White Paper
Google Presentation

Chips&Media Success for Deep Learning Object
Detection IP
Successfully delivered
inference-targeted
Deep Learning IP with move to HLS
• RTL designers now plan to use HLS
on all future new
computer vision/deep learning IP
• HLS is key to finding power optimized specific
DNN
Cut the block/IP design and verification
time in half
• New DNN architecture
• Delivered critical FPGA customer demonstrator early
HLS helped find optimal power/performance
architecture that RTL “would not have had time”
New detailed white paper: Design and Verification of Deep Learning
Object Detection IP

NVIDIA Research with DARPA - New methodology for 10x faster chip design
• HLS to target 80% of future NVIDIA chips
2 Tapeouts - 20M+ gate Machine Learning
accelerator SoC
Used for SoC performance verification
• 30X RTL, <2.6% error in cycle count
Foundation for NVDLA HW
• NVIDIA Deep Learning Accelerators
2 DAC Papers; 2016 ,2018 available now
• Digital VLSI Flow for High-Productivity SoC Design
• Hardware Accelerator for Mobile Computer Vision Applications
• Design and Verification of a Machine Learning Accelerator SoC Using an Object-Oriented HLS-Based Design
Flow
NVIDIA Research New Methodology with Catapult
Machine Learning Accelerator SoC using an Object-Oriented HLS flow

NVIDIA Achieves Cost Reduction of ~80%
for Functional Verification with Catapult
Used in production level automotive targeted SoC’s
C++ functional verification runtime ~500x less resources than RTL
Fast verification makes rapid product changes possible
• VP9/HEVC code from 8 to 10 bit color depth in 2 weeks
• Change from 20nm/500MHz to 28nm/800MHz in 3 days with HLS
Traditional RTL
Functional
Regression
3 months
1000 CPUs
Resources
Time
HLS C++
Functional
Regression
2 weeks
14 CPUs
Resources
Time
NVIDIA Xavier 12nFF SoC
Most Complex SoC Ever Made
9 Billion Transistors
~8,000 man years
NVIDIA Case Study available on
mentor.com
Video
Processor
DLA

FotoNation
Next-Gen Mobile Face Recognition With Catapult
DAC Presentation
• “A Designer’s Life with HLS - Faster Computer Vision/Neural
Networks”
“3 weeks from Caffe to FPGA”
• Initial FPGA from unique C algorithm - 10fps
• HLS for desired µArchitecture delivered 30fps FPGA
at 100MHz
Faster, easier reuse, testing and customization
• “4x faster then hand coding”
• “Verification is Easier - Bit exact between
HW and C is native”
• Instant retargeting to optimal ASIC RTL
3+ B DEVICES
High Performance,
Low-Power
Computational Imaging

SeeCubic/StreamTV Networks uses Catapult HLS to
Deliver Realistic 3D Experience without Glasses
New Ultra-D branded technology and algorithms
- Far more realistic 3D display
Target Automotive, Medical and Consumer
“Catapult HLS came to the rescue”
• First, must prove the image quality
and algorithms and demonstrate on FPGA
• Enables to work with partners to embed
in ASIC/SoC
• Only Catapult HLS methodology delivers needed
technology independence
Presented at DAC 2017 and White paper
CATAPULT HLS Enables ULTRA-D 3D without Glasses

To date created 50+ Image Processing IPs using HLS Imaging Template
Why they use HLS and Catapult (their words)
• Increase IP value
• Improve IP performance versus power & area
• Reduce project cost
• Reduce IP development from 24 weeks to 4 weeks
Experience with HLS
• Less code to write and debug
• Fast integration of new features
• Algorithm and architecture exploration possible
• Fast Verification using C++
On-Demand Webinar and White Paper
STMicroelectronics Quickly Brings Automotive Image Signal Processing to Market with H
ST Imaging HLS Success for ISP (Automotive)

Google Continues Video CODEC Success with Catapult HLS
AV1 improving compression by 40-50% over VP9/HEVC
Goal: High bandwidth free-of-charge CODEC releasing every 3-4 years
(rather than 10 which is HEVC)
Catapult HLS on VP9 CODEC
• Time to Verified RTL: 2x faster
• Simulation Speed: 500x faster
• >99% bugs caught in C simulation
Catapult HLS on AV1 CODEC
• Productivity –90% less code, less bugs
• Leverage the whole team – Algorithm, architect, HW, DV
• Flexibility – SW-like process, late-stage algorithm change easy
• Empowering HW engineers – work on interesting/important problems
• Rapid HW prototyping – rapidly evaluate new ideas, algorithms
Google Presentation
Google White Paper

"Using High-level Synthesis to Bridge the Gap Between Deep Learning Frameworks and Custom Hardware Accelerators," a Presentation from Mentor

Recommended

Recommended

More Related Content

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

"Using High-level Synthesis to Bridge the Gap Between Deep Learning Frameworks and Custom Hardware Accelerators," a Presentation from Mentor