Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

"Using High-level Synthesis to Bridge the Gap Between Deep Learning Frameworks and Custom Hardware Accelerators," a Presentation from Mentor

42 views

Published on

For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/mentor/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit

For more information about embedded vision, please visit:
http://www.embedded-vision.com

Michael Fingeroff, HLS Technologist at Mentor, presents the "Using High-level Synthesis to Bridge the Gap Between Deep Learning Frameworks and Custom Hardware Accelerators" tutorial at the May 2019 Embedded Vision Summit.

Recent years have seen an explosion in machine learning/AI algorithms with a corresponding need to use custom hardware for best performance and power efficiency. However, there is still a wide gap between algorithm creation and experimentation (using deep learning frameworks such as TensorFlow and Caffe) and custom hardware implementations in FPGAs or ASICs. In this presentation, Fingeroff explains how High-level synthesis (HLS) using standard C++ as the design language can provide an automated path to custom hardware implementations by leveraging existing APIs available in deep learning frameworks (e.g., the TensorFlow Operator C++ API).

Using these APIs can enable designers to easily plug their synthesizable C++ hardware models into deep learning frameworks to validate a given implementation. Designing using C++ and HLS not only provides the ability to quickly create AI hardware accelerators with the best power, performance and area (PPA) for a target application, but helps bridge the gap between software algorithms developed in deep learning frameworks and their corresponding hardware implementations.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

"Using High-level Synthesis to Bridge the Gap Between Deep Learning Frameworks and Custom Hardware Accelerators," a Presentation from Mentor

  1. 1. © 2019 Mentor Graphics, A Siemens Business Using High-level Synthesis to Bridge the Gap Between Deep Learning Frameworks and Custom Hardware Accelerators Mike Fingeroff High-level Synthesis Technologist
  2. 2. © 2019 Mentor Graphics, A Siemens Business Agenda Machine learning has massive design complexity requirements Why Catapult High-level Synthesis (HLS) is crucial to getting designs to market on time Verification of the quantized algorithm Customer successes and future direction of Machine Learning and HLS
  3. 3. © 2019 Mentor Graphics, A Siemens Business Machine Learning Hardware is Evolving Rapidly
  4. 4. © 2019 Mentor Graphics, A Siemens Business Machine Learning Algorithms Have Massive Computational Complexity Training • Very large datasets & memory, CPU/GPU farms, floating point required • Not real time, can take days/weeks This is where Catapult HLS fits Inferencing • Uses weights from trained network • Memory storage/bandwidth challenges • Often real-time • Can be reduced to fixed point, dramatically reduce the power
  5. 5. © 2019 Mentor Graphics, A Siemens Business Numerous Possible Hardware/Memory NN Architectures for Inference Engines Machine learning architectures are still evolving • How to know which one is right for the application? • Not enough time to do them all in RTL On-chip memory, memory bandwidth, power performance and area are all important
  6. 6. © 2019 Mentor Graphics, A Siemens Business Memory Architecture and Power Considerations Keeping data local is key to minimizing power consumption • Very important for ASIC Floating-point is costly • Used in training of networks • Not needed in network inference engine Processor ML architectures are fixed bit-width • Not power efficient *MIT/NVIDIA 2017
  7. 7. © 2019 Mentor Graphics, A Siemens Business Data in the Real World is Exploding Data traffic is going to increase exponentially over the next decade • Frame rates, sensor/camera resolution will keep doubling every few years How can processing technology keep up? • General purpose solutions wont work, too much power Tractica 2018 EB = 10^18 bytes
  8. 8. © 2019 Mentor Graphics, A Siemens Business Machine Learning Design Flow Algorithm Engineers work here. They don’t understand hardware AI Development Platforms Pruning Quantization Compression HW Implementation Weights Retraining Compilation Hardware Engineers work here and are already building NN HW using Catapult HLS. They don’t understand the NN platforms
  9. 9. © 2019 Mentor Graphics, A Siemens Business Why Catapult HLS is Crucial to Getting Designs to Market on Time
  10. 10. © 2019 Mentor Graphics, A Siemens Business Catapult HLS is the Best Solution for Rapid Algorithm to HW void func (short a[N], for (int i=0; i<N; i++) { if (cond) z+=a[i]*b[i]; else RTL Enable late functional changes without impacting schedule • Algorithms can be easily modified and regenerated • New technology nodes are easy (or FPGA to ASIC) Quickly evaluate power and performance of algorithms • Rapidly explore multiple options for optimal power, performance and area (PPA) Accelerate design time with higher level of abstraction • 1 Year reduced to a few months • New features added in days not weeks • 5X less code than RTL
  11. 11. © 2019 Mentor Graphics, A Siemens Business Catapult Synthesizes C++ and SystemC to Optimal ASIC or FPGA Hardware void simple_function(<function interface variables>){ <function body> } class simpleClass{ … public: void simple_function(<function interface variables>){ <function body> } }; SC_MODULE(simpleClass){ <module ports> SC_CTOR(simpleClass){ SC_THREAD(run) } void run(){ <function body> } }; ASIC FPGAs Catapult synthesizes C++/SystemC to optimized Register Transfer Level (RTL)
  12. 12. © 2019 Mentor Graphics, A Siemens Business High-level Synthesis Models Bit-accuracy in the C++ Source Arbitrary precision Integer, fixed-point, and floating-point • New bfloat16, ac_std_float<E,M> ac_ieee_float HLS uses exact bit-widths to meet specification and save power/area • Hardware bit-widths are not always pow2 (1, 8, 16, 32, 64 bits) Rapid simulation of true hardware behavior Bit-accurate C++/SystemC Verify Refine/Explore Precision Model using floating-point Bit-accurate RTL Catapult Ultra Verify The Algorithmic C fixed point data types are declared as: ac_fixed<W,I,S> x; width #integer bits
  13. 13. © 2019 Mentor Graphics, A Siemens Business Constraint Driven Exploration of Parallelism/Timing Exploration done using loop transformations Loop unrolling drives parallelism Timing closure is automatic Architecture Constraints +x + x x x x + + Catapult Architectural Constraints View Loops in design
  14. 14. © 2019 Mentor Graphics, A Siemens Business Simplifies designing memory architecture C++ arrays automatically mapped to ASIC or FPGA memories/registers User control over memory mapping, banking, etc. Arrays on the design interface can be synthesized as memory interfaces or AXI4 Master/slave interfaces Constraint-driven Creation of Memories/Memory Architecture void simple_function(… ,int data[1024]){ int mem[1024]; <function body> } 43,264 words Width = 17 Ram1 676 words Width = 17 Ram2 676 words Ram64 676 words Catapult Constraint GUI
  15. 15. © 2019 Mentor Graphics, A Siemens Business Verification of the Quantized Algorithm in Hardware
  16. 16. © 2019 Mentor Graphics, A Siemens Business Automatic Verification of C++ vs Hardware Implementation C++ algorithm is fully verified before synthesis No RTL debug required Bit-accurate C++ Reference Model Stimulus == Synthesizable Model Is the same? C++ Testbench Catapult C++ Synthesis RTL Automated RTL Sanity Check Algorithm Input Algorithm Output
  17. 17. © 2019 Mentor Graphics, A Siemens Business Swap any layer or the entire design Easily Test HW C++ Models Directly in Tensorflow catapult conv2d Sliding- Window Convolution/ Max Pooling FIFO Sliding- Window Convolution/ Max Pooling …. FIFO In-place Convolution/ Max Pooling Off-chip DRAM AXI4 stream Weights and results Tensorflow Python File Tensorflow Operator wrapper call Tensorflow C++ API Operator Wrapper HLS Model in C++ Tensorflow C++ API Wrapper
  18. 18. © 2019 Mentor Graphics, A Siemens Business Customer Successes and Future Direction of Machine Learning and HLS
  19. 19. © 2019 Mentor Graphics, A Siemens Business Chips&Media Success: Deep Learning Object Detection IP 19 Successfully delivered inference-targeted deep learning IP with move to HLS • RTL designers now plan to use HLS on all future new computer vision/deep learning IP • HLS is key to finding power optimized specific DNN Cut the block/IP design and verification time in half • New DNN architecture • Delivered critical FPGA customer demonstrator early HLS helped find optimal power/performance architecture that RTL “would not have had time”
  20. 20. © 2019 Mentor Graphics, A Siemens Business NVIDIA Research with DARPA - New methodology for 10x faster chip design HLS to target 80% of future NVIDIA chips • Open-Source Matchlib HLS IP 2 Tapeouts - 20M+ gate machine learning accelerator SoC Foundation for NVDLA HW • NVIDIA Deep Learning Accelerators NVIDIA Research New Methodology with Catapult Machine Learning Accelerator SoC using an Object-Oriented HLS flow 20
  21. 21. © 2019 Mentor Graphics, A Siemens Business Vision: Enable Fast Path to Custom AI/Neural Network Accelerators with Catapult HLS 21 • Build low-power HW from trained network • Quickly produce deployable proof-of-concept • Optimize performance, power and area when final requirements are set • Make FPGA design flow a viable alternative for neural networking vs GPU AI Development Platforms HLS Model in C++ HLS IP Catapult HLS
  22. 22. © 2019 Mentor Graphics, A Siemens Business Conclusion Machine learning hardware implementations are massively complex • Implementing real-time HW solutions on-time is very challenging • General purpose solutions will not be power efficient Catapult High-level Synthesis enables designers to rapidly deliver custom hardware solutions for machine learning algorithms • Hardware is optimized for the ML network/algorithm • Most efficient power Verification in C++ is the most flexible solution • Easily verify the hardware model in Tensorflow ML framework
  23. 23. © 2019 Mentor Graphics, A Siemens Business Backup Material
  24. 24. © 2019 Mentor Graphics, A Siemens Business Catapult HLS Resources 24 Catapult Customer White Papers Chips&Media Design and Verification of Deep Learning Object Detection IP NVDIA Digital VLSI Flow for High-Productivity SoC Design Hardware Accelerator for Mobile Computer Vision Applications Design and Verification of a Machine Learning Accelerator SoC Using an Object- Oriented HLS-Based Design Flow SeeCubic CATAPULT HLS Enables ULTRA-D 3D without Glasses ST Imaging STMicroelectronics Quickly Brings Automotive Image Signal Processing to Market with HL Google Google White Paper Google Presentation
  25. 25. © 2019 Mentor Graphics, A Siemens Business Chips&Media Success for Deep Learning Object Detection IP Successfully delivered inference-targeted Deep Learning IP with move to HLS • RTL designers now plan to use HLS on all future new computer vision/deep learning IP • HLS is key to finding power optimized specific DNN Cut the block/IP design and verification time in half • New DNN architecture • Delivered critical FPGA customer demonstrator early HLS helped find optimal power/performance architecture that RTL “would not have had time” New detailed white paper: Design and Verification of Deep Learning Object Detection IP
  26. 26. © 2019 Mentor Graphics, A Siemens Business NVIDIA Research with DARPA - New methodology for 10x faster chip design • HLS to target 80% of future NVIDIA chips 2 Tapeouts - 20M+ gate Machine Learning accelerator SoC Used for SoC performance verification • 30X RTL, <2.6% error in cycle count Foundation for NVDLA HW • NVIDIA Deep Learning Accelerators 2 DAC Papers; 2016 ,2018 available now • Digital VLSI Flow for High-Productivity SoC Design • Hardware Accelerator for Mobile Computer Vision Applications • Design and Verification of a Machine Learning Accelerator SoC Using an Object-Oriented HLS-Based Design Flow NVIDIA Research New Methodology with Catapult Machine Learning Accelerator SoC using an Object-Oriented HLS flow
  27. 27. © 2019 Mentor Graphics, A Siemens Business NVIDIA Achieves Cost Reduction of ~80% for Functional Verification with Catapult Used in production level automotive targeted SoC’s C++ functional verification runtime ~500x less resources than RTL Fast verification makes rapid product changes possible • VP9/HEVC code from 8 to 10 bit color depth in 2 weeks • Change from 20nm/500MHz to 28nm/800MHz in 3 days with HLS Traditional RTL Functional Regression 3 months 1000 CPUs Resources Time HLS C++ Functional Regression 2 weeks 14 CPUs Resources Time NVIDIA Xavier 12nFF SoC Most Complex SoC Ever Made 9 Billion Transistors ~8,000 man years NVIDIA Case Study available on mentor.com Video Processor DLA
  28. 28. © 2019 Mentor Graphics, A Siemens Business FotoNation Next-Gen Mobile Face Recognition With Catapult DAC Presentation • “A Designer’s Life with HLS - Faster Computer Vision/Neural Networks” “3 weeks from Caffe to FPGA” • Initial FPGA from unique C algorithm - 10fps • HLS for desired µArchitecture delivered 30fps FPGA at 100MHz Faster, easier reuse, testing and customization • “4x faster then hand coding” • “Verification is Easier - Bit exact between HW and C is native” • Instant retargeting to optimal ASIC RTL 3+ B DEVICES High Performance, Low-Power Computational Imaging
  29. 29. © 2019 Mentor Graphics, A Siemens Business SeeCubic/StreamTV Networks uses Catapult HLS to Deliver Realistic 3D Experience without Glasses New Ultra-D branded technology and algorithms - Far more realistic 3D display Target Automotive, Medical and Consumer “Catapult HLS came to the rescue” • First, must prove the image quality and algorithms and demonstrate on FPGA • Enables to work with partners to embed in ASIC/SoC • Only Catapult HLS methodology delivers needed technology independence Presented at DAC 2017 and White paper CATAPULT HLS Enables ULTRA-D 3D without Glasses
  30. 30. © 2019 Mentor Graphics, A Siemens Business To date created 50+ Image Processing IPs using HLS Imaging Template Why they use HLS and Catapult (their words) • Increase IP value • Improve IP performance versus power & area • Reduce project cost • Reduce IP development from 24 weeks to 4 weeks Experience with HLS • Less code to write and debug • Fast integration of new features • Algorithm and architecture exploration possible • Fast Verification using C++ On-Demand Webinar and White Paper STMicroelectronics Quickly Brings Automotive Image Signal Processing to Market with H ST Imaging HLS Success for ISP (Automotive)
  31. 31. © 2019 Mentor Graphics, A Siemens Business Google Continues Video CODEC Success with Catapult HLS AV1 improving compression by 40-50% over VP9/HEVC Goal: High bandwidth free-of-charge CODEC releasing every 3-4 years (rather than 10 which is HEVC) Catapult HLS on VP9 CODEC • Time to Verified RTL: 2x faster • Simulation Speed: 500x faster • >99% bugs caught in C simulation Catapult HLS on AV1 CODEC • Productivity –90% less code, less bugs • Leverage the whole team – Algorithm, architect, HW, DV • Flexibility – SW-like process, late-stage algorithm change easy • Empowering HW engineers – work on interesting/important problems • Rapid HW prototyping – rapidly evaluate new ideas, algorithms Google Presentation Google White Paper
  32. 32. © 2019 Mentor Graphics, A Siemens Business

×