Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

"Tailoring Convolutional Neural Networks for Low-Cost, Low-Power Implementation," a Presentation From Synopsys

1,355 views

Published on

For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/synopsys/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit

For more information about embedded vision, please visit:
http://www.embedded-vision.com

Bruno Lavigueur, Project Leader for Embedded Vision at Synopsys, presents the "Tailoring Convolutional Neural Networks for Low-Cost, Low-Power Implementation" tutorial at the May 2015 Embedded Vision Summit.

Deep learning-based object detection using convolutional neural networks (CNN) has recently emerged as one of the leading approaches for achieving state-of-the-art detection accuracy for a wide range of object classes. Most of the current CNN-based detection algorithm implementations run on high-performance computing platforms that include high-end general-purpose processors and GP-GPUs. These CNN implementations have significant computing power and memory requirements.

Bruno presents Synopsys' experience in reducing the complexity of the CNN graph to make the resulting algorithm amenable to low-cost and low-power computing platforms. This involves reducing the compute requirements, memory size for storing convolution coefficients, and moving from floating point to 8 and 16 bit fixed point data widths. Bruno demonstrates results for a face detection application running on a dedicated low-cost and low-power multi-core platform optimized for CNN-based applications.

Published in: Technology
  • Be the first to comment

"Tailoring Convolutional Neural Networks for Low-Cost, Low-Power Implementation," a Presentation From Synopsys

  1. 1. Copyright © 2015 Synopsys Inc. 1 Bruno Lavigueur 12 May 2015 Tailoring CNNs for Low-cost, Low-power Implementations
  2. 2. Copyright © 2015 Synopsys Inc. 2 • Embedded vision subsystem, build from many silicon proven IPs • DesignWare: ARC HS processor, AXI, DMA, Memory Compiler, … • HAPS FPGA-based rapid prototyping system Synopsys at a Glance >5,300 Masters/PhD Degrees >2,300 IP Designers >1,500 Applications Engineers >$2.2B FY14 Revenue 32% Revenue on R&D >9,300 Employees
  3. 3. Copyright © 2015 Synopsys Inc. 3 • Convolutional Neural Network (CNN) • Wide range of detection and classification possible • The majority of the published CNN graphs are not tailored for embedded • Memory requirements • Number of floating point operations (# of MAC) • Yet CNN have nice properties for parallelization on embedded devices • Regular processing, feed forward dataflow, no data dependant computation • Key questions • Can the size and complexity of the graph be reduced with minimal impact on detection rates ? • Number of layers, connectivity, size of convolution • What is the impact of moving from floating to fixed point ? CNN on Embedded Devices
  4. 4. Copyright © 2015 Synopsys Inc. 4 How CNN Works (Once Trained) • Multiple feature extraction layers • Progressive refinement process • Each successive layer extracts more complex features (higher level) • Last layer performs classification • Same computation (neuron) replicated multiple times Input image Layer 1 Low level feature extraction Pooling & down sampling Layer 2 Mid-level features Partially connected Layer 3 High-level features Fully connected classification
  5. 5. Copyright © 2015 Synopsys Inc. 5 • Each layer of convolutions extract progressively higher level features • Subsampling / max pooling to “zoom out” and detect bigger objects with smaller convolutions • Non-linear function on each neuron to activate it Visualising a CNN Layer 1 output sample Layer 2 output sample Layer 3 output sample Layer 4 output sample
  6. 6. Copyright © 2015 Synopsys Inc. 6 • Convolution of multiple inputs together • Fixed kernel size • Optional subsampling • 1, 2, 4x • Optional max-pooling • Very regular, repetitive computation • Dominated by MAC • Deterministic • Non-linear activation function (sigmoid, hyperbolic tangent, rectifier) CNN Computation I0 IM-1 I1 O0 ON-1 M inputs (XI * YI) Z kernels (K * K) with associated weights N outputs (XO * YO) Oj = act(Bj+ (Iv x Kw) + …) Convolution (x) act act Activation (tanh, ReLU) …
  7. 7. Copyright © 2015 Synopsys Inc. 7 • Given the nature of the algorithm, there are many ways to accelerate CNNs including: • Vector / SIMD unit • Systolic array / Streaming • GPU • Performance / Power / Area trade-offs will vary • Depending on the architecture • In all cases the main limitations will be • Amount of closely coupled memory available • Maximum number of Giga-MAC/s that can be sustained • I/O bandwidth required & available • Optimized data movement, efficient streaming Moving Towards Embedded CNN EV Processor Shared Memory DMA Interconnect RISC CPU 32-bit Core 32-bit Core 32-bit Core 32-bit Core CNN Engine … … PE PE PE PE PE PE
  8. 8. Copyright © 2015 Synopsys Inc. 8 Moving CNN to Embedded Systems • Graph Complexity • Number of layers (depth) • Size of the convolutions filters • Number of connections between the layers Compute requirements ALU width/costMemory size Input Layer 1 Layer 2 Layer 3 Layer 4 3 2 1 1 2 6 1 2 1 0 1 1 0 Image Filter 5 8 3 3 Feature map Conv. = 4 6 2 2 Data precision# Coefficients Act.
  9. 9. Copyright © 2015 Synopsys Inc. 9 • Starting point: • Multicoreware generated ~10 million faces/non-faces from over 200 Hollywood and Bollywood full length movies • Trained CNN to detect faces in those movies Example of a Big& Small CNN Application Metric Alexnet like Embedded version Weight Space 400 MB 0.5 MB Layers 10 (7Cv+3 FC) 5 (3 Cv+2 FC) Compute 200x 1x Bandwidth 400x 1x F1-Score .963 .905 Accuracy .993 .981 VGA 30 FPS 4800 GOPS 24 GOPS • Cv: Convolution layers (partially connected) • FC: Fully connected layers
  10. 10. Copyright © 2015 Synopsys Inc. 10 • Using standard open source projects to train networks with floating point and GPU acceleration to explore network topology • Cuda-convnet, Caffe, Theano • Didn’t worry initially about numerical precision as literature has shown CNN are robust to precision • From scratch: Small networks can be trained very fast • Enables lots of shots on goal : • Using scripting and many GPU’s • Number of network layers, convolutions, subsampling & pooling • Explored huge space and quickly converged on a graph with good learning • From an existing graph: Also worked backwards from high accuracy large graph • Iteratively reduced it and retrained the best ones • End up with similar networks in both cases Reducing Complexity of the Graph
  11. 11. Copyright © 2015 Synopsys Inc. 11 • Improve F-1 score with classic techniques such as • Data Normalization • Hard negative mining (boosting) • Annealing the learning rate • Data Augmentation: Flip, Random Cropping, color space, .. • Moved initial system from F1 of ~.74 to ~.90 • Once the graph topology and training is satisfying look at the impact of moving to fixed point • Test below are done with 31437 positive and 263145 negative samples Training Optimizations Initial Optimized True positive 19706 27093 False positive 1769 1335 False negative 11731 4344 F-1 Score 0.7449 0.9051
  12. 12. Copyright © 2015 Synopsys Inc. 12 • Compare output of every layer with reference floating point version • Differences may grow after each layer • Detection threshold might need to be tweaked to achieve similar results Moving to Fixed Point: Empirical Approach ReLU Image Filter Convolution = Accumulator Feature map 200 64 1 150 50 1 1 10 220 4 0 0 -1 750 255 590 -20 Non-linear function 750 255 590 0 Shift + saturate 255 127 255 0 Greyscale image, 8 bit pixels Convert to fixed-point based on range, e.g 16 bit (Q2S13) Make sure accumulator is wide enough, e.g. 32 bit (signed) Shift-right values to avoid overflow, x = max(0, x) >> N Choose ‘N’ according to dynamic range of ‘x’ values
  13. 13. Copyright © 2015 Synopsys Inc. 13 • FDDB: Face Detection Data Set and Benchmark • Results shown for the embedded small & fixed point graph • Localization can be improved with pre/post processing • Impacts scores • Not done here Results For Face Detection Application Type F-1 Best (CascadeCNN) 0.91 Middle 10 average 0.85 Embedded – 40% 0.84 Embedded – 50% 0.82 Fixed point, 8bit
  14. 14. Copyright © 2015 Synopsys Inc. 14 • Design time configurable • Number of CNN Processing Elements (2 to 8) • Streaming interconnection network configured for number of cores • Runtime reconfigurable • Flexible point-to-point connections between all cores • CNN-optimized instruction set • Convolutions, MAC, LUT, … • Micro-DMA & stream interface for data movement • Programmable • Using the generated C compiler • Each CNN PE has a local data & program memory Low-cost, Low-power, Flexible CNN SubsystemInterconnect DMA Shared DMem CNN Engine Reconfigurable Streaming Interconnect PE 1 …PE 2 PE 4 PE 5 PE 6 PE 8… RISC MP 32 bit RISC 32 bit RISC 32 bit RISC 32 bit RISC Sync
  15. 15. Copyright © 2015 Synopsys Inc. 15 Mapping Example and Performance L1&4 FIFO L2 L3a L3b Subsystem Interconnect L1 L2 L3 L4 • Input image read only once • 30 cycles average to do 8 convolutions of 5x5 in parallel • Including all data movement & contention • Over 85% MAC resource utilization (8 MACs / CNN PE) • ~15mW per PE @28nm HPM • w. memory & interconnect • Mapping on 4 processing elements • Smaller layers merged together 4 PE, 5 FIFO configuration
  16. 16. Copyright © 2015 Synopsys Inc. 16 Demonstrator ARC EV52 Processor RISC multi-core Shared Data Mem CNN Engine DMA AXI Subsystem Interconnect PE 8 Core 2 MEM PE 1 Core 1 MEM AXI Interconnect DDR ARC HS Core • Read in frame, • Pyramid (scaling) • Non-max suppression • Softmax • Display the result AXI 2 UMRBus CNN graph Host application streaming video frames to DDR over UMR-bus and back HAPS 70-S12 Prototyping System Clocked at 50Mhz (10% of real-time) Workstation webcam
  17. 17. Copyright © 2015 Synopsys Inc. 17 • CNN compute requirement can be dramatically reduced with a small impact of the detection rates • Works well when the number of object classes to detect is kept small • Offline training is the critical step to obtain good performances • Specialized and programmable hardware can be used to efficiently implement many different CNN graphs • Low power and area • Some pre- and post-processing is needed to have a complete and useful application • CNN accelerator coupled with quad-core RISC cluster • Useful to couple CNN with other processing steps to improve performances • Shrinking the image when it doesn’t impact detection rates • Sliding a detection window on an image • Region of interest Lessons Learned
  18. 18. Copyright © 2015 Synopsys Inc. 18 • Selected CNN papers • Embedded facial image processing with Convolutional Neural Networks • http://liris.cnrs.fr/Documents/Liris-6072.pdf • Memory-Centric Accelerator Design for Convolutional Neural Networks • http://parse.ele.tue.nl/system/attachments/58/original/iccdMP17.pdf?1381908921 • CNN tutorial & courses • Stanford CNN course • http://cs231n.github.io/ • Neural network intro and visualization • http://colah.github.io/ • Synopsys DesignWare Embedded Vision Processors • http://www.synopsys.com/ev • More information and demo available at the Technology Showcase (Mission City Ballroom, Tables 3 & 4) Resources

×