Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

"Five+ Techniques for Efficient Implementation of Neural Networks," a Presentation from Synopsys

31 views

Published on

For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/synopsys/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit-moons

For more information about embedded vision, please visit:
http://www.embedded-vision.com

Bert Moons, Hardware Design Architect at Synopsys, presents the "Five+ Techniques for Efficient Implementation of Neural Networks" tutorial at the May 2019 Embedded Vision Summit.

Embedding real-time, large-scale deep learning vision applications at the edge is challenging due to their huge computational, memory and bandwidth requirements. System architects can mitigate these demands by modifying deep neural networks (DNNs) to make them more energy- efficient and less demanding of embedded processing hardware.

In this talk, Moons provides an introduction to today’s established techniques for efficient implementation of DNNs: advanced quantization, network decomposition, weight pruning and sharing and sparsity-based compression. He also previews up-and-coming techniques such as trained quantization and correlation- based compression.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

"Five+ Techniques for Efficient Implementation of Neural Networks," a Presentation from Synopsys

  1. 1. © 2019 Synopsys 5+ Techniques for Efficient Implementation of Neural Networks Bert Moons Synopsys May 2019
  2. 2. © 2019 Synopsys Introduction -- Challenges of Embedding Deep Learning
  3. 3. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling
  4. 4. © 2019 Synopsys 3 Major challenges Introduction – Challenges of embedded deep learning 1. Many operations per pixel 2. Process a lot of pixels in real-time 3. A large variation of different algorithms
  5. 5. © 2019 Synopsys Classification accuracy comes at a cost Introduction – Challenges of embedded deep learning Conventional Machine Learning Deep Learning Human Bestreportedtop-5accuracy onIMAGENET-1000[%] Neural network accuracy comes at a cost of a high workload per input pixel and huge model sizes and bandwidth requirements
  6. 6. © 2019 Synopsys Computing on large input data Introduction – Challenges of embedded deep learning 4KFHDIMAGENET 1X 40X 160X Embedded applications require real-time operation on large input frames
  7. 7. © 2019 Synopsys Massive workload in real-time applications Introduction – Challenges of embedded deep learning 1GOP 1TOP Top-1IMAGENETaccuracy[%] 70 75 65 Single ImageNet Image 1GOP to 10GOP per IMAGENET image # operations / ImageNet image1GOP/s 1TOP/s Top-1IMAGENETaccuracy[%] 70 75 65 6 Cameras 30fps Full HD Image 5-to-180 TOPS @ 30 fps, FHD, ADAS # operations / second MobileNet V2 ResNet-50 GoogleNet VGG-16
  8. 8. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning 1. Linear post-training 8/12/16b quantization 2. Linear trained 2/4/8 bit quantization 3. Non-linear trained 2/4/8 bit quantization through clustering C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned 4. Network pruning and compression 5. Network decomposition: low-rank network approximations 6. Sparsity and correlation based feature map compression
  9. 9. © 2019 Synopsys A. Neural Networks Are Error- Tolerant 9
  10. 10. © 2019 Synopsys The benefits of quantized number representations 5 Techniques – A. Neural Networks Are Error-Tolerant 8 bit fixed is 3-4x faster, 2-4x more efficient than 16b floating point Energy consumption per unit Processing units per chip Classification time per chip * [Choi,2019] 16b float 8b fixed O(1) relative fps O(16) relative fps 4b fixed O(256) relative fps ~ 16 ~ 6-8 ~ 2-4 Relative accuracy 100% no loss 99% 50-95%
  11. 11. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Convert floating point pretrained models to Dynamic Fixed Point 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 Fixed Point Dynamic Fixed Point 0 1 1 0 1System Exponent Group 1 Exponent Group 2 Exponent * [Courbariaux,2019]
  12. 12. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Dynamic Fixed-Point Quantization allows running neural networks with 8 bit weights and activations across the board 32 bit float baseline 8 bit fixed point * [Nvidia,2017]
  13. 13. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant How to optimally choose: dynamic exponent groups, saturation thresholds, weight and activation exponents? Min-max scaling throws away small values A saturation threshold better represents small values, but clips large values * [Nvidia,2017]
  14. 14. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019] *
  15. 15. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Good accuracy down to 2b Graceful performance degradation 0.85 0.9 0.95 1 1.05 CIFAR10 SVHN AlexNet ResNet18 ResNet50 full precision 5b 4b 3b 2b * [Choi,2018] Relativebenchmark accuracyvsfloatbaseline*
  16. 16. © 2019 Synopsys Non-linear trained quantization – codebook clustering 5 Techniques – A. Neural Networks Are Error-Tolerant Clustered, codebook quantization can be optimally trained. This only reduces bandwidth, computations are still in floating point. * [Han,2015]
  17. 17. © 2019 Synopsys B. Neural Networks Are Over- Dimensioned & Redundant 17
  18. 18. © 2019 Synopsys Pruning Neural Networks 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Pruning removes unnecessary connections in the neural network. Accuracy is recovered through retraining the pruned network * [Han,2015]
  19. 19. © 2019 Synopsys Low Rank Singular Value Decomposition (SVD) in DNNs 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Many singular values are small and can be discarded * [Xue,2013] 𝑨 = 𝑼 𝜮 𝑽 𝑻 𝑨 ≅ 𝑼′𝜮′𝑽′ 𝑻 = 𝑵𝑼
  20. 20. © 2019 Synopsys Low Rank Canonical Polyadic (CP) decomp. in CNNs 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Convert a large convolutional filter in a triplet of smaller filters * [Astrid,2017]
  21. 21. © 2019 Synopsys Basic example: Combining SVD, pruning and clustering 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant 11x model compression in a phone-recognition LSTM 0 2 4 6 8 10 12 Base P SVD SVD+P P+C SVD+P+C LSTMCompressionRate (P) (C) * [Goetschalckx,2018] P = Pruning SVD = Singular Value Decomposition C = Clustering / Codebook Compression
  22. 22. © 2019 Synopsys C. Neural Networks have Sparse & Correlated Intermediate Results 22
  23. 23. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Feature map bandwidth dominates in modern CNNs 0 2 4 6 8 10 12 Coefficient BW Feature Map BW BWinMObileNet-V1[MB] 1x1, 32 3x3DW, 32 1x1, 64 32
  24. 24. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps ReLU activation introduces 50-90% zero-valued numbers in intermediate feature maps -5 4 12 -10 0 17 -1 3 2 0 4 12 0 0 17 0 3 2 ReLU activation 8b Features 8b Features
  25. 25. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Hardware support for multi-bit Huffman encoding allows up to 2x bandwidth reduction in typical networks. Zero-runlength encoding as in [Chen, 2016], Huffman-encoding as in [Moons, 2017] 0 4 12 0 0 17 0 3 2 8b Features 72b Huffman Features zero 2’b00 <16 2’b01, 4’b WORD nonzero 1’b1, 8’b WORD 41b < 72b
  26. 26. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Intermediate features in the same channel-plane are highly correlated Intermediate featuremaps in ReLU-less YOLOV2 An example featuremap scale1 An example featuremap scale9
  27. 27. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Super-linear correlation based extended bit-plane compression allows feature-map compression even on non-sparse data * [Cavigelli,2018] 0,4,0,0,0,8,0,0,71 Zero Values Non-Zero Values Correlated Values: 16, 20, 20, 20, 28, 28, 28, 99 Delta Values: 0, 4, 0, 0, 8, 0, 0, 71
  28. 28. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Correlated compression outperforms sparsity-based compression 0 0.5 1 1.5 2 2.5 3 Mobilenet ResNet-50 Yolo V2 VOC VGG-16 CompressionRate Sparsity-based Correlation-based
  29. 29. © 2019 Synopsys Conclusion – Bringing it All Together
  30. 30. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together A first-order energy model for Neural Network Inference Assume: • Quadratic energy scaling / MAC when going from 32 to 8 bit. • Linear energy saving / read-write in DDR/SRAM when going from 32 to 8 bit • 50% of coefficients zero when pruning • 50% compute reduction under decomposition • 50% of activations can be compressed *[Han,2015] *
  31. 31. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together When all model data is stored in DRAM optimized ResNet-50 is 10x more efficient than its plain 32b counterpart O(65MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame 10x 100% 22% 16% 11% 0% 20% 40% 60% 80% 100% 120% 32b float A. 8b fixed B. Decomposition + Pruning C. Featuremap Compression RelativeEnergy Consumption
  32. 32. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together In a system with sufficient on-chip SRAM, optimized ResNet-50 is 12.5x more efficient than its plain 32b counterpart O(0MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame 100% 15% 9% 8% 0% 20% 40% 60% 80% 100% 120% 32b float A. 8b fixed B. Decomposition + Pruning C. Featuremap Compression RelativeEnergy Consumption 12.5x
  33. 33. © 2019 Synopsys For More Information Visit the Synopsys booth for demos on Automotive ADAS, Virtual Reality & More 33 EV6x Embedded Vision Processor IP with Safety Enhancement Package • Thursday, May 23 • Santa Clara Convention Center • Doors open 8 AM • Sessions on EV6x Vision Processor IP, Functional Safety, Security, OpenVX… • Register via the EV Alliance website or at Synopsys Booth Join Synopsys’ EV Seminar on Thursday Navigating Embedded Vision at the Edge B E S T P R O C E S S O R
  34. 34. © 2019 Synopsys References 34 [Han,2015,2016] https://arxiv.org/abs/1510.00149 https://arxiv.org/abs/1602.01528 [Xue, 2013] https://www.microsoft.com/en-us/research/wp- content/uploads/2013/01/svd_v2.pdf [Nvidia, 2017] http://on-demand.gputechconf.com/gtc /2017/presentation/s7310-8-bit-inference-with- tensorrt.pdf [Choi, 2018, 2019] https://arxiv.org/abs/1805.06085 https://www.ibm.com/blogs/research/2019/04/2-bit- precision/ [Goetschalckx, 2018] https://www.sigmobile.org/mobisys/2018/workshops/ deepmobile18/papers/Efficiently_Combining_SVD_Pru ning_Clustering_Retraining.pdf [Astrid,2017] https://arxiv.org/abs/1701.07148 [Moons,2017] https://ieeexplore.ieee.org/abstract/document/78703 53 [Chen,2016] http://eyeriss.mit.edu/ [Cavigelli, 2018] https://arxiv.org/abs/1810.03979 [Courbariaux, 2014] https://arxiv.org/pdf/1412.7024.pdf Embedded Vision Summit Bert Moons -- 5+ Techniques for Efficient Implementations of Neural Networks May 2019
  35. 35. © 2019 Synopsys THANK YOU 35
  36. 36. © 2019 Synopsys 5+ Techniques for Efficient Implementation of Neural Networks Bert Moons Synopsys May 2019
  37. 37. © 2019 Synopsys Introduction -- Challenges of Embedding Deep Learning
  38. 38. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning
  39. 39. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources
  40. 40. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling
  41. 41. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling 1. Many operations per pixel 2. Process a lot of pixels in real-time
  42. 42. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling 1. Many operations per pixel 2. Process a lot of pixels in real-time 3. A large variation of different algorithms
  43. 43. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling 1. Many operations per pixel 2. Process a lot of pixels in real-time 3. A large variation of different algorithms
  44. 44. © 2019 Synopsys Classification accuracy comes at a cost Introduction – Challenges of embedded deep learning Conventional Machine Learning Deep Learning Human Bestreportedtop-5accuracy onIMAGENET-1000[%] Neural network accuracy comes at a cost of a high workload per input pixel and huge model sizes and bandwidth requirements
  45. 45. © 2019 Synopsys Computing on large input data Introduction – Challenges of embedded deep learning 4KFHDIMAGENET 1X 40X 160X Embedded applications require real-time operation on large input frames
  46. 46. © 2019 Synopsys Massive workload in real-time applications Introduction – Challenges of embedded deep learning 1GOP 1TOP Top-1IMAGENETaccuracy[%] 70 75 65 Single ImageNet Image 1GOP to 10GOP per IMAGENET image # operations / ImageNet image MobileNet V2 ResNet-50 GoogleNet VGG-16
  47. 47. © 2019 Synopsys Massive workload in real-time applications Introduction – Challenges of embedded deep learning 1GOP 1TOP Top-1IMAGENETaccuracy[%] 70 75 65 Single ImageNet Image 1GOP to 10GOP per IMAGENET image # operations / ImageNet image1GOP/s 1TOP/s Top-1IMAGENETaccuracy[%] 70 75 65 6 Cameras 30fps Full HD Image 5-to-180 TOPS @ 30 fps, FHD, ADAS # operations / second MobileNet V2 ResNet-50 GoogleNet VGG-16
  48. 48. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned
  49. 49. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning 1. Linear post-training 8/12/16b quantization 2. Linear trained 2/4/8 bit quantization 3. Non-linear trained 2/4/8 bit quantization through clustering C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned
  50. 50. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning 1. Linear post-training 8/12/16b quantization 2. Linear trained 2/4/8 bit quantization 3. Non-linear trained 2/4/8 bit quantization through clustering C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned 4. Network pruning and compression 5. Network decomposition: low-rank network approximations
  51. 51. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning 1. Linear post-training 8/12/16b quantization 2. Linear trained 2/4/8 bit quantization 3. Non-linear trained 2/4/8 bit quantization through clustering C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned 4. Network pruning and compression 5. Network decomposition: low-rank network approximations 6. Sparsity and correlation based feature map compression
  52. 52. © 2019 Synopsys A. Neural Networks Are Error- Tolerant 52
  53. 53. © 2019 Synopsys The benefits of quantized number representations 5 Techniques – A. Neural Networks Are Error-Tolerant 8 bit fixed is 3-4x faster, 2-4x more efficient than 16b floating point Energy consumption per unit Processing units per chip Classification time per chip * [Choi,2019] 16b float 8b fixed O(1) relative fps O(16) relative fps 4b fixed O(256) relative fps ~ 16 ~ 6-8 ~ 2-4 Relative accuracy 100% no loss 99% 50-95%
  54. 54. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Convert floating point pretrained models to Dynamic Fixed Point 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 Fixed Point Dynamic Fixed Point 0 1 1 0 1System Exponent Group 1 Exponent Group 2 Exponent * [Courbariaux,2019]
  55. 55. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Dynamic Fixed-Point Quantization allows running neural networks with 8 bit weights and activations across the board 32 bit float baseline 8 bit fixed point * [Nvidia,2017]
  56. 56. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant How to optimally choose: dynamic exponent groups, saturation thresholds, weight and activation exponents? Min-max scaling throws away small values A saturation threshold better represents small values, but clips large values * [Nvidia,2017]
  57. 57. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019]
  58. 58. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019]
  59. 59. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019]
  60. 60. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019]
  61. 61. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019] *
  62. 62. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Good accuracy down to 2b Graceful performance degradation 0.85 0.9 0.95 1 1.05 CIFAR10 SVHN AlexNet ResNet18 ResNet50 full precision 5b 4b 3b 2b * [Choi,2018] Relativebenchmark accuracyvsfloatbaseline*
  63. 63. © 2019 Synopsys Non-linear trained quantization – codebook clustering 5 Techniques – A. Neural Networks Are Error-Tolerant Clustered, codebook quantization can be optimally trained. This only reduces bandwidth, computations are still in floating point. * [Han,2015]
  64. 64. © 2019 Synopsys B. Neural Networks Are Over- Dimensioned & Redundant 64
  65. 65. © 2019 Synopsys Pruning Neural Networks 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Pruning removes unnecessary connections in the neural network. Accuracy is recovered through retraining the pruned network * [Han,2015]
  66. 66. © 2019 Synopsys Low Rank Singular Value Decomposition (SVD) in DNNs 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Many singular values are small and can be discarded * [Xue,2013] 𝑨 = 𝑼 𝜮 𝑽 𝑻 𝑨 ≅ 𝑼′𝜮′𝑽′ 𝑻 = 𝑵𝑼
  67. 67. © 2019 Synopsys Low Rank Canonical Polyadic (CP) decomp. in CNNs 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Convert a large convolutional filter in a triplet of smaller filters * [Astrid,2017]
  68. 68. © 2019 Synopsys Basic example: Combining SVD, pruning and clustering 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant 11x model compression in a phone-recognition LSTM 0 2 4 6 8 10 12 Base P SVD SVD+P P+C SVD+P+C LSTMCompressionRate (P) (C) * [Goetschalckx,2018] P = Pruning SVD = Singular Value Decomposition C = Clustering / Codebook Compression
  69. 69. © 2019 Synopsys C. Neural Networks have Sparse & Correlated Intermediate Results 69
  70. 70. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Feature map bandwidth dominates in modern CNNs 0 2 4 6 8 10 12 Coefficient BW Feature Map BW BWinMObileNet-V1[MB] 1x1, 32 3x3DW, 32 1x1, 64 32
  71. 71. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps ReLU activation introduces 50-90% zero-valued numbers in intermediate feature maps -5 4 12 -10 0 17 -1 3 2 0 4 12 0 0 17 0 3 2 ReLU activation 8b Features 8b Features
  72. 72. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Hardware support for multi-bit Huffman encoding allows up to 2x bandwidth reduction in typical networks. Zero-runlength encoding as in [Chen, 2016], Huffman-encoding as in [Moons, 2017] 0 4 12 0 0 17 0 3 2 8b Features 72b Huffman Features zero 2’b00 <16 2’b01, 4’b WORD nonzero 1’b1, 8’b WORD 41b < 72b
  73. 73. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Intermediate features in the same channel-plane are highly correlated Intermediate featuremaps in ReLU-less YOLOV2 An example featuremap scale1 An example featuremap scale9
  74. 74. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Super-linear correlation based extended bit-plane compression allows feature-map compression even on non-sparse data * [Cavigelli,2018] 0,4,0,0,0,8,0,0,71 Zero Values Non-Zero Values Correlated Values: 16, 20, 20, 20, 28, 28, 28, 99 Delta Values: 0, 4, 0, 0, 8, 0, 0, 71
  75. 75. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Correlated compression outperforms sparsity-based compression 0 0.5 1 1.5 2 2.5 3 Mobilenet ResNet-50 Yolo V2 VOC VGG-16 CompressionRate Sparsity-based Correlation-based
  76. 76. © 2019 Synopsys Conclusion – Bringing it All Together
  77. 77. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together A first-order energy model for Neural Network Inference Assume: • Quadratic energy scaling / MAC when going from 32 to 8 bit. • Linear energy saving / read-write in DDR/SRAM when going from 32 to 8 bit • 50% of coefficients zero when pruning • 50% compute reduction under decomposition • 50% of activations can be compressed *[Han,2015] *
  78. 78. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together When all model data is stored in DRAM optimized ResNet-50 is 10x more efficient than its plain 32b counterpart O(65MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame 10x 100% 22% 16% 11% 0% 20% 40% 60% 80% 100% 120% 32b float A. 8b fixed B. Decomposition + Pruning C. Featuremap Compression RelativeEnergy Consumption
  79. 79. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together In a system with sufficient on-chip SRAM, optimized ResNet-50 is 12.5x more efficient than its plain 32b counterpart O(0MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame 100% 15% 9% 8% 0% 20% 40% 60% 80% 100% 120% 32b float A. 8b fixed B. Decomposition + Pruning C. Featuremap Compression RelativeEnergy Consumption 12.5x
  80. 80. © 2019 Synopsys For More Information Visit the Synopsys booth for demos on Automotive ADAS, Virtual Reality & More 80 EV6x Embedded Vision Processor IP with Safety Enhancement Package • Thursday, May 23 • Santa Clara Convention Center • Doors open 8 AM • Sessions on EV6x Vision Processor IP, Functional Safety, Security, OpenVX… • Register via the EV Alliance website or at Synopsys Booth Join Synopsys’ EV Seminar on Thursday Navigating Embedded Vision at the Edge B E S T P R O C E S S O R
  81. 81. © 2019 Synopsys References 81 [Han,2015,2016] https://arxiv.org/abs/1510.00149 https://arxiv.org/abs/1602.01528 [Xue, 2013] https://www.microsoft.com/en-us/research/wp- content/uploads/2013/01/svd_v2.pdf [Nvidia, 2017] http://on-demand.gputechconf.com/gtc /2017/presentation/s7310-8-bit-inference-with- tensorrt.pdf [Choi, 2018, 2019] https://arxiv.org/abs/1805.06085 https://www.ibm.com/blogs/research/2019/04/2-bit- precision/ [Goetschalckx, 2018] https://www.sigmobile.org/mobisys/2018/workshops/ deepmobile18/papers/Efficiently_Combining_SVD_Pru ning_Clustering_Retraining.pdf [Astrid,2017] https://arxiv.org/abs/1701.07148 [Moons,2017] https://ieeexplore.ieee.org/abstract/document/78703 53 [Chen,2016] http://eyeriss.mit.edu/ [Cavigelli, 2018] https://arxiv.org/abs/1810.03979 [Courbariaux, 2014] https://arxiv.org/pdf/1412.7024.pdf Embedded Vision Summit Bert Moons -- 5+ Techniques for Efficient Implementations of Neural Networks May 2019
  82. 82. © 2019 Synopsys THANK YOU 82

×