Your SlideShare is downloading.
×

Free with a 30 day trial from Scribd

- 1. Squeezing Deep Learning into mobile phones - A Practitioners guide Anirudh Koul
- 2. Anirudh Koul , @anirudhkoul , http://koul.ai Project Lead, Seeing AI (SeeingAI.com) Applied Researcher, Microsoft AI & Research Akoul at Microsoft dot com Currently working on applying artificial intelligence for Hololens, autonomous robots and accessibility Along with Eugene Seleznev, Saqib Shaikh, Meher Kasam
- 3. Why Deep Learning On Mobile? Latency Privacy
- 4. Response Time Limits – Powers of 10 0.1 second : Reacting instantly 1.0 seconds : User’s flow of thought 10 seconds : Keeping the user’s attention [Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:
- 5. Mobile Deep Learning Recipe Mobile Inference Engine + Pretrained Model = DL App (Efficient) (Efficient)
- 6. Building a DL App in _ time
- 7. Building a DL App in 1 hour
- 8. Use Cloud APIs Microsoft Cognitive Services Clarifai Google Cloud Vision IBM Watson Services Amazon Rekognition Tip : Resize to 224x224 at under 50% compression with bilinear interpolation before network transmission But don’t resize for Text / OCR projects
- 9. Microsoft Cognitive Services Models won the 2015 ImageNet Large Scale Visual Recognition Challenge Vision, Face, Emotion, Video and 21 other topics
- 10. Custom Vision Service (customvision.ai) – Drag and drop training Tip : Upload 30 photos per class for make prototype model Upload 200 photos per class for more robust production model More distinct the shape/type of object, lesser images required.
- 11. Custom Vision Service (customvision.ai) – Drag and drop training Tip : Use Fatkun Browser Extension to download images from Search Engine, or use Bing Image Search API to programmatically download photos with proper rights
- 12. Building a DL App in 1 day
- 13. http://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/ Energy to train Convolutional Neural Network Energy to use Convolutional Neural Network
- 14. Base PreTrained Model ImageNet – 1000 Object Categorizer Inception Resnet
- 15. Running pre-trained models on mobile Core ML Tensorflow Caffe2 Snapdragon Neural Processing Engine MXNet CNNDroid DeepLearningKit Torch
- 16. Core ML From Apple, for iOS 11 Convert Caffe/Tensorflow model to CoreML model in 3 lines: import coremltools coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’) coreml_model.save('my_model.mlmodel’) Add model to iOS project and call for prediction. Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM Builds on top of low-level primitives Accelerate, BNNS, Metal Performance Shaders (MPS) Noticable speedup between MPS (iOS 10) vs CoreML implementation (iOS 11) (same model, same hardware) Automatically minimizes memory footprint and power consumption
- 17. CoreML Benchmark - Pick a DNN for your mobile architecture Model Top-1 Accuracy Size of Model (MB) iPhone 6 Execution Time (ms) iPhone 6S Execution Time (ms) iPhone 7 Execution Time (ms) VGG 16 71 553 4556 254 208 Inception v3 78 95 637 98 90 Resnet 50 75 103 557 72 64 MobileNet 71 17 109 52 32 SqueezeNet 57 5 78 29 24 2014 2015 2016 Huge improvement in hardware in 2015
- 18. Putting out more frames than an art gallery
- 19. Tensorflow Easy pipeline to bring Tensorflow models to mobile Excellent documentation Optimizations to bring model to mobile Upcoming : XLA (Accelerated Linear Algebra) compiler to optimize for hardware
- 20. Caffe2 From Facebook Under 1 MB of binary size Built for Speed : For ARM CPU : Uses NEON Kernels, NNPack For iPhone GPU : Uses Metal Performance Shaders and Metal For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup) ONNX format support to import models from CNTK/PyTorch
- 21. Snapdragon Neural Processing Engine (NPE) SDK Like CoreML for Qualcomm Snapdragon chips Published speedup of 4-5x On about half the Android phones Identifies best target core for inference - GPU, DSP or CPU Customizable to choose between battery power and performance Supports importing models from Caffe, Caffe2, Tensorflow
- 22. MXNET Amalgamation : Pack all the code in a single source file Pro: • Cross Platform (iOS, Android), Easy porting • Usable in any programming language Con: • CPU only, Slow https://github.com/Leliana/WhatsThis
- 23. CNNdroid GPU accelerated CNNs for Android Supports Caffe, Torch and Theano models ~30-40x Speedup using mobile GPU vs CPU (AlexNet) Internally, CNNdroid expresses data parallelism for different layers, instead of leaving to the GPU’s hardware scheduler
- 24. DeepLearningKit Platform : iOS, OS X and tvOS (Apple TV) DNN Type : CNNs models trained in Caffe Runs on mobile GPU, uses Metal Pro : Fast, directly ingests Caffe models Con : Unmaintained
- 25. Running pre-trained models on mobile Mobile Library Platform GPU DNN Architecture Supported Trained Models Supported CoreML iOS Yes CNN, RNN, SciKit Keras, Tensorflow, MXNet Tensorflow iOS/Android Yes CNN,RNN,LSTM, etc Tensorflow Caffe2 iOS/Android Yes CNN Caffe2, CNTK, PyTorch Snapdragon NPE Android Yes CNN, RNN, LSTM Caffe, Caffe2, Tensorflow CNNDroid Android Yes CNN Caffe, Torch, Theano DeepLearningKit iOS Yes CNN Caffe MXNet iOS/Android No CNN,RNN,LSTM, etc MXNet Torch iOS/Android No CNN,RNN,LSTM, etc Torch
- 26. Possible Long Term Route for fastest speed on each phone Train a model using your favorite DNN library Import it to Tensorflow/Keras with Tensorflow For iOS : Use Keras and CoreML For Android : For Qualcomm chips (~50% of android phones) : Using Snapdragon NPE For remaining phones Use Tensorflow Mobile Model (Tensorflow format) Keras + CoreML Snapdragon NPE Tensorflow Mobile iOS Android Qualcomm Chips Remaining
- 27. Building a DL App in 1 week
- 28. Learn Playing an Accordion 3 months
- 29. Learn Playing an Accordion 3 months Knows Piano Fine Tune Skills 1 week
- 30. I got a dataset, Now What? Step 1 : Find a pre-trained model Step 2 : Fine tune a pre-trained model Step 3 : Run using existing frameworks “Don’t Be A Hero” - Andrej Karpathy
- 31. How to find pretrained models for my task? Search “Model Zoo” Microsoft Cognitive Toolkit (previously called CNTK) – 50 Models Caffe Model Zoo Keras Tensorflow MXNet
- 32. AlexNet, 2012 (simplified) [Krizhevsky, Sutskever,Hinton’12] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11 n-dimension Feature representation
- 33. Deciding how to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
- 34. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
- 35. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
- 36. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
- 37. CoreML exporter from customvision.ai – Drag and drop training 5 minute shortcut to finetuning and getting model ready in CoreML format
- 38. CoreML exporter from customvision.ai – Drag and drop training 5 minute shortcut to training, finetuning and getting model ready in CoreML format Drag and drop interface
- 39. Building a DL Website in 1 week
- 40. Less Data + Smaller Networks = Faster browser training
- 41. Several JavaScript Libraries Run large CNNs • Tensorfire • WebDNN • Keras-JS • MXNetJS • CaffeJS Train and Run CNNs • DeepLearn.js • ConvNetJS Train and Run LSTMs • Brain.js • Synaptic.js Train and Run NNs • Mind.js • DN2A
- 42. ConvNetJS – Train + Infer on CPU Both Train and Test NNs in browser Train CNNs in browser
- 43. DeepLearn.js – Train + Infer on GPU a
- 44. DeepLearn.js – Train + Infer on GPU Uses WebGL to perform computation on GPU (including backprop) Immediate execution model for inference (like Numpy) Delayed execution model for training (like TensorFlow) Upcoming tools to export weights from Tensorflow checkpoints
- 45. Tensorfire – Infer on GPU Import models from Keras/Tensorflow Any GPU works (including AMD), runs faster than TensorFlow on Macbook Pro in browser Supports low-precision math Transforms NN weights into WebGL textures for speedup Similar library : WebDNN.js
- 46. Keras.js Run Keras models in browser, with GPU support.
- 47. Brain.JS Train and run NNs in browser Supports Feedforward, RNN, LSTM, GRU No CNNs Demo : http://brainjs.com/ Trained NN to recognize color contrast
- 48. MXNetJS On Firefox and Microsoft Edge, performance is 8x faster than Chrome. Optimization difference because of ASM.js.
- 49. Building a Crowdsourced Data Collector in 1 months
- 50. Barcode recognition from Seeing AI Live Guide user in finding a barcode with audio cues With Server Decode barcode to identify product Tech MPSCNN running on mobile GPU + barcode library Metrics 40 FPS (~25 ms) on iPhone 7 Aim : Help blind users identify products using barcode Issue : Blind users don’t know where the barcode is
- 51. Currency recognition from Seeing AI Aim : Identify currency Live Identify denomination of paper currency instantly With Server - Tech Task specific CNN running on mobile GPU Metrics 40 FPS (~25 ms) on iPhone 7
- 52. Training Data Collection App Request volunteers to take photos of objects in non-obvious settings Sends photos to cloud, trains model nightly Newsletter shows the best photos from volunteers Let them compete for fame
- 53. Daily challenge - Collected by volunteers
- 54. Daily challenge - Collected by volunteers
- 55. Challenge: Can you fool a Deep Neural Network? Challenge users to find flaws in DNN Helps trains a robust classifier with much lesser photos
- 56. Building a DL App in 6 months
- 57. What you want https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016 $2000$200,000 What you can afford
- 58. 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
- 59. 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC 2014) input Conv 7x7+ 2(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 1x1+ 1(V) Conv 3x3+ 1(S) LocalRespNorm MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat AveragePool 7x7+ 1(V) FC Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max0 Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max1 Soft maxAct ivat ion soft max2 GoogleNet, 22 layers (ILSVRC 2014) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
- 60. AlexNet, 8 layers (ILSVRC 2012) ResNet, 152 layers (ILSVRC 2015) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2 VGG, 19 layers (ILSVRC 2014) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015 Ultra deep
- 61. ResNet, 152 layers 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 7x7 conv, 64, /2, pool/2 Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
- 62. 28.2 25.8 16.4 11.7 7.3 6.7 3.6 2.9 ILSVRC'10 ILSVRC'11 ILSVRC'12 AlexNet ILSVRC'13 ILSVRC'14 VGG ILSVRC'14 GoogleNet ILSVRC'15 ResNet ILSVRC'16 Ensemble ImageNet Classification top-5 error (%) shallow 8 layers 19 layers 22 layers 152 layers Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015 Revolution of Depth vs Classification Accuracy Ensemble of Resnet, Inception Resnet, Inception and Wide Residual Network
- 63. Your Budget - Smartphone Floating Point Operations Per Second (2015) http://pages.experts-exchange.com/processing-power-compared/
- 64. iPhone X is more powerful than a Macbook Pro https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/
- 65. Accuracy vs Operations Per Image Inference Size is proportional to num parameters Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016 552 MB 240 MB What we want
- 66. Accuracy Per Parameter Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016
- 67. Pick your DNN Architecture for your mobile architecture Resnet Family Under 64 ms on iPhone 7 using Metal GPU Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, "Deep Residual Learning for Image Recognition”, 2015
- 68. CoreML Benchmark - Pick your DNN for your mobile architecture Model Top-1 Accuracy Size of Model (MB) Million Multi Adds iPhone 6 Execution Time (ms) iPhone 6S Execution Time (ms) iPhone 7 Execution Time (ms) VGG 16 71 553 15300 4556 254 208 Inception v3 78 95 5000 637 98 90 Resnet 50 75 103 3900 557 72 64 MobileNet 71 17 569 109 52 32 SqueezeNet 57 5 1700 78 29 24
- 69. MobileNet family Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise conv Tune with two parameters – Width Multiplier and resolution multiplier Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017
- 70. Comparison for DNN architectures for Object Detection Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017
- 71. Strategies to make DNNs even more efficient Shallow networks Compressing pre-trained networks Designing compact layers Quantizing parameters Network binarization
- 72. Pruning Aim : Remove all connections with absolute weights below a threshold Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015
- 73. Observation : Most parameters in Fully Connected Layers AlexNet 240 MB VGG-16 552 MB 96% of all parameters 90% of all parameters
- 74. Pruning gets quickest model compression without accuracy loss AlexNet 240 MB VGG-16 552 MB First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy
- 75. Weight Sharing Idea : Cluster weights with similar values together, and store in a dictionary. Codebook Huffman coding HashedNets Simplest implementation: • Round all weights into 256 levels • Tensorflow export script reduces inception zip file from 87 MB to 26 MB with 1% drop in precision
- 76. Selective training to keep networks shallow Idea : Augment data limited to how your network will be used Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate. Followed by WordLens / Google Translate Example : Add blur if analyzing mobile phone frames
- 77. Design consideration for custom architectures – Small Filters Three layers of 3x3 convolutions >> One layer of 7x7 convolution Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions Replace NxN convolutions with stack of 1xN and Nx1 Fewer parameters Less compute More non-linearity Better Faster Stronger Andrej Karpathy, CS-231n Notes, Lecture 11
- 78. SqueezeNet - AlexNet-level accuracy in 0.5 MB SqueezeNet base 4.8 MB SqueezeNet compressed 0.5 MB 80.3% top-5 Accuracy on ImageNet 0.72 GFLOPS/image Fire Block Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"
- 79. Reduced precision Reduce precision from 32 bits to <=16 bits or lesser Use stochastic rounding for best results In Practice: • Ristretto + Caffe • Automatic Network quantization • Finds balance between compression rate and accuracy • Apple Metal Performance Shaders automatically quantize to 16 bits • Tensorflow has 8 bit quantization support • Gemmlowp – Low precision matrix multiplication library
- 80. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
- 81. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
- 82. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
- 83. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
- 84. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
- 85. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
- 86. XNOR-Net on Mobile
- 87. Challenges Off the shelf CNNs not robust for video Solutions: • Collective confidence over several frames • CortexNet
- 88. Building a DL App and get $10 million in funding (or a PhD)
- 89. Minerva
- 90. Minerva
- 91. DeepX Toolkit Nicholas D. Lane et al, “DXTK : Enabling Resource-efficient Deep Learning on Mobile and Embedded Devices with the DeepX Toolkit",2016
- 92. EIE : Efficient Inference Engine on Compressed DNNs Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, William Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network", 2016 189x faster on CPU 13x faster on GPU
- 93. One Last Question
- 94. How to access the slides in 1 second Link posted here -> @anirudhkoul