Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Aran Khanna, Software Engineer, Amazon Web Services at MLconf ATL 2017

250 views

Published on

High Performance Deep Learning on Edge Devices With Apache MXNet:
Deep network based models are marked by an asymmetry between the large amount of compute power needed to train a model, and the relatively small amount of compute power needed to deploy a trained model for inference. This is particularly true in computer vision tasks such as object detection or image classification, where millions of labeled images and large numbers of GPUs are needed to produce an accurate model that can be deployed for inference on low powered devices with a single CPU. The challenge when deploying vision models on these low powered devices though, is getting inference to run efficiently enough to allow for near real time processing of a video stream. Fortunately Apache MXNet provides the tools to solve this issues, allowing users to create highly performant models with tools like separable convolutions, quantized weights and sparsity exploitation as well as providing custom hardware kernels to ensure inference calculations are accelerated to the maximum amount allowed by the hardware the model is being deployed on. This is demonstrated though a state of the art MXNet based vision network running in near real time on a low powered Raspberry Pi device. We finally discuss how running inference at the edge as well as leveraging MXNet’s efficient modeling tools can be used to massively drive down compute costs for deploying deep networks in a production system at scale.

Published in: Technology
  • Be the first to comment

Aran Khanna, Software Engineer, Amazon Web Services at MLconf ATL 2017

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Aran Khanna, AI Engineer, AWS Deep Learning Deep Learning at the Edge With Apache MXNet Amazon AI GRT Intern
  2. 2. Amazon AI
  3. 3. What Do These Have in Common?
  4. 4. Deep Neural Networks Inputs Outputs
  5. 5. …At The Edge Inputs Outputs
  6. 6. Deep Neural Networks At The Edge
  7. 7. Overview Motivating Problems in DL at the Edge Why Apache MXNet From the Metal To the Models With MXNet DL at the Edge with AWS
  8. 8. Why The Edge, When We have the Cloud? VS.
  9. 9. Why The Edge, When We have the Cloud? Latency VS.
  10. 10. Why The Edge, When We have the Cloud? Latency Connectivity VS.
  11. 11. Why The Edge, When We have the Cloud? Latency Connectivity Cost VS.
  12. 12. Why The Edge, When We have the Cloud? Latency Connectivity Cost Privacy/Security VS.
  13. 13. Motivating Examples • Real Time Filtering (Neural Style Transfer)
  14. 14. Motivating Examples • Industrial IoT (Out of Distribution/Anomaly Detection)
  15. 15. Motivating Examples • Robotics (Object Detection and Recognition)
  16. 16. Motivating Examples • Autonomous Driving Systems
  17. 17. Infrastructure GPU CPU IoT Mobile Amazon AI : Artificial Intelligence In The Hands Of Every Developer Engines MXNet TensorFlow Caffe Theano Pytorch CNTK Platforms Amazon ML Spark & EMR Kinesis Batch ECS Services Rekognition Polly ChatSpeechVision Lex
  18. 18. Infrastructure GPU CPU IoT Mobile Amazon AI : Artificial Intelligence In The Hands Of Every Developer Engines MXNet TensorFlow Caffe Theano Pytorch CNTK
  19. 19. Overview Motivating Problems in DL at the Edge Why Apache MXNet From the Metal To the Models With MXNet DL at the Edge with AWS
  20. 20. Deep Learning Frameworks
  21. 21. Flexible Portable Performance Mixed Programming API Runs Everywhere Near Linear Scaling Apache MXNet | Differentiators
  22. 22. Flexible Portable Performance Mixed Programming API Runs Everywhere Near Linear Scaling Apache MXNet | Differentiators
  23. 23. >>> import mxnet as mx >>> a = mx.nd.zeros((100, 50)) >>> b = mx.nd.ones((100, 50)) >>> c = a + b >>> c += 1 >>> print(c) IMPERATIVE NDARRAY API >>> import mxnet as mx >>> net = mx.symbol.Variable('data') >>> net = mx.symbol.FullyConnected(data=net, num_hidden=128) >>> net = mx.symbol.SoftmaxOutput(data=net) >>> texec = mx.module.Module(net) >>> texec.forward(data=c) >>> texec.backward() DECLARATIVE SYMBOLIC EXECUTOR Apache MXNet | Flexible Programming
  24. 24. Flexible Portable Performance Mixed Programming API Runs Everywhere Near Linear Scaling Apache MXNet | Differentiators
  25. 25. Ideal Inception v3 Resnet Alexnet 88% Efficiency 1 2 4 8 16 32 64 128 256 No. of GPUs Apache MXNet | Efficient Scaling
  26. 26. Flexible Portable Performance Mixed Programming API Runs Everywhere Near Linear Scaling Apache MXNet | Differentiators
  27. 27. Apache MXNet | On Mobile Devices https://mxnet.incubator.apache.org /how_to/smart_device.html
  28. 28. mxnet.incubator.apache.org/get_started/install.html Apache MXNet | On IoT Devices
  29. 29. Most Open Best On AWS Optimized for deep learning on AWS Accepted into the Apache Incubator Apache MXNet | Community
  30. 30. 35% Outpacing Contributors Diverse Community 0 40,000 Yutian Li (Stanford) Nan Zhu (MSFT) Liang Depeng (Sun Yat-sen U.) Xingjian Shi (HKUST) Tianjun Xiao (Tesla) Chiyuan Zhang (MIT) Yao Wang (AWS) Jian Guo (TuSimple) Yizhi Liu (Mediav) Sandeep K. (AWS) Sergey Kolychev (Whitehat) Eric Xie (AWS) Tianqi Chen (UW) Mu Li (AWS) Bing Su (Apple) *As of 3/30/17 **Amazon @35% of Contributions | Amazon Contributions | Torch, Theano, CNTK Apple, Tesla, Microsoft, NYU, MIT, Stanford, Lots of others.. | Apache MXNet | Community
  31. 31. Apache MXNet | Apple CoreML pip install mxnet-to-coreml
  32. 32. Apache MXNet | Easy to Get Started http://gluon.mxnet.io/
  33. 33. Overview Motivating Problems in DL at the Edge Why Apache MXNet From the Metal To the Models With MXNet DL at the Edge with AWS
  34. 34. What Are the Challenges at the Edge?
  35. 35. The Metal: Heterogeneity In the Cloud • X86_64 • CUDA GPU
  36. 36. The Metal: Heterogeneity In the Cloud • X86_64 • CUDA GPU At the Edge • X86_64, X86_32, ARM, Arch64, Android, iOS • OpenCL GPU, CUDA GPU, Metal GPU • NEON DSP, Hexagon DSP • Custom Accelerators, FPGA
  37. 37. The Metal: Performance Gap Low End: Raspberry Pi 3 - 32 Bit ARMv7 - ARM NEON - 1GB Ram High End: NVIDIA Jetson - ARM Arch64 - 128 CUDA Cores - 8GB RAM
  38. 38. The Metal: The Problem
  39. 39. How Can We Adapt Our Models?
  40. 40. The Models: Where is Our Cost? Convolutions are expensive
  41. 41. The Models: Where is Our Cost? Models are generally over parameterized
  42. 42. Cheaper Convolutions: Winograd Convolution in Time Domain = Pointwise Multiplication in Frequency Domain Under the Hood in MXNet with integrations in NNPACK, CUDA etc.
  43. 43. Cheaper Convolutions: Separable Convolutions Good for devices that can’t run lots of multiplications in parallel Convolve separately over each depth channel of input followed by 1x1 convolutions to merge channels
  44. 44. Depth Separable Convolutions in MXNet >>> import mxnet as mx >>> x = mx.sym.Variable('x') >>> w = mx.sym.Variable('w') >>> b = mx.sym.Variable('b') >>> xslice = mx.sym.SliceChannel(data=x, num_outputs=num_group, axis=1) >>> wslice = mx.sym.SliceChannel(data=w, num_outputs=num_group, axis=0) >>> bslice = mx.sym.SliceChannel(data=b, num_outputs=num_group, axis=0) >>> y_sep = mx.sym.Concat(*[mx.sym.Convolution(data=xslice[i], weight=wslice[i], bias=bslice[i], num_filter=num_filter//num_group, kernel=kernel, stride=stride, pad=pad) for i in range(num_group)]) >>> y = mx.sym.Convolution(data=x, weight=w, bias=b, num_filter=num_filter, num_group=num_group, kernel=kernel, stride=stride, pad=pad)
  45. 45. Fewer Parameters: Quantization Good for devices with hardware to accelerate low precision operations Map activations into lower bit-width buckets and multiply with quantized weights
  46. 46. Quantization in MXNet >>> import mxnet as mx >>> min0 = mx.nd.array([0.0]) >>> max0 = mx.nd.array([1.0]) >>> sym = mx.nd.array([[0.1392, 0.5928], [0.6027, 0.8579]] >>> quantized_sym, min1, max1 = mx.nd.contrib.quantize(a, min0, max0, out_type='uint8') >>> dequantized_sym = mx.nd.contrib.dequantize(quantized_sym, min1, max1, out_type='float32')
  47. 47. Fewer Parameters: Weight Pruning Prune unused weights during training Good at high sparsity for devices with fast sparse multiplication
  48. 48. Weight Pruning in MXNet >>> # Assume we have defined a model and training data set >>> model.fit(train, >>> eval_data=val, >>> eval_metric='acc', >>> num_epoch=10, >>> optimizer='sparsesgd’, >>> optimizer_params={'learning_rate' : 0.1, >>> 'wd' : 0.004, >>> 'momentum' : 0.9, >>> 'pruning_switch_epoch' : 5, >>> 'weight_sparsity' : 0.8, >>> 'bias_sparsity' : 0.0, >>> }
  49. 49. Weight Pruning in MXNet
  50. 50. Fewer Parameters: Efficient Architectures SqueezeNet: AlexNet Accuracy with 50x Fewer Parameters Good for devices with low RAM that can’t hold all weights for larger models concurrently in memory
  51. 51. Efficient Architectures in MXNet https://mxnet.incubator.apache.org/model_zoo/
  52. 52. Fewer Parameters: Tensor Decompositions CVPR paper at arxiv.org/abs/1706.00439 Code at https://github.com/tensorly/tensorly
  53. 53. Table of Model Optimization Techniques Winograd Convolutions Separable Convolutions Quantization Tensor Contractions Sparsity Exploitation Weight Sharing CPU Acceleration + ++ = ++ + + GPU Acceleration + + + + = + Model Size = = - - - - Model Accuracy = - - - - - Specialized Hardware Acceleration + + ++ + + +
  54. 54. Edge Model Optimization Benefits The Cloud Models with fewer parameters often generalize better Tricks from the edge can be applied in the cloud Pre-processing with edge models decreases compute load in the cloud
  55. 55. Overview Motivating Problems in DL at the Edge Why Apache MXNet From the Metal To the Models With MXNet DL at the Edge with AWS
  56. 56. Tons of GPUs and CPUs Serverless At the Edge, On IoT Devices Prediction The Challenge For Artificial Intelligence: SCALE Tons of GPUs Elastic capacity Training Pre-built images Aggressive migration New data created on AWS Data PBs of existing data
  57. 57. p2 instances Up to 40k CUDA cores Deep Learning AMI Pre-configured for Deep Learning CFN Template Launch a Deep Learning Cluster AWS Tools for Deep Learning
  58. 58. AWS Deep Learning AMI: One-Click Deep Learning Kepler, Volta & Skylake Apache MXNet Python 2/3 Notebooks & Examples (and others)
  59. 59. https://aws.amazon.com/amazon-ai/amis/
  60. 60. AWS IoT and AWS Greengrass
  61. 61. Manage and Monitor Models on The Fly AWS Captured Data Upload Tagged Data Escalate to AI Service Escalate to Custom Model on P2 Deploy and Manage Model
  62. 62. Local Learning Loop Poorly Classified Data Updated Model Fine Tune Model With Accurate Classification
  63. 63. Getting Started with MXNet at the Edge+ AWS IoT http://amzn.to/2h6kPvY
  64. 64. Running AI In Production on AWS Today
  65. 65. We’re Hiring!
  66. 66. Thank You! Aran Khanna – arankhan@amazon.com GRT Intern

×