The Caffe2 Framework for Mobile
and Embedded Deep Learning
Fei Sun
AI Platform, Facebook
1
• Caffe2 on mobile
• ONNX
• From research to production
• Vendor’s dilemma
• Caffe2 on embedded. Benchmarking the performance
Outline
2
Caffe2 on Mobile
3
• A lightweight open source framework for deep learning
algorithms
• Primarily designed for production use cases
• Speed is top priority
• C++ / Python based interfaces
• Supports deployment on multiple platforms
• Linux, Mac, iOS, Android and Windows
• IoT devices, Raspberry Pi, Tegra X1, ...
Caffe2 is...
4
Mobile Fragmentation
5
OpenGL
Two major
operating systems
Android iOS
20+ chipset vendors 25+ CPU microarchitectures 15+ GPU architectures
Three major
graphics APIs
Two major
compute APIs
RenderScript
OpenCL
Vulkan
Metal
One Framework, Multiple Backends
ARM Compute
Library
NNPACK
Metal™/
MPSCNN
Qualcomm
Snapdragon
NPE
CUDA/cuDNN
CPU Acceleration with NNPACK
7
• Fast convolution algorithms
• NEON micro-kernels
• Multi-core computation
• big.LITTLE optimizations
• Custom Metal™ Kernels
• Leverage MPSCNN (Metal Performance Shaders)
• Performs best on iPhone 6s and later
GPU Acceleration on iPhones
• Leverage Qualcomm's Snapdragon NPE
• Supports new Qualcomm Adreno GPUs
• Runs on top of OpenCL
• Potential to use Hexagon DSPs
GPU Acceleration on Android
Caffe2 mobile integration
with Qualcomm® Snapdragon™ mobile platform
CPU
12 FPS
GPU
50 FPS
Galaxy	S7
Snapdragon	820
Marshmallow
• Leveraging ARM Compute Library
• Utilizes OpenGL 3.1
• For newer Mali GPUs - ex: from Samsung LSI, MediaTek
• Person segmentation model:
• CPU: 50 FPS
• ACL: 71 FPS with CPU->GPU, 133 FPS without
GPU Acceleration on Android
• Engage and collaborate with a few vendors:
• Support Caffe2
• Iterate on performance
• Problem:
• Not scalable
Caffe2 on Mobile
12
ONNX
13
Support What?
14
Framework
backends
O (n^2) pairs
Tensor
Flow
MXNET CNTK
Vendor and numeric libraries
Apple
CoreML
Nvidia
TensorRT
ARM
Compute
Library
Qualcomm
SNPE …
From Research to Production
15
• Research new models/operators in Pytorch
• Re-implement the models/operators in Caffe2
Retrain the models
• Deploy Caffe2 models to production
• Enable interoperability
• Across frameworks and hardware vendors
• Starting base compatibility
• Creating community effort
• Across PyTorch and Caffe2 at FB
• Operators and programming modes gap
• Advanced research to production uses cases
Open Neural Network Exchange
(ONNX)
16
Support What All
17
Framework
backends
O (n) pairs
Tensor
Flow
MXNET CNTK
Vendor and numeric libraries
Apple
CoreML
Nvidia
TensorRT
ARM
Compute
Library
Qualcomm
SNPE …
From Research to Production
18
• Frontend
• Representation
• Backend
• Frontend
• Representation
• Backend
Caffe2 on Embedded
19
Embedded Sea of Choices
20
Two major
operating systems
20+ chipset vendors 25+ CPU microarchitectures 15+ GPU architectures
Three major
graphics APIs
Two major
compute APIs
Many
Many DSP Many proprietary
Many
Many proprietary Many
design
flows
• The approach working with mobile vendors does not scale
• What ML models matter?
• How to help embedded vendors to enhance ML model
performance?
• How to assist embedded vendors to evaluate against market?
Existing Challenges
21
• Provide a model zoo on important models
• Normalize the benchmarking metrics and conditions
• Automate the benchmarking process
• Honest measurement on performance
• Focus on inference
AI Benchmarking
22
Benchmarking Starting Point
23
Nexus 6 Nexus 6P Galaxy S7
Huawei Mate
10
Galaxy S8
ShuffleNet 108 148 84 125 112
SqueezeNet 149 279 143 161 156
ResNet50 1230 1970 1220 1510 1490
Style
Transfer
52 80 56 53 39
CPU inference delay on select Caffe2 models in ms
Benchmarking - Add a New Model
24
Nexus 6 Nexus 6P Galaxy S7
Huawei Mate
10
Galaxy S8
ShuffleNet 108 148 84 125 112
SqueezeNet 149 279 143 161 156
ResNet50 1230 1970 1220 1510 1490
Style
Transfer
52 80 56 53 39
Inception V1 612 829 575 638 645
CPU inference delay on select Caffe2 models in ms
Benchmarking - Add a New Device
25
Nexus 6 Nexus 6P Galaxy S7
Huawei Mate
10
Galaxy S8 Pixel XL
ShuffleNet 108 148 84 125 112 83
SqueezeNet 149 279 143 161 156 141
ResNet50 1230 1970 1220 1510 1490 1230
Style
Transfer
52 80 56 53 39 57
Inception V1 612 829 575 638 645 597
CPU inference delay on select Caffe2 models in ms
Three Steps of Benchmarking
26
Model Zoo Data Consumption
GPU
CPU
Phone
Embedded
Benchmarking
• Supported framework
• Caffe2
• Supported model format
• Caffe2
• ONNX
• Supported backend
• CPU, GPU, Android, linux based systems.
• Eigen, MKL, NNPACK, OpenGL, Cuda
• Community help needed!
Benchmarking Status
27
• Caffe2
• https://github.com/caffe2/caffe2
• ONNX
• https://github.com/onnx/onnx
• Benchmarking
• https://github.com/caffe2/caffe2-benchmarking
• Model zoo
• https://github.com/caffe2/models
• https://github.com/onnx/models
Resources
28
Questions?
29

"The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation from Facebook

  • 1.
    The Caffe2 Frameworkfor Mobile and Embedded Deep Learning Fei Sun AI Platform, Facebook 1
  • 2.
    • Caffe2 onmobile • ONNX • From research to production • Vendor’s dilemma • Caffe2 on embedded. Benchmarking the performance Outline 2
  • 3.
  • 4.
    • A lightweightopen source framework for deep learning algorithms • Primarily designed for production use cases • Speed is top priority • C++ / Python based interfaces • Supports deployment on multiple platforms • Linux, Mac, iOS, Android and Windows • IoT devices, Raspberry Pi, Tegra X1, ... Caffe2 is... 4
  • 5.
    Mobile Fragmentation 5 OpenGL Two major operatingsystems Android iOS 20+ chipset vendors 25+ CPU microarchitectures 15+ GPU architectures Three major graphics APIs Two major compute APIs RenderScript OpenCL Vulkan Metal
  • 6.
    One Framework, MultipleBackends ARM Compute Library NNPACK Metal™/ MPSCNN Qualcomm Snapdragon NPE CUDA/cuDNN
  • 7.
    CPU Acceleration withNNPACK 7 • Fast convolution algorithms • NEON micro-kernels • Multi-core computation • big.LITTLE optimizations
  • 8.
    • Custom Metal™Kernels • Leverage MPSCNN (Metal Performance Shaders) • Performs best on iPhone 6s and later GPU Acceleration on iPhones
  • 9.
    • Leverage Qualcomm'sSnapdragon NPE • Supports new Qualcomm Adreno GPUs • Runs on top of OpenCL • Potential to use Hexagon DSPs GPU Acceleration on Android
  • 10.
    Caffe2 mobile integration withQualcomm® Snapdragon™ mobile platform CPU 12 FPS GPU 50 FPS Galaxy S7 Snapdragon 820 Marshmallow
  • 11.
    • Leveraging ARMCompute Library • Utilizes OpenGL 3.1 • For newer Mali GPUs - ex: from Samsung LSI, MediaTek • Person segmentation model: • CPU: 50 FPS • ACL: 71 FPS with CPU->GPU, 133 FPS without GPU Acceleration on Android
  • 12.
    • Engage andcollaborate with a few vendors: • Support Caffe2 • Iterate on performance • Problem: • Not scalable Caffe2 on Mobile 12
  • 13.
  • 14.
    Support What? 14 Framework backends O (n^2)pairs Tensor Flow MXNET CNTK Vendor and numeric libraries Apple CoreML Nvidia TensorRT ARM Compute Library Qualcomm SNPE …
  • 15.
    From Research toProduction 15 • Research new models/operators in Pytorch • Re-implement the models/operators in Caffe2 Retrain the models • Deploy Caffe2 models to production
  • 16.
    • Enable interoperability •Across frameworks and hardware vendors • Starting base compatibility • Creating community effort • Across PyTorch and Caffe2 at FB • Operators and programming modes gap • Advanced research to production uses cases Open Neural Network Exchange (ONNX) 16
  • 17.
    Support What All 17 Framework backends O(n) pairs Tensor Flow MXNET CNTK Vendor and numeric libraries Apple CoreML Nvidia TensorRT ARM Compute Library Qualcomm SNPE …
  • 18.
    From Research toProduction 18 • Frontend • Representation • Backend • Frontend • Representation • Backend
  • 19.
  • 20.
    Embedded Sea ofChoices 20 Two major operating systems 20+ chipset vendors 25+ CPU microarchitectures 15+ GPU architectures Three major graphics APIs Two major compute APIs Many Many DSP Many proprietary Many Many proprietary Many design flows
  • 21.
    • The approachworking with mobile vendors does not scale • What ML models matter? • How to help embedded vendors to enhance ML model performance? • How to assist embedded vendors to evaluate against market? Existing Challenges 21
  • 22.
    • Provide amodel zoo on important models • Normalize the benchmarking metrics and conditions • Automate the benchmarking process • Honest measurement on performance • Focus on inference AI Benchmarking 22
  • 23.
    Benchmarking Starting Point 23 Nexus6 Nexus 6P Galaxy S7 Huawei Mate 10 Galaxy S8 ShuffleNet 108 148 84 125 112 SqueezeNet 149 279 143 161 156 ResNet50 1230 1970 1220 1510 1490 Style Transfer 52 80 56 53 39 CPU inference delay on select Caffe2 models in ms
  • 24.
    Benchmarking - Adda New Model 24 Nexus 6 Nexus 6P Galaxy S7 Huawei Mate 10 Galaxy S8 ShuffleNet 108 148 84 125 112 SqueezeNet 149 279 143 161 156 ResNet50 1230 1970 1220 1510 1490 Style Transfer 52 80 56 53 39 Inception V1 612 829 575 638 645 CPU inference delay on select Caffe2 models in ms
  • 25.
    Benchmarking - Adda New Device 25 Nexus 6 Nexus 6P Galaxy S7 Huawei Mate 10 Galaxy S8 Pixel XL ShuffleNet 108 148 84 125 112 83 SqueezeNet 149 279 143 161 156 141 ResNet50 1230 1970 1220 1510 1490 1230 Style Transfer 52 80 56 53 39 57 Inception V1 612 829 575 638 645 597 CPU inference delay on select Caffe2 models in ms
  • 26.
    Three Steps ofBenchmarking 26 Model Zoo Data Consumption GPU CPU Phone Embedded Benchmarking
  • 27.
    • Supported framework •Caffe2 • Supported model format • Caffe2 • ONNX • Supported backend • CPU, GPU, Android, linux based systems. • Eigen, MKL, NNPACK, OpenGL, Cuda • Community help needed! Benchmarking Status 27
  • 28.
    • Caffe2 • https://github.com/caffe2/caffe2 •ONNX • https://github.com/onnx/onnx • Benchmarking • https://github.com/caffe2/caffe2-benchmarking • Model zoo • https://github.com/caffe2/models • https://github.com/onnx/models Resources 28
  • 29.