Scaling up deep
learning by scaling
down
—
Nick Pentreath
Principal Engineer
@MLnick
About
IBM Developer / © 2020 IBM Corporation
–@MLnick on Twitter, Github, LinkedIn
–Principal Engineer, IBM CODAIT (Center
for Open-Source Data & AI Technologies)
–Machine Learning & AI
–Apache Spark committer & PMC
–Author of Machine Learning with Spark
–Various conferences & meetups
2
Improving the Enterprise AI Lifecycle in Open Source
IBM Developer / © 2020 IBM Corporation 3
–CODAIT aims to make AI solutions
dramatically easier to create,
deploy, and manage in the
enterprise.
–We contribute to and advocate for
the open-source technologies that
are foundational to IBM’s AI
offerings.
–30+ open-source developers!
Center for Open Source Data & AI Technologies
codait.org
CODAIT
Open Source @ IBM
Agenda
4
–Deep Learning overview & computational
considerations
–Evolving efficiency of model architectures
–Model compression
–Model distillation
–Conclusion
DEG / June 4, 2020 / © 2020 IBM Corporation
Machine Learning
Workflow
5
Data Analyze Process Train Deploy
Predict
&
Maintain
DEG / June 4, 2020 / © 2020 IBM Corporation
Compute-heavy
Deep Learning
–Original theory from 1940s; computer
models originated around 1960s; fell out of
favor in 1980s/90s
–Recent resurgence due to
• Bigger (and better) data; standard datasets
(e.g. ImageNet)
• Better hardware (GPUs, TPUs)
• Improvements to algorithms, architectures and
optimization
–Leading to new state-of-the-art results in
computer vision (images and video);
speech/text; language translation and more
IBM Developer / © 2020 IBM Corporation 6Source: Wikipedia
Modern Neural Networks
–Deep (multi-layer) networks
–Computer vision
• Convolution neural networks (CNNs)
• Image classification, object detection,
segmentation
–Sequences and time-series
• Machine translation, text generation
• Recurrent neural networks - LSTM, GRU
–Natural language processing
• Word embeddings
• Transformers, attention mechanisms
–Deep learning frameworks
• Flexibility, computation graphs, auto-differentiation,
GPUs
IBM Developer / © 2020 IBM Corporation 7Source: Stanford CS231n
Evolution of Training
Computation Requirements
IBM Developer / © 2020 IBM Corporation 8
Source
Computational
resources
required for
training AI models
doubles every 3
to 4 months
Example: Image
Classification
IBM Developer / © 2020 IBM Corporation 9Source
Input image Inference Prediction
beagle: 0.82
basset: 0.09
bluetick: 0.07
...
Example: Inception V3
Source
IBM Developer / © 2020 IBM Corporation 10
Effectively
matrix
multiplication
~24 million
parameters
78.8%
accuracy
(ImageNet)
Accuracy vs Computational
Complexity (ImageNet)
IBM Developer / © 2020 IBM Corporation 11
Source: Paper, blog
Computational efficiency
(ImageNet)
IBM Developer / © 2020 IBM Corporation 12
Source: Paper, blog
Deep Learning
Deployment
IBM Developer / © 2020 IBM Corporation 13
–Model training typically
uses substantial
hardware
–GPU / multi-GPU
–Cloud-based
deployment scenarios
Deep Learning
Deployment
IBM Developer / © 2020 IBM Corporation 14
–Edge devices have more limited resources
• Memory
• Compute (CPU, mobile GPU, edge TPU)
• Network bandwidth
–Also applies to low-latency applications
IBM Developer / © 2020 IBM Corporation 15
How do we improve
performance
efficiency?
–Architecture
improvements
–Model pruning
–Quantization
–Model distillation
Architecture
Improvements
IBM Developer / © 2020 IBM Corporation 16
Specialized architectures for
low-resource targets
Source
IBM Developer / © 2020 IBM Corporation 17
Standard
Convolution
Building Block
Inception V3 MobileNet V1
Depthwise
Convolution
Building Block
(~8x less
computation)
~4 million
parameters
70.9%
accuracy
~24 million
parameters
78.8%
accuracy
(ImageNet)
Trade off accuracy vs model
size
Source
IBM Developer / © 2020 IBM Corporation 18
–Scale layer width &
resolution multiplier to
target available
computation budget
–Width multiplier =
“thinner” models
–Resolution multiplier
scales input image
representation
MobileNet V2
Source
IBM Developer / © 2020 IBM Corporation 19
–Same depthwise
convolution backbone
–Add linear bottlenecks &
shortcut connections
~3.4 million
parameters
72%
accuracy
Accuracy vs Computation - Updated
(ImageNet)
IBM Developer / © 2020 IBM Corporation 20
Source: Paper, blog
Computational efficiency - Updated
(ImageNet)
IBM Developer / © 2020 IBM Corporation 21
Source: Paper, blog
EfficientNet
Source: blog post, paper
IBM Developer / © 2020 IBM Corporation 22
–Neural Architecture
Search to find backbone
–Optimize for accuracy &
efficiency (FLOPS)
~5.3 million
parameters
77.3%
accuracy
~60 million
parameters
84.5%
accuracy
MobileNet V3
Source: GitHub, paper
IBM Developer / © 2020 IBM Corporation 23
–Hardware-aware Neural
Architecture Search
~5.4 million
parameters
75.2%
accuracy
One network to rule them all?
Source: GitHub, paper
IBM Developer / © 2020 IBM Corporation 24
–Once for All: Train One
Network and Specialize
it for Efficient
Deployment
–Manual design or NAS
is hugely costly in terms
of computation
–Train one network,
“cherry-pick” the sub-net
without additional
training
Model Pruning
IBM Developer / © 2020 IBM Corporation 25
–Reduce # of model
parameters
–Effectively like L1
regularization – remove
weights with small
impact on prediction
–Sparse weights ->
model compression &
lower latency
Model Pruning
IBM Developer / © 2020 IBM Corporation 26
Source
70
71
72
73
74
75
76
77
78
79
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Top-1Accuracy(%)
Model Sparsity
ImageNet Classification
InceptionV3
MobileNet V1 224
Model Pruning
IBM Developer / © 2020 IBM Corporation 27
Source
26
26.5
27
27.5
28
28.5
29
29.5
30
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
BLEUScore
Model Sparsity
Language Translation
English-German
German-English
Quantization
IBM Developer / © 2020 IBM Corporation 28
Quantization
IBM Developer / © 2020 IBM Corporation 29
Source
–Most DL computation
users 32 (or even 64)
bits floating point
–Quantization reduces
numerical precision of
weights by binning
values
–Popular targets are 16-
bit FP and 8-bit integer
coding
Quantization
IBM Developer / © 2020 IBM Corporation 30
Source
–Post-training quantization
• Useful if you can’t (or don’t wish
to) retrain a model
• Give up accuracy
• Various options
– Float16
– Dynamic
– Int8
–Training-aware quantization
• Much more complex
• Can provide large efficiency
gains with minimal accuracy loss
78
71.9
77.2
63.7
77.5
70.9
InceptionV3 MobileNet V2 224Top-1Accuracy(%)
ImageNet Classification
Original Post-training Training-aware
Quantization
IBM Developer / © 2020 IBM Corporation 31
Source
100% 100%
25% 26%
InceptionV3 MobileNet V2 224
%orginalmodelsize
ImageNet Classification
Original Quantized
75%
110%
48%
61%
InceptionV3 MobileNet V2 224
%orginallatency
ImageNet Classification
Post-training Training-aware
Quantization
IBM Developer / © 2020 IBM Corporation 32
–TensorFlow Model
Optimization
–PyTorch
–Distiller for PyTorch
Model Distillation
IBM Developer / © 2020 IBM Corporation 33
–Large models may be
over-parameterized
–Use a large, complex
model to teach a
smaller, simpler model
–Effectively distil the core
knowledge of the large
model
Model Distillation
IBM Developer / © 2020 IBM Corporation 34
Source: Distiller docs, paper
Model Distillation
IBM Developer / © 2020 IBM Corporation 35
–BERT model distillations have been very
successful
–DistilBERT
–TinyBERT
–Others (see this blog post)
Conclusion
–Model distillation is less popular but
potentially compelling in NLP tasks
–Area of rapid research evolution
–New efficient model architectures are
rapidly evolving
• If one fits your needs, use it!
–Compression techniques can yield
large efficiency gains
• Now good support in DL frameworks
/ supporting libraries
–Perhaps combining pruning &
quantization (though trickier)
36DEG / June 4, 2020 / © 2020 IBM Corporation
Thank you
codait.org
twitter.com/MLnick
github.com/MLnick
developer.ibm.com
37DEG / June 4, 2020 / © 2020 IBM Corporation
Check out the Model Asset Exchange
https://ibm.biz/model-exchange
Sign up for IBM Cloud
https://ibm.biz/BdqdSi
References
Efficient Inference in Deep Learning – Where
is the Problem?
Analysis of deep neural networks
MobileNets
EfficientNet
Making Neural Nets Work With Low Precision
Speeding up BERT
38DEG / June 4, 2020 / © 2020 IBM Corporation
Distilling the Knowledge in a Neural Network
Once for All: Train One Network and
Specialize it for Efficient Deployment
Distiller – PyTorch
TensorFlow Model Optimization
Deep Compression: Compressing Deep
Neural Networks with Pruning, Trained
Quantization and Huffman Coding
IBM Developer / © 2020 IBM Corporation 39

Scaling up deep learning by scaling down

  • 1.
    Scaling up deep learningby scaling down — Nick Pentreath Principal Engineer @MLnick
  • 2.
    About IBM Developer /© 2020 IBM Corporation –@MLnick on Twitter, Github, LinkedIn –Principal Engineer, IBM CODAIT (Center for Open-Source Data & AI Technologies) –Machine Learning & AI –Apache Spark committer & PMC –Author of Machine Learning with Spark –Various conferences & meetups 2
  • 3.
    Improving the EnterpriseAI Lifecycle in Open Source IBM Developer / © 2020 IBM Corporation 3 –CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise. –We contribute to and advocate for the open-source technologies that are foundational to IBM’s AI offerings. –30+ open-source developers! Center for Open Source Data & AI Technologies codait.org CODAIT Open Source @ IBM
  • 4.
    Agenda 4 –Deep Learning overview& computational considerations –Evolving efficiency of model architectures –Model compression –Model distillation –Conclusion DEG / June 4, 2020 / © 2020 IBM Corporation
  • 5.
    Machine Learning Workflow 5 Data AnalyzeProcess Train Deploy Predict & Maintain DEG / June 4, 2020 / © 2020 IBM Corporation Compute-heavy
  • 6.
    Deep Learning –Original theoryfrom 1940s; computer models originated around 1960s; fell out of favor in 1980s/90s –Recent resurgence due to • Bigger (and better) data; standard datasets (e.g. ImageNet) • Better hardware (GPUs, TPUs) • Improvements to algorithms, architectures and optimization –Leading to new state-of-the-art results in computer vision (images and video); speech/text; language translation and more IBM Developer / © 2020 IBM Corporation 6Source: Wikipedia
  • 7.
    Modern Neural Networks –Deep(multi-layer) networks –Computer vision • Convolution neural networks (CNNs) • Image classification, object detection, segmentation –Sequences and time-series • Machine translation, text generation • Recurrent neural networks - LSTM, GRU –Natural language processing • Word embeddings • Transformers, attention mechanisms –Deep learning frameworks • Flexibility, computation graphs, auto-differentiation, GPUs IBM Developer / © 2020 IBM Corporation 7Source: Stanford CS231n
  • 8.
    Evolution of Training ComputationRequirements IBM Developer / © 2020 IBM Corporation 8 Source Computational resources required for training AI models doubles every 3 to 4 months
  • 9.
    Example: Image Classification IBM Developer/ © 2020 IBM Corporation 9Source Input image Inference Prediction beagle: 0.82 basset: 0.09 bluetick: 0.07 ...
  • 10.
    Example: Inception V3 Source IBMDeveloper / © 2020 IBM Corporation 10 Effectively matrix multiplication ~24 million parameters 78.8% accuracy (ImageNet)
  • 11.
    Accuracy vs Computational Complexity(ImageNet) IBM Developer / © 2020 IBM Corporation 11 Source: Paper, blog
  • 12.
    Computational efficiency (ImageNet) IBM Developer/ © 2020 IBM Corporation 12 Source: Paper, blog
  • 13.
    Deep Learning Deployment IBM Developer/ © 2020 IBM Corporation 13 –Model training typically uses substantial hardware –GPU / multi-GPU –Cloud-based deployment scenarios
  • 14.
    Deep Learning Deployment IBM Developer/ © 2020 IBM Corporation 14 –Edge devices have more limited resources • Memory • Compute (CPU, mobile GPU, edge TPU) • Network bandwidth –Also applies to low-latency applications
  • 15.
    IBM Developer /© 2020 IBM Corporation 15 How do we improve performance efficiency? –Architecture improvements –Model pruning –Quantization –Model distillation
  • 16.
  • 17.
    Specialized architectures for low-resourcetargets Source IBM Developer / © 2020 IBM Corporation 17 Standard Convolution Building Block Inception V3 MobileNet V1 Depthwise Convolution Building Block (~8x less computation) ~4 million parameters 70.9% accuracy ~24 million parameters 78.8% accuracy (ImageNet)
  • 18.
    Trade off accuracyvs model size Source IBM Developer / © 2020 IBM Corporation 18 –Scale layer width & resolution multiplier to target available computation budget –Width multiplier = “thinner” models –Resolution multiplier scales input image representation
  • 19.
    MobileNet V2 Source IBM Developer/ © 2020 IBM Corporation 19 –Same depthwise convolution backbone –Add linear bottlenecks & shortcut connections ~3.4 million parameters 72% accuracy
  • 20.
    Accuracy vs Computation- Updated (ImageNet) IBM Developer / © 2020 IBM Corporation 20 Source: Paper, blog
  • 21.
    Computational efficiency -Updated (ImageNet) IBM Developer / © 2020 IBM Corporation 21 Source: Paper, blog
  • 22.
    EfficientNet Source: blog post,paper IBM Developer / © 2020 IBM Corporation 22 –Neural Architecture Search to find backbone –Optimize for accuracy & efficiency (FLOPS) ~5.3 million parameters 77.3% accuracy ~60 million parameters 84.5% accuracy
  • 23.
    MobileNet V3 Source: GitHub,paper IBM Developer / © 2020 IBM Corporation 23 –Hardware-aware Neural Architecture Search ~5.4 million parameters 75.2% accuracy
  • 24.
    One network torule them all? Source: GitHub, paper IBM Developer / © 2020 IBM Corporation 24 –Once for All: Train One Network and Specialize it for Efficient Deployment –Manual design or NAS is hugely costly in terms of computation –Train one network, “cherry-pick” the sub-net without additional training
  • 25.
    Model Pruning IBM Developer/ © 2020 IBM Corporation 25 –Reduce # of model parameters –Effectively like L1 regularization – remove weights with small impact on prediction –Sparse weights -> model compression & lower latency
  • 26.
    Model Pruning IBM Developer/ © 2020 IBM Corporation 26 Source 70 71 72 73 74 75 76 77 78 79 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Top-1Accuracy(%) Model Sparsity ImageNet Classification InceptionV3 MobileNet V1 224
  • 27.
    Model Pruning IBM Developer/ © 2020 IBM Corporation 27 Source 26 26.5 27 27.5 28 28.5 29 29.5 30 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% BLEUScore Model Sparsity Language Translation English-German German-English
  • 28.
    Quantization IBM Developer /© 2020 IBM Corporation 28
  • 29.
    Quantization IBM Developer /© 2020 IBM Corporation 29 Source –Most DL computation users 32 (or even 64) bits floating point –Quantization reduces numerical precision of weights by binning values –Popular targets are 16- bit FP and 8-bit integer coding
  • 30.
    Quantization IBM Developer /© 2020 IBM Corporation 30 Source –Post-training quantization • Useful if you can’t (or don’t wish to) retrain a model • Give up accuracy • Various options – Float16 – Dynamic – Int8 –Training-aware quantization • Much more complex • Can provide large efficiency gains with minimal accuracy loss 78 71.9 77.2 63.7 77.5 70.9 InceptionV3 MobileNet V2 224Top-1Accuracy(%) ImageNet Classification Original Post-training Training-aware
  • 31.
    Quantization IBM Developer /© 2020 IBM Corporation 31 Source 100% 100% 25% 26% InceptionV3 MobileNet V2 224 %orginalmodelsize ImageNet Classification Original Quantized 75% 110% 48% 61% InceptionV3 MobileNet V2 224 %orginallatency ImageNet Classification Post-training Training-aware
  • 32.
    Quantization IBM Developer /© 2020 IBM Corporation 32 –TensorFlow Model Optimization –PyTorch –Distiller for PyTorch
  • 33.
    Model Distillation IBM Developer/ © 2020 IBM Corporation 33 –Large models may be over-parameterized –Use a large, complex model to teach a smaller, simpler model –Effectively distil the core knowledge of the large model
  • 34.
    Model Distillation IBM Developer/ © 2020 IBM Corporation 34 Source: Distiller docs, paper
  • 35.
    Model Distillation IBM Developer/ © 2020 IBM Corporation 35 –BERT model distillations have been very successful –DistilBERT –TinyBERT –Others (see this blog post)
  • 36.
    Conclusion –Model distillation isless popular but potentially compelling in NLP tasks –Area of rapid research evolution –New efficient model architectures are rapidly evolving • If one fits your needs, use it! –Compression techniques can yield large efficiency gains • Now good support in DL frameworks / supporting libraries –Perhaps combining pruning & quantization (though trickier) 36DEG / June 4, 2020 / © 2020 IBM Corporation
  • 37.
    Thank you codait.org twitter.com/MLnick github.com/MLnick developer.ibm.com 37DEG /June 4, 2020 / © 2020 IBM Corporation Check out the Model Asset Exchange https://ibm.biz/model-exchange Sign up for IBM Cloud https://ibm.biz/BdqdSi
  • 38.
    References Efficient Inference inDeep Learning – Where is the Problem? Analysis of deep neural networks MobileNets EfficientNet Making Neural Nets Work With Low Precision Speeding up BERT 38DEG / June 4, 2020 / © 2020 IBM Corporation Distilling the Knowledge in a Neural Network Once for All: Train One Network and Specialize it for Efficient Deployment Distiller – PyTorch TensorFlow Model Optimization Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
  • 39.
    IBM Developer /© 2020 IBM Corporation 39

Editor's Notes

  • #23 NAS required huge computational resources to find best architecture & train
  • #24 NAS required huge computational resources to find best architecture & train
  • #25 NAS required huge computational resources to find best architecture & train
  • #26 Types of pruning One-shot pruning (post training) Iterative pruning – prune then re-train Drop the least important weights, then repeat Stop pruning based on target sparsity or computational requirements
  • #28 For some applications pruning can result in no loss in accuracy (even a small gain)
  • #30 Main complexity is dealing with overflow – use intermediate, larger size e.g. 32 bit integer – then requantize to 8-bit
  • #32 In some instances (eg smaller MobileNet V2) trade off slightly higher latency for the 75% reduction in model size