SlideShare a Scribd company logo
2020/09/06
Ho Seong Lee (hoya012)
Cognex Deep Learning Lab
Research Engineer
PR-274 | Mixed Precision Training 1
Contents
• Introduction
• Related Work
• Implementation
• Results
• PyTorch 1.6 AMP New features & Experiment
• Conclusion
PR-274 | Mixed Precision Training 2
Introduction
Increasing the size of a neural network typically improves accuracy
• But also increases the memory and compute requirements for training the model.
• Introduce methodology for training deep neural networks using half-precision floating point numbers,
without losing model accuracy or having to modify hyper-parameters.
• Introduce three techniques to prevent model accuracy loss.
• Using these techniques, demonstrate that a wide variety of network architectures and
applications can be trained to match the accuracy FP32 training.
PR-274 | Mixed Precision Training 3
Main Contributions
Related Works
Network Compression
PR-274 | Mixed Precision Training 4
• Low-precision Training
• Train networks with low precision weights.
• Quantization
• Quantize pretrained model reducing the number of bits.
• Pruning
• Remove connections according to an importance criteria.
• Dedicated architectures
• Design architecture to be memory efficient such as SqueezeNet, MobileNet, ShuffleNet.
Related Works
Network Compression in PR-12 Study
PR-274 | Mixed Precision Training 5
• Total 23 papers were covered! → 23/274 = Almost 8%!
• But, Low-precision training is, as far as I know, the first topic to be covered.
Related Works
Related Works – Low Precision Training
• “Binaryconnect: Training deep neural networks with binary weights during propagations.”, 2015 NIPS
• Propose training with binary weights, all other tensors and arithmetic were in full precision.
• “Binarized neural networks.”, 2016 NIPS
• Also binarize the activations, but gradients were stored and computed in single precision.
• “Quantized neural net- works: Training neural networks with low precision weights and activations.”,
2016 arXiv
• Quantize weights and activations to 2, 4, and 6 bits, but gradients were real numbers.
• “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”, 2016 ECCV
• Binarize all tensors, including the gradients, but lead to non-trivial loss of accuracy.
PR-274 | Mixed Precision Training 6
Related Works
Main Contributions
• All tensors and arithmetic for forward and backward passes use reduced precision, FP16.
• No hyper-parameters (such as layer width) are adjusted.
• Models trained with these techniques do not incur accuracy loss when compared to FP16 baselines.
• Demonstrate that this technique works across a variety of applications.
PR-274 | Mixed Precision Training 7
Implementation
IEEE 754 Floating Point Representation
• Number can be represented by (−1) 𝑆
∗ 1. 𝑀 ∗ 2(𝐸 −𝐵𝑖𝑎𝑠)
PR-274 | Mixed Precision Training 8
Implementation
PR-274 | Mixed Precision Training 9
Bonus) New Floating-Point format
IEEE754
FP32
IEEE754
FP16
1bit
1bit
8bit
5bit
23bit
10bit
Google
bfloat16
1bit 8bit 7bit
NVIDIA
TensorFloat
1bit 8bit 10bit
AMD
FP24
1bit 7bit 16bit
Implementation
PR-274 | Mixed Precision Training 10
1. FP32 Master copy of weights
• In mixed precision training, weights, activations, and gradients are stored as FP16.
• In order to match the accuracy of FP32 networks, an FP32 master copy of weights is maintained and
update with the weight gradient during the optimizer step.
Halving the storage and bandwidth
Implementation
PR-274 | Mixed Precision Training 11
1. FP32 Master copy of weights → Why?
• Weight Update (weight gradients multiplied by the learning rate) becomes too small to be represented
in FP16. (smaller than 2−24
)
𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝜂 ∗
𝜕𝐸
𝜕𝑊
Implementation
PR-274 | Mixed Precision Training 12
1. FP32 Master copy of weights → Experiments
• Train the Mandarin speech model with FP32 master copy and without FP32 master copy.
• Updating FP16 weights results in 80% relative accuracy loss.
Worse than FP master copy
Implementation
PR-274 | Mixed Precision Training 13
2. Loss Scaling
• Activation gradient values tend to be dominated by small magnitudes.
• Scaling them by a factor of 8 is sufficient to match the accuracy achieved with FP32 training.
• It means activation gradient values below 2−27
were irrelevant to the training.
Implementation
PR-274 | Mixed Precision Training 14
2. Loss Scaling
• One efficient way to shift the gradient values into FP16-representable range is to scale the loss value
computed in the forward pass, prior to starting back-propagation.
• This can keep the relevant gradient values from becoming zeros.
• Weight gradients must be unscaled before weight update to maintain the update magnitudes.
Implementation
PR-274 | Mixed Precision Training 15
2. Loss Scaling – How to choose the loss scaling factor?
• Simple way is to pick a constant scaling factor empirically.
• Or if gradient statistics are available, directly choosing a factor so that its product with the maximum
absolute gradient value is below 65,504 (the maximum value representable in FP16).
• There is no downside to choosing a large scaling factor as long as it does not cause overflow during
backpropagation.
Implementation
PR-274 | Mixed Precision Training 16
2. Loss Scaling – Automatic Mixed Precision
• More robust way is to choose the loss scaling factor dynamically (Automatically).
• The basic idea is to start with a large scaling factor and then reconsider it in each training iteration.
• If an overflow occurs, skip the weight update and decrease the scaling factor.
• If no overflow occurs for a chosen number of iterations N, increase the scaling factor.
Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
Use N=2000, Increase x2, Decrease x0.5
Implementation
PR-274 | Mixed Precision Training 17
3. Arithmetic Precision
• Neural network arithmetic falls into three categories: vector dot-products, reductions, and point-wise
operations.
• To maintain model accuracy, we found that some networks require that FP16 vector dot-product
accumulates the partial products into an FP32 value, which is converted to FP16 before writing to
memory.
Reference: https://www.quora.com/How-does-Fused-Multiply-Add-FMA-work-and-what-is-its-importance-in-computing
Implementation
PR-274 | Mixed Precision Training 18
3. Arithmetic Precision
• Large reductions (sums across elements of a vector) should be carried out in FP32.
• Such reductions mostly come up in batch-normalization layers and softmax layers.
• Both layer types in author’s implementations still read and write FP16 tensors from memory, performing
the arithmetic in FP32. → did not slow down the training process.
Results
PR-274 | Mixed Precision Training 19
Comparison Baseline(FP32) with Mixed Precision
Results
PR-274 | Mixed Precision Training 20
Comparison Baseline(FP32) with Mixed Precision
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 21
Automatic Mixed Precision in PyTorch
• Last July, PyTorch release new version 1.6 and support Automatic Mixed Precision features officially!
• We can very simply use Automatic Mixed Precision. Just add 5 lines.
Merged into PyTorch / Deprecated!
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 22
Automatic Mixed Precision in PyTorch
• Just add 5 line. Now we can use Automatic Mixed Precision Training in PyTorch!
Before
After
Reference: https://github.com/hoya012/automatic-mixed-precision-tutorials-pytorch
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 23
Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• To verify effect of AMP, perform a simple classification experiment.
• Use Kaggle Intel Image Classification dataset.
• Contains around 25k images of size 150x150 distributed under 6 categories .
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 24
Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• Use ImageNet Pretrained ResNet-18.
• Use GTX 1080 Ti (w/o Tensor Core) and RTX 2080 Ti (with Tensor Core).
• Fix training setting (batch size=256, epoch=120, lr, augmentation, optimizer, etc.).
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 25
Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• We can save GPU Memory almost 30% ~ 40%!
• If use good GPU (with Tensor Core), we can save computational time!
• NVIDIA Tensor Cores provide hardware acceleration for mixed precision training.
Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
Conclusion
PR-274 | Mixed Precision Training
• Introduce methodology for training deep neural networks using half-precision floating point.
• Introduce three techniques to prevent model accuracy loss.
• PyTorch officially support Automatic Mixed Precision training.
28

More Related Content

What's hot

An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture Search
Bill Liu
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
Databricks
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Ashray Bhandare
 
Yann le cun
Yann le cunYann le cun
Yann le cunYandex
 
【ジュニパーサロン】データセンタに特化した新しい経路制御技術 RIFTの紹介
【ジュニパーサロン】データセンタに特化した新しい経路制御技術 RIFTの紹介【ジュニパーサロン】データセンタに特化した新しい経路制御技術 RIFTの紹介
【ジュニパーサロン】データセンタに特化した新しい経路制御技術 RIFTの紹介
Juniper Networks (日本)
 
W8PRML5.1-5.3
W8PRML5.1-5.3W8PRML5.1-5.3
W8PRML5.1-5.3
Masahito Ohue
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
Databricks
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Anne Nicolas
 
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Kirill Eremenko
 
NMAP
NMAPNMAP
Brief Introduction to Deep Learning + Solving XOR using ANNs
Brief Introduction to Deep Learning + Solving XOR using ANNsBrief Introduction to Deep Learning + Solving XOR using ANNs
Brief Introduction to Deep Learning + Solving XOR using ANNs
Ahmed Gad
 
Classification By Back Propagation
Classification By Back PropagationClassification By Back Propagation
Classification By Back Propagation
BineeshJose99
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
Noura Hussein
 
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
FixMatch: Simplifying Semi-Supervised Learning with Consistency and ConfidenceFixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
Sungchul Kim
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)
Rehan Guha
 
DSIRNLP #3 LZ4 の速さの秘密に迫ってみる
DSIRNLP #3 LZ4 の速さの秘密に迫ってみるDSIRNLP #3 LZ4 の速さの秘密に迫ってみる
DSIRNLP #3 LZ4 の速さの秘密に迫ってみる
Atsushi KOMIYA
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Simplilearn
 
linear classification
linear classificationlinear classification
linear classification
nep_test_account
 
[GomGuard] 뉴런부터 YOLO 까지 - 딥러닝 전반에 대한 이야기
[GomGuard] 뉴런부터 YOLO 까지 - 딥러닝 전반에 대한 이야기[GomGuard] 뉴런부터 YOLO 까지 - 딥러닝 전반에 대한 이야기
[GomGuard] 뉴런부터 YOLO 까지 - 딥러닝 전반에 대한 이야기
JungHyun Hong
 

What's hot (20)

An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture Search
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Yann le cun
Yann le cunYann le cun
Yann le cun
 
【ジュニパーサロン】データセンタに特化した新しい経路制御技術 RIFTの紹介
【ジュニパーサロン】データセンタに特化した新しい経路制御技術 RIFTの紹介【ジュニパーサロン】データセンタに特化した新しい経路制御技術 RIFTの紹介
【ジュニパーサロン】データセンタに特化した新しい経路制御技術 RIFTの紹介
 
W8PRML5.1-5.3
W8PRML5.1-5.3W8PRML5.1-5.3
W8PRML5.1-5.3
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
 
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
 
NMAP
NMAPNMAP
NMAP
 
Brief Introduction to Deep Learning + Solving XOR using ANNs
Brief Introduction to Deep Learning + Solving XOR using ANNsBrief Introduction to Deep Learning + Solving XOR using ANNs
Brief Introduction to Deep Learning + Solving XOR using ANNs
 
Classification By Back Propagation
Classification By Back PropagationClassification By Back Propagation
Classification By Back Propagation
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
 
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
FixMatch: Simplifying Semi-Supervised Learning with Consistency and ConfidenceFixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)
 
DSIRNLP #3 LZ4 の速さの秘密に迫ってみる
DSIRNLP #3 LZ4 の速さの秘密に迫ってみるDSIRNLP #3 LZ4 の速さの秘密に迫ってみる
DSIRNLP #3 LZ4 の速さの秘密に迫ってみる
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
 
linear classification
linear classificationlinear classification
linear classification
 
[GomGuard] 뉴런부터 YOLO 까지 - 딥러닝 전반에 대한 이야기
[GomGuard] 뉴런부터 YOLO 까지 - 딥러닝 전반에 대한 이야기[GomGuard] 뉴런부터 YOLO 까지 - 딥러닝 전반에 대한 이야기
[GomGuard] 뉴런부터 YOLO 까지 - 딥러닝 전반에 대한 이야기
 
PRML 5.3-5.4
PRML 5.3-5.4PRML 5.3-5.4
PRML 5.3-5.4
 

Similar to Mixed Precision Training Review

Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
Dongmin Choi
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
Sri Ambati
 
IRJET - Implementation of Neural Network on FPGA
IRJET - Implementation of Neural Network on FPGAIRJET - Implementation of Neural Network on FPGA
IRJET - Implementation of Neural Network on FPGA
IRJET Journal
 
FixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceFixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidence
LEE HOSEONG
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]
taeseon ryu
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
milad abbasi
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
Mehrnaz Faraz
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Intel® Software
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
Bharath Sudharsan
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
Edge AI and Vision Alliance
 
Semi-Supervised Deep Learning
Semi-Supervised Deep LearningSemi-Supervised Deep Learning
Semi-Supervised Deep Learning
Kamer Ali Yuksel
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
Sabidur Rahman
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
ruvex
 
How to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training TimeHow to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training Time
Michael Galarnyk
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
Tahmid Abtahi
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
Akash Goel
 
Model Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdfModel Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdf
ZHUORANGUO2
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
David Tung
 

Similar to Mixed Precision Training Review (20)

Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
 
IRJET - Implementation of Neural Network on FPGA
IRJET - Implementation of Neural Network on FPGAIRJET - Implementation of Neural Network on FPGA
IRJET - Implementation of Neural Network on FPGA
 
FixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceFixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidence
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
Semi-Supervised Deep Learning
Semi-Supervised Deep LearningSemi-Supervised Deep Learning
Semi-Supervised Deep Learning
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
How to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training TimeHow to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training Time
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Model Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdfModel Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdf
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
 

More from LEE HOSEONG

Unsupervised anomaly detection using style distillation
Unsupervised anomaly detection using style distillationUnsupervised anomaly detection using style distillation
Unsupervised anomaly detection using style distillation
LEE HOSEONG
 
do adversarially robust image net models transfer better
do adversarially robust image net models transfer betterdo adversarially robust image net models transfer better
do adversarially robust image net models transfer better
LEE HOSEONG
 
CNN Architecture A to Z
CNN Architecture A to ZCNN Architecture A to Z
CNN Architecture A to Z
LEE HOSEONG
 
carrier of_tricks_for_image_classification
carrier of_tricks_for_image_classificationcarrier of_tricks_for_image_classification
carrier of_tricks_for_image_classification
LEE HOSEONG
 
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen..."The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
LEE HOSEONG
 
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly DetectionMVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
LEE HOSEONG
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
LEE HOSEONG
 
"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review
LEE HOSEONG
 
Unsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-SupervisionUnsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-Supervision
LEE HOSEONG
 
Human uncertainty makes classification more robust, ICCV 2019 Review
Human uncertainty makes classification more robust, ICCV 2019 ReviewHuman uncertainty makes classification more robust, ICCV 2019 Review
Human uncertainty makes classification more robust, ICCV 2019 Review
LEE HOSEONG
 
Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution Overview
LEE HOSEONG
 
2019 ICLR Best Paper Review
2019 ICLR Best Paper Review2019 ICLR Best Paper Review
2019 ICLR Best Paper Review
LEE HOSEONG
 
2019 cvpr paper_overview
2019 cvpr paper_overview2019 cvpr paper_overview
2019 cvpr paper_overview
LEE HOSEONG
 
"Google Vizier: A Service for Black-Box Optimization" Paper Review
"Google Vizier: A Service for Black-Box Optimization" Paper Review"Google Vizier: A Service for Black-Box Optimization" Paper Review
"Google Vizier: A Service for Black-Box Optimization" Paper Review
LEE HOSEONG
 
"Searching for Activation Functions" Paper Review
"Searching for Activation Functions" Paper Review"Searching for Activation Functions" Paper Review
"Searching for Activation Functions" Paper Review
LEE HOSEONG
 
"Learning transferable architectures for scalable image recognition" Paper Re...
"Learning transferable architectures for scalable image recognition" Paper Re..."Learning transferable architectures for scalable image recognition" Paper Re...
"Learning transferable architectures for scalable image recognition" Paper Re...
LEE HOSEONG
 
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
LEE HOSEONG
 
"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review
LEE HOSEONG
 
"From image level to pixel-level labeling with convolutional networks" Paper ...
"From image level to pixel-level labeling with convolutional networks" Paper ..."From image level to pixel-level labeling with convolutional networks" Paper ...
"From image level to pixel-level labeling with convolutional networks" Paper ...
LEE HOSEONG
 
"simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r..."simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r...
LEE HOSEONG
 

More from LEE HOSEONG (20)

Unsupervised anomaly detection using style distillation
Unsupervised anomaly detection using style distillationUnsupervised anomaly detection using style distillation
Unsupervised anomaly detection using style distillation
 
do adversarially robust image net models transfer better
do adversarially robust image net models transfer betterdo adversarially robust image net models transfer better
do adversarially robust image net models transfer better
 
CNN Architecture A to Z
CNN Architecture A to ZCNN Architecture A to Z
CNN Architecture A to Z
 
carrier of_tricks_for_image_classification
carrier of_tricks_for_image_classificationcarrier of_tricks_for_image_classification
carrier of_tricks_for_image_classification
 
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen..."The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
 
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly DetectionMVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
 
"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review
 
Unsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-SupervisionUnsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-Supervision
 
Human uncertainty makes classification more robust, ICCV 2019 Review
Human uncertainty makes classification more robust, ICCV 2019 ReviewHuman uncertainty makes classification more robust, ICCV 2019 Review
Human uncertainty makes classification more robust, ICCV 2019 Review
 
Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution Overview
 
2019 ICLR Best Paper Review
2019 ICLR Best Paper Review2019 ICLR Best Paper Review
2019 ICLR Best Paper Review
 
2019 cvpr paper_overview
2019 cvpr paper_overview2019 cvpr paper_overview
2019 cvpr paper_overview
 
"Google Vizier: A Service for Black-Box Optimization" Paper Review
"Google Vizier: A Service for Black-Box Optimization" Paper Review"Google Vizier: A Service for Black-Box Optimization" Paper Review
"Google Vizier: A Service for Black-Box Optimization" Paper Review
 
"Searching for Activation Functions" Paper Review
"Searching for Activation Functions" Paper Review"Searching for Activation Functions" Paper Review
"Searching for Activation Functions" Paper Review
 
"Learning transferable architectures for scalable image recognition" Paper Re...
"Learning transferable architectures for scalable image recognition" Paper Re..."Learning transferable architectures for scalable image recognition" Paper Re...
"Learning transferable architectures for scalable image recognition" Paper Re...
 
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
 
"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review
 
"From image level to pixel-level labeling with convolutional networks" Paper ...
"From image level to pixel-level labeling with convolutional networks" Paper ..."From image level to pixel-level labeling with convolutional networks" Paper ...
"From image level to pixel-level labeling with convolutional networks" Paper ...
 
"simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r..."simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r...
 

Recently uploaded

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 

Mixed Precision Training Review

  • 1. 2020/09/06 Ho Seong Lee (hoya012) Cognex Deep Learning Lab Research Engineer PR-274 | Mixed Precision Training 1
  • 2. Contents • Introduction • Related Work • Implementation • Results • PyTorch 1.6 AMP New features & Experiment • Conclusion PR-274 | Mixed Precision Training 2
  • 3. Introduction Increasing the size of a neural network typically improves accuracy • But also increases the memory and compute requirements for training the model. • Introduce methodology for training deep neural networks using half-precision floating point numbers, without losing model accuracy or having to modify hyper-parameters. • Introduce three techniques to prevent model accuracy loss. • Using these techniques, demonstrate that a wide variety of network architectures and applications can be trained to match the accuracy FP32 training. PR-274 | Mixed Precision Training 3 Main Contributions
  • 4. Related Works Network Compression PR-274 | Mixed Precision Training 4 • Low-precision Training • Train networks with low precision weights. • Quantization • Quantize pretrained model reducing the number of bits. • Pruning • Remove connections according to an importance criteria. • Dedicated architectures • Design architecture to be memory efficient such as SqueezeNet, MobileNet, ShuffleNet.
  • 5. Related Works Network Compression in PR-12 Study PR-274 | Mixed Precision Training 5 • Total 23 papers were covered! → 23/274 = Almost 8%! • But, Low-precision training is, as far as I know, the first topic to be covered.
  • 6. Related Works Related Works – Low Precision Training • “Binaryconnect: Training deep neural networks with binary weights during propagations.”, 2015 NIPS • Propose training with binary weights, all other tensors and arithmetic were in full precision. • “Binarized neural networks.”, 2016 NIPS • Also binarize the activations, but gradients were stored and computed in single precision. • “Quantized neural net- works: Training neural networks with low precision weights and activations.”, 2016 arXiv • Quantize weights and activations to 2, 4, and 6 bits, but gradients were real numbers. • “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”, 2016 ECCV • Binarize all tensors, including the gradients, but lead to non-trivial loss of accuracy. PR-274 | Mixed Precision Training 6
  • 7. Related Works Main Contributions • All tensors and arithmetic for forward and backward passes use reduced precision, FP16. • No hyper-parameters (such as layer width) are adjusted. • Models trained with these techniques do not incur accuracy loss when compared to FP16 baselines. • Demonstrate that this technique works across a variety of applications. PR-274 | Mixed Precision Training 7
  • 8. Implementation IEEE 754 Floating Point Representation • Number can be represented by (−1) 𝑆 ∗ 1. 𝑀 ∗ 2(𝐸 −𝐵𝑖𝑎𝑠) PR-274 | Mixed Precision Training 8
  • 9. Implementation PR-274 | Mixed Precision Training 9 Bonus) New Floating-Point format IEEE754 FP32 IEEE754 FP16 1bit 1bit 8bit 5bit 23bit 10bit Google bfloat16 1bit 8bit 7bit NVIDIA TensorFloat 1bit 8bit 10bit AMD FP24 1bit 7bit 16bit
  • 10. Implementation PR-274 | Mixed Precision Training 10 1. FP32 Master copy of weights • In mixed precision training, weights, activations, and gradients are stored as FP16. • In order to match the accuracy of FP32 networks, an FP32 master copy of weights is maintained and update with the weight gradient during the optimizer step. Halving the storage and bandwidth
  • 11. Implementation PR-274 | Mixed Precision Training 11 1. FP32 Master copy of weights → Why? • Weight Update (weight gradients multiplied by the learning rate) becomes too small to be represented in FP16. (smaller than 2−24 ) 𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝜂 ∗ 𝜕𝐸 𝜕𝑊
  • 12. Implementation PR-274 | Mixed Precision Training 12 1. FP32 Master copy of weights → Experiments • Train the Mandarin speech model with FP32 master copy and without FP32 master copy. • Updating FP16 weights results in 80% relative accuracy loss. Worse than FP master copy
  • 13. Implementation PR-274 | Mixed Precision Training 13 2. Loss Scaling • Activation gradient values tend to be dominated by small magnitudes. • Scaling them by a factor of 8 is sufficient to match the accuracy achieved with FP32 training. • It means activation gradient values below 2−27 were irrelevant to the training.
  • 14. Implementation PR-274 | Mixed Precision Training 14 2. Loss Scaling • One efficient way to shift the gradient values into FP16-representable range is to scale the loss value computed in the forward pass, prior to starting back-propagation. • This can keep the relevant gradient values from becoming zeros. • Weight gradients must be unscaled before weight update to maintain the update magnitudes.
  • 15. Implementation PR-274 | Mixed Precision Training 15 2. Loss Scaling – How to choose the loss scaling factor? • Simple way is to pick a constant scaling factor empirically. • Or if gradient statistics are available, directly choosing a factor so that its product with the maximum absolute gradient value is below 65,504 (the maximum value representable in FP16). • There is no downside to choosing a large scaling factor as long as it does not cause overflow during backpropagation.
  • 16. Implementation PR-274 | Mixed Precision Training 16 2. Loss Scaling – Automatic Mixed Precision • More robust way is to choose the loss scaling factor dynamically (Automatically). • The basic idea is to start with a large scaling factor and then reconsider it in each training iteration. • If an overflow occurs, skip the weight update and decrease the scaling factor. • If no overflow occurs for a chosen number of iterations N, increase the scaling factor. Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html Use N=2000, Increase x2, Decrease x0.5
  • 17. Implementation PR-274 | Mixed Precision Training 17 3. Arithmetic Precision • Neural network arithmetic falls into three categories: vector dot-products, reductions, and point-wise operations. • To maintain model accuracy, we found that some networks require that FP16 vector dot-product accumulates the partial products into an FP32 value, which is converted to FP16 before writing to memory. Reference: https://www.quora.com/How-does-Fused-Multiply-Add-FMA-work-and-what-is-its-importance-in-computing
  • 18. Implementation PR-274 | Mixed Precision Training 18 3. Arithmetic Precision • Large reductions (sums across elements of a vector) should be carried out in FP32. • Such reductions mostly come up in batch-normalization layers and softmax layers. • Both layer types in author’s implementations still read and write FP16 tensors from memory, performing the arithmetic in FP32. → did not slow down the training process.
  • 19. Results PR-274 | Mixed Precision Training 19 Comparison Baseline(FP32) with Mixed Precision
  • 20. Results PR-274 | Mixed Precision Training 20 Comparison Baseline(FP32) with Mixed Precision
  • 21. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 21 Automatic Mixed Precision in PyTorch • Last July, PyTorch release new version 1.6 and support Automatic Mixed Precision features officially! • We can very simply use Automatic Mixed Precision. Just add 5 lines. Merged into PyTorch / Deprecated!
  • 22. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 22 Automatic Mixed Precision in PyTorch • Just add 5 line. Now we can use Automatic Mixed Precision Training in PyTorch! Before After Reference: https://github.com/hoya012/automatic-mixed-precision-tutorials-pytorch
  • 23. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 23 Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial • To verify effect of AMP, perform a simple classification experiment. • Use Kaggle Intel Image Classification dataset. • Contains around 25k images of size 150x150 distributed under 6 categories .
  • 24. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 24 Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial • Use ImageNet Pretrained ResNet-18. • Use GTX 1080 Ti (w/o Tensor Core) and RTX 2080 Ti (with Tensor Core). • Fix training setting (batch size=256, epoch=120, lr, augmentation, optimizer, etc.).
  • 25. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 25 Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial • We can save GPU Memory almost 30% ~ 40%! • If use good GPU (with Tensor Core), we can save computational time! • NVIDIA Tensor Cores provide hardware acceleration for mixed precision training. Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
  • 26. Conclusion PR-274 | Mixed Precision Training • Introduce methodology for training deep neural networks using half-precision floating point. • Introduce three techniques to prevent model accuracy loss. • PyTorch officially support Automatic Mixed Precision training. 28