SlideShare a Scribd company logo
1 of 53
David Tung
2/1/2019
DEEP GRADIENT COMPRESSION:
REDUCING THE COMMUNICATION
BANDWIDTH FOR DISTRIBUTED
TRAINING
Outline
• Introduction
• Distributed Training
• Related Work
• Deep Gradient Compression
• Experiment and Result
• Conclusion and Discussion
Introduction
• Minimize training time by reducing the
bandwidth for gradient exchange in distributed
training
• Preserve model accuracy for faster training
• Focus on reducing data communication on
inexpensive commodity network or training on
mobile devices
Introduction
(continue)
• To preserve accuracy during compression:
Momentum correction, Local gradient
clipping, Momentum factor masking and
Warm-up training
• Applied DGC to CNN - Cifar10, ImageNet,
RNN - Penn Treebank (NLP), Speech -
Librispeech Corpus
• No need to modify neural network model
structure
Gradient compression 300x to
600x without losing accuracy
Introduction – Deep
Gradient Compression
Paper
Distributed Deep Learning
Motivation
Deep Gradient
Compression ( GDC )
All-Reduce
Example : Nvidia NCCL
Momentum Correction
Use Velocity instead
The Challenges
● AlexNet 240 MB weights and ResNet has
100 MB weights
● Every node has to exchange 100 MB of
gradient to each other during each
iteration training for ResNet, which make
the bottle neck of the infrastructure
Challenges
Reaching the limits of
distributed SGD for trainings
RNNS on Common crawl
Related Work
Related Distributed Training
Research
• Asynchronous SGD
• Gradient Quantization
• Gradient Dropping ( Aji, 2017)
• Training ImageNet in one hour (FB)
• Training ImageNet in 15 mins (PFN)
Related - Gradient Quantization
• Quantizing the gradients to low-precision
values can reduce the communication
bandwidth.
• Seide et al. (2014) proposed 1-bit SGD to
reduce gradients transfer data size and
achieved 10× speedup in traditional speech
applications.
Related - Gradient Dropping
• Sparsify the gradients by a single
threshold value.
• To keep the convergence speed, Gradient
Dropping requires adding the layer
normalization
• Gradient Dropping saves 99% of gradient
exchange while incurring 0.3% loss on a
machine translation task.
Related - Training ImageNet in 1 hour
• FB Big Basin server
• Large minibatch SGD – 8k
• Caffe2 trains ResNet 50
• 256 GPU, Tesla P100
Related - Training ImageNet in 1 hour
• Used Facebook’s Big Basin GPU servers
• Each server has 8 Tesla P100 GPUs and 3.2TB
of SSDs.
• Servers have 50Gbit Ethernet network card
• ResNet-50 has approximately 25 million
parameters. This means the total size of
parameters is 25 · 106 · sizeof(float) = 100MB
$$ Expensive hardware
Training ImageNet in 1 hour
Related work - PFN training
ImageNet in 15 mins
PFN training ImageNet in 15
mins
PFN training ImageNet in 15
mins
Comparison
• DGC pushes the gradient compression ratio to up to
600× without expensive hardware
• DGC does not require extra layer normalization, and
thus does not need to change the model structure.
• Most importantly, Deep Gradient Compression
results in no loss of accuracy.
Deep Gradient Compression
Overview
Deep Gradient Compression
• Gradient Sparsification
• Local Gradient Accumulation
• Momentum Correction
• Local Gradient Clipping
• Momentum Factor Masking
• Warm-up Training
1. GRADIENT
SPARSIFICATION
• Reduce the communication bandwidth by
sending only the important gradients.
• Use the gradient magnitude as a simple
heuristics for importance
• Only gradients larger than a threshold are
transmitted ( top 0.1%)
2. Local Gradient
Accumulation
• To avoid losing information, we
accumulate the rest of the gradients
locally.
• Eventually, these gradients become large
enough to be transmitted.
Accuracy Image classification: -1.6%
Accuracy speech recognition: -3.3%
3. Momentum Correction
● Momentum SGD – Using part of previous
gradient and current gradient to avoid
noise
● New vector is created as ‘Velocity’
● We should do local accumulation of
velocity than gradient
Accuracy Image classification: -0.3%
Speech recognition: can’t converge
4. Local Gradient Clipping
• Gradient clipping is widely adopted to avoid the
exploding gradient problem
• This step is conventionally executed after gradient
aggregation from all nodes.
• Perform the gradient clipping locally before adding
the current gradient to previous accumulation
Accuracy Image classification: N/A
Speech recognition: -2.0%
5. Momentum Factor Masking
There is a long tail accumulation issue ( ~2k
iterations)
Introduce momentum factor masking, to
alleviate staleness
This mask stops the momentum for delayed
gradients
Preventing the stale momentum from carrying
the weights in the wrong direction.
Accuracy Image classification: -0.1%
Speech recognition: -0.5%
6. Warm-up Training
Use a less aggressive learning rate to slow down the
changing speed of the neural network at the start of
training
Instead of linearly ramping up the learning rate during the
first several epochs, we exponentially increase the gradient
sparsity from a relatively small value to the final value, in
order to help the training adapt to the gradients of larger
sparsity.
Accuracy Image classification: +0.37%
Speech recognition: +0.4%
Deep Gradient Compression
Experiment and result
Conclusion
• Limitation on scale up, optimize communication comes next.
• Deep Gradient Compression compresses the gradient by
300-600× for a wide range of CNNs and RNNs.
• To achieve this compression without slowing down the
convergence, DGC employs momentum correction, local
gradient clipping, momentum factor masking and warm-up
training.
• Deep Gradient Compression reduces the required
communication bandwidth and improves the scalability of
distributed training with inexpensive, commodity networking
infrastructure.
Thank you and Q&A
Motivation
Algorithm
PFN training ImageNet in 15
mins
Distributed Deep
Learning Research
Momentum Correction
Use Velocity instead
PFN training ImageNet in 15
mins
PFN training ImageNet in 15
mins

More Related Content

What's hot

Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCOlga Lavrentieva
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningJeremy Nixon
 
C-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the CloudC-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the CloudQian Lin
 
DL_lecture3_regularization_I.pdf
DL_lecture3_regularization_I.pdfDL_lecture3_regularization_I.pdf
DL_lecture3_regularization_I.pdfsagayalavanya2
 
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...Taegyun Jeon
 
"A Shallow Dive into Training Deep Neural Networks," a Presentation from Deep...
"A Shallow Dive into Training Deep Neural Networks," a Presentation from Deep..."A Shallow Dive into Training Deep Neural Networks," a Presentation from Deep...
"A Shallow Dive into Training Deep Neural Networks," a Presentation from Deep...Edge AI and Vision Alliance
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
 
Deferred rendering in_leadwerks_engine[1]
Deferred rendering in_leadwerks_engine[1]Deferred rendering in_leadwerks_engine[1]
Deferred rendering in_leadwerks_engine[1]ozlael ozlael
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...WMLab,NCU
 
Server Consolidation through Virtual Machine Task Migration to achieve Green ...
Server Consolidation through Virtual Machine Task Migration to achieve Green ...Server Consolidation through Virtual Machine Task Migration to achieve Green ...
Server Consolidation through Virtual Machine Task Migration to achieve Green ...IJCSIS Research Publications
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Arinto Murdopo
 
Exploiting latency bounds for energy efficient load balancing
Exploiting latency bounds for energy efficient load balancingExploiting latency bounds for energy efficient load balancing
Exploiting latency bounds for energy efficient load balancingMichael May
 

What's hot (13)

Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
C-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the CloudC-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the Cloud
 
DL_lecture3_regularization_I.pdf
DL_lecture3_regularization_I.pdfDL_lecture3_regularization_I.pdf
DL_lecture3_regularization_I.pdf
 
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
 
"A Shallow Dive into Training Deep Neural Networks," a Presentation from Deep...
"A Shallow Dive into Training Deep Neural Networks," a Presentation from Deep..."A Shallow Dive into Training Deep Neural Networks," a Presentation from Deep...
"A Shallow Dive into Training Deep Neural Networks," a Presentation from Deep...
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
Deferred rendering in_leadwerks_engine[1]
Deferred rendering in_leadwerks_engine[1]Deferred rendering in_leadwerks_engine[1]
Deferred rendering in_leadwerks_engine[1]
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...
 
Server Consolidation through Virtual Machine Task Migration to achieve Green ...
Server Consolidation through Virtual Machine Task Migration to achieve Green ...Server Consolidation through Virtual Machine Task Migration to achieve Green ...
Server Consolidation through Virtual Machine Task Migration to achieve Green ...
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services
 
Exploiting latency bounds for energy efficient load balancing
Exploiting latency bounds for energy efficient load balancingExploiting latency bounds for energy efficient load balancing
Exploiting latency bounds for energy efficient load balancing
 

Similar to Deep gradient compression

MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformGanesan Narayanasamy
 
Predicting Drug Target Interaction Using Deep Belief Network
Predicting Drug Target Interaction Using Deep Belief NetworkPredicting Drug Target Interaction Using Deep Belief Network
Predicting Drug Target Interaction Using Deep Belief NetworkRashim Dhaubanjar
 
Mixed Precision Training Review
Mixed Precision Training ReviewMixed Precision Training Review
Mixed Precision Training ReviewLEE HOSEONG
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performances.rohit
 
Deep Learning in Low Power Devices
Deep Learning in Low Power DevicesDeep Learning in Low Power Devices
Deep Learning in Low Power DevicesLokesh Vadlamudi
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
 
Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...Dongmin Choi
 
Batch normalization paper review
Batch normalization paper reviewBatch normalization paper review
Batch normalization paper reviewMinho Heo
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...Edge AI and Vision Alliance
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Databricks
 
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...thanhdowork
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]taeseon ryu
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryAhmed Yousry
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfvudinhphuong96
 
M.Tech project on Haar wavelet based approach for image compression
M.Tech project on Haar wavelet based approach for image compressionM.Tech project on Haar wavelet based approach for image compression
M.Tech project on Haar wavelet based approach for image compressionVeerendra B R Revanna
 

Similar to Deep gradient compression (20)

MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
 
Predicting Drug Target Interaction Using Deep Belief Network
Predicting Drug Target Interaction Using Deep Belief NetworkPredicting Drug Target Interaction Using Deep Belief Network
Predicting Drug Target Interaction Using Deep Belief Network
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
Mixed Precision Training Review
Mixed Precision Training ReviewMixed Precision Training Review
Mixed Precision Training Review
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
 
Deep Learning in Low Power Devices
Deep Learning in Low Power DevicesDeep Learning in Low Power Devices
Deep Learning in Low Power Devices
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
OBDPC 2022
OBDPC 2022OBDPC 2022
OBDPC 2022
 
Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
 
Batch normalization paper review
Batch normalization paper reviewBatch normalization paper review
Batch normalization paper review
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
 
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
240422_Thuy_Labseminar[Large Graph Property Prediction via Graph Segment Trai...
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdf
 
M.Tech project on Haar wavelet based approach for image compression
M.Tech project on Haar wavelet based approach for image compressionM.Tech project on Haar wavelet based approach for image compression
M.Tech project on Haar wavelet based approach for image compression
 

Recently uploaded

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 

Deep gradient compression

  • 1. David Tung 2/1/2019 DEEP GRADIENT COMPRESSION: REDUCING THE COMMUNICATION BANDWIDTH FOR DISTRIBUTED TRAINING
  • 2. Outline • Introduction • Distributed Training • Related Work • Deep Gradient Compression • Experiment and Result • Conclusion and Discussion
  • 3. Introduction • Minimize training time by reducing the bandwidth for gradient exchange in distributed training • Preserve model accuracy for faster training • Focus on reducing data communication on inexpensive commodity network or training on mobile devices
  • 4. Introduction (continue) • To preserve accuracy during compression: Momentum correction, Local gradient clipping, Momentum factor masking and Warm-up training • Applied DGC to CNN - Cifar10, ImageNet, RNN - Penn Treebank (NLP), Speech - Librispeech Corpus • No need to modify neural network model structure Gradient compression 300x to 600x without losing accuracy
  • 5. Introduction – Deep Gradient Compression Paper
  • 12. The Challenges ● AlexNet 240 MB weights and ResNet has 100 MB weights ● Every node has to exchange 100 MB of gradient to each other during each iteration training for ResNet, which make the bottle neck of the infrastructure
  • 13. Challenges Reaching the limits of distributed SGD for trainings RNNS on Common crawl
  • 15. Related Distributed Training Research • Asynchronous SGD • Gradient Quantization • Gradient Dropping ( Aji, 2017) • Training ImageNet in one hour (FB) • Training ImageNet in 15 mins (PFN)
  • 16. Related - Gradient Quantization • Quantizing the gradients to low-precision values can reduce the communication bandwidth. • Seide et al. (2014) proposed 1-bit SGD to reduce gradients transfer data size and achieved 10× speedup in traditional speech applications.
  • 17. Related - Gradient Dropping • Sparsify the gradients by a single threshold value. • To keep the convergence speed, Gradient Dropping requires adding the layer normalization • Gradient Dropping saves 99% of gradient exchange while incurring 0.3% loss on a machine translation task.
  • 18. Related - Training ImageNet in 1 hour • FB Big Basin server • Large minibatch SGD – 8k • Caffe2 trains ResNet 50 • 256 GPU, Tesla P100
  • 19. Related - Training ImageNet in 1 hour • Used Facebook’s Big Basin GPU servers • Each server has 8 Tesla P100 GPUs and 3.2TB of SSDs. • Servers have 50Gbit Ethernet network card • ResNet-50 has approximately 25 million parameters. This means the total size of parameters is 25 · 106 · sizeof(float) = 100MB $$ Expensive hardware
  • 21. Related work - PFN training ImageNet in 15 mins
  • 22. PFN training ImageNet in 15 mins
  • 23. PFN training ImageNet in 15 mins
  • 24. Comparison • DGC pushes the gradient compression ratio to up to 600× without expensive hardware • DGC does not require extra layer normalization, and thus does not need to change the model structure. • Most importantly, Deep Gradient Compression results in no loss of accuracy.
  • 27. Deep Gradient Compression • Gradient Sparsification • Local Gradient Accumulation • Momentum Correction • Local Gradient Clipping • Momentum Factor Masking • Warm-up Training
  • 28. 1. GRADIENT SPARSIFICATION • Reduce the communication bandwidth by sending only the important gradients. • Use the gradient magnitude as a simple heuristics for importance • Only gradients larger than a threshold are transmitted ( top 0.1%)
  • 29. 2. Local Gradient Accumulation • To avoid losing information, we accumulate the rest of the gradients locally. • Eventually, these gradients become large enough to be transmitted. Accuracy Image classification: -1.6% Accuracy speech recognition: -3.3%
  • 30. 3. Momentum Correction ● Momentum SGD – Using part of previous gradient and current gradient to avoid noise ● New vector is created as ‘Velocity’ ● We should do local accumulation of velocity than gradient Accuracy Image classification: -0.3% Speech recognition: can’t converge
  • 31. 4. Local Gradient Clipping • Gradient clipping is widely adopted to avoid the exploding gradient problem • This step is conventionally executed after gradient aggregation from all nodes. • Perform the gradient clipping locally before adding the current gradient to previous accumulation Accuracy Image classification: N/A Speech recognition: -2.0%
  • 32. 5. Momentum Factor Masking There is a long tail accumulation issue ( ~2k iterations) Introduce momentum factor masking, to alleviate staleness This mask stops the momentum for delayed gradients Preventing the stale momentum from carrying the weights in the wrong direction. Accuracy Image classification: -0.1% Speech recognition: -0.5%
  • 33. 6. Warm-up Training Use a less aggressive learning rate to slow down the changing speed of the neural network at the start of training Instead of linearly ramping up the learning rate during the first several epochs, we exponentially increase the gradient sparsity from a relatively small value to the final value, in order to help the training adapt to the gradients of larger sparsity. Accuracy Image classification: +0.37% Speech recognition: +0.4%
  • 36.
  • 37.
  • 38.
  • 39.
  • 40. Conclusion • Limitation on scale up, optimize communication comes next. • Deep Gradient Compression compresses the gradient by 300-600× for a wide range of CNNs and RNNs. • To achieve this compression without slowing down the convergence, DGC employs momentum correction, local gradient clipping, momentum factor masking and warm-up training. • Deep Gradient Compression reduces the required communication bandwidth and improves the scalability of distributed training with inexpensive, commodity networking infrastructure.
  • 42.
  • 43.
  • 45.
  • 47. PFN training ImageNet in 15 mins
  • 49.
  • 50.
  • 52. PFN training ImageNet in 15 mins
  • 53. PFN training ImageNet in 15 mins

Editor's Notes

  1. This paper is regarding to Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth and preserve accuracy during this compression with inexpensive commodity hardware
  2. General introduction to this paper, background, author Some intro to Distributed deep learning training Related research work/paper regarding to DDLT Today’s paper - DGC Experiment they did and detailed result 300-600X wrap up and discussion
  3. Add more nodes will provide more computing power, but there is another factor: communication which might limit distributed training scalability 99.9% gradient exchange are redundant, especially for recurrent neural networks (RNN) where the computation-to-communication ratio is low. Therefore, the network bandwidth becomes a significant bottleneck for scaling up distributed training. lots of related work/research can have fast training, like 1 hours or even 15 mins e.g. Uber framework horoboard requires expensive 40mbits/sec network, same as other big companies like google, amazon, FB… Enable DT with less expensive network, e.g, AWS 1gbits ethernet to democratize DLT using commodity hardware Training on mobile – for privacy and better personalization
  4. Cifar10 is an established computer-vision dataset used for object recognition. The ImageNet project is a large visual database designed for use in visual object recognition software research. Resnet 50 from 97 MB to 0.35 MB, Deep Speech from 488 MB to .74 MB Penn Treebank dataset, known as PTB dataset, is widely used in machine learning of NLP (Natural Language Processing) research LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, DGC does not require extra layer normalization, and thus does not need to change the model structure. especially for recurrent neural networks (RNN) where the computation-to-communication ratio is low. Therefore, the network bandwidth becomes a significant bottleneck for scaling up distributed training
  5. 2018 ICLR conference paper Song Han – PhD from Stanford EECS, now assistant professor at MIT, also mange HAN’s lab @ MIT His Deep compression paper got 2016 ICLR best paper award Bill Dally – Professor @ Stanford, Chief scientist @ Nvidia for 10 years Others: From Tsinghua in China
  6. next is the overview of deep learning training in distributed env
  7. General distributed system, it is similar for distributed db, computing… vertical vs horizontal scaling scale up or out
  8. Data parallelism different chunk of data to different nodes, easier to implement, same model on each node ( CNN or RNN), node 1 may have batch of training images 1-32, node 2 may have next 32 images, etc. All the node are sharing the same model but they are fed with different chunk of data and they are calculating local gradients according to their own chunk of data and then they exchange gradients to each other. Can be implemented in two ways: a. parameter server - centralized , it receives the gradients from all nodes then sum it up and calculate average and update local weights then broadcast to all the training nodes. b. All-reduce operation(de-centralized) - Model parallelism different chuck of model to different nodes, hard to implement and less people to adopt this approach
  9. For single node training, there is no gradient exchange over network Every node receives every other's calculated gradients and then calculate the average. still have a master training node ( like a tree structure ) this is one of the basic implementation. more advanced like butterfly structure where χ is the training dataset, w are the weights of a network, f(x, w) is the loss computed from samples x ∈ χ, 𝜂 is the learning rate, N is the number of training nodes, and Bk,t for 1 ≤ k < N is a sequence of N minibatches sampled from χ at iteration t, each of size b. After T iterations, we have Equation 2 shows that local gradient accumulation can be considered as increasing the batch size from N b to N bT (the second summation over τ ), where T is the length of the sparse update interval between two iterations at which the gradient of w (i) is sent. Learning rate scaling (Goyal et al., 2017) is a commonly used technique to deal with large minibatch
  10. NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs. provide all-gather, all-reduce, broadcast...
  11. AlexNet: 2012 ResNet 2015
  12. LARGE SCALE DISTRIBUTED NEURAL NETWORK TRAINING THROUGH ONLINE DISTILLATION. G. E. Hinton In our first set of experiments, our goal was to approximately determine the maximum number of GPU workers that can be productively employed for SGD in our Common Crawl neural language model setup Common crawl dataset is an open repository of web crawl data and a largest to-date dataset used for neural language modeling, Common Craw consists of petabytes of data collected since 2011.[3] It completes crawls generally every month Figure 1a plots the validation error as a function of global steps for the different numbers of workers we tried, using the best learning rate for each number of workers. Increasing the number of workers (and thus the effective batch size) reduced the number of steps required to reach the best validation error until 128 workers, at which point there was no additional improvement. Even with idealized perfect infrastructure, 256 workers would at best result in the same end to end training time on this problem. However, because steps can take so much longer with 256 workers, going from 128 to 256 workers is highly counterproductive in practice. Figure 1b plots validation error against wall time for the same varying numbers of synchronous workers. There is a large degradation in step time, and thus learning progress, at 256 workers. Although it might be possible to improve the step time at 256 workers by using a more sophisticated scheme with backup workers (Chen et al., 2016), the operative limit to scalability on this task is the diminishing return from increasing the effective batch size, not the degradation in step times.
  13. Next related work
  14. Researchers have proposed many approaches to overcome the communication bottleneck in distributed training We will quickly take a look of existing research for DDL and compare today’s paper I presented
  15. Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations.
  16. FB - large minibatch SGD, caffe2 trains ResNet 50 with minibatch 8192 on 256 GPU, P100
  17. Closer look of FB big basin Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
  18. ImageNet top-1 validation error vs. minibatch size. large minibatch SGD, caffe2 trains ResNet 50 with minibatch 8192 on 256 GPU
  19. “Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes,” arXiv, 2017 basically it is a supercomputer
  20. Chainer - The development is led by Japanese venture company Preferred Networks in partnership with IBM, Intel, Microsoft, and Nvidia NVIDIA Collective Communications Library ( NCCL2) - multi nodes, multi-GPU systems provide functions like : all-gather, all-reduce, broadcast... pytorch has integrated NCCL2 to accelerate deep learning training on multi-GPU systems.
  21. large minibatch SGD, caffe2 trains ResNet 50 with minibatch 8192 on 256 GPU
  22. in some case, it improves accuracy
  23. next - today’s paper
  24. Gradient basic - Optimization problem - single node: like find direction when climbing downhill - multiple nodes - each node have images,  finds their own direction how to merge together? they need to communicate and exchange the gradient via network - Exchange can be bulky, e.g., alexnet 240 mB weights resnet 100 MB, every iteration, every node has to exchange 100 MB of gradient to each other which make the bottle neck of the infrastructure In synchronized training, each node need to know every other nodes' computed gradient. DeepSpeech from 488MB to 0.74MB Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile
  25. Some of the gradients are very small, not zero but small. so, sort the gradients, only send out the top 0.1% largest gradients. but just send 0,1%. Only simply doing this using threading doesn’t event converge in CNN or RNN So, some small gradients still affect accuracy
  26. If we don't send the small gradients, it will hurt the accuracy => locally accumulate the gradients for more iteration until it gets large enough then sent them out, In this way, accuracy can be recovered. Almost equivalent to increase batch size in N iteration, it mathematica way to interpret it Let’s say we accumulate the gradients locally for 3 iterations, it almost equivalent to increase the batch 3 times
  27. take into the prev gradient into account Using part of prev gradients and current gradient to do weighted average. which give a new vector called velocity. We should do local accumulation of velocity rather then local accumulation of gradients.
  28. gradient clipping is for RNN only. change the order between clipping and summation go through continuous matrix multiplications because of the the chain rule, and as they approach the earlier layers, if they have small values (<1), they shrink exponentially until they vanish and make it impossible for the model to learn , this is the vanishing gradient problem. While on the other hand if they have large values (>1) they get larger and eventually blow up and crash the model, this is the exploding gradient problem
  29. Long tail accumulation ( 2k iteration), it is necessary to cut or mask the gradients. to mask away the obsoleted velocity
  30. 75, 95,.. 99.9% In the early stages of training, the network is changing rapidly, and the gradients are more diverse and aggressive The only hyper-parameter introduced by Deep Gradient Compression is the warm-up training strategy. In all experiments related to DGC, we rise the sparsity in the warm-up period as follows: 75%, 93.75%, 98.4375%, 99.6%, 99.9% The warm-up period for DGC is 4 epochs out of164 epochs for Cifar10 and 4 epochs out of 90 epochs for ImageNet Dataset
  31. Figure 6 shows the speedup of multi-node training compared with single-node training. Conventional training achieves much worse speedup with 1Gbps (Figure 6(a)) than 10Gbps Ethernet (Figure 6(b)). Nonetheless, Deep Gradient Compression enables the training with 1Gbps Ethernet to be competitive with conventional training with 10Gbps Ethernet
  32. We refer to this migration as the momentum correction. It is a tweak to the update equation, it doesn’t incur any hyper parameter
  33. Shorter training time Equal to better model accuracy ( no degradation) Programming? LARGE SCALE DISTRIBUTED NEURAL NETWORK TRAINING THROUGH ONLINE DISTILLATION As the number of machines increases, there are diminishing improvements to the time needed to train a high quality model, to a point where adding workers does not further improve training time For the synchronous algorithm, there are rapidly diminishing returns from increasing the effective batch size For the asynchronous algorithm, gradient interference from inconsistent weights can cause updates to thrash and even, in some cases, result in worse final accuracy or completely stall learning progress In our experience it can be very difficult to scale effectively much beyond a hundred GPU workers in realistic setups
  34. The encode() function packs the 32-bit nonzero gradient values and 16-bit run lengths of zeros. where χ is the training dataset, w are the weights of a network, f(x, w) is the loss computed from samples x ∈ χ, 𝜂 is the learning rate, N is the number of training nodes, and Bk,t for 1 ≤ k < N is a sequence of N minibatches sampled from χ at iteration t, each of size b. After T iterations, we have Equation 2 shows that local gradient accumulation can be considered as increasing the batch size from N b to N bT (the second summation over τ ), where T is the length of the sparse update interval between two iterations at which the gradient of w (i) is sent. Learning rate scaling (Goyal et al., 2017) is a commonly used technique to deal with large minibatch
  35. PFN’s strategies to improve all-reduce network bottleneck
  36. Downpour SGD is an asynchronous variant of SGD in their DistBelief (predecessor to TensorFlow) at Google. It runs multiple replicas of a model in parallel on subsets of the training data. These models send their updates to a parameter server, which is split across many machines. Each machine is responsible for storing and updating a fraction of the model's parameters. However, as replicas don't communicate with each other e.g. by sharing weights or updates, their parameters are continuously at risk of diverging, hindering convergence. Image net in one hour : FB – large minibatch SGD, caffe2 trains ResNet 50 with minibatch 8192 on 256 GPU ImageNet in 15 mins : Preferred Network ( Japaness IOT company) – Chainer, 1024 P100 GPUs, BS = 32k Codistillatin - Google The idea of distillation is to first train a teacher model, which traditionally is an ensemble or another high-capacity model, and then, once this teacher model is trained, train a student model with an additional term in the loss function which encourages its predictions to be similar to the predictions of the teacher model.
  37. large minibatch SGD, caffe2 trains ResNet 50 with minibatch 8192 on 256 GPU
  38. large minibatch SGD, caffe2 trains ResNet 50 with minibatch 8192 on 256 GPU