SlideShare a Scribd company logo
1 of 25
Asynchronous, Data-Parallel
Deep Convolutional Neural Network Training
with Linear Prediction Model
for Parameter Transition
Ikuro Sato1), Ryo Fujisaki1),
Yosuke Oyama2), Akihiro Nomura2), and Satoshi Matsuoka2)
Deep Learning 3 (Nov. 16, 2017)
ICONIP 2017ICONIP 2017
1) Denso IT Laboratory,
2) Tokyo Institute of Technology, Japan
Ikuro Sato, Denso IT Laboratory, Inc. 1/25
Introduction
Method
Experiment
1.
2.
3.
Ikuro Sato, Denso IT Laboratory, Inc. 2/25
Common practices in state-of-the-art CNNs
Recent trend
#multiplications
per parameter
Computationally intensive models tend to perform well.
AlexNet
VGG-19
GoogLeNet
ResNet
137
11
221
179
top-5 error rate
@LSVRC
16.4%
7.32%
6.67%
3.57%
[Krizhevsky+, NIPS2012]
[Simonyan+, ICLR2015]
[Szegedy+, CVPR2015]
[He+, CVPR2016]
Ikuro Sato, Denso IT Laboratory, Inc. 3/25
Data-parallel, mini-batch SGD to boost training
What is it?
How fast is it to train computationally intensive CNNs?
GoogLeNet training on ImageNet boosted by 16x with 32 GPUs
Model optimization with many processors (GPUs) used in parallel
ResNet training on ImageNet within 1h with 256 GPUs
[Iandola+, CVPR2016]
ResNet training on ImageNet within 15 min with 1024 GPUs
[Akiba+, 2017]
[Goyal+, 2017]
Ikuro Sato, Denso IT Laboratory, Inc. 4/25
Two approaches: SSGD and ASGD
SSGD: Synchronous Stochastic Gradient Descent
ASGD: Asynchronous Stochastic Gradient Descent
Allows parameter update after completing all gradient comp.
Allows parameter update without completing all gradient comp.
Basic update rule:
Basic update rule:
𝑊 𝑡+1
= 𝑊 𝑡
− 𝜆
𝑎𝑙𝑙 𝐺𝑃𝑈𝑠
𝜕𝐽
𝜕𝑊
𝑊 𝑡
𝑊 𝑡+1 = 𝑊 𝑡 − 𝜆
𝑠𝑜𝑚𝑒 𝐺𝑃𝑈𝑠
𝜕𝐽
𝜕𝑊
𝑊 𝜏
Gradients evaluated
at old parameters.
𝒔𝒕𝒂𝒍𝒆𝒏𝒆𝒔𝒔 = 𝑡 − 𝜏 > 0
Ikuro Sato, Denso IT Laboratory, Inc. 5/25
Which is faster, SSGD or ASGD?
high update-frequencylow update-frequency
SSGD
ASGD
“Sync is faster” group:
“Async is faster” group:
low cost-drop per update
high cost-drop per update
No conclusion yet.
steepest
descent
[Zheng+, arxiv1609.08326] [Gupta+, ICDM2016] [Zhang+, IJCAI2016]
[Chen+, ICLR 2016 workshop] [Jin+, NIPS2016 workshop]
Ikuro Sato, Denso IT Laboratory, Inc. 6/25
Our contributions
Outperforms ASGD & conditionally outperforms SSGD in speed.
Proposes a new update rule, PP-ASGD (Parameter-Predicted ASGD).
Mitigates badness of staleness.
high update-frequencylow update-frequency
steepest
descent
SSGD
ASGD
PP-ASGD
low cost-drop per update
high cost-drop per update
better
gradient
“quality”
much higher
update frequency
Ikuro Sato, Denso IT Laboratory, Inc. 7/25
Introduction
Method
Experiment
1.
2.
3.
SSGD
ASGD
PP-ASGD (proposed)
Ikuro Sato, Denso IT Laboratory, Inc. 8/25
SSGD (with collective communication)
Load
Comp. grad.
Send
grad. &
update
Grad
Update rule (SSGD with momentum)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑎𝑙𝑙 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝑡
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡
synchronous
Ikuro Sato, Denso IT Laboratory, Inc. 9/25
ASGD (with collective communication)
Load
Comp. grad.
Flag Unflag
Send
grad. &
update
Grad
Send
zero &
update
Flagged?
Zero
yes no
Update rule (ASGD with momentum)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝜏
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated
at stale parameters
asynchronous synchronous
[Oyama+, IEEE BigData 2016]
Ikuro Sato, Denso IT Laboratory, Inc. 10/25
PP-ASGD (proposed)
Load
Comp. grad.
Flag Unflag
Send
grad. &
update
Grad
Send
zero &
update
Flagged?
Zero
yes no
Predict param.
Update rule (PP-ASGD)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝜏 +𝑀 𝜏−1
𝓈=1
𝑠+1 𝜇 𝓈
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated
at predicted parameters
(𝑠 = measured staleness)
asynchronous synchronous
Ikuro Sato, Denso IT Laboratory, Inc. 11/25
PP-ASGD (proposed)
Load
Comp. grad.
Flag Unflag
Send
grad. &
update
Grad
Send
zero &
update
Flagged?
Zero
yes no
Predict param.
Update rule (PP-ASGD)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝜏 +𝑀 𝜏−1
𝓈=1
𝑠+1 𝜇 𝓈
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated
at predicted parameters
(𝑠 = measured staleness)
asynchronous synchronous
If staleness is zero (𝑠 = 0),
PP-ASGD becomes
Nesterov’s Accelerated Gradient method
(NAG).
Ikuro Sato, Denso IT Laboratory, Inc. 12/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
Ex) staleness of 2
Ikuro Sato, Denso IT Laboratory, Inc. 13/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡
+ 𝜇𝑀 𝑡−1
transition by momentum
predicted transition
transition by (stale) gradients
grad (computing)
Ikuro Sato, Denso IT Laboratory, Inc. 14/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡
+ 𝜇𝑀 𝑡−1
transition by momentum
predicted transition
transition by (stale) gradients
grad (computing)
𝜇 + 𝜇2 + 𝜇3 = 2.94 𝜇 = 0.99
Ikuro Sato, Denso IT Laboratory, Inc. 15/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡
+ 𝜇𝑀 𝑡−1
transition by momentum
predicted transition
transition by (stale) gradients
grad (computing)
𝑊 𝑡+1
stale grad
Ikuro Sato, Denso IT Laboratory, Inc. 16/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
grad (computing)
Ikuro Sato, Denso IT Laboratory, Inc. 17/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
grad (computing)𝑊 𝑡+2
Ikuro Sato, Denso IT Laboratory, Inc. 18/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
grad (DONE!)𝑊 𝑡+2
𝑊 𝑡+3
Ikuro Sato, Denso IT Laboratory, Inc. 19/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
𝑊 𝑡+2
𝑊 𝑡+3
Hypothesis:
They’re close!
Ikuro Sato, Denso IT Laboratory, Inc. 20/25
Introduction
Method
Experiment
1.
2.
3.
Ikuro Sato, Denso IT Laboratory, Inc. 21/25
Training speed: PP-ASGD vs ASGD
Proposed PP-ASGD outperforms ASGD by ~2x
on (randomly chosen) 32-class ImageNet.
Validation error rate curves
32-GPU
(4-node x 8-GPU)
staleness
resource
8.5
Ikuro Sato, Denso IT Laboratory, Inc. 22/25
Training speed: PP-ASGD vs SSGD
1.9x
faster
Relative speed to reach
0.6 error rate.
Validation error rate curves
on 1000-class ImageNet
Proposed PP-ASGD consistently outperforms SSGD
by factor of 1.8-1.9 on 1000-class ImageNet.
staleness 1.9-2.6
GPU
update
frequency (Hz)
PP-
ASGD
(ours)
SSGD
32 13.4 4.8
64 12.1 4.7
128 9.9 4.5
256 8.2 3.9
Ikuro Sato, Denso IT Laboratory, Inc. 23/25
Parameter prediction accuracy
The proposed parameter transition model
Distance between
the (𝑠0-step) future param 𝑊𝑓𝑢𝑡𝑢𝑟𝑒, and
the predicted param 𝑊𝑝𝑟𝑒𝑑 𝑠 ,
as a function of 𝑠.
𝑊𝑝𝑟𝑒𝑑 𝑠 ≡ 𝑊 𝜏
+ 𝑀 𝜏−1
𝓈=1
𝑠+1
𝜇 𝓈
𝑊𝑝𝑟𝑒𝑑𝑠−𝑊𝑓𝑢𝑡𝑢𝑟𝑒2
No prediction (ASGD)
𝑊𝑝𝑟𝑒𝑑 0 − 𝑊𝑓𝑢𝑡𝑢𝑟𝑒 2
is most accurate when 𝑠 = measured staleness.
outperforms ASGD in prediction accuracy (𝑠 > 0).
Case of SSGD
Ikuro Sato, Denso IT Laboratory, Inc. 24/25
Conclusion
high update-frequencylow update-frequency
steepest
descent
SSGD
ASGD
PP-ASGD
low loss-drop per update
high loss-drop per update
better
gradient
“quality”
Proposes a new update rule, PP-ASGD (Parameter-Predicted ASGD).
Mitigates badness of staleness by parameter prediction.
much higher
update frequency
Outperforms ASGD & conditionally outperforms SSGD in speed.
Ikuro Sato, Denso IT Laboratory, Inc. 25/25

More Related Content

What's hot

IEEE/RSJ IROS 2008 Real-time Tracker
IEEE/RSJ IROS 2008 Real-time TrackerIEEE/RSJ IROS 2008 Real-time Tracker
IEEE/RSJ IROS 2008 Real-time Trackerc.choi
 
Enhanced Human Computer Interaction using hand gesture analysis on GPU
Enhanced Human Computer Interaction using hand gesture analysis on GPUEnhanced Human Computer Interaction using hand gesture analysis on GPU
Enhanced Human Computer Interaction using hand gesture analysis on GPUMahesh Khadatare
 
Generating Automated and Online Test Oracles for Simulink Models with Continu...
Generating Automated and Online Test Oracles for Simulink Models with Continu...Generating Automated and Online Test Oracles for Simulink Models with Continu...
Generating Automated and Online Test Oracles for Simulink Models with Continu...Lionel Briand
 
Implementation of PD controller in attitude of quadcopter
Implementation of PD controller in attitude of quadcopterImplementation of PD controller in attitude of quadcopter
Implementation of PD controller in attitude of quadcopterTack-geun You
 
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-indexGpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-indexMahesh Khadatare
 
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Tom Hubregtsen
 
Rethinking attention with performers
Rethinking attention with performersRethinking attention with performers
Rethinking attention with performersKyuYeolJung
 
Landuse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep LearningLanduse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep LearningDataWorks Summit
 
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...IRJET Journal
 
Self-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecuritySelf-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecurityKim Hammar
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15Karen Pao
 
Frechet Derivatives of Matrix Functions and Applications
Frechet Derivatives of Matrix Functions and ApplicationsFrechet Derivatives of Matrix Functions and Applications
Frechet Derivatives of Matrix Functions and ApplicationsSam Relton
 
Progress_report_KUSP2016_Ngo-Sy-Toan
Progress_report_KUSP2016_Ngo-Sy-ToanProgress_report_KUSP2016_Ngo-Sy-Toan
Progress_report_KUSP2016_Ngo-Sy-ToanToan Ngo Sy
 
Magnetic tracking --- talking from Magic Leap One
Magnetic tracking --- talking from Magic Leap OneMagnetic tracking --- talking from Magic Leap One
Magnetic tracking --- talking from Magic Leap OneJames D.B. Wang, PhD
 
"The Arrival of Quantum Computing" by Will Zeng
"The Arrival of Quantum Computing" by Will Zeng"The Arrival of Quantum Computing" by Will Zeng
"The Arrival of Quantum Computing" by Will ZengImpact.Tech
 

What's hot (20)

20191019 sinkhorn
20191019 sinkhorn20191019 sinkhorn
20191019 sinkhorn
 
IEEE/RSJ IROS 2008 Real-time Tracker
IEEE/RSJ IROS 2008 Real-time TrackerIEEE/RSJ IROS 2008 Real-time Tracker
IEEE/RSJ IROS 2008 Real-time Tracker
 
Enhanced Human Computer Interaction using hand gesture analysis on GPU
Enhanced Human Computer Interaction using hand gesture analysis on GPUEnhanced Human Computer Interaction using hand gesture analysis on GPU
Enhanced Human Computer Interaction using hand gesture analysis on GPU
 
Generating Automated and Online Test Oracles for Simulink Models with Continu...
Generating Automated and Online Test Oracles for Simulink Models with Continu...Generating Automated and Online Test Oracles for Simulink Models with Continu...
Generating Automated and Online Test Oracles for Simulink Models with Continu...
 
Implementation of PD controller in attitude of quadcopter
Implementation of PD controller in attitude of quadcopterImplementation of PD controller in attitude of quadcopter
Implementation of PD controller in attitude of quadcopter
 
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-indexGpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
 
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
 
Rethinking attention with performers
Rethinking attention with performersRethinking attention with performers
Rethinking attention with performers
 
Landuse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep LearningLanduse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep Learning
 
Thermography slide
Thermography slideThermography slide
Thermography slide
 
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...
 
Quantum computing
Quantum computingQuantum computing
Quantum computing
 
Self-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecuritySelf-Learning Systems for Cyber Security
Self-Learning Systems for Cyber Security
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15
 
Frechet Derivatives of Matrix Functions and Applications
Frechet Derivatives of Matrix Functions and ApplicationsFrechet Derivatives of Matrix Functions and Applications
Frechet Derivatives of Matrix Functions and Applications
 
Progress_report_KUSP2016_Ngo-Sy-Toan
Progress_report_KUSP2016_Ngo-Sy-ToanProgress_report_KUSP2016_Ngo-Sy-Toan
Progress_report_KUSP2016_Ngo-Sy-Toan
 
Magnetic tracking --- talking from Magic Leap One
Magnetic tracking --- talking from Magic Leap OneMagnetic tracking --- talking from Magic Leap One
Magnetic tracking --- talking from Magic Leap One
 
"The Arrival of Quantum Computing" by Will Zeng
"The Arrival of Quantum Computing" by Will Zeng"The Arrival of Quantum Computing" by Will Zeng
"The Arrival of Quantum Computing" by Will Zeng
 
Pycon9 dibernado
Pycon9 dibernadoPycon9 dibernado
Pycon9 dibernado
 

Similar to Ikuro Sato's slide presented at ICONIP2017

2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...asahiushio1
 
SpectrumEstimation.ppt
SpectrumEstimation.pptSpectrumEstimation.ppt
SpectrumEstimation.pptMaryanne678733
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...NECST Lab @ Politecnico di Milano
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to AlgorithmsVenkatesh Iyer
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcscpconf
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructioncsandit
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcsandit
 
IRJET - A Speculative Approximate Adder for Error Recovery Unit
IRJET - A Speculative Approximate Adder for Error Recovery UnitIRJET - A Speculative Approximate Adder for Error Recovery Unit
IRJET - A Speculative Approximate Adder for Error Recovery UnitIRJET Journal
 
2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...
2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...
2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...asahiushio1
 
Chaotic based Pteropus algorithm for solving optimal reactive power problem
Chaotic based Pteropus algorithm for solving optimal reactive power problemChaotic based Pteropus algorithm for solving optimal reactive power problem
Chaotic based Pteropus algorithm for solving optimal reactive power problemIJAAS Team
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationWork-Bench
 
IRJET- Different Data Mining Techniques for Weather Prediction
IRJET-  	  Different Data Mining Techniques for Weather PredictionIRJET-  	  Different Data Mining Techniques for Weather Prediction
IRJET- Different Data Mining Techniques for Weather PredictionIRJET Journal
 
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORCOUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORIJNSA Journal
 
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...IJERA Editor
 
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...AIRCC Publishing Corporation
 
Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...ijcsit
 
Real time active noise cancellation using adaptive filters following RLS and ...
Real time active noise cancellation using adaptive filters following RLS and ...Real time active noise cancellation using adaptive filters following RLS and ...
Real time active noise cancellation using adaptive filters following RLS and ...IRJET Journal
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Willy Marroquin (WillyDevNET)
 

Similar to Ikuro Sato's slide presented at ICONIP2017 (20)

2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
 
SpectrumEstimation.ppt
SpectrumEstimation.pptSpectrumEstimation.ppt
SpectrumEstimation.ppt
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstruction
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
6. Implementation
6. Implementation6. Implementation
6. Implementation
 
IRJET - A Speculative Approximate Adder for Error Recovery Unit
IRJET - A Speculative Approximate Adder for Error Recovery UnitIRJET - A Speculative Approximate Adder for Error Recovery Unit
IRJET - A Speculative Approximate Adder for Error Recovery Unit
 
2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...
2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...
2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...
 
Chaotic based Pteropus algorithm for solving optimal reactive power problem
Chaotic based Pteropus algorithm for solving optimal reactive power problemChaotic based Pteropus algorithm for solving optimal reactive power problem
Chaotic based Pteropus algorithm for solving optimal reactive power problem
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical Computation
 
IRJET- Different Data Mining Techniques for Weather Prediction
IRJET-  	  Different Data Mining Techniques for Weather PredictionIRJET-  	  Different Data Mining Techniques for Weather Prediction
IRJET- Different Data Mining Techniques for Weather Prediction
 
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORCOUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
 
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
 
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
 
Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...
 
Real time active noise cancellation using adaptive filters following RLS and ...
Real time active noise cancellation using adaptive filters following RLS and ...Real time active noise cancellation using adaptive filters following RLS and ...
Real time active noise cancellation using adaptive filters following RLS and ...
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
 

Recently uploaded

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 

Recently uploaded (20)

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 

Ikuro Sato's slide presented at ICONIP2017

  • 1. Asynchronous, Data-Parallel Deep Convolutional Neural Network Training with Linear Prediction Model for Parameter Transition Ikuro Sato1), Ryo Fujisaki1), Yosuke Oyama2), Akihiro Nomura2), and Satoshi Matsuoka2) Deep Learning 3 (Nov. 16, 2017) ICONIP 2017ICONIP 2017 1) Denso IT Laboratory, 2) Tokyo Institute of Technology, Japan Ikuro Sato, Denso IT Laboratory, Inc. 1/25
  • 3. Common practices in state-of-the-art CNNs Recent trend #multiplications per parameter Computationally intensive models tend to perform well. AlexNet VGG-19 GoogLeNet ResNet 137 11 221 179 top-5 error rate @LSVRC 16.4% 7.32% 6.67% 3.57% [Krizhevsky+, NIPS2012] [Simonyan+, ICLR2015] [Szegedy+, CVPR2015] [He+, CVPR2016] Ikuro Sato, Denso IT Laboratory, Inc. 3/25
  • 4. Data-parallel, mini-batch SGD to boost training What is it? How fast is it to train computationally intensive CNNs? GoogLeNet training on ImageNet boosted by 16x with 32 GPUs Model optimization with many processors (GPUs) used in parallel ResNet training on ImageNet within 1h with 256 GPUs [Iandola+, CVPR2016] ResNet training on ImageNet within 15 min with 1024 GPUs [Akiba+, 2017] [Goyal+, 2017] Ikuro Sato, Denso IT Laboratory, Inc. 4/25
  • 5. Two approaches: SSGD and ASGD SSGD: Synchronous Stochastic Gradient Descent ASGD: Asynchronous Stochastic Gradient Descent Allows parameter update after completing all gradient comp. Allows parameter update without completing all gradient comp. Basic update rule: Basic update rule: 𝑊 𝑡+1 = 𝑊 𝑡 − 𝜆 𝑎𝑙𝑙 𝐺𝑃𝑈𝑠 𝜕𝐽 𝜕𝑊 𝑊 𝑡 𝑊 𝑡+1 = 𝑊 𝑡 − 𝜆 𝑠𝑜𝑚𝑒 𝐺𝑃𝑈𝑠 𝜕𝐽 𝜕𝑊 𝑊 𝜏 Gradients evaluated at old parameters. 𝒔𝒕𝒂𝒍𝒆𝒏𝒆𝒔𝒔 = 𝑡 − 𝜏 > 0 Ikuro Sato, Denso IT Laboratory, Inc. 5/25
  • 6. Which is faster, SSGD or ASGD? high update-frequencylow update-frequency SSGD ASGD “Sync is faster” group: “Async is faster” group: low cost-drop per update high cost-drop per update No conclusion yet. steepest descent [Zheng+, arxiv1609.08326] [Gupta+, ICDM2016] [Zhang+, IJCAI2016] [Chen+, ICLR 2016 workshop] [Jin+, NIPS2016 workshop] Ikuro Sato, Denso IT Laboratory, Inc. 6/25
  • 7. Our contributions Outperforms ASGD & conditionally outperforms SSGD in speed. Proposes a new update rule, PP-ASGD (Parameter-Predicted ASGD). Mitigates badness of staleness. high update-frequencylow update-frequency steepest descent SSGD ASGD PP-ASGD low cost-drop per update high cost-drop per update better gradient “quality” much higher update frequency Ikuro Sato, Denso IT Laboratory, Inc. 7/25
  • 9. SSGD (with collective communication) Load Comp. grad. Send grad. & update Grad Update rule (SSGD with momentum) 𝑀 𝑡 = 𝜇𝑀 𝑡−1 − 𝜆 𝑎𝑙𝑙 𝑛𝑜𝑑𝑒𝑠 𝜕𝐽 𝜕𝑊 𝑊 𝑡 𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 synchronous Ikuro Sato, Denso IT Laboratory, Inc. 9/25
  • 10. ASGD (with collective communication) Load Comp. grad. Flag Unflag Send grad. & update Grad Send zero & update Flagged? Zero yes no Update rule (ASGD with momentum) 𝑀 𝑡 = 𝜇𝑀 𝑡−1 − 𝜆 𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠 𝜕𝐽 𝜕𝑊 𝑊 𝜏 𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated at stale parameters asynchronous synchronous [Oyama+, IEEE BigData 2016] Ikuro Sato, Denso IT Laboratory, Inc. 10/25
  • 11. PP-ASGD (proposed) Load Comp. grad. Flag Unflag Send grad. & update Grad Send zero & update Flagged? Zero yes no Predict param. Update rule (PP-ASGD) 𝑀 𝑡 = 𝜇𝑀 𝑡−1 − 𝜆 𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠 𝜕𝐽 𝜕𝑊 𝑊 𝜏 +𝑀 𝜏−1 𝓈=1 𝑠+1 𝜇 𝓈 𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated at predicted parameters (𝑠 = measured staleness) asynchronous synchronous Ikuro Sato, Denso IT Laboratory, Inc. 11/25
  • 12. PP-ASGD (proposed) Load Comp. grad. Flag Unflag Send grad. & update Grad Send zero & update Flagged? Zero yes no Predict param. Update rule (PP-ASGD) 𝑀 𝑡 = 𝜇𝑀 𝑡−1 − 𝜆 𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠 𝜕𝐽 𝜕𝑊 𝑊 𝜏 +𝑀 𝜏−1 𝓈=1 𝑠+1 𝜇 𝓈 𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated at predicted parameters (𝑠 = measured staleness) asynchronous synchronous If staleness is zero (𝑠 = 0), PP-ASGD becomes Nesterov’s Accelerated Gradient method (NAG). Ikuro Sato, Denso IT Laboratory, Inc. 12/25
  • 13. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space Ex) staleness of 2 Ikuro Sato, Denso IT Laboratory, Inc. 13/25
  • 14. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡 + 𝜇𝑀 𝑡−1 transition by momentum predicted transition transition by (stale) gradients grad (computing) Ikuro Sato, Denso IT Laboratory, Inc. 14/25
  • 15. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡 + 𝜇𝑀 𝑡−1 transition by momentum predicted transition transition by (stale) gradients grad (computing) 𝜇 + 𝜇2 + 𝜇3 = 2.94 𝜇 = 0.99 Ikuro Sato, Denso IT Laboratory, Inc. 15/25
  • 16. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡 + 𝜇𝑀 𝑡−1 transition by momentum predicted transition transition by (stale) gradients grad (computing) 𝑊 𝑡+1 stale grad Ikuro Sato, Denso IT Laboratory, Inc. 16/25
  • 17. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡+1 + 𝑀 𝑡 𝓈=1 2+1 𝜇 𝓈 𝑊 𝑡+1 grad (computing) Ikuro Sato, Denso IT Laboratory, Inc. 17/25
  • 18. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡+1 + 𝑀 𝑡 𝓈=1 2+1 𝜇 𝓈 𝑊 𝑡+1 grad (computing)𝑊 𝑡+2 Ikuro Sato, Denso IT Laboratory, Inc. 18/25
  • 19. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡+1 + 𝑀 𝑡 𝓈=1 2+1 𝜇 𝓈 𝑊 𝑡+1 grad (DONE!)𝑊 𝑡+2 𝑊 𝑡+3 Ikuro Sato, Denso IT Laboratory, Inc. 19/25
  • 20. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡+1 + 𝑀 𝑡 𝓈=1 2+1 𝜇 𝓈 𝑊 𝑡+1 𝑊 𝑡+2 𝑊 𝑡+3 Hypothesis: They’re close! Ikuro Sato, Denso IT Laboratory, Inc. 20/25
  • 22. Training speed: PP-ASGD vs ASGD Proposed PP-ASGD outperforms ASGD by ~2x on (randomly chosen) 32-class ImageNet. Validation error rate curves 32-GPU (4-node x 8-GPU) staleness resource 8.5 Ikuro Sato, Denso IT Laboratory, Inc. 22/25
  • 23. Training speed: PP-ASGD vs SSGD 1.9x faster Relative speed to reach 0.6 error rate. Validation error rate curves on 1000-class ImageNet Proposed PP-ASGD consistently outperforms SSGD by factor of 1.8-1.9 on 1000-class ImageNet. staleness 1.9-2.6 GPU update frequency (Hz) PP- ASGD (ours) SSGD 32 13.4 4.8 64 12.1 4.7 128 9.9 4.5 256 8.2 3.9 Ikuro Sato, Denso IT Laboratory, Inc. 23/25
  • 24. Parameter prediction accuracy The proposed parameter transition model Distance between the (𝑠0-step) future param 𝑊𝑓𝑢𝑡𝑢𝑟𝑒, and the predicted param 𝑊𝑝𝑟𝑒𝑑 𝑠 , as a function of 𝑠. 𝑊𝑝𝑟𝑒𝑑 𝑠 ≡ 𝑊 𝜏 + 𝑀 𝜏−1 𝓈=1 𝑠+1 𝜇 𝓈 𝑊𝑝𝑟𝑒𝑑𝑠−𝑊𝑓𝑢𝑡𝑢𝑟𝑒2 No prediction (ASGD) 𝑊𝑝𝑟𝑒𝑑 0 − 𝑊𝑓𝑢𝑡𝑢𝑟𝑒 2 is most accurate when 𝑠 = measured staleness. outperforms ASGD in prediction accuracy (𝑠 > 0). Case of SSGD Ikuro Sato, Denso IT Laboratory, Inc. 24/25
  • 25. Conclusion high update-frequencylow update-frequency steepest descent SSGD ASGD PP-ASGD low loss-drop per update high loss-drop per update better gradient “quality” Proposes a new update rule, PP-ASGD (Parameter-Predicted ASGD). Mitigates badness of staleness by parameter prediction. much higher update frequency Outperforms ASGD & conditionally outperforms SSGD in speed. Ikuro Sato, Denso IT Laboratory, Inc. 25/25