9. 深層学習における代表的なタスク
11
The graph was excerpted from https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/object_localization_and_detection.html
27. 29
The “large batch” problem
From Keskar et al.
“On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”
“It has been observed in practice that when using a larger batch
there is a significant degradation in the quality of the model, as
measured by its ability to generalize”
1. Computed gradients in each iteration is an average of larger number of samples
→ gradients are “less stochastic”, which makes it difficult to escape from local minima
2. Total number of iterations (=updates) is smaller
(number of iterations in 1epoch = number of images / batchsize)
Local minima
Better model
28. 30
“Linear scaling rule” for large batch problem
“If minibatch-size is k times larger, increase learning rate by k times”
29. 31
Data parallel: sync vs. async
All-Reduc
e
Forward
Forward
Forward
Backward
Backward
Backward
Optimize
Optimize
Optimize
Synchronous:
Parameter server
Asynchronous:
30. 32
Reduce communication: use FP16
Compute gradients
Convert FP32 to FP16
Allreduce (with NCCL)
Convert FP16 to FP32 and update
31. 33
Hide communication (by overlapping)
Double buffering
• Each update uses the gradients from previous iteration (1-step stale grad.)
32. 同期型データ並列による学習 高 化
● 問題設定
● Dataset: ImageNet
● モデル: ResNet50
● 90epochをいかに精度を落とさずに高 に学習するか (epoch数自体を減らして いけない )
● 2年弱で100倍以上高 化している
Company Processor Date Training time
PFN TITAN X *128 17/1 4h
Facebook P100 *256 17/6 1h
PFN P100 *1024 17/11 15min
SONY V100 *2176 18/11 3.7min
Google TPUv3 *1024 18/11 2.2min