Successfully reported this slideshow.
Upcoming SlideShare
×

# Large Scale Distributed Deep Networks

1,217 views

Published on

Survey of paper from NIPS 2012, Large Scale Distributed Deep Networks

Published in: Engineering
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Large Scale Distributed Deep Networks

1. 1. Large Scale Distributed Deep Networks Survey of paper from NIPS 2012 Hiroyuki Vincent Yamazaki, Jan 8, 2016  hiroyuki.vincent.yamazaki@gmail.com
2. 2. What is Deep Learning? How can distributed computing be applied?
3. 3. – Jeff Dean, Google  GitHub Issue - Distributed Version #23, TensorFlow, Nov 11, 2015 “… We realize that distributed support is really important, and it's one of the top features we're prioritizing at the moment.”
4. 4. What is Deep Learning?
5. 5. Multi layered neural networks Functions that take some input  and return some output Input Outputf
6. 6. Input Output AND (1, 0) 0 y(x) = 2x + 5 7 19 Object Classiﬁer Cat Speech Recognizer “Hello world” f
7. 7. Neural Networks Machine learning models, inspired by the human brain Layered units with weighted connections Signals are passed between layers  Input layer → Hidden layers → Output layer
8. 8. Steps 1. Prepare training, validation and test data 2. Deﬁne the model and its initial parameters 3. Train using the data to improve the modelf
9. 9. Here to   train?
10. 10. Input Outputf
11. 11. Input Output Hidden Layers
12. 12. Input Output Hidden Layers
13. 13. Yes,   let’s do it
14. 14. Feed Forward 1. For each unit, compute its weighted sum   based on its input 2. Pass the sum to the activation function   to get the output of the unit z is the weighted sum n is the number of inputs xi is the i-th input wi is the weight for xi b is the bias term y is the output is the activation function z z = nX i=1 xiwi + b y = (z) y w1 x1 x2 w2 b
15. 15. Loss 3. Given the output from the last layer, compute the loss using the Mean Squared Error (MSE) or the cross entropy             This is the error that we want to minimize E(W ) = 1 2 (ˆy y)2 E is the loss/error W is the weights ˆy is the target values y is the output values
16. 16. Back Propagation 4. Compute the gradient of the loss function with respect to the parameters using Stochastic Gradient Descent (SGD) 5. Taken a step proportional (scaled by the learning rate) to the negative of the gradient to adjust the weights wi = ↵ @E @wi wi,t+1 = wi,t + wi ↵ is the learning rate, typically 10 1 to 10 3
17. 17. Improve the accuracy of the network by iteratively repeating these steps
18. 18. But it takes time
20. 20. AlexNet, NIPS 2012 7 layers 650K units 60M parameters
21. 21. Yes, train hard It’s too much
22. 22. How can distributed computing be applied?
23. 23. A framework, DistBelief proposed by the researchers at Google, 2012
25. 25. Asynchronousness - Robustness to cope with slow machines and single point failures Network Overhead - Manage the amount of data sent across machines
26. 26. DistBelief Parallelization Splitting up the network/model Model Replication Processing multiple   instances of the network/model asynchronously
27. 27. DistBelief Parallelization
28. 28. Split up the network among multiple machines Speed up gains for networks with many parameters up to the point when communication cost dominate Bold connections require network trafﬁc
29. 29. DistBelief Model Replication
30. 30. Two optimization algorithms to achieve asynchronousness, Downpour SGD and Sandblaster L-BFGS
31. 31. Downpour SGD Online Asynchronous   Stochastic Gradient Descent
32. 32. 1. Split the training data into  shards and assign a model   replica to each data shard 2. For each model replica, fetch the parameters from the centralized sharded parameter server 3. Gradients are computed per model and pushed back to the parameter server Each data shard stores a subset of the   complete training data
33. 33. Asynchrousness  Model replicas and parameter server shards process data independently Network Overhead  Each machine only need to communicate with a subset of the parameter server shards
34. 34. Batch Updates  Performing batch updates and batch push/pull to and from the parameter server → Also reduces network overhead AdaGrad  Adaptive learning rates per weight using AdaGrad improves the training results Stochasticity  Out of date parameters in model replicas →   Not clear how this affects the training
35. 35. Sandblaster L-BFGS  Batch Distributed Parameter Storage   and Manipulation
36. 36. 1. Create model replicas 2. Load balancing by dividing computational tasks into smaller subtasks and letting a coordinator assigns those subtasks to appropriate shards
37. 37. Asynchrousness  Model replicas and parameter shards process data independently Network Overhead  Only a single fetch per batch
38. 38. Distributed Parameter Server  No need for a central parameter server that needs to handle all the parameters Coordinator  A process that balances the loads among the shards to prevent slow machines from slowing down or stopping the training
39. 39. Results
40. 40. Training speed-up is the number of times the parallelized model is faster   compared with a regular model running on a single machine
41. 41. The numbers in the brackets are the number of model replicas
42. 42. Closer to the origin is better, in this case cost efﬁcient in terms of money
43. 43. Conclusion
44. 44.   Signiﬁcant improvements over   single machine training DistBelief is CPU oriented due to the   CPU-GPU data transfer overhead Unfortunately adds   unit connectivity limitations
45. 45. If neural networks continue to scale up distributed computing will become essential
46. 46. Designed hardware such as the Big Sur could address these problems
47. 47. We are strong together
48. 48. References Large Scaled Distributed Deep Networks  http://research.google.com/archive/large_deep_networks_nips2012.html Going Deeper with Convolutions  http://arxiv.org/abs/1409.4842 ImageNet Classiﬁcation with Deep Convolutional Neural Networks  http://papers.nips.cc/book/advances-in-neural-information-processing-systems-25-2012 Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms  http://arxiv.org/abs/1505.04956 GitHub Issue - Distributed Version #23, TensorFlow, Nov 11, 2015  https://github.com/tensorﬂow/tensorﬂow/issues/23 Big Sur, Facebook, Dec 11, 2015  https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/
49. 49. Hiroyuki Vincent Yamazaki, Jan 8, 2016  hiroyuki.vincent.yamazaki@gmail.com