Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Large Scale Distributed Deep Networks


Published on

Survey of paper from NIPS 2012, Large Scale Distributed Deep Networks

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Large Scale Distributed Deep Networks

  1. 1. Large Scale Distributed Deep Networks Survey of paper from NIPS 2012 Hiroyuki Vincent Yamazaki, Jan 8, 2016
  2. 2. What is Deep Learning? How can distributed computing be applied?
  3. 3. – Jeff Dean, Google
 GitHub Issue - Distributed Version #23, TensorFlow, Nov 11, 2015 “… We realize that distributed support is really important, and it's one of the top features we're prioritizing at the moment.”
  4. 4. What is Deep Learning?
  5. 5. Multi layered neural networks Functions that take some input
 and return some output Input Outputf
  6. 6. Input Output AND (1, 0) 0 y(x) = 2x + 5 7 19 Object Classifier Cat Speech Recognizer “Hello world” f
  7. 7. Neural Networks Machine learning models, inspired by the human brain Layered units with weighted connections Signals are passed between layers
 Input layer → Hidden layers → Output layer
  8. 8. Steps 1. Prepare training, validation and test data 2. Define the model and its initial parameters 3. Train using the data to improve the modelf
  9. 9. Here to 
  10. 10. Input Outputf
  11. 11. Input Output Hidden Layers
  12. 12. Input Output Hidden Layers
  13. 13. Yes, 
 let’s do it
  14. 14. Feed Forward 1. For each unit, compute its weighted sum 
 based on its input 2. Pass the sum to the activation function 
 to get the output of the unit z is the weighted sum n is the number of inputs xi is the i-th input wi is the weight for xi b is the bias term y is the output is the activation function z z = nX i=1 xiwi + b y = (z) y w1 x1 x2 w2 b
  15. 15. Loss 3. Given the output from the last layer, compute the loss using the Mean Squared Error (MSE) or the cross entropy 
 This is the error that we want to minimize E(W ) = 1 2 (ˆy y)2 E is the loss/error W is the weights ˆy is the target values y is the output values
  16. 16. Back Propagation 4. Compute the gradient of the loss function with respect to the parameters using Stochastic Gradient Descent (SGD) 5. Taken a step proportional (scaled by the learning rate) to the negative of the gradient to adjust the weights wi = ↵ @E @wi wi,t+1 = wi,t + wi ↵ is the learning rate, typically 10 1 to 10 3
  17. 17. Improve the accuracy of the network by iteratively repeating these steps
  18. 18. But it takes time
  19. 19. 22 layers 5M parameters GoogLeNet, Google, ILSVRC 2014
  20. 20. AlexNet, NIPS 2012 7 layers 650K units 60M parameters
  21. 21. Yes, train hard It’s too much
  22. 22. How can distributed computing be applied?
  23. 23. A framework, DistBelief proposed by the researchers at Google, 2012
  24. 24. Here, let 
 me help you 
 with those
  25. 25. Asynchronousness - Robustness to cope with slow machines and single point failures Network Overhead - Manage the amount of data sent across machines
  26. 26. DistBelief Parallelization Splitting up the network/model Model Replication Processing multiple 
 instances of the network/model asynchronously
  27. 27. DistBelief Parallelization
  28. 28. Split up the network among multiple machines Speed up gains for networks with many parameters up to the point when communication cost dominate Bold connections require network traffic
  29. 29. DistBelief Model Replication
  30. 30. Two optimization algorithms to achieve asynchronousness, Downpour SGD and Sandblaster L-BFGS
  31. 31. Downpour SGD Online Asynchronous 
 Stochastic Gradient Descent
  32. 32. 1. Split the training data into
 shards and assign a model 
 replica to each data shard 2. For each model replica, fetch the parameters from the centralized sharded parameter server 3. Gradients are computed per model and pushed back to the parameter server Each data shard stores a subset of the 
 complete training data
  33. 33. Asynchrousness
 Model replicas and parameter server shards process data independently Network Overhead
 Each machine only need to communicate with a subset of the parameter server shards
  34. 34. Batch Updates
 Performing batch updates and batch push/pull to and from the parameter server → Also reduces network overhead AdaGrad
 Adaptive learning rates per weight using AdaGrad improves the training results Stochasticity
 Out of date parameters in model replicas → 
 Not clear how this affects the training
  35. 35. Sandblaster L-BFGS
 Batch Distributed Parameter Storage 
 and Manipulation
  36. 36. 1. Create model replicas 2. Load balancing by dividing computational tasks into smaller subtasks and letting a coordinator assigns those subtasks to appropriate shards
  37. 37. Asynchrousness
 Model replicas and parameter shards process data independently Network Overhead
 Only a single fetch per batch
  38. 38. Distributed Parameter Server
 No need for a central parameter server that needs to handle all the parameters Coordinator
 A process that balances the loads among the shards to prevent slow machines from slowing down or stopping the training
  39. 39. Results
  40. 40. Training speed-up is the number of times the parallelized model is faster 
 compared with a regular model running on a single machine
  41. 41. The numbers in the brackets are the number of model replicas
  42. 42. Closer to the origin is better, in this case cost efficient in terms of money
  43. 43. Conclusion
  44. 44. 
 Significant improvements over 
 single machine training DistBelief is CPU oriented due to the 
 CPU-GPU data transfer overhead Unfortunately adds 
 unit connectivity limitations
  45. 45. If neural networks continue to scale up distributed computing will become essential
  46. 46. Designed hardware such as the Big Sur could address these problems
  47. 47. We are strong together
  48. 48. References Large Scaled Distributed Deep Networks Going Deeper with Convolutions ImageNet Classification with Deep Convolutional Neural Networks Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms GitHub Issue - Distributed Version #23, TensorFlow, Nov 11, 2015 Big Sur, Facebook, Dec 11, 2015
  49. 49. Hiroyuki Vincent Yamazaki, Jan 8, 2016