The 30000’ Summary and 3 Cool ResultsJohn Langford (Yahoo!) with Ron Bekkerman(Linkedin) and                Misha Bilenko(...
This hour’s outline     1   A summary of results     2   Cool uses of GPUs     3   Terascale linear
A comparison is virtually impossible   Prediction performance varies wildly depending on the   problem–method pair.
A comparison is virtually impossible   Prediction performance varies wildly depending on the   problem–method pair.   But ...
A comparison is virtually impossible   Prediction performance varies wildly depending on the   problem–method pair.   But ...
A comparison is virtually impossible   Prediction performance varies wildly depending on the   problem–method pair.   But ...
Features/s                                                     10                                                    100  ...
Features/s                                                     100                                                    1000...
Features/s                                       100                                             1000                     ...
My Flow Chart for Learning Optimization    1   Choose an efficient effective algorithm    2   Use compact binary representati...
This hour’s outline     1   A summary of results     2   Cool uses of GPUs           1   RBM learning           2   Speech...
Restricted Boltzmann Machine learning CRN11 Chapter 18   Goal: Learn weights which predict hidden state given features tha...
Restricted Boltzmann Machine learning CRN11 Chapter 18   Goal: Learn weights which predict hidden state given features tha...
RBM parallelization   GPU = hundreds of weak processors doing vector operations on   shared memory.     1   Activation lev...
RBM parallelization   GPU = hundreds of weak processors doing vector operations on   shared memory.     1   Activation lev...
RBM parallelization   GPU = hundreds of weak processors doing vector operations on   shared memory.     1   Activation lev...
RBM parallelization   GPU = hundreds of weak processors doing vector operations on   shared memory.     1   Activation lev...
RBM parallelization   GPU = hundreds of weak processors doing vector operations on   shared memory.     1   Activation lev...
Parallelization Techniques     1   Store model in GPU memory and stream data.     2   Use existing GPU-optimized matrix op...
Parallelization Techniques     1   Store model in GPU memory and stream data.     2   Use existing GPU-optimized matrix op...
GPUs for Speech Recognition CGYK11 Chapter 21  Given observed utterances, we want to reconstruct the original  (hidden) se...
GPUs for Speech Recognition CGYK11 Chapter 21  Given observed utterances, we want to reconstruct the original  (hidden) se...
GPUs for Speech Recognition CGYK11 Chapter 21  Given observed utterances, we want to reconstruct the original  (hidden) se...
GPUs for Speech Recognition CGYK11 Chapter 21  Given observed utterances, we want to reconstruct the original  (hidden) se...
The approach used   Start with a careful systematic analysis of where parallelization   might help.
The approach used   Start with a careful systematic analysis of where parallelization   might help.     1   SIMD instructi...
The approach used   Start with a careful systematic analysis of where parallelization   might help.     1   SIMD instructi...
This hour’s outline     1   A summary of results     2   Cool uses of GPUs           1   RBM learning           2   Speech...
Terascale Linear Learning ACDL11   Given 2.1 Terafeatures of data, how can you learn a linear predictor   fw (x) = i wi xi ?
Terascale Linear Learning ACDL11   Given 2.1 Terafeatures of data, how can you learn a linear predictor   fw (x) = i wi xi...
Terascale Linear Learning ACDL11   Given 2.1 Terafeatures of data, how can you learn a linear predictor   fw (x) = i wi xi...
Terascale Linear Learning ACDL11   Given 2.1 Terafeatures of data, how can you learn a linear predictor   fw (x) = i wi xi...
MPI-style AllReduce    Allreduce initial state                    7         5                    6    1          2        ...
MPI-style AllReduce    Reducing, step 1                    7         8                    13    1          2         3    ...
MPI-style AllReduce    Reducing, step 2                    28         8                    13    1          2         3   ...
MPI-style AllReduce    Broadcast, step 1                    28         28                   28    1          2         3  ...
MPI-style AllReduce    Allreduce final state                    28         28                   28    28         28       ...
MPI-style AllReduce    Allreduce final state                       28           28                      28    28          ...
An Example Algorithm: Weight averaging   n = AllReduce(1)   While (pass number < max)     1 While (examples left)         ...
An Example Algorithm: Weight averaging   n = AllReduce(1)   While (pass number < max)     1 While (examples left)         ...
Approach Used: Preliminaries   Optimize so few data passes required.   Basic problem with gradient descent = confused unit...
Approach Used: Preliminaries   Optimize so few data passes required.   Basic problem with gradient descent = confused unit...
Approach Used    1   Optimize hard so few data passes required.          1   L-BFGS = batch algorithm that builds up appro...
Approach Used    1   Optimize hard so few data passes required.          1   L-BFGS = batch algorithm that builds up appro...
Approach Used    1   Optimize hard so few data passes required.          1   L-BFGS = batch algorithm that builds up appro...
Approach Used    1   Optimize hard so few data passes required.          1   L-BFGS = batch algorithm that builds up appro...
Approach Used    1   Optimize hard so few data passes required.          1   L-BFGS = batch algorithm that builds up appro...
Approach Used    1   Optimize hard so few data passes required.          1   L-BFGS = batch algorithm that builds up appro...
Approach Used    1   Optimize hard so few data passes required.          1   L-BFGS = batch algorithm that builds up appro...
Empirical Results   2.1T sparse features   17B Examples   16M parameters   1K nodes   70 minutes
The end  Right now there is extreme diversity:    1   Many different notions of large scale.    2   Many different approache...
Numbers References   RBMs Adam Coates, Rajat Raina, and Andrew Y. Ng, Large Scale        Learning for Vision with GPUs, in...
Online ReferencesAdapt I H. Brendan McMahan and Matthew Streeter Adaptive Bound        Optimization for Online Convex Opti...
Batch References BFGS I Broyden, C. G. (1970), ”The convergence of a class of        double-rank minimization algorithms”,...
Upcoming SlideShare
Loading in …5
×

Scaling Up Machine Learning, the Tutorial, KDD 2011 (Part III Summary + GPU learning + Terascale linear learning)

1,269 views
1,129 views

Published on

Published in: Technology, Education

Scaling Up Machine Learning, the Tutorial, KDD 2011 (Part III Summary + GPU learning + Terascale linear learning)

  1. 1. The 30000’ Summary and 3 Cool ResultsJohn Langford (Yahoo!) with Ron Bekkerman(Linkedin) and Misha Bilenko(Microsoft) http://hunch.net/~large_scale_survey
  2. 2. This hour’s outline 1 A summary of results 2 Cool uses of GPUs 3 Terascale linear
  3. 3. A comparison is virtually impossible Prediction performance varies wildly depending on the problem–method pair.
  4. 4. A comparison is virtually impossible Prediction performance varies wildly depending on the problem–method pair. But we can try: use Input complexity/time.
  5. 5. A comparison is virtually impossible Prediction performance varies wildly depending on the problem–method pair. But we can try: use Input complexity/time. ⇒ No credit for creating complexity then reducing it. (Ouch!)
  6. 6. A comparison is virtually impossible Prediction performance varies wildly depending on the problem–method pair. But we can try: use Input complexity/time. ⇒ No credit for creating complexity then reducing it. (Ouch!) Most interesting results reported. Some cases require creative best-effort summary.
  7. 7. Features/s 10 100 1000 10000 100000 1e+06 1e+07 800x800 Info CC MPI-400 RCV1 Spec. Clust-53 Unsupervised Learning MapRed-128 RCV1 single parallel LDA-2000# = Prev. * = Next. & = New MPI-1024 Medline # LDA-100 SSE Wikipedia & LDA-1000 ICE-400 Speed per method News #& K-Means-400 GPU Synthetic RBM-93K GPU Images *
  8. 8. Features/s 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 RBF-SVM MPI?-500 Supervised Training RCV1 Ensemble Tree MPI-128 Synthetic single parallel RBF-SVM# = Prev. * = Next. & = New TCP-48 MNIST 220K Decision Tree MapRed-200 Ad-Bounce # Boosted DT MPI-32 Speed per method Ranking # Linear Threads-2 RCV1 Linear Hadoop+TCP-1000 Ads *&
  9. 9. Features/s 100 1000 10000 100000 1e+06 Speech-WFST GPU# = Prev. * = Next. NIST * Inference MPI-40 single parallel Protein # Conv. NN FPGA Caltech Boosted DT Supervised Testing (but not training) GPU Speed per method Images Conv NN Asic(sim) Caltech
  10. 10. My Flow Chart for Learning Optimization 1 Choose an efficient effective algorithm 2 Use compact binary representations. 3 If (Computationally Constrained) 4 then GPU 5 else 1 If few learning steps 2 then Map-Reduce AllReduce 3 else Research Problem.
  11. 11. This hour’s outline 1 A summary of results 2 Cool uses of GPUs 1 RBM learning 2 Speech Recognition 3 Terascale linear
  12. 12. Restricted Boltzmann Machine learning CRN11 Chapter 18 Goal: Learn weights which predict hidden state given features that can predict features given hidden state * Hidden Observed (*) Lots of extra details here.
  13. 13. Restricted Boltzmann Machine learning CRN11 Chapter 18 Goal: Learn weights which predict hidden state given features that can predict features given hidden state * Hidden Observed 1 Number of parameters = hidden*observed = quadratic pain 2 An observed useful method for creating relevant features for supervised learning. (*) Lots of extra details here.
  14. 14. RBM parallelization GPU = hundreds of weak processors doing vector operations on shared memory. 1 Activation levels of hidden node i is sig( j wij xj ).
  15. 15. RBM parallelization GPU = hundreds of weak processors doing vector operations on shared memory. 1 Activation levels of hidden node i is sig( j wij xj ). 2 Given activation levels, hidden nodes are independently randomly rounded to {0, 1}.
  16. 16. RBM parallelization GPU = hundreds of weak processors doing vector operations on shared memory. 1 Activation levels of hidden node i is sig( j wij xj ). 2 Given activation levels, hidden nodes are independently randomly rounded to {0, 1}. 3 Predict features given hidden units just as step 1.
  17. 17. RBM parallelization GPU = hundreds of weak processors doing vector operations on shared memory. 1 Activation levels of hidden node i is sig( j wij xj ). 2 Given activation levels, hidden nodes are independently randomly rounded to {0, 1}. 3 Predict features given hidden units just as step 1. 4 Shift weights to make reconstruction more accurate.
  18. 18. RBM parallelization GPU = hundreds of weak processors doing vector operations on shared memory. 1 Activation levels of hidden node i is sig( j wij xj ). A GPU is perfectly designed for a dense matrix/vector dot product. 2 Given activation levels, hidden nodes are independently randomly rounded to {0, 1}. Good for GPUs 3 Predict features given hidden units just as step 1. Perfect for GPUs 4 Shift weights to make reconstruction more accurate. Perfect for GPUs
  19. 19. Parallelization Techniques 1 Store model in GPU memory and stream data. 2 Use existing GPU-optimized matrix operation code. 3 Use multicore GPU parallelism for the rest. This is a best-case situation for GPUs. x10 to x55 speedups observed.
  20. 20. Parallelization Techniques 1 Store model in GPU memory and stream data. 2 Use existing GPU-optimized matrix operation code. 3 Use multicore GPU parallelism for the rest. This is a best-case situation for GPUs. x10 to x55 speedups observed. But, maybe we just sped up a slow algorithm?
  21. 21. GPUs for Speech Recognition CGYK11 Chapter 21 Given observed utterances, we want to reconstruct the original (hidden) sequence of words via an HMM structure. Hidden Observed
  22. 22. GPUs for Speech Recognition CGYK11 Chapter 21 Given observed utterances, we want to reconstruct the original (hidden) sequence of words via an HMM structure. Hidden Observed Standard method of decoding: forward-backward algorithm using Bayes law to find the most probable utterance.
  23. 23. GPUs for Speech Recognition CGYK11 Chapter 21 Given observed utterances, we want to reconstruct the original (hidden) sequence of words via an HMM structure. Hidden Observed Standard method of decoding: forward-backward algorithm using Bayes law to find the most probable utterance. Naively, this is trivially parallelized just as before.
  24. 24. GPUs for Speech Recognition CGYK11 Chapter 21 Given observed utterances, we want to reconstruct the original (hidden) sequence of words via an HMM structure. Hidden Observed Standard method of decoding: forward-backward algorithm using Bayes law to find the most probable utterance. Naively, this is trivially parallelized just as before. But it’s not. 1 The observation is non-binary. The standard approach matches the observed sound with one of very many different recorded sounds via nearest neighbor search. 2 The state transitions are commonly beam searched rather than using Bayesian integration. 3 The entire structure is compiled into a weighted finite state transducer, which is what’s really optimized.
  25. 25. The approach used Start with a careful systematic analysis of where parallelization might help.
  26. 26. The approach used Start with a careful systematic analysis of where parallelization might help. 1 SIMD instructions: Use carefully arranged datastructures so single-instruction-multiple-data works. 2 Multicore over 30 cores of GPU. 3 Use Atomic instructions (Atomic max, Atomic swap) = thread safe primitives. 4 Stick model in GPU memory, using GPU memory as (essentially) a monstrous cache.
  27. 27. The approach used Start with a careful systematic analysis of where parallelization might help. 1 SIMD instructions: Use carefully arranged datastructures so single-instruction-multiple-data works. 2 Multicore over 30 cores of GPU. 3 Use Atomic instructions (Atomic max, Atomic swap) = thread safe primitives. 4 Stick model in GPU memory, using GPU memory as (essentially) a monstrous cache. Result: x10.5 speedup. Crucially, this makes the algorithm faster than real time. GPUs help, even for highly optimized algorithms.
  28. 28. This hour’s outline 1 A summary of results 2 Cool uses of GPUs 1 RBM learning 2 Speech Recognition 3 Terascale linear
  29. 29. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a linear predictor fw (x) = i wi xi ?
  30. 30. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a linear predictor fw (x) = i wi xi ? 1 No single machine algorithm.
  31. 31. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a linear predictor fw (x) = i wi xi ? 1 No single machine algorithm. 2 No multimachine algorithm requiring bandwidth ∝ Tbytes for any single machine.
  32. 32. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a linear predictor fw (x) = i wi xi ? 1 No single machine algorithm. 2 No multimachine algorithm requiring bandwidth ∝ Tbytes for any single machine. It is necessary but not sufficient to have an efficient communication mechanism.
  33. 33. MPI-style AllReduce Allreduce initial state 7 5 6 1 2 3 4 AllReduce = Reduce+Broadcast
  34. 34. MPI-style AllReduce Reducing, step 1 7 8 13 1 2 3 4 AllReduce = Reduce+Broadcast
  35. 35. MPI-style AllReduce Reducing, step 2 28 8 13 1 2 3 4 AllReduce = Reduce+Broadcast
  36. 36. MPI-style AllReduce Broadcast, step 1 28 28 28 1 2 3 4 AllReduce = Reduce+Broadcast
  37. 37. MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast
  38. 38. MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast Properties: 1 Easily pipelined so no latency concerns. 2 Bandwidth ≤ 6n. 3 No need to rewrite code!
  39. 39. An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number < max) 1 While (examples left) 1 Do online update. 2 AllReduce(weights) 3 For each weight w ← w /n
  40. 40. An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number < max) 1 While (examples left) 1 Do online update. 2 AllReduce(weights) 3 For each weight w ← w /n Other algorithms implemented: 1 Nonuniform averaging for online learning 2 Conjugate Gradient 3 LBFGS
  41. 41. Approach Used: Preliminaries Optimize so few data passes required. Basic problem with gradient descent = confused units. fw (x) = i wi xi (x)−y 2 ⇒ ∂(fw ∂wi ) = 2(fw (x) − y )xi which has units of i. But wi naturally has units of 1/i since doubling xi implies halving wi to get the same prediction. Crude fixes: ∂ 2 −1 1 Newton: Multiply inverse Hessian: ∂wi ∂wj by gradient to get update direction. 2 Normalize update so total step size is controlled.
  42. 42. Approach Used: Preliminaries Optimize so few data passes required. Basic problem with gradient descent = confused units. fw (x) = i wi xi (x)−y 2 ⇒ ∂(fw ∂wi ) = 2(fw (x) − y )xi which has units of i. But wi naturally has units of 1/i since doubling xi implies halving wi to get the same prediction. Crude fixes: 2 −1 1 Newton: Multiply inverse Hessian: ∂w∂∂wj i by gradient to get update direction...but computational complexity kills you. 2 Normalize update so total step size is controlled...but this just works globally rather than per dimension.
  43. 43. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g .
  44. 44. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g . 2 Dimensionally correct, adaptive, online, gradient descent for small-multiple passes. 1 Online = update weights after seeing each example. 1 2 Adaptive = learning rate of feature i according to √P gi2 where gi = previous gradients. 3 Dimensionally correct = still works if you double all feature values. 3 Use (2) to warmstart (1).
  45. 45. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g . 2 Dimensionally correct, adaptive, online, gradient descent for small-multiple passes. 1 Online = update weights after seeing each example. 1 2 Adaptive = learning rate of feature i according to √P gi2 where gi = previous gradients. 3 Dimensionally correct = still works if you double all feature values. 3 Use (2) to warmstart (1). 2 Use map-only Hadoop for process control and error recovery.
  46. 46. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g . 2 Dimensionally correct, adaptive, online, gradient descent for small-multiple passes. 1 Online = update weights after seeing each example. 1 2 Adaptive = learning rate of feature i according to √P gi2 where gi = previous gradients. 3 Dimensionally correct = still works if you double all feature values. 3 Use (2) to warmstart (1). 2 Use map-only Hadoop for process control and error recovery. 3 Use custom AllReduce code to sync state.
  47. 47. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g . 2 Dimensionally correct, adaptive, online, gradient descent for small-multiple passes. 1 Online = update weights after seeing each example. 1 2 Adaptive = learning rate of feature i according to √P gi2 where gi = previous gradients. 3 Dimensionally correct = still works if you double all feature values. 3 Use (2) to warmstart (1). 2 Use map-only Hadoop for process control and error recovery. 3 Use custom AllReduce code to sync state. 4 Always save input examples in a cachefile to speed later passes.
  48. 48. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g . 2 Dimensionally correct, adaptive, online, gradient descent for small-multiple passes. 1 Online = update weights after seeing each example. 1 2 Adaptive = learning rate of feature i according to √P gi2 where gi = previous gradients. 3 Dimensionally correct = still works if you double all feature values. 3 Use (2) to warmstart (1). 2 Use map-only Hadoop for process control and error recovery. 3 Use custom AllReduce code to sync state. 4 Always save input examples in a cachefile to speed later passes. 5 Use hashing trick to reduce input complexity.
  49. 49. Approach Used 1 Optimize hard so few data passes required. 1 L-BFGS = batch algorithm that builds up approximate inverse ∆ ∆T hessian according to: ∆w ∆w where ∆w is a change in weights T w g w and ∆g is a change in the loss gradient g . 2 Dimensionally correct, adaptive, online, gradient descent for small-multiple passes. 1 Online = update weights after seeing each example. 1 2 Adaptive = learning rate of feature i according to √P gi2 where gi = previous gradients. 3 Dimensionally correct = still works if you double all feature values. 3 Use (2) to warmstart (1). 2 Use map-only Hadoop for process control and error recovery. 3 Use custom AllReduce code to sync state. 4 Always save input examples in a cachefile to speed later passes. 5 Use hashing trick to reduce input complexity. Open source in Vowpal Wabbit 6.0. Search for it.
  50. 50. Empirical Results 2.1T sparse features 17B Examples 16M parameters 1K nodes 70 minutes
  51. 51. The end Right now there is extreme diversity: 1 Many different notions of large scale. 2 Many different approaches. What works generally? What are the natural “kinds” of large scale learning problems? And what are good solutions for each kind? The great diversity implies this is really the beginning.
  52. 52. Numbers References RBMs Adam Coates, Rajat Raina, and Andrew Y. Ng, Large Scale Learning for Vision with GPUs, in the book. speech Jike Chong, Ekaterina Gonina Kisun You, and Kurt Keutzer, in the book.Teralinear Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, and John Langford, Vowpal Wabbit 6.0. LDA II M. Hoffman, D. Blei, F. Bach, ”Online Learning for Latent Dirichlet Allocation,” in Neural Information Processing Systems (NIPS) 2010, Vancouver, 2010. LDA III Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, Alexander Smola, Scalable Inference in Latent Variable Models, In Submission.
  53. 53. Online ReferencesAdapt I H. Brendan McMahan and Matthew Streeter Adaptive Bound Optimization for Online Convex Optimization, COLT 2010.Adapt II John Duchi, Elad Hazan, and Yoram Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, COLT 2010. Dim I http://www.machinedlearnings.com/2011/06/dimensional- analysis-and-gradient.html Import Nikos Karampatziakis, John Langford, Online Importance Weight Aware Updates, UAI 2011
  54. 54. Batch References BFGS I Broyden, C. G. (1970), ”The convergence of a class of double-rank minimization algorithms”, Journal of the Institute of Mathematics and Its Applications 6: 7690.BFGS II Fletcher, R. (1970), ”A New Approach to Variable Metric Algorithms”, Computer Journal 13 (3): 317322.BFGS III Goldfarb, D. (1970), ”A Family of Variable Metric Updates Derived by Variational Means”, Mathematics of Computation 24 (109): 2326.BFGS IV Shanno, David F. (July 1970), ”Conditioning of quasi-Newton methods for function minimization”, Math. Comput. 24: 647656.L-BFGS Nocedal, J. (1980). ”Updating Quasi-Newton Matrices with Limited Storage”. Mathematics of Computation 35: 773782.

×