Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

One Algorithm to Rule Them All: How to Automate Statistical Computation

9,541 views

Published on

Delivered by Alp Kucukelbir (Data Scientist, Columbia University) at the 2016 New York R Conference on April 8th and 9th at Work-Bench.

Published in: Data & Analytics
  • Be the first to comment

One Algorithm to Rule Them All: How to Automate Statistical Computation

  1. 1. Can one algorithm rule them all? How to automate statistical computations Alp Kucukelbir COLUMBIA UNIVERSITY
  2. 2. Can one algorithm rule them all? Not yet. (But some tools can help!)
  3. 3. Rajesh Ranganath Dustin Tran Andrew Gelman David Blei
  4. 4. Machine Learning data machine learning hidden patterns We want to discover and explore hidden patterns to study hard-to-see connections, to predict future outcomes, to explore causal relationships.
  5. 5. How taxis navigate the city of Porto [1.7m trips] (K et.al., 2016).
  6. 6. How do we use machine learning?
  7. 7. statistical model data machine learning expert hidden patterns many months later
  8. 8. statistical model data machine learning expert hidden patterns many months later
  9. 9. statistical model data machine learning expert hidden patterns many months later Statistical Model Make assumptions about data. Capture uncertainties using probability.
  10. 10. statistical model data machine learning expert hidden patterns many months later Statistical Model Make assumptions about data. Capture uncertainties using probability.
  11. 11. statistical model data machine learning expert hidden patterns many months later Statistical Model Make assumptions about data. Capture uncertainties using probability. Machine Learning Expert aka a PhD student.
  12. 12. statistical model data machine learning expert hidden patterns many months later Statistical Model Make assumptions about data. Capture uncertainties using probability. Machine Learning Expert aka a PhD student.
  13. 13. statistical model data machine learning expert hidden patterns many months later Machine learning should be 1. Easy to use 2. Scalable 3. Flexible.
  14. 14. statistical model data automatic tool hidden patternsinstant revise Machine learning should be 1. Easy to use 2. Scalable 3. Flexible.
  15. 15. statistical model data automatic tool hidden patternsinstant revise Machine learning should be 1. Easy to use 2. Scalable 3. Flexible. “[Statistical] models are developed iteratively: we build a model, use it to analyze data, assess how it succeeds and fails, revise it, and repeat.” (Box, 1960; Blei, 2014)
  16. 16. What does this automatic tool need to do?
  17. 17. statistical model data machine learning expert hidden patterns many months later
  18. 18. statistical model data inference (maths) inference (algorithm) hidden patterns
  19. 19. statistical model data inference (maths) inference (algorithm) hidden patterns X θ Bayesian Model likelihood p(X | θ) model p(X,θ) = p(X | θ) p(θ) prior p(θ)
  20. 20. statistical model data inference (maths) inference (algorithm) hidden patterns X θ Bayesian Model likelihood p(X | θ) model p(X,θ) = p(X | θ) p(θ) prior p(θ) The model describes a data generating process. The latent variables θ capture hidden patterns.
  21. 21. statistical model data inference (maths) inference (algorithm) hidden patterns X θ Bayesian Inference posterior p(θ | X) = p(X,θ) p(X,θ)dθ The posterior describes hidden patterns given data X. It is typically intractable.
  22. 22. statistical model data inference (maths) inference (algorithm) hidden patterns X θ Approximating the Posterior Sampling draw samples using MCMC Variational approximate using a simple function The computations depend heavily on the model!
  23. 23. Common Statistical Computations Expectations q(θ;φ) logp(X,θ) = logp(X,θ) q(θ;φ)dθ Gradients (of expectations) ∇φ q(θ;φ) logp(X,θ) Maximization (by following gradients) max φ q(θ;φ) logp(X,θ)
  24. 24. Automating Expectations Monte Carlo sampling θ f(θ) a a + 1 θ f(θ) a a + 1 f(θ(s) ) a+1 a f(θ)dθ ≈ 1 S S s=1 f(θ(s) ) where θ(s) ∼ Uniform(a,a + 1)
  25. 25. Automating Expectations Monte Carlo sampling q(θ;φ) logp(X,θ) = logp(X,θ) q(θ;φ)dθ ≈ 1 S S s=1 logp(X,θ(s) ) where θ(s) ∼ q(θ;φ) Monte Carlo Statistical Methods, Robert and Casella, 1999 Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009
  26. 26. Automating Expectations Probability Distributions Stan, GSL (C++) NumPy, SciPy, edward (Python) built-in (R) Distributions.jl (Julia)
  27. 27. Automating Gradients Symbolic or Automatic Differentiation Let f(x1,x2) = logx1 +x1x2 −sinx2. Compute ∂ f(2,5)/∂ x1. Automatic di↵erentiation in machine learning: a survey 9 Table 2 Forward mode AD example, with y = f(x1, x2) = ln(x1) + x1x2 sin(x2) at (x1, x2) = (2, 5) and setting ˙x1 = 1 to compute @y @x1 . The original forward run on the left is augmented by the forward AD operations on the right, where each line supplements the original on its left. Forward Evaluation Trace v 1 = x1 = 2 v0 = x2 = 5 v1 = ln v 1 = ln 2 v2 = v 1 ⇥v0 = 2 ⇥ 5 v3 = sin v0 = sin 5 v4 = v1 + v2 = 0.693 + 10 v5 = v4 v3 = 10.693 + 0.959 y = v5 = 11.652 Forward Derivative Trace ˙v 1 = ˙x1 = 1 ˙v0 = ˙x2 = 0 ˙v1 = ˙v 1/v 1 = 1/2 ˙v2 = ˙v 1⇥v0+ ˙v0⇥v 1 = 1⇥5+0⇥2 ˙v3 = ˙v0 ⇥ cos v0 = 0 ⇥ cos 5 ˙v4 = ˙v1 + ˙v2 = 0.5 + 5 ˙v5 = ˙v4 ˙v3 = 5.5 0 ˙y = ˙v5 = 5.5 each intermediate variable vi a derivative ˙vi = @vi @x1 . Applying the chain rule to each elementary operation in the forward evalu- ation trace, we generate the corresponding derivative trace, given on the right hand side of Table 2. Evaluating variables vi one by one together with their corresponding ˙vi values gives us the required derivative in the final variable @y Automatic differentiation in machine learning: a survey, Baydin et al., 2015
  28. 28. #include < stan /math . hpp> i n t main () { using namespace std ; stan : : math : : var x1 = 2 , x2 = 5; stan : : math : : var f ; f = log ( x1 ) + x1*x2 - sin ( x2 ) ; cout << " f ( x1 , x2 ) = " << f . val () << endl ; f . grad () ; cout << " df / dx1 = " << x1 . adj () << endl << " df / dx2 = " << x2 . adj () << endl ; return 0; } The Stan math library, Carpenter et al., 2015
  29. 29. Automating Gradients Automatic Differentiation Stan, Adept, CppAD (C++) autograd, Tensorflow (Python) radx (R) http://www.juliadiff.org/ (Julia) Symbolic Differentiation SymbolicC++ (C++) SymPy, Theano (Python) Deriv, Ryacas (R) http://www.juliadiff.org/ (Julia)
  30. 30. Stochastic Optimization Follow noisy unbiased gradients. 8.5. Online learning and stochastic optimization black line = LMS trajectory towards LS soln (red cross) w0 w1 −1 0 1 2 3 −1 −0.5 0 0.5 1 1.5 2 2.5 3 (a) 0 5 10 15 3 4 5 6 7 8 9 10 RSS vs iteration (b) Figure 8.8 Illustration of the LMS algorithm. Left: we start from θ = (−0.5, to the least squares solution of ˆθ = (1.45, 0.92) (red cross). Right: plot of obje Note that it does not decrease monotonically. Figure generated by LMSdemo. where i = i(k) is the training example to use at iteration k. If the data s i(k) = k; we shall assume this from now on, for notational simplicity. Figure 8.8a. Scale up by subsampling the data at each step. Machine Learning: a Probabilistic Perspective, Murphy, 2012
  31. 31. Stochastic Optimization Generic Implementations Vowpal Wabbit, sgd (C++) Theano, Tensorflow (Python) sgd (R) SGDOptim.jl (Julia)
  32. 32. ADVI (Automatic Differentiation Variational Inference) An easy-to use, scalable, flexible algorithm smc‐ tan.org Stan is a probabilistic programming system. 1. Write the model in a simple language. 2. Provide data. 3. Run. RStan, PyStan, Stan.jl, ...
  33. 33. How taxis navigate the city of Porto [1.7m trips] (K et.al., 2016).
  34. 34. Exploring Taxi Rides Data: 1.7 million taxi rides Write down a pPCA model. (∼minutes) Use ADVI to infer subspace. (∼hours) Project data into pPCA subspace. (∼minutes) Write down a mixture model. (∼minutes) Use ADVI to find patterns. (∼minutes) Write down a supervised pPCA model. (∼minutes) Repeat. (∼hours) What would have taken us weeks → a single day.
  35. 35. statistical model data automatic tool hidden patternsinstant revise Monte Carlo Statistical Methods, Robert and Casella, 1999 Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009 Automatic differentiation in machine learning: a survey, Baydin et al., 2015 The Stan math library, Carpenter et al., 2015 Machine Learning: a Probabilistic Perspective, Murphy, 2012 Automatic differentiation variational inference, K et al., 2016 proditus.com mc-stan.org Thank you!
  36. 36. EXTRA SLIDES
  37. 37. Kullback Leibler Divergence KL(q(θ) p(θ | X)) = θ q(θ)log q(θ) p(θ | X) dθ = q(θ) log q(θ) p(θ | X) = q(θ) [logq(θ) − logp(θ | X)]
  38. 38. Related Objective Function (φ) = logp(X) − KL(q(θ) p(θ | X)) = logp(X) − q(θ) [logq(θ) − logp(θ | X)] = logp(X) + q(θ) [logp(X | θ)] − q(θ) [logq(θ)] = q(θ) [logp(θ,X)] − q(θ) [logq(θ)] = q(θ ;φ) logp(X,θ) cross-entropy − q(θ ;φ) logq(θ ; φ) entropy

×