Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- The Feels by Work-Bench 9728 views
- Analyzing NYC Transit Data by Work-Bench 9984 views
- Broom: Converting Statistical Model... by Work-Bench 10080 views
- Reflection on the Data Science Prof... by Work-Bench 10001 views
- The Political Impact of Social Penu... by Work-Bench 9885 views
- R Packages for Time-Varying Network... by Work-Bench 9788 views

9,797 views

Published on

Published in:
Data & Analytics

No Downloads

Total views

9,797

On SlideShare

0

From Embeds

0

Number of Embeds

9,267

Shares

0

Downloads

17

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Can one algorithm rule them all? How to automate statistical computations Alp Kucukelbir COLUMBIA UNIVERSITY
- 2. Can one algorithm rule them all? Not yet. (But some tools can help!)
- 3. Rajesh Ranganath Dustin Tran Andrew Gelman David Blei
- 4. Machine Learning data machine learning hidden patterns We want to discover and explore hidden patterns to study hard-to-see connections, to predict future outcomes, to explore causal relationships.
- 5. How taxis navigate the city of Porto [1.7m trips] (K et.al., 2016).
- 6. How do we use machine learning?
- 7. statistical model data machine learning expert hidden patterns many months later
- 8. statistical model data machine learning expert hidden patterns many months later
- 9. statistical model data machine learning expert hidden patterns many months later Statistical Model Make assumptions about data. Capture uncertainties using probability.
- 10. statistical model data machine learning expert hidden patterns many months later Statistical Model Make assumptions about data. Capture uncertainties using probability.
- 11. statistical model data machine learning expert hidden patterns many months later Statistical Model Make assumptions about data. Capture uncertainties using probability. Machine Learning Expert aka a PhD student.
- 12. statistical model data machine learning expert hidden patterns many months later Statistical Model Make assumptions about data. Capture uncertainties using probability. Machine Learning Expert aka a PhD student.
- 13. statistical model data machine learning expert hidden patterns many months later Machine learning should be 1. Easy to use 2. Scalable 3. Flexible.
- 14. statistical model data automatic tool hidden patternsinstant revise Machine learning should be 1. Easy to use 2. Scalable 3. Flexible.
- 15. statistical model data automatic tool hidden patternsinstant revise Machine learning should be 1. Easy to use 2. Scalable 3. Flexible. “[Statistical] models are developed iteratively: we build a model, use it to analyze data, assess how it succeeds and fails, revise it, and repeat.” (Box, 1960; Blei, 2014)
- 16. What does this automatic tool need to do?
- 17. statistical model data machine learning expert hidden patterns many months later
- 18. statistical model data inference (maths) inference (algorithm) hidden patterns
- 19. statistical model data inference (maths) inference (algorithm) hidden patterns X θ Bayesian Model likelihood p(X | θ) model p(X,θ) = p(X | θ) p(θ) prior p(θ)
- 20. statistical model data inference (maths) inference (algorithm) hidden patterns X θ Bayesian Model likelihood p(X | θ) model p(X,θ) = p(X | θ) p(θ) prior p(θ) The model describes a data generating process. The latent variables θ capture hidden patterns.
- 21. statistical model data inference (maths) inference (algorithm) hidden patterns X θ Bayesian Inference posterior p(θ | X) = p(X,θ) p(X,θ)dθ The posterior describes hidden patterns given data X. It is typically intractable.
- 22. statistical model data inference (maths) inference (algorithm) hidden patterns X θ Approximating the Posterior Sampling draw samples using MCMC Variational approximate using a simple function The computations depend heavily on the model!
- 23. Common Statistical Computations Expectations q(θ;φ) logp(X,θ) = logp(X,θ) q(θ;φ)dθ Gradients (of expectations) ∇φ q(θ;φ) logp(X,θ) Maximization (by following gradients) max φ q(θ;φ) logp(X,θ)
- 24. Automating Expectations Monte Carlo sampling θ f(θ) a a + 1 θ f(θ) a a + 1 f(θ(s) ) a+1 a f(θ)dθ ≈ 1 S S s=1 f(θ(s) ) where θ(s) ∼ Uniform(a,a + 1)
- 25. Automating Expectations Monte Carlo sampling q(θ;φ) logp(X,θ) = logp(X,θ) q(θ;φ)dθ ≈ 1 S S s=1 logp(X,θ(s) ) where θ(s) ∼ q(θ;φ) Monte Carlo Statistical Methods, Robert and Casella, 1999 Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009
- 26. Automating Expectations Probability Distributions Stan, GSL (C++) NumPy, SciPy, edward (Python) built-in (R) Distributions.jl (Julia)
- 27. Automating Gradients Symbolic or Automatic Differentiation Let f(x1,x2) = logx1 +x1x2 −sinx2. Compute ∂ f(2,5)/∂ x1. Automatic di↵erentiation in machine learning: a survey 9 Table 2 Forward mode AD example, with y = f(x1, x2) = ln(x1) + x1x2 sin(x2) at (x1, x2) = (2, 5) and setting ˙x1 = 1 to compute @y @x1 . The original forward run on the left is augmented by the forward AD operations on the right, where each line supplements the original on its left. Forward Evaluation Trace v 1 = x1 = 2 v0 = x2 = 5 v1 = ln v 1 = ln 2 v2 = v 1 ⇥v0 = 2 ⇥ 5 v3 = sin v0 = sin 5 v4 = v1 + v2 = 0.693 + 10 v5 = v4 v3 = 10.693 + 0.959 y = v5 = 11.652 Forward Derivative Trace ˙v 1 = ˙x1 = 1 ˙v0 = ˙x2 = 0 ˙v1 = ˙v 1/v 1 = 1/2 ˙v2 = ˙v 1⇥v0+ ˙v0⇥v 1 = 1⇥5+0⇥2 ˙v3 = ˙v0 ⇥ cos v0 = 0 ⇥ cos 5 ˙v4 = ˙v1 + ˙v2 = 0.5 + 5 ˙v5 = ˙v4 ˙v3 = 5.5 0 ˙y = ˙v5 = 5.5 each intermediate variable vi a derivative ˙vi = @vi @x1 . Applying the chain rule to each elementary operation in the forward evalu- ation trace, we generate the corresponding derivative trace, given on the right hand side of Table 2. Evaluating variables vi one by one together with their corresponding ˙vi values gives us the required derivative in the ﬁnal variable @y Automatic differentiation in machine learning: a survey, Baydin et al., 2015
- 28. #include < stan /math . hpp> i n t main () { using namespace std ; stan : : math : : var x1 = 2 , x2 = 5; stan : : math : : var f ; f = log ( x1 ) + x1*x2 - sin ( x2 ) ; cout << " f ( x1 , x2 ) = " << f . val () << endl ; f . grad () ; cout << " df / dx1 = " << x1 . adj () << endl << " df / dx2 = " << x2 . adj () << endl ; return 0; } The Stan math library, Carpenter et al., 2015
- 29. Automating Gradients Automatic Differentiation Stan, Adept, CppAD (C++) autograd, Tensorﬂow (Python) radx (R) http://www.juliadiff.org/ (Julia) Symbolic Differentiation SymbolicC++ (C++) SymPy, Theano (Python) Deriv, Ryacas (R) http://www.juliadiff.org/ (Julia)
- 30. Stochastic Optimization Follow noisy unbiased gradients. 8.5. Online learning and stochastic optimization black line = LMS trajectory towards LS soln (red cross) w0 w1 −1 0 1 2 3 −1 −0.5 0 0.5 1 1.5 2 2.5 3 (a) 0 5 10 15 3 4 5 6 7 8 9 10 RSS vs iteration (b) Figure 8.8 Illustration of the LMS algorithm. Left: we start from θ = (−0.5, to the least squares solution of ˆθ = (1.45, 0.92) (red cross). Right: plot of obje Note that it does not decrease monotonically. Figure generated by LMSdemo. where i = i(k) is the training example to use at iteration k. If the data s i(k) = k; we shall assume this from now on, for notational simplicity. Figure 8.8a. Scale up by subsampling the data at each step. Machine Learning: a Probabilistic Perspective, Murphy, 2012
- 31. Stochastic Optimization Generic Implementations Vowpal Wabbit, sgd (C++) Theano, Tensorﬂow (Python) sgd (R) SGDOptim.jl (Julia)
- 32. ADVI (Automatic Differentiation Variational Inference) An easy-to use, scalable, ﬂexible algorithm smc‐ tan.org Stan is a probabilistic programming system. 1. Write the model in a simple language. 2. Provide data. 3. Run. RStan, PyStan, Stan.jl, ...
- 33. How taxis navigate the city of Porto [1.7m trips] (K et.al., 2016).
- 34. Exploring Taxi Rides Data: 1.7 million taxi rides Write down a pPCA model. (∼minutes) Use ADVI to infer subspace. (∼hours) Project data into pPCA subspace. (∼minutes) Write down a mixture model. (∼minutes) Use ADVI to ﬁnd patterns. (∼minutes) Write down a supervised pPCA model. (∼minutes) Repeat. (∼hours) What would have taken us weeks → a single day.
- 35. statistical model data automatic tool hidden patternsinstant revise Monte Carlo Statistical Methods, Robert and Casella, 1999 Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009 Automatic differentiation in machine learning: a survey, Baydin et al., 2015 The Stan math library, Carpenter et al., 2015 Machine Learning: a Probabilistic Perspective, Murphy, 2012 Automatic differentiation variational inference, K et al., 2016 proditus.com mc-stan.org Thank you!
- 36. EXTRA SLIDES
- 37. Kullback Leibler Divergence KL(q(θ) p(θ | X)) = θ q(θ)log q(θ) p(θ | X) dθ = q(θ) log q(θ) p(θ | X) = q(θ) [logq(θ) − logp(θ | X)]
- 38. Related Objective Function (φ) = logp(X) − KL(q(θ) p(θ | X)) = logp(X) − q(θ) [logq(θ) − logp(θ | X)] = logp(X) + q(θ) [logp(X | θ)] − q(θ) [logq(θ)] = q(θ) [logp(θ,X)] − q(θ) [logq(θ)] = q(θ ;φ) logp(X,θ) cross-entropy − q(θ ;φ) logq(θ ; φ) entropy

No public clipboards found for this slide

Be the first to comment