Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

3,348 views

Published on

H2O World 2015 - Stephen Boyd

Published in:
Software

No Downloads

Total views

3,348

On SlideShare

0

From Embeds

0

Number of Embeds

39

Shares

0

Downloads

89

Comments

0

Likes

4

No embeds

No notes for slide

- 1. Consensus Optimization and Machine Learning Stephen Boyd and Steven Diamond EE & CS Departments Stanford University H2O World, 11/10/2015 1
- 2. Outline Convex optimization Model ﬁtting via convex optimization Consensus optimization and model ﬁtting 2
- 3. Outline Convex optimization Model ﬁtting via convex optimization Consensus optimization and model ﬁtting Convex optimization 3
- 4. Convex optimization problem convex optimization problem: minimize f0(x) subject to fi (x) ≤ 0, i = 1, . . . , m Ax = b variable x ∈ Rn equality constraints are linear f0, . . . , fm are convex: for θ ∈ [0, 1], fi (θx + (1 − θ)y) ≤ θfi (x) + (1 − θ)fi (y) i.e., fi have nonnegative (upward) curvature Convex optimization 4
- 5. Why convex optimization? Convex optimization 5
- 6. Why convex optimization? we can solve convex optimization problems eﬀectively Convex optimization 5
- 7. Why convex optimization? we can solve convex optimization problems eﬀectively there are lots of applications Convex optimization 5
- 8. Application areas machine learning, statistics ﬁnance supply chain, revenue management, advertising control signal and image processing, vision networking circuit design and many others . . . Convex optimization 6
- 9. Convex optimization solvers medium scale (1000s–10000s variables, constraints) interior-point methods on single machine large-scale (100k – 1B variables, constraints) custom (often problem speciﬁc) methods, e.g., SGD lots of on-going research growing list of open source solvers Convex optimization 7
- 10. Convex optimization modeling languages (new) high level language support for convex optimization describe problem in high level language problem compiled to standard form and solved implementations: YALMIP, CVX (Matlab) CVXPY (Python) Convex.jl (Julia) Convex optimization 8
- 11. CVXPY (Diamond & Boyd, 2013) minimize Ax − b 2 2 + γ x 1 subject to x ∞ ≤ 1 from cvxpy import * x = Variable(n) cost = sum_squares(A*x-b) + gamma*norm(x,1) prob = Problem(Minimize(cost), [norm(x,"inf") <= 1]) opt_val = prob.solve() solution = x.value Convex optimization 9
- 12. Example: Image in-painting guess pixel values in obscured/corrupted parts of image total variation in-painting: choose pixel values xij ∈ R3 to minimize total variation TV(x) = ij xi+1,j − xij xi,j+1 − xij 2 a convex problem Convex optimization 10
- 13. Example 512 × 512 color image (n ≈ 800000 variables) Original Corrupted Convex optimization 11
- 14. Example Original Recovered Convex optimization 12
- 15. Example 80% of pixels removed Original Corrupted Convex optimization 13
- 16. Example 80% of pixels removed Original Recovered Convex optimization 14
- 17. Outline Convex optimization Model ﬁtting via convex optimization Consensus optimization and model ﬁtting Model ﬁtting via convex optimization 15
- 18. Predictor given data (xi , yi ), i = 1, . . . , m x is feature vector, y is outcome or label ﬁnd predictor ψ so that y ≈ ˆy = ψ(x) for data (x, y) that you haven’t seen ψ is a regression model for y ∈ R ψ is a classiﬁer for y ∈ {−1, 1} Model ﬁtting via convex optimization 16
- 19. Loss minimization predictor predictor parametrized by θ ∈ Rn loss function L(xi , yi , θ) gives miss-ﬁt for data point (xi , yi ) for given θ, predictor is ψ(x) = argmin y L(x, y, θ) how do we choose parameter θ? Model ﬁtting via convex optimization 17
- 20. Model ﬁtting via regularized loss minimization choose θ by minimizing regularized loss 1 m m i=1 L(xi , yi , θ) + λr(θ) regularization r(θ) penalizes model complexity, enforces constraints, or represents prior λ > 0 scales regularization Model ﬁtting via convex optimization 18
- 21. Model ﬁtting via regularized loss minimization choose θ by minimizing regularized loss 1 m m i=1 L(xi , yi , θ) + λr(θ) regularization r(θ) penalizes model complexity, enforces constraints, or represents prior λ > 0 scales regularization for many useful cases, this is a convex problem Model ﬁtting via convex optimization 18
- 22. Examples predictor L(x, y, θ) ψ(x) r(θ) least-squares (θT x − y)2 θT x 0 ridge regression (θT x − y)2 θT x θ 2 2 lasso (θT x − y)2 θT x θ 1 logistic classiﬁer log(1 + exp(−yθT x)) sign(θT x) 0 SVM (1 − yθT x)+ sign(θT x) θ 2 2 can mix and match, e.g., r(θ) = θ 1 sparsiﬁes all lead to convex ﬁtting problems Model ﬁtting via convex optimization 19
- 23. Robust (Huber) regression loss L(x, y, θ) = φhub(θT x − y) φhub is Huber function (with threshold M > 0): φhub (u) = u2 |u| ≤ M 2Mu − M2 |u| > M same as least-squares for small residuals, but allows (some) large residuals and so, robust to outliers Model ﬁtting via convex optimization 20
- 24. Example m = 450 measurements, n = 300 regressors choose θtrue; xi ∼ N(0, I) set yi = (θtrue)T xi + i , i ∼ N(0, 1) with probability p, replace yi with −yi data has fraction p of (non-obvious) wrong measurements distribution of ‘good’ and ‘bad’ yi are the same try to recover θtrue ∈ Rn from measurements y ∈ Rm ‘prescient’ version: we know which measurements are wrong Model ﬁtting via convex optimization 21
- 25. Example 50 problem instances, p varying from 0 to 0.15 Model ﬁtting via convex optimization 22
- 26. Example Model ﬁtting via convex optimization 23
- 27. Quantile regression quantile regression: use tilted 1 loss L(x, y, θ) = τ(r)+ + (1 − τ)(r)− with r = θT x − y, τ ∈ (0, 1) τ = 0.5: equal penalty for over- and under-estimating τ = 0.1: 9× more penalty for under-estimating τ = 0.9: 9× more penalty for over-estimating τ-quantile of residuals is zero Model ﬁtting via convex optimization 24
- 28. Example time series xt, t = 0, 1, 2, . . . auto-regressive predictor: ˆxt+1 = θT (1, xt, . . . , xt−M) M = 10 is memory of predictor use quantile regression for τ = 0.1, 0.5, 0.9 at each time t, gives three one-step-ahead predictions: ˆx0.1 t+1, ˆx0.5 t+1, ˆx0.9 t+1 Model ﬁtting via convex optimization 25
- 29. Example time series xt Model ﬁtting via convex optimization 26
- 30. Example xt and predictions ˆx0.1 t+1, ˆx0.5 t+1, ˆx0.9 t+1 (training set, t = 0, . . . , 399) Model ﬁtting via convex optimization 27
- 31. Example xt and predictions ˆx0.1 t+1, ˆx0.5 t+1, ˆx0.9 t+1 (test set, t = 400, . . . , 449) Model ﬁtting via convex optimization 28
- 32. Example residual distributions for τ = 0.9, 0.5, and 0.1 (training set) Model ﬁtting via convex optimization 29
- 33. Example residual distributions for τ = 0.9, 0.5, and 0.1 (test set) Model ﬁtting via convex optimization 30
- 34. Outline Convex optimization Model ﬁtting via convex optimization Consensus optimization and model ﬁtting Consensus optimization and model ﬁtting 31
- 35. Consensus optimization want to solve problem with N objective terms minimize N i=1 fi (x) e.g., fi is the loss function for ith block of training data consensus form: minimize N i=1 fi (xi ) subject to xi − z = 0 xi are local variables z is the global variable xi − z = 0 are consistency or consensus constraints Consensus optimization and model ﬁtting 32
- 36. Consensus optimization via ADMM with xk = (1/N) N i=1 xk i (average over local variables) xk+1 i := argmin xi fi (xi ) + (ρ/2) xi − xk + uk i 2 2 uk+1 i := uk i + (xk+1 i − xk+1 ) get global minimum, under very general conditions uk is running sum of inconsistencies (PI control) minimizations carried out independently and in parallel coordination is via averaging of local variables xi Consensus optimization and model ﬁtting 33
- 37. Consensus model ﬁtting variable is θ, parameter in predictor fi (θi ) is loss + (share of) regularizer for ith data block θk+1 i minimizes local loss + additional quadratic term local parameters converge to consensus, same as if whole data set were handled together privacy preserving: agents don’t reveal data to each other Consensus optimization and model ﬁtting 34
- 38. Example SVM: hinge loss l(u) = (1 − u)+ sum square regularization r(θ) = θ2 2 baby problem with n = 2, m = 400 to illustrate examples split into N = 20 groups, in worst possible way: each group contains only positive or negative examples Consensus optimization and model ﬁtting 35
- 39. Iteration 1 −3 −2 −1 0 1 2 3 −10 −8 −6 −4 −2 0 2 4 6 8 10 Consensus optimization and model ﬁtting 36
- 40. Iteration 5 −3 −2 −1 0 1 2 3 −10 −8 −6 −4 −2 0 2 4 6 8 10 Consensus optimization and model ﬁtting 37
- 41. Iteration 40 −3 −2 −1 0 1 2 3 −10 −8 −6 −4 −2 0 2 4 6 8 10 Consensus optimization and model ﬁtting 38
- 42. CVXPY implementation (Steven Diamond) N = 105 samples, n = 103 (dense) features hinge (SVM) loss with 1 regularization data split into 100 chunks 100 processes on 32 cores 26 sec per ADMM iteration 100 iterations for objective to converge 10 iterations (5 minutes) to get good model Consensus optimization and model ﬁtting 39
- 43. CVXPY implementation Consensus optimization and model ﬁtting 40
- 44. H2O implementation (Tomas Nykodym) click-through data derived from a kaggle data set 20000 features, 20M examples logistic loss, elastic net regularization examples divided into 100 chunks (of diﬀerent sizes) run on 100 H2O instances 5 iterations to get good global model Consensus optimization and model ﬁtting 41
- 45. H2O implementation ROC, iteration 1 Consensus optimization and model ﬁtting 42
- 46. H2O implementation ROC, iteration 2 Consensus optimization and model ﬁtting 43
- 47. H2O implementation ROC, iteration 3 Consensus optimization and model ﬁtting 44
- 48. H2O implementation ROC, iteration 5 Consensus optimization and model ﬁtting 45
- 49. H2O implementation ROC, iteration 10 Consensus optimization and model ﬁtting 46
- 50. Summary ADMM consensus can do machine learning across distributed data sources the data never moves get same model as if you had collected all data in one place Consensus optimization and model ﬁtting 47
- 51. Resources many researchers have worked on the topics covered Convex Optimization Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers EE364a (course slides, videos, code, homework, . . . ) software CVX, CVXPY, Convex.jl all available online Consensus optimization and model ﬁtting 48

No public clipboards found for this slide

Be the first to comment