Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Consensus Optimization and Machine Learning
Stephen Boyd and Steven Diamond
EE & CS Departments
Stanford University
H2O Wo...
Outline
Convex optimization
Model fitting via convex optimization
Consensus optimization and model fitting
2
Outline
Convex optimization
Model fitting via convex optimization
Consensus optimization and model fitting
Convex optimizati...
Convex optimization problem
convex optimization problem:
minimize f0(x)
subject to fi (x) ≤ 0, i = 1, . . . , m
Ax = b
var...
Why convex optimization?
Convex optimization 5
Why convex optimization?
we can solve convex optimization problems effectively
Convex optimization 5
Why convex optimization?
we can solve convex optimization problems effectively
there are lots of applications
Convex optimi...
Application areas
machine learning, statistics
finance
supply chain, revenue management, advertising
control
signal and ima...
Convex optimization solvers
medium scale (1000s–10000s variables, constraints)
interior-point methods on single machine
la...
Convex optimization modeling languages
(new) high level language support for convex optimization
describe problem in high ...
CVXPY
(Diamond & Boyd, 2013)
minimize Ax − b 2
2 + γ x 1
subject to x ∞ ≤ 1
from cvxpy import *
x = Variable(n)
cost = sum...
Example: Image in-painting
guess pixel values in obscured/corrupted parts of image
total variation in-painting: choose pix...
Example
512 × 512 color image (n ≈ 800000 variables)
Original Corrupted
Convex optimization 11
Example
Original Recovered
Convex optimization 12
Example
80% of pixels removed
Original Corrupted
Convex optimization 13
Example
80% of pixels removed
Original Recovered
Convex optimization 14
Outline
Convex optimization
Model fitting via convex optimization
Consensus optimization and model fitting
Model fitting via ...
Predictor
given data (xi , yi ), i = 1, . . . , m
x is feature vector, y is outcome or label
find predictor ψ so that
y ≈ ˆ...
Loss minimization predictor
predictor parametrized by θ ∈ Rn
loss function L(xi , yi , θ) gives miss-fit for data point (xi...
Model fitting via regularized loss minimization
choose θ by minimizing regularized loss
1
m
m
i=1
L(xi , yi , θ) + λr(θ)
re...
Model fitting via regularized loss minimization
choose θ by minimizing regularized loss
1
m
m
i=1
L(xi , yi , θ) + λr(θ)
re...
Examples
predictor L(x, y, θ) ψ(x) r(θ)
least-squares (θT x − y)2 θT x 0
ridge regression (θT x − y)2 θT x θ 2
2
lasso (θT...
Robust (Huber) regression
loss L(x, y, θ) = φhub(θT x − y)
φhub is Huber function (with threshold M > 0):
φhub
(u) =
u2 |u...
Example
m = 450 measurements, n = 300 regressors
choose θtrue; xi ∼ N(0, I)
set yi = (θtrue)T xi + i , i ∼ N(0, 1)
with pr...
Example
50 problem instances, p varying from 0 to 0.15
Model fitting via convex optimization 22
Example
Model fitting via convex optimization 23
Quantile regression
quantile regression: use tilted 1 loss
L(x, y, θ) = τ(r)+ + (1 − τ)(r)−
with r = θT x − y, τ ∈ (0, 1)
...
Example
time series xt, t = 0, 1, 2, . . .
auto-regressive predictor:
ˆxt+1 = θT
(1, xt, . . . , xt−M)
M = 10 is memory of...
Example
time series xt
Model fitting via convex optimization 26
Example
xt and predictions ˆx0.1
t+1, ˆx0.5
t+1, ˆx0.9
t+1 (training set, t = 0, . . . , 399)
Model fitting via convex opti...
Example
xt and predictions ˆx0.1
t+1, ˆx0.5
t+1, ˆx0.9
t+1 (test set, t = 400, . . . , 449)
Model fitting via convex optimi...
Example
residual distributions for τ = 0.9, 0.5, and 0.1 (training set)
Model fitting via convex optimization 29
Example
residual distributions for τ = 0.9, 0.5, and 0.1 (test set)
Model fitting via convex optimization 30
Outline
Convex optimization
Model fitting via convex optimization
Consensus optimization and model fitting
Consensus optimiz...
Consensus optimization
want to solve problem with N objective terms
minimize N
i=1 fi (x)
e.g., fi is the loss function fo...
Consensus optimization via ADMM
with xk = (1/N) N
i=1 xk
i (average over local variables)
xk+1
i := argmin
xi
fi (xi ) + (...
Consensus model fitting
variable is θ, parameter in predictor
fi (θi ) is loss + (share of) regularizer for ith data block
...
Example
SVM:
hinge loss l(u) = (1 − u)+
sum square regularization r(θ) = θ2
2
baby problem with n = 2, m = 400 to illustra...
Iteration 1
−3 −2 −1 0 1 2 3
−10
−8
−6
−4
−2
0
2
4
6
8
10
Consensus optimization and model fitting 36
Iteration 5
−3 −2 −1 0 1 2 3
−10
−8
−6
−4
−2
0
2
4
6
8
10
Consensus optimization and model fitting 37
Iteration 40
−3 −2 −1 0 1 2 3
−10
−8
−6
−4
−2
0
2
4
6
8
10
Consensus optimization and model fitting 38
CVXPY implementation
(Steven Diamond)
N = 105 samples, n = 103 (dense) features
hinge (SVM) loss with 1 regularization
dat...
CVXPY implementation
Consensus optimization and model fitting 40
H2O implementation
(Tomas Nykodym)
click-through data derived from a kaggle data set
20000 features, 20M examples
logistic...
H2O implementation
ROC, iteration 1
Consensus optimization and model fitting 42
H2O implementation
ROC, iteration 2
Consensus optimization and model fitting 43
H2O implementation
ROC, iteration 3
Consensus optimization and model fitting 44
H2O implementation
ROC, iteration 5
Consensus optimization and model fitting 45
H2O implementation
ROC, iteration 10
Consensus optimization and model fitting 46
Summary
ADMM consensus
can do machine learning across distributed data sources
the data never moves
get same model as if y...
Resources
many researchers have worked on the topics covered
Convex Optimization
Distributed Optimization and Statistical ...
Upcoming SlideShare
Loading in …5
×

H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

3,348 views

Published on

H2O World 2015 - Stephen Boyd

Published in: Software
  • Be the first to comment

H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

  1. 1. Consensus Optimization and Machine Learning Stephen Boyd and Steven Diamond EE & CS Departments Stanford University H2O World, 11/10/2015 1
  2. 2. Outline Convex optimization Model fitting via convex optimization Consensus optimization and model fitting 2
  3. 3. Outline Convex optimization Model fitting via convex optimization Consensus optimization and model fitting Convex optimization 3
  4. 4. Convex optimization problem convex optimization problem: minimize f0(x) subject to fi (x) ≤ 0, i = 1, . . . , m Ax = b variable x ∈ Rn equality constraints are linear f0, . . . , fm are convex: for θ ∈ [0, 1], fi (θx + (1 − θ)y) ≤ θfi (x) + (1 − θ)fi (y) i.e., fi have nonnegative (upward) curvature Convex optimization 4
  5. 5. Why convex optimization? Convex optimization 5
  6. 6. Why convex optimization? we can solve convex optimization problems effectively Convex optimization 5
  7. 7. Why convex optimization? we can solve convex optimization problems effectively there are lots of applications Convex optimization 5
  8. 8. Application areas machine learning, statistics finance supply chain, revenue management, advertising control signal and image processing, vision networking circuit design and many others . . . Convex optimization 6
  9. 9. Convex optimization solvers medium scale (1000s–10000s variables, constraints) interior-point methods on single machine large-scale (100k – 1B variables, constraints) custom (often problem specific) methods, e.g., SGD lots of on-going research growing list of open source solvers Convex optimization 7
  10. 10. Convex optimization modeling languages (new) high level language support for convex optimization describe problem in high level language problem compiled to standard form and solved implementations: YALMIP, CVX (Matlab) CVXPY (Python) Convex.jl (Julia) Convex optimization 8
  11. 11. CVXPY (Diamond & Boyd, 2013) minimize Ax − b 2 2 + γ x 1 subject to x ∞ ≤ 1 from cvxpy import * x = Variable(n) cost = sum_squares(A*x-b) + gamma*norm(x,1) prob = Problem(Minimize(cost), [norm(x,"inf") <= 1]) opt_val = prob.solve() solution = x.value Convex optimization 9
  12. 12. Example: Image in-painting guess pixel values in obscured/corrupted parts of image total variation in-painting: choose pixel values xij ∈ R3 to minimize total variation TV(x) = ij xi+1,j − xij xi,j+1 − xij 2 a convex problem Convex optimization 10
  13. 13. Example 512 × 512 color image (n ≈ 800000 variables) Original Corrupted Convex optimization 11
  14. 14. Example Original Recovered Convex optimization 12
  15. 15. Example 80% of pixels removed Original Corrupted Convex optimization 13
  16. 16. Example 80% of pixels removed Original Recovered Convex optimization 14
  17. 17. Outline Convex optimization Model fitting via convex optimization Consensus optimization and model fitting Model fitting via convex optimization 15
  18. 18. Predictor given data (xi , yi ), i = 1, . . . , m x is feature vector, y is outcome or label find predictor ψ so that y ≈ ˆy = ψ(x) for data (x, y) that you haven’t seen ψ is a regression model for y ∈ R ψ is a classifier for y ∈ {−1, 1} Model fitting via convex optimization 16
  19. 19. Loss minimization predictor predictor parametrized by θ ∈ Rn loss function L(xi , yi , θ) gives miss-fit for data point (xi , yi ) for given θ, predictor is ψ(x) = argmin y L(x, y, θ) how do we choose parameter θ? Model fitting via convex optimization 17
  20. 20. Model fitting via regularized loss minimization choose θ by minimizing regularized loss 1 m m i=1 L(xi , yi , θ) + λr(θ) regularization r(θ) penalizes model complexity, enforces constraints, or represents prior λ > 0 scales regularization Model fitting via convex optimization 18
  21. 21. Model fitting via regularized loss minimization choose θ by minimizing regularized loss 1 m m i=1 L(xi , yi , θ) + λr(θ) regularization r(θ) penalizes model complexity, enforces constraints, or represents prior λ > 0 scales regularization for many useful cases, this is a convex problem Model fitting via convex optimization 18
  22. 22. Examples predictor L(x, y, θ) ψ(x) r(θ) least-squares (θT x − y)2 θT x 0 ridge regression (θT x − y)2 θT x θ 2 2 lasso (θT x − y)2 θT x θ 1 logistic classifier log(1 + exp(−yθT x)) sign(θT x) 0 SVM (1 − yθT x)+ sign(θT x) θ 2 2 can mix and match, e.g., r(θ) = θ 1 sparsifies all lead to convex fitting problems Model fitting via convex optimization 19
  23. 23. Robust (Huber) regression loss L(x, y, θ) = φhub(θT x − y) φhub is Huber function (with threshold M > 0): φhub (u) = u2 |u| ≤ M 2Mu − M2 |u| > M same as least-squares for small residuals, but allows (some) large residuals and so, robust to outliers Model fitting via convex optimization 20
  24. 24. Example m = 450 measurements, n = 300 regressors choose θtrue; xi ∼ N(0, I) set yi = (θtrue)T xi + i , i ∼ N(0, 1) with probability p, replace yi with −yi data has fraction p of (non-obvious) wrong measurements distribution of ‘good’ and ‘bad’ yi are the same try to recover θtrue ∈ Rn from measurements y ∈ Rm ‘prescient’ version: we know which measurements are wrong Model fitting via convex optimization 21
  25. 25. Example 50 problem instances, p varying from 0 to 0.15 Model fitting via convex optimization 22
  26. 26. Example Model fitting via convex optimization 23
  27. 27. Quantile regression quantile regression: use tilted 1 loss L(x, y, θ) = τ(r)+ + (1 − τ)(r)− with r = θT x − y, τ ∈ (0, 1) τ = 0.5: equal penalty for over- and under-estimating τ = 0.1: 9× more penalty for under-estimating τ = 0.9: 9× more penalty for over-estimating τ-quantile of residuals is zero Model fitting via convex optimization 24
  28. 28. Example time series xt, t = 0, 1, 2, . . . auto-regressive predictor: ˆxt+1 = θT (1, xt, . . . , xt−M) M = 10 is memory of predictor use quantile regression for τ = 0.1, 0.5, 0.9 at each time t, gives three one-step-ahead predictions: ˆx0.1 t+1, ˆx0.5 t+1, ˆx0.9 t+1 Model fitting via convex optimization 25
  29. 29. Example time series xt Model fitting via convex optimization 26
  30. 30. Example xt and predictions ˆx0.1 t+1, ˆx0.5 t+1, ˆx0.9 t+1 (training set, t = 0, . . . , 399) Model fitting via convex optimization 27
  31. 31. Example xt and predictions ˆx0.1 t+1, ˆx0.5 t+1, ˆx0.9 t+1 (test set, t = 400, . . . , 449) Model fitting via convex optimization 28
  32. 32. Example residual distributions for τ = 0.9, 0.5, and 0.1 (training set) Model fitting via convex optimization 29
  33. 33. Example residual distributions for τ = 0.9, 0.5, and 0.1 (test set) Model fitting via convex optimization 30
  34. 34. Outline Convex optimization Model fitting via convex optimization Consensus optimization and model fitting Consensus optimization and model fitting 31
  35. 35. Consensus optimization want to solve problem with N objective terms minimize N i=1 fi (x) e.g., fi is the loss function for ith block of training data consensus form: minimize N i=1 fi (xi ) subject to xi − z = 0 xi are local variables z is the global variable xi − z = 0 are consistency or consensus constraints Consensus optimization and model fitting 32
  36. 36. Consensus optimization via ADMM with xk = (1/N) N i=1 xk i (average over local variables) xk+1 i := argmin xi fi (xi ) + (ρ/2) xi − xk + uk i 2 2 uk+1 i := uk i + (xk+1 i − xk+1 ) get global minimum, under very general conditions uk is running sum of inconsistencies (PI control) minimizations carried out independently and in parallel coordination is via averaging of local variables xi Consensus optimization and model fitting 33
  37. 37. Consensus model fitting variable is θ, parameter in predictor fi (θi ) is loss + (share of) regularizer for ith data block θk+1 i minimizes local loss + additional quadratic term local parameters converge to consensus, same as if whole data set were handled together privacy preserving: agents don’t reveal data to each other Consensus optimization and model fitting 34
  38. 38. Example SVM: hinge loss l(u) = (1 − u)+ sum square regularization r(θ) = θ2 2 baby problem with n = 2, m = 400 to illustrate examples split into N = 20 groups, in worst possible way: each group contains only positive or negative examples Consensus optimization and model fitting 35
  39. 39. Iteration 1 −3 −2 −1 0 1 2 3 −10 −8 −6 −4 −2 0 2 4 6 8 10 Consensus optimization and model fitting 36
  40. 40. Iteration 5 −3 −2 −1 0 1 2 3 −10 −8 −6 −4 −2 0 2 4 6 8 10 Consensus optimization and model fitting 37
  41. 41. Iteration 40 −3 −2 −1 0 1 2 3 −10 −8 −6 −4 −2 0 2 4 6 8 10 Consensus optimization and model fitting 38
  42. 42. CVXPY implementation (Steven Diamond) N = 105 samples, n = 103 (dense) features hinge (SVM) loss with 1 regularization data split into 100 chunks 100 processes on 32 cores 26 sec per ADMM iteration 100 iterations for objective to converge 10 iterations (5 minutes) to get good model Consensus optimization and model fitting 39
  43. 43. CVXPY implementation Consensus optimization and model fitting 40
  44. 44. H2O implementation (Tomas Nykodym) click-through data derived from a kaggle data set 20000 features, 20M examples logistic loss, elastic net regularization examples divided into 100 chunks (of different sizes) run on 100 H2O instances 5 iterations to get good global model Consensus optimization and model fitting 41
  45. 45. H2O implementation ROC, iteration 1 Consensus optimization and model fitting 42
  46. 46. H2O implementation ROC, iteration 2 Consensus optimization and model fitting 43
  47. 47. H2O implementation ROC, iteration 3 Consensus optimization and model fitting 44
  48. 48. H2O implementation ROC, iteration 5 Consensus optimization and model fitting 45
  49. 49. H2O implementation ROC, iteration 10 Consensus optimization and model fitting 46
  50. 50. Summary ADMM consensus can do machine learning across distributed data sources the data never moves get same model as if you had collected all data in one place Consensus optimization and model fitting 47
  51. 51. Resources many researchers have worked on the topics covered Convex Optimization Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers EE364a (course slides, videos, code, homework, . . . ) software CVX, CVXPY, Convex.jl all available online Consensus optimization and model fitting 48

×