H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Consensus Optimization and Machine Learning
Stephen Boyd and Steven Diamond
EE & CS Departments
Stanford University
H2O World, 11/10/2015
1

Outline
Convex optimization
Model ﬁtting via convex optimization
Consensus optimization and model ﬁtting
2

Outline
Convex optimization
Convex optimization 3

Convex optimization problem
convex optimization problem:
minimize f0(x)
subject to fi (x) ≤ 0, i = 1, . . . , m
Ax = b
variable x ∈ Rn
equality constraints are linear
f0, . . . , fm are convex: for θ ∈ [0, 1],
fi (θx + (1 − θ)y) ≤ θfi (x) + (1 − θ)fi (y)
i.e., fi have nonnegative (upward) curvature

Why convex optimization?

we can solve convex optimization problems eﬀectively

we can solve convex optimization problems eﬀectively
there are lots of applications

Application areas
machine learning, statistics
ﬁnance
supply chain, revenue management, advertising
control
signal and image processing, vision
networking
circuit design
and many others . . .

Convex optimization solvers
medium scale (1000s–10000s variables, constraints)
interior-point methods on single machine
large-scale (100k – 1B variables, constraints)
custom (often problem speciﬁc) methods, e.g., SGD
lots of on-going research
growing list of open source solvers

Convex optimization modeling languages
(new) high level language support for convex optimization
describe problem in high level language
problem compiled to standard form and solved
implementations:
YALMIP, CVX (Matlab)
CVXPY (Python)
Convex.jl (Julia)

CVXPY
(Diamond & Boyd, 2013)
minimize Ax − b 2
2 + γ x 1
subject to x ∞ ≤ 1
from cvxpy import *
x = Variable(n)
cost = sum_squares(A*x-b) + gamma*norm(x,1)
prob = Problem(Minimize(cost),
[norm(x,"inf") <= 1])
opt_val = prob.solve()
solution = x.value

Example: Image in-painting
guess pixel values in obscured/corrupted parts of image
total variation in-painting: choose pixel values xij ∈ R3
to
minimize total variation
TV(x) =
ij
xi+1,j − xij
xi,j+1 − xij 2
a convex problem

Example
512 × 512 color image (n ≈ 800000 variables)
Original Corrupted

Example
Original Recovered

Example
80% of pixels removed
Original Corrupted

Example
80% of pixels removed
Original Recovered

Outline
Convex optimization
Model ﬁtting via convex optimization 15

Predictor
given data (xi , yi ), i = 1, . . . , m
x is feature vector, y is outcome or label
ﬁnd predictor ψ so that
y ≈ ˆy = ψ(x) for data (x, y) that you haven’t seen
ψ is a regression model for y ∈ R
ψ is a classiﬁer for y ∈ {−1, 1}

Loss minimization predictor
predictor parametrized by θ ∈ Rn
loss function L(xi , yi , θ) gives miss-ﬁt for data point (xi , yi )
for given θ, predictor is
ψ(x) = argmin
y
L(x, y, θ)
how do we choose parameter θ?

Model ﬁtting via regularized loss minimization
choose θ by minimizing regularized loss
1
m
m
i=1
L(xi , yi , θ) + λr(θ)
regularization r(θ) penalizes model complexity, enforces
constraints, or represents prior
λ > 0 scales regularization

Model ﬁtting via regularized loss minimization
choose θ by minimizing regularized loss
1
m
m
i=1
L(xi , yi , θ) + λr(θ)
regularization r(θ) penalizes model complexity, enforces
constraints, or represents prior
λ > 0 scales regularization
for many useful cases, this is a convex problem

Examples
predictor L(x, y, θ) ψ(x) r(θ)
least-squares (θT x − y)2 θT x 0
ridge regression (θT x − y)2 θT x θ 2
2
lasso (θT x − y)2 θT x θ 1
logistic classifier log(1 + exp(−yθT x)) sign(θT x) 0
SVM (1 − yθT x)+ sign(θT x) θ 2
2
can mix and match, e.g., r(θ) = θ 1 sparsifies
all lead to convex fitting problems

Robust (Huber) regression
loss L(x, y, θ) = φhub(θT x − y)
φhub is Huber function (with threshold M > 0):
φhub
(u) =
u2 |u| ≤ M
2Mu − M2 |u| > M
same as least-squares for small residuals, but allows (some)
large residuals
and so, robust to outliers

Example
m = 450 measurements, n = 300 regressors
choose θtrue; xi ∼ N(0, I)
set yi = (θtrue)T xi + i , i ∼ N(0, 1)
with probability p, replace yi with −yi
data has fraction p of (non-obvious) wrong measurements
distribution of ‘good’ and ‘bad’ yi are the same
try to recover θtrue ∈ Rn
from measurements y ∈ Rm
‘prescient’ version: we know which measurements are wrong

Example
50 problem instances, p varying from 0 to 0.15

Example

Quantile regression
quantile regression: use tilted 1 loss
L(x, y, θ) = τ(r)+ + (1 − τ)(r)−
with r = θT x − y, τ ∈ (0, 1)
τ = 0.5: equal penalty for over- and under-estimating
τ = 0.1: 9× more penalty for under-estimating
τ = 0.9: 9× more penalty for over-estimating
τ-quantile of residuals is zero

Example
time series xt, t = 0, 1, 2, . . .
auto-regressive predictor:
ˆxt+1 = θT
(1, xt, . . . , xt−M)
M = 10 is memory of predictor
use quantile regression for τ = 0.1, 0.5, 0.9
at each time t, gives three one-step-ahead predictions:
ˆx0.1
t+1, ˆx0.5
t+1, ˆx0.9
t+1

Example
time series xt

Example
xt and predictions ˆx0.1
t+1, ˆx0.5
t+1, ˆx0.9
t+1 (training set, t = 0, . . . , 399)

Example
xt and predictions ˆx0.1
t+1, ˆx0.5
t+1, ˆx0.9
t+1 (test set, t = 400, . . . , 449)

Example
residual distributions for τ = 0.9, 0.5, and 0.1 (training set)

Example
residual distributions for τ = 0.9, 0.5, and 0.1 (test set)

Outline
Convex optimization
Consensus optimization and model ﬁtting 31

Consensus optimization
want to solve problem with N objective terms
minimize N
i=1 fi (x)
e.g., fi is the loss function for ith block of training data
consensus form:
minimize N
i=1 fi (xi )
subject to xi − z = 0
xi are local variables
z is the global variable
xi − z = 0 are consistency or consensus constraints

Consensus optimization via ADMM
with xk = (1/N) N
i=1 xk
i (average over local variables)
xk+1
i := argmin
xi
fi (xi ) + (ρ/2) xi − xk
+ uk
i
2
2
uk+1
i := uk
i + (xk+1
i − xk+1
)
get global minimum, under very general conditions
uk is running sum of inconsistencies (PI control)
minimizations carried out independently and in parallel
coordination is via averaging of local variables xi

Consensus model ﬁtting
variable is θ, parameter in predictor
fi (θi ) is loss + (share of) regularizer for ith data block
θk+1
i minimizes local loss + additional quadratic term
local parameters converge to consensus, same as if whole
data set were handled together
privacy preserving: agents don’t reveal data to each other

Example
SVM:
hinge loss l(u) = (1 − u)+
sum square regularization r(θ) = θ2
2
baby problem with n = 2, m = 400 to illustrate
examples split into N = 20 groups, in worst possible way:
each group contains only positive or negative examples

Iteration 1
−3 −2 −1 0 1 2 3
−10
−8
−6
−4
−2
0
2
4
6
8
10

Iteration 5
−3 −2 −1 0 1 2 3
−10
−8
−6
−4
−2
0
2
4
6
8
10

Iteration 40
−3 −2 −1 0 1 2 3
−10
−8
−6
−4
−2
0
2
4
6
8
10

CVXPY implementation
(Steven Diamond)
N = 105 samples, n = 103 (dense) features
hinge (SVM) loss with 1 regularization
data split into 100 chunks
100 processes on 32 cores
26 sec per ADMM iteration
100 iterations for objective to converge
10 iterations (5 minutes) to get good model

CVXPY implementation

H2O implementation
(Tomas Nykodym)
click-through data derived from a kaggle data set
20000 features, 20M examples
logistic loss, elastic net regularization
examples divided into 100 chunks (of diﬀerent sizes)
run on 100 H2O instances
5 iterations to get good global model

H2O implementation
ROC, iteration 1

H2O implementation
ROC, iteration 2

H2O implementation
ROC, iteration 3

H2O implementation
ROC, iteration 5

H2O implementation
ROC, iteration 10

Summary
ADMM consensus
can do machine learning across distributed data sources
the data never moves
get same model as if you had collected all data in one place

Resources
many researchers have worked on the topics covered
Convex Optimization
Distributed Optimization and Statistical Learning via the
Alternating Direction Method of Multipliers
EE364a (course slides, videos, code, homework, . . . )
software CVX, CVXPY, Convex.jl
all available online

H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Similar to H2O World - Consensus Optimization and Machine Learning - Stephen Boyd (20)

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

H2O World - Consensus Optimization and Machine Learning - Stephen Boyd