Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stochastic Optimization - Amitabh Basu, August 12, 2019

Admissibility of solution estimators to stochastic
optimization problems
Amitabh Basu
Joint Work with Tu Nguyen and Ao Sun
Foundations of Deep Learning, Opening Workshop,
SAMSI, Durham, August 2019
1 / 19

A general stochastic optimization problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
2 / 19

F : X × Rm → R
min
x∈X
Eξ[ F(x, ξ) ]
Example:
Supervised Machine Learning: One sees samples (z, y) ∈ Rn × R
of labeled data from some (joint) distribution, and one aims to ﬁnd
a function f ∈ F in a hypothesis class F that minimizes the
expected loss E(z,y)[ (f (z), y)], where : R × R → R+ is some
loss function. Then X = F, m = n + 1, ξ = (z, y), and
F(f , (z, y)) = (f (z), y).
2 / 19

F : X × Rm → R
min
x∈X
Eξ[ F(x, ξ) ]
Example:
(News) Vendor Problem: (News) Vendor buys some units of a
product (newspapers) from supplier at cost of c > 0 dollars/unit;
at most u units available. Stochastic demand for product. Product
sold at price p > c dollars/unit. End of day, vendor can return
unsold product to supplier at r < c dollars/unit. Find number of
units to buy to maximize (minimize) the expected proﬁt (loss).
2 / 19

F : X × Rm → R
min
x∈X
Eξ[ F(x, ξ) ]
Example:
(News) Vendor Problem: (News) Vendor buys some units of a
product (newspapers) from supplier at cost of c > 0 dollars/unit;
at most u units available. Stochastic demand for product. Product
sold at price p > c dollars/unit. End of day, vendor can return
unsold product to supplier at r < c dollars/unit. Find number of
units to buy to maximize (minimize) the expected proﬁt (loss).
m = 1, X = [0, u],
F(x, ξ) = cx − p min{x, ξ} − r max{x − ξ, 0}.
2 / 19

Solving the problem
F : X × Rm → R
min
x∈X
Eξ[ F(x, ξ) ]
3 / 19

Solving the problem
F : X × Rm → R
min
x∈X
Eξ[ F(x, ξ) ]
Solve the problem only given access to n i.i.d. samples of ξ.
3 / 19

Solving the problem
F : X × Rm → R
min
x∈X
Eξ[ F(x, ξ) ]
Natural idea: Given samples ξ1, . . . , ξn ∈ Rd , solve the
deterministic problem
min
x∈X
1
n
n
i=1
F(x, ξi
)
3 / 19

Solving the problem
F : X × Rm → R
min
x∈X
Eξ[ F(x, ξ) ]
Natural idea: Given samples ξ1, . . . , ξn ∈ Rd , solve the
deterministic problem
min
x∈X
1
n
n
i=1
F(x, ξi
)
Stochastic optimizers call this sample average approximation
(SAA); machine learners call this empirical risk minimization.
3 / 19

Concrete Problem
F(x, ξ) = ξT x
X ⊆ Rd is a compact set (e.g., polytope, integer points in a
polytope). So m = d.
ξ ∼ N(µ, Σ).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
4 / 19

Concrete Problem
F(x, ξ) = ξT x
ξ ∼ N(µ, Σ).
min
x∈X
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
Important: µ is unknown.
4 / 19

Concrete Problem
F(x, ξ) = ξT x
ξ ∼ N(µ, Σ).
min
x∈X
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
Important: µ is unknown.
Sample Average Approximation (SAA):
min
x∈X
1
n
n
i=1
F(x, ξi
) = min
x∈X
ξ
T
x
where ξ := 1
n
n
i=1 ξi .
4 / 19

A quick tour of Statistical Decision Theory
Set of states of nature, modeled by a set Θ.
Set of possible actions to take, modeled by A.
In a particular state of nature θ ∈ Θ, the performance of any
action a ∈ A, is evaluated by a loss function L(θ, a). Goal: choose
action to minimize loss.
(Partial/Incomplete) Information about θ is obtained through a
random variable y taking values in a sample space χ. The
distribution of y depends on the particular state of nature θ,
denoted by Pθ.
5 / 19

A quick tour of Statistical Decision Theory
Set of states of nature, modeled by a set Θ.
Set of possible actions to take, modeled by A.
In a particular state of nature θ ∈ Θ, the performance of any
action a ∈ A, is evaluated by a loss function L(θ, a). Goal: choose
action to minimize loss.
(Partial/Incomplete) Information about θ is obtained through a
random variable y taking values in a sample space χ. The
distribution of y depends on the particular state of nature θ,
denoted by Pθ.
Decision Rule: Takes y ∈ χ as input and reports an action a ∈ A.
Denote by δ : χ → A.
5 / 19

Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
6 / 19

min
x∈X
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
Loss function: ?
6 / 19

min
x∈X
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
Loss function:
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
6 / 19

min
x∈X
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
Loss function:
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − minx∈X Eξ∼N(¯µ,I)[ F(x, ξ) ]
= ¯µT ¯x − ¯µT x(¯µ)
6 / 19

min
x∈X
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
Loss function:
= ¯µT ¯x − ¯µT x(¯µ)
Sample Space: χ = Rd
× Rd
× . . . × Rd
n times
6 / 19

min
x∈X
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
Loss function:
= ¯µT ¯x − ¯µT x(¯µ)
× Rd
× . . . × Rd
n times
Decision Rule: δ : χ → X.
6 / 19

min
x∈X
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
Loss function:
= ¯µT ¯x − ¯µT x(¯µ)
× Rd
× . . . × Rd
n times
Decision Rule: δ : χ → X.
SAA: δ(ξ1
, . . . , ξn
) ∈ arg max{ξ
T
x : x ∈ X}
6 / 19

Decision Rules
1 2
34
1 2
34
7 / 19

How does one decide between decision rules?
8 / 19

States of nature Θ, Actions A, Loss function L : Θ × A → R,
Sample space χ with distributions {Pθ : θ ∈ Θ}.
8 / 19

Given a decision rule δ : χ → A, deﬁne the risk function of this
decision rule as:
Rδ(θ) := Ey∼Pθ
[ L(θ, δ(y)) ]
8 / 19

decision rule as:
Rδ(θ) := Ey∼Pθ
[ L(θ, δ(y)) ]
We say that a decision rule δ dominates a decision rule δ if
Rδ (θ) ≤ Rδ(θ) for all θ ∈ Θ, and Rδ (θ∗) < Rδ(θ∗) for some
θ∗ ∈ Θ.
8 / 19

decision rule as:
Rδ(θ) := Ey∼Pθ
[ L(θ, δ(y)) ]
We say that a decision rule δ dominates a decision rule δ if
Rδ (θ) ≤ Rδ(θ) for all θ ∈ Θ, and Rδ (θ∗) < Rδ(θ∗) for some
θ∗ ∈ Θ.
If a decision rule δ is not dominated by any other decision rule, we
say that δ is admissible. Otherwise, it is inadmissible.
8 / 19

Is the Sample Average Approximation (SAA) rule admissible?
9 / 19

Admissibility in stochastic optimization
Stochastic optimization setup:
F : X × Rm → R, ξ is a R.V. in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Want to solve with access to n i.i.d. samples of ξ.
10 / 19

Stochastic optimization setup:
F : X × Rm → R, ξ is a R.V. in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Want to solve with access to n i.i.d. samples of ξ.
Statistical decision theory view:
ξ ∼ N(µ, I); states of nature Θ = Rm = {all possible µ ∈ Rm}.
Set of actions A = X, Sample space χ = Rd
× Rd
× . . . × Rd
n times
Loss function
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Given a decision rule δ : χ → X, the risk of δ
Rδ(µ) := Eξ1,...,ξn [ L(µ, δ(ξ1
, . . . , ξn
)) ]
10 / 19

L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Rδ(µ) := Eξ1,...,ξn [ L(µ, δ(ξ1
, . . . , ξn
)) ]
min
x∈X
1
n
n
i=1
F(x, ξi
)
11 / 19

L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Rδ(µ) := Eξ1,...,ξn [ L(µ, δ(ξ1
, . . . , ξn
)) ]
min
x∈X
1
n
n
i=1
F(x, ξi
)
Sample Average Approximation (SAA) can be inadmissible!!
11 / 19

Inadmissibility of SAA: Stein’s Paradox
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
F(x, ξ) = x − ξ 2, X = Rd , ξ ∼ N(µ, I).
min
x∈Rd
x∈Rd
Eξ[ x − ξ 2
]
12 / 19

L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
min
x∈Rd
Eξ∼N(µ,I)[ x − ξ 2
] = min
x∈Rd
x − µ 2
+ V[ ξ ]
Optimal solution: x(¯µ) = ¯µ, Optimal value: V[ ξ ] = d.
L(¯µ, ¯x) = ¯x − ¯µ 2
.
13 / 19

L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
min
x∈Rd
Eξ∼N(µ,I)[ x − ξ 2
] = min
x∈Rd
x − µ 2
+ V[ ξ ]
L(¯µ, ¯x) = ¯x − ¯µ 2
.
min
x∈Rd
1
n
n
i=1
x − ξi 2
13 / 19

L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
min
x∈Rd
Eξ∼N(µ,I)[ x − ξ 2
] = min
x∈Rd
x − µ 2
+ V[ ξ ]
L(¯µ, ¯x) = ¯x − ¯µ 2
.
min
x∈Rd
1
n
n
i=1
x − ξi 2
δSAA(ξ1
, . . . , ξn
) = ξ :=
n
i=1
ξi
.
13 / 19

L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
min
x∈Rd
Eξ∼N(µ,I)[ x − ξ 2
]
Generalized to arbitrary convex quadratic function with uncertain
linear term in Davarnia and Cornu´ejols 2018. Follow-up work from
a Bayesian perspective in Davarnia, Kocuk and Cornu´ejols 2018.
14 / 19

A class of problems with no Stein’s paradox
THEOREM Basu-Nguyen-Sun 2018
Consider the problem of optimizing an uncertain linear objective
ξ ∼ N(µ, I) over a ﬁxed compact set X ⊆ Rd :
min
x∈X
Eξ∼N(µ,I)[ ξT
x ]
The Sample Average Approximation (SAA) rule is admissible.
15 / 19

Main technical ideas/tools
17 / 19

Suﬃcient Statistic: P = {Pθ : θ ∈ Θ} family of distributions for
r.v. y in sample space χ. Suﬃcient statistic for this family is a
function T : χ → τ such that the conditional probability
P(y|T = t) does not depend on θ.
17 / 19

FACT:
χ = Rd
× . . . × Rd
n times
, P = {N(µ, I) × . . . × N(µ, I)
n times
: µ ∈ Rd
},
i.e., (ξ1, . . . , ξn) ∈ χ are i.i.d samples from the normal distribution
N(µ, I). Then T(ξ1 . . . , ξn) = ξ := 1
n
n
i=1 ξi is a suﬃcient
statistic for P.
17 / 19

THEOREM Rao-Blackwell 1940s
If the loss function is convex in the action space, then for any
decision rule δ, there exists a rule δ which is a function only of a
suﬃcient statistic and Rδ ≤ Rδ.
17 / 19

For any decision rule δ, deﬁne the function
F(µ) = Rδ(µ) − RδSAA
(µ).
Suﬃces to show that there exists ˆµ ∈ Rd such that F(ˆµ) > 0.
17 / 19

(µ).
First observe: F(0) = 0.
17 / 19

(µ).
Then compute 2F(0); show it has a strictly positive eigenvalue.
17 / 19

(µ).
Then compute 2F(0); show it has a strictly positive eigenvalue.
Use a fact from probability theory that for any Lebesgue integrable
function f : Rn → Rd , the map
µ → Ey∈N(µ,Σ) [ f (y) ] :=
Rd
f (y) exp −
1
2
(y−µ)T
Σ−1
(y−µ) dy
has derivatives of all orders and these can be computed by taking
the derivative under the integral sign.
17 / 19

Open Questions
What about nonlinear objectives over compact feasible
regions? For example, what if F(x, ξ) = xT Qx + ξT x for
some ﬁxed PSD matrix Q, and X is a compact (convex) set?
18 / 19

Open Questions
What about piecewise linear objectives F(x, ξ)? Recall News
Vendor Problem.
18 / 19

Open Questions
Vendor Problem.
Objectives coming from machine learning problems, such as
neural network training with squared or logistic loss
(admissibility of “empirical risk minimization”). Maybe this
depends on the hypothesis class that is being learnt?
18 / 19

Open Questions
Vendor Problem.
Objectives coming from machine learning problems, such as
neural network training with squared or logistic loss
(admissibility of “empirical risk minimization”). Maybe this
depends on the hypothesis class that is being learnt?
METATHEOREM (from G´erard Cornu´ejols): Admissible if and
only if feasible region is bounded !?
18 / 19

THANK YOU !
Questions/Comments ?
19 / 19

Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stochastic Optimization - Amitabh Basu, August 12, 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stochastic Optimization - Amitabh Basu, August 12, 2019

Similar to Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stochastic Optimization - Amitabh Basu, August 12, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stochastic Optimization - Amitabh Basu, August 12, 2019