We consider stochastic optimization problems arising in deep learning and other areas of statistical and machine learning from a statistical decision theory perspective. In particular, we investigate the admissibility (in the sense of decision theory) of the sample average solution estimator. We show that this estimator can be inadmissible in very simple settings, a phenomenon that is derived from the classical James-Stein estimator. However, for many problems of interest, the sample average estimator is indeed admissible. We will end with several open questions in this research directi
Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stochastic Optimization - Amitabh Basu, August 12, 2019
1. Admissibility of solution estimators to stochastic
optimization problems
Amitabh Basu
Joint Work with Tu Nguyen and Ao Sun
Foundations of Deep Learning, Opening Workshop,
SAMSI, Durham, August 2019
1 / 19
2. A general stochastic optimization problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
2 / 19
3. A general stochastic optimization problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Example:
Supervised Machine Learning: One sees samples (z, y) ∈ Rn × R
of labeled data from some (joint) distribution, and one aims to find
a function f ∈ F in a hypothesis class F that minimizes the
expected loss E(z,y)[ (f (z), y)], where : R × R → R+ is some
loss function. Then X = F, m = n + 1, ξ = (z, y), and
F(f , (z, y)) = (f (z), y).
2 / 19
4. A general stochastic optimization problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Example:
(News) Vendor Problem: (News) Vendor buys some units of a
product (newspapers) from supplier at cost of c > 0 dollars/unit;
at most u units available. Stochastic demand for product. Product
sold at price p > c dollars/unit. End of day, vendor can return
unsold product to supplier at r < c dollars/unit. Find number of
units to buy to maximize (minimize) the expected profit (loss).
2 / 19
5. A general stochastic optimization problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Example:
(News) Vendor Problem: (News) Vendor buys some units of a
product (newspapers) from supplier at cost of c > 0 dollars/unit;
at most u units available. Stochastic demand for product. Product
sold at price p > c dollars/unit. End of day, vendor can return
unsold product to supplier at r < c dollars/unit. Find number of
units to buy to maximize (minimize) the expected profit (loss).
m = 1, X = [0, u],
F(x, ξ) = cx − p min{x, ξ} − r max{x − ξ, 0}.
2 / 19
6. Solving the problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
3 / 19
7. Solving the problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Solve the problem only given access to n i.i.d. samples of ξ.
3 / 19
8. Solving the problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Solve the problem only given access to n i.i.d. samples of ξ.
Natural idea: Given samples ξ1, . . . , ξn ∈ Rd , solve the
deterministic problem
min
x∈X
1
n
n
i=1
F(x, ξi
)
3 / 19
9. Solving the problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Solve the problem only given access to n i.i.d. samples of ξ.
Natural idea: Given samples ξ1, . . . , ξn ∈ Rd , solve the
deterministic problem
min
x∈X
1
n
n
i=1
F(x, ξi
)
Stochastic optimizers call this sample average approximation
(SAA); machine learners call this empirical risk minimization.
3 / 19
10. Concrete Problem
F(x, ξ) = ξT x
X ⊆ Rd is a compact set (e.g., polytope, integer points in a
polytope). So m = d.
ξ ∼ N(µ, Σ).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
4 / 19
11. Concrete Problem
F(x, ξ) = ξT x
X ⊆ Rd is a compact set (e.g., polytope, integer points in a
polytope). So m = d.
ξ ∼ N(µ, Σ).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
Solve the problem only given access to n i.i.d. samples of ξ.
Important: µ is unknown.
4 / 19
12. Concrete Problem
F(x, ξ) = ξT x
X ⊆ Rd is a compact set (e.g., polytope, integer points in a
polytope). So m = d.
ξ ∼ N(µ, Σ).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
Solve the problem only given access to n i.i.d. samples of ξ.
Important: µ is unknown.
Sample Average Approximation (SAA):
min
x∈X
1
n
n
i=1
F(x, ξi
) = min
x∈X
ξ
T
x
where ξ := 1
n
n
i=1 ξi .
4 / 19
13. A quick tour of Statistical Decision Theory
Set of states of nature, modeled by a set Θ.
Set of possible actions to take, modeled by A.
In a particular state of nature θ ∈ Θ, the performance of any
action a ∈ A, is evaluated by a loss function L(θ, a). Goal: choose
action to minimize loss.
(Partial/Incomplete) Information about θ is obtained through a
random variable y taking values in a sample space χ. The
distribution of y depends on the particular state of nature θ,
denoted by Pθ.
5 / 19
14. A quick tour of Statistical Decision Theory
Set of states of nature, modeled by a set Θ.
Set of possible actions to take, modeled by A.
In a particular state of nature θ ∈ Θ, the performance of any
action a ∈ A, is evaluated by a loss function L(θ, a). Goal: choose
action to minimize loss.
(Partial/Incomplete) Information about θ is obtained through a
random variable y taking values in a sample space χ. The
distribution of y depends on the particular state of nature θ,
denoted by Pθ.
Decision Rule: Takes y ∈ χ as input and reports an action a ∈ A.
Denote by δ : χ → A.
5 / 19
15. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
6 / 19
16. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
Loss function: ?
6 / 19
17. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
Loss function:
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
6 / 19
18. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
Loss function:
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − minx∈X Eξ∼N(¯µ,I)[ F(x, ξ) ]
= ¯µT ¯x − ¯µT x(¯µ)
6 / 19
19. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
Loss function:
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − minx∈X Eξ∼N(¯µ,I)[ F(x, ξ) ]
= ¯µT ¯x − ¯µT x(¯µ)
Sample Space: χ = Rd
× Rd
× . . . × Rd
n times
6 / 19
20. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
Loss function:
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − minx∈X Eξ∼N(¯µ,I)[ F(x, ξ) ]
= ¯µT ¯x − ¯µT x(¯µ)
Sample Space: χ = Rd
× Rd
× . . . × Rd
n times
Decision Rule: δ : χ → X.
6 / 19
21. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
Loss function:
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − minx∈X Eξ∼N(¯µ,I)[ F(x, ξ) ]
= ¯µT ¯x − ¯µT x(¯µ)
Sample Space: χ = Rd
× Rd
× . . . × Rd
n times
Decision Rule: δ : χ → X.
SAA: δ(ξ1
, . . . , ξn
) ∈ arg max{ξ
T
x : x ∈ X}
6 / 19
25. How does one decide between decision rules?
8 / 19
26. How does one decide between decision rules?
States of nature Θ, Actions A, Loss function L : Θ × A → R,
Sample space χ with distributions {Pθ : θ ∈ Θ}.
8 / 19
27. How does one decide between decision rules?
States of nature Θ, Actions A, Loss function L : Θ × A → R,
Sample space χ with distributions {Pθ : θ ∈ Θ}.
Given a decision rule δ : χ → A, define the risk function of this
decision rule as:
Rδ(θ) := Ey∼Pθ
[ L(θ, δ(y)) ]
8 / 19
28. How does one decide between decision rules?
States of nature Θ, Actions A, Loss function L : Θ × A → R,
Sample space χ with distributions {Pθ : θ ∈ Θ}.
Given a decision rule δ : χ → A, define the risk function of this
decision rule as:
Rδ(θ) := Ey∼Pθ
[ L(θ, δ(y)) ]
We say that a decision rule δ dominates a decision rule δ if
Rδ (θ) ≤ Rδ(θ) for all θ ∈ Θ, and Rδ (θ∗) < Rδ(θ∗) for some
θ∗ ∈ Θ.
8 / 19
29. How does one decide between decision rules?
States of nature Θ, Actions A, Loss function L : Θ × A → R,
Sample space χ with distributions {Pθ : θ ∈ Θ}.
Given a decision rule δ : χ → A, define the risk function of this
decision rule as:
Rδ(θ) := Ey∼Pθ
[ L(θ, δ(y)) ]
We say that a decision rule δ dominates a decision rule δ if
Rδ (θ) ≤ Rδ(θ) for all θ ∈ Θ, and Rδ (θ∗) < Rδ(θ∗) for some
θ∗ ∈ Θ.
If a decision rule δ is not dominated by any other decision rule, we
say that δ is admissible. Otherwise, it is inadmissible.
8 / 19
30. Is the Sample Average Approximation (SAA) rule admissible?
9 / 19
31. Admissibility in stochastic optimization
Stochastic optimization setup:
F : X × Rm → R, ξ is a R.V. in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Want to solve with access to n i.i.d. samples of ξ.
10 / 19
32. Admissibility in stochastic optimization
Stochastic optimization setup:
F : X × Rm → R, ξ is a R.V. in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Want to solve with access to n i.i.d. samples of ξ.
Statistical decision theory view:
ξ ∼ N(µ, I); states of nature Θ = Rm = {all possible µ ∈ Rm}.
Set of actions A = X, Sample space χ = Rd
× Rd
× . . . × Rd
n times
Loss function
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Given a decision rule δ : χ → X, the risk of δ
Rδ(µ) := Eξ1,...,ξn [ L(µ, δ(ξ1
, . . . , ξn
)) ]
10 / 19
33. Admissibility in stochastic optimization
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Given a decision rule δ : χ → X, the risk of δ
Rδ(µ) := Eξ1,...,ξn [ L(µ, δ(ξ1
, . . . , ξn
)) ]
Sample Average Approximation (SAA):
min
x∈X
1
n
n
i=1
F(x, ξi
)
11 / 19
34. Admissibility in stochastic optimization
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Given a decision rule δ : χ → X, the risk of δ
Rδ(µ) := Eξ1,...,ξn [ L(µ, δ(ξ1
, . . . , ξn
)) ]
Sample Average Approximation (SAA):
min
x∈X
1
n
n
i=1
F(x, ξi
)
Sample Average Approximation (SAA) can be inadmissible!!
11 / 19
35. Inadmissibility of SAA: Stein’s Paradox
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Sample Average Approximation (SAA) can be inadmissible!!
F(x, ξ) = x − ξ 2, X = Rd , ξ ∼ N(µ, I).
min
x∈Rd
Eξ[ F(x, ξ) ] = min
x∈Rd
Eξ[ x − ξ 2
]
12 / 19
36. Inadmissibility of SAA: Stein’s Paradox
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Sample Average Approximation (SAA) can be inadmissible!!
min
x∈Rd
Eξ∼N(µ,I)[ x − ξ 2
] = min
x∈Rd
x − µ 2
+ V[ ξ ]
Optimal solution: x(¯µ) = ¯µ, Optimal value: V[ ξ ] = d.
L(¯µ, ¯x) = ¯x − ¯µ 2
.
13 / 19
37. Inadmissibility of SAA: Stein’s Paradox
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Sample Average Approximation (SAA) can be inadmissible!!
min
x∈Rd
Eξ∼N(µ,I)[ x − ξ 2
] = min
x∈Rd
x − µ 2
+ V[ ξ ]
Optimal solution: x(¯µ) = ¯µ, Optimal value: V[ ξ ] = d.
L(¯µ, ¯x) = ¯x − ¯µ 2
.
Sample Average Approximation (SAA):
min
x∈Rd
1
n
n
i=1
x − ξi 2
13 / 19
38. Inadmissibility of SAA: Stein’s Paradox
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Sample Average Approximation (SAA) can be inadmissible!!
min
x∈Rd
Eξ∼N(µ,I)[ x − ξ 2
] = min
x∈Rd
x − µ 2
+ V[ ξ ]
Optimal solution: x(¯µ) = ¯µ, Optimal value: V[ ξ ] = d.
L(¯µ, ¯x) = ¯x − ¯µ 2
.
Sample Average Approximation (SAA):
min
x∈Rd
1
n
n
i=1
x − ξi 2
δSAA(ξ1
, . . . , ξn
) = ξ :=
n
i=1
ξi
.
13 / 19
39. Inadmissibility of SAA: Stein’s Paradox
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Sample Average Approximation (SAA) can be inadmissible!!
min
x∈Rd
Eξ∼N(µ,I)[ x − ξ 2
]
Generalized to arbitrary convex quadratic function with uncertain
linear term in Davarnia and Cornu´ejols 2018. Follow-up work from
a Bayesian perspective in Davarnia, Kocuk and Cornu´ejols 2018.
14 / 19
40. A class of problems with no Stein’s paradox
THEOREM Basu-Nguyen-Sun 2018
Consider the problem of optimizing an uncertain linear objective
ξ ∼ N(µ, I) over a fixed compact set X ⊆ Rd :
min
x∈X
Eξ∼N(µ,I)[ ξT
x ]
The Sample Average Approximation (SAA) rule is admissible.
15 / 19
43. Main technical ideas/tools
Sufficient Statistic: P = {Pθ : θ ∈ Θ} family of distributions for
r.v. y in sample space χ. Sufficient statistic for this family is a
function T : χ → τ such that the conditional probability
P(y|T = t) does not depend on θ.
17 / 19
44. Main technical ideas/tools
Sufficient Statistic: P = {Pθ : θ ∈ Θ} family of distributions for
r.v. y in sample space χ. Sufficient statistic for this family is a
function T : χ → τ such that the conditional probability
P(y|T = t) does not depend on θ.
FACT:
χ = Rd
× . . . × Rd
n times
, P = {N(µ, I) × . . . × N(µ, I)
n times
: µ ∈ Rd
},
i.e., (ξ1, . . . , ξn) ∈ χ are i.i.d samples from the normal distribution
N(µ, I). Then T(ξ1 . . . , ξn) = ξ := 1
n
n
i=1 ξi is a sufficient
statistic for P.
17 / 19
45. Main technical ideas/tools
Sufficient Statistic: P = {Pθ : θ ∈ Θ} family of distributions for
r.v. y in sample space χ. Sufficient statistic for this family is a
function T : χ → τ such that the conditional probability
P(y|T = t) does not depend on θ.
THEOREM Rao-Blackwell 1940s
If the loss function is convex in the action space, then for any
decision rule δ, there exists a rule δ which is a function only of a
sufficient statistic and Rδ ≤ Rδ.
17 / 19
46. Main technical ideas/tools
For any decision rule δ, define the function
F(µ) = Rδ(µ) − RδSAA
(µ).
Suffices to show that there exists ˆµ ∈ Rd such that F(ˆµ) > 0.
17 / 19
47. Main technical ideas/tools
For any decision rule δ, define the function
F(µ) = Rδ(µ) − RδSAA
(µ).
Suffices to show that there exists ˆµ ∈ Rd such that F(ˆµ) > 0.
First observe: F(0) = 0.
17 / 19
48. Main technical ideas/tools
For any decision rule δ, define the function
F(µ) = Rδ(µ) − RδSAA
(µ).
Suffices to show that there exists ˆµ ∈ Rd such that F(ˆµ) > 0.
First observe: F(0) = 0.
Then compute 2F(0); show it has a strictly positive eigenvalue.
17 / 19
49. Main technical ideas/tools
For any decision rule δ, define the function
F(µ) = Rδ(µ) − RδSAA
(µ).
Suffices to show that there exists ˆµ ∈ Rd such that F(ˆµ) > 0.
First observe: F(0) = 0.
Then compute 2F(0); show it has a strictly positive eigenvalue.
Use a fact from probability theory that for any Lebesgue integrable
function f : Rn → Rd , the map
µ → Ey∈N(µ,Σ) [ f (y) ] :=
Rd
f (y) exp −
1
2
(y−µ)T
Σ−1
(y−µ) dy
has derivatives of all orders and these can be computed by taking
the derivative under the integral sign.
17 / 19
50. Open Questions
What about nonlinear objectives over compact feasible
regions? For example, what if F(x, ξ) = xT Qx + ξT x for
some fixed PSD matrix Q, and X is a compact (convex) set?
18 / 19
51. Open Questions
What about nonlinear objectives over compact feasible
regions? For example, what if F(x, ξ) = xT Qx + ξT x for
some fixed PSD matrix Q, and X is a compact (convex) set?
What about piecewise linear objectives F(x, ξ)? Recall News
Vendor Problem.
18 / 19
52. Open Questions
What about nonlinear objectives over compact feasible
regions? For example, what if F(x, ξ) = xT Qx + ξT x for
some fixed PSD matrix Q, and X is a compact (convex) set?
What about piecewise linear objectives F(x, ξ)? Recall News
Vendor Problem.
Objectives coming from machine learning problems, such as
neural network training with squared or logistic loss
(admissibility of “empirical risk minimization”). Maybe this
depends on the hypothesis class that is being learnt?
18 / 19
53. Open Questions
What about nonlinear objectives over compact feasible
regions? For example, what if F(x, ξ) = xT Qx + ξT x for
some fixed PSD matrix Q, and X is a compact (convex) set?
What about piecewise linear objectives F(x, ξ)? Recall News
Vendor Problem.
Objectives coming from machine learning problems, such as
neural network training with squared or logistic loss
(admissibility of “empirical risk minimization”). Maybe this
depends on the hypothesis class that is being learnt?
METATHEOREM (from G´erard Cornu´ejols): Admissible if and
only if feasible region is bounded !?
18 / 19