Sd4 tignes 2013-2

Stochastic Decomposition

Suvrajeet Sen
Lecture at the Winter School
April 2013

Epstein Department of ISE

The Plan – Part I ( 70 minutes)
• Review of Basics: 2-stage Stochastic Linear Programs
(SLP)
• Deterministic Algorithms for 2-stage SLP
– Subgradients of the Expected Recourse Function
– Subgradient Methods
– Deterministic Decomposition (Kelley/Benders/L-shaped)
• Stochastic Algorithms for 2-stage SLP
– Stochastic Quasi-gradient Methods
– Sample Average Approximation
– Stochastic Decomposition (SD) Algorithm
• Computational Results

Review of Basics: 2-stage
Stochastic Linear Programs


The commonly stated 2-stage SLP
(will be stated again, as needed)


The Recourse Function and its Expectation


Wait a minute!


One Approach: Use Sampling


Deterministic Algorithms for
2-stage SLP

Subgradient Method, Kelley/Benders/L-shaped Method


Subgradient Method
(Shor/Polyak/Nesterov/Nemirovski…)
Interchange of Expectation and
• At iteration k let be given Subdifferentiation is required here

• Let
• Then, where denotes the
projection operator on the set of the decisions
and,
• As mentioned earlier, is very difficult to compute!
• No concern about loss of convexity due to sampling


Strengths and Weaknesses of Subgradient Methods

• Strengths
– Easy to Program, no master problem, and easily parallelizable
– Recently there have been improvements in step size rules
• Weaknesses
– Difficult to establish lower bounds (optimality, in general)
– Traditional step-size rules (e.g. Constant/k) Need a lot of fine tuning
– Convergence
• Method makes good progress early on, but like other steepest-descent type
methods, there is zig-zagging behavior
– Need ways to stop the algorithm
• Difficult because upper and lower bounds on objective values are difficult to
obtain


Kelley’s Cutting Plane/Benders’/L-shaped
Decomposition for 2-stage SLP
• Let be a random variable defined on a
probability space
• Then a stochastic program is given by


KBL Decomposition (J. Benders/Van Slyke/Wets)

• At iteration k let , and be given. Recall

• Then define

• Let
• Then,


Comparing Subgradient Method and KBL Decomposition

• Both evaluate subgradients

• Expensive Operation (requires solving as many second-stage
LPs as there are scenarios)
• Step size in KBL is implicit (user need not worry)
• Master program grows without bound and looks unstable in
the early rounds
• Stopping rule is automatic (Upper Bound – Lower Bound ≤ ε)
• KBL’s use of master can be a bottleneck for parallelization


KBL Graphical Illustration

Expected Recourse
Function

Approximation: fk-1

Approximation: fk


Regularization of the Master Problem
(Ruszczynski/Kiwiel/Lemarechal …)

• Addresses the following issue:
– Master program grows without bound and looks unstable in
the early rounds
• Include an incumbent and a proximity measure
from the incumbent, using σ >0 as a weight:
• Particularly useful for Stochastic Decomposition.


Stochastic Algorithms for
2-stage SLP

Stochastic Quasi-Gradient, SAA, and SD


Some “concrete” instances
Table 1: SP Test Instances

Problem Domain # of 1st # of # of Universe Comment
Name stage 2nd random of
variab stage variables scenarios
les variabl
es
LandS Generation 4 12 3 Made-up
20TERM Logistics 63 764 40 Semi-real
SSN Telecom 89 706 86 Semi-real
STORM Logistics 121 1259 117 Semi-real

Notice the size of scenarios. Using a deterministic algorithm would be a “non-starter”


Numerous Large-scale Applications
• Wind-integration in Economic Dispatch
– How should conventional resources be
dispatched in the presence of intermittent
resources?
• Supply chain planning
– Inventory Transhipment between Regional
Distribution Centers, Local Warehouses, Outlets
• “Everyone” wants to “solve” Stochastic Unit
Commitment

So what do we mean by “solve”?
I. At the very least
– An algorithm which, under specified assumptions,
will provide
• A first-stage decision, with known metrics of optimality.
That is, report statistically quantified error
• Be reasonably fast on easily accessible machines
II. There are other things that people want


Stochastic Quasi-gradient Method (SQG) (Ermoliev/
Gaivoronski/Lan/Nemirovski/Uryasev …)
• At iteration k let be given. Sample
• Replace of subgradient optimization with its
unbiased estimate

• Then,
with
and


Comments on SQG

• Because this is a sampling-based algorithm, you must replicate so that you
can estimate “variability” (of what?)
• If you replicate M times, you will get M first-stage decisions. Which of
these should you use?
– Could evaluate each of the M first-stage decisions, and then choose the one with
smallest objective estimate
– For realistic models, this can be computationally time-consuming
• Could simply choose the mean of replications
– Less expensive, but may be unreliable
• But most importantly, it does not provide lower bounds


Sample Average Approximation (SAA, Shapiro, Kleywegt,
Homem-de-Mello, Linderoth, Wright ….)

• Choose a sample size N; Solve a sampled problem; Repeat M times
• Since you replicate M times, you will get M decisions. Which of these
should you use?
– Could evaluate each of the M decisions, and then choose the one with
smallest estimate
• For realistic models, this can be computationally time-consuming
– Could simply choose the mean of replications
• Less expensive, but may be unreliable
• Very widely cited, sometimes misused (because some non-expert users
appear to choose M=1)!


Stochastic Decomposition (SD)

• Allow arbitrarily many outcomes (scenarios)
including continuous random variables
• Perhaps interface with a simulator …


Some Central Questions for SP

• Instead of choosing a sample size at the start, can we
decide how much to sample, on-the-fly?
– The analogous question for nonlinear programming would
be: instead of choosing the number of iterations at the
start, can we decide the number of iterations, on-the-fly? –
Yes!
– So can we do this with sample size selection? Perhaps!
• If we are to determine a sample size on-the-fly, what is
the smallest number by which to increment the sample
size, and yet guarantee asymptotic convergence?


Under some assumptions …
• Fixed Recourse Matrix
• Current computer Implementations assume
– Relatively complete recourse (under revision)
– Second-stage cost is deterministic (under revision)


Approximating the recourse
function in SD
• At the start of iteration k, sample one more
outcome … say ωk independently of
• Given let solve the following LP

• Define and calculate for

• Notice the mapping of outcomes to finitely many dualISE
Epstein Department of
vertices.

Approximation used in SD
• The estimated “cut” in SD is given by

• To calculate this “cut” requires one LP corresponding
to the most recent outcome and the “argmax”
operations at the bottom of the previous slide
• In addition all previous cuts need to be updated
… to make old cuts consistent with the changing
sample size over iterations.

Update Previous Cuts
• Updating previously generated subgradients
– Why?
– Because … early cuts (based on small sizes) can
cut away the optimum for ever!
Expected Recourse Function

An Early Cut can Cause Trouble


How do we get around this?
• Suppose we assume that we know a lower
bound (e.g. zero) on hn(x), then we can include
such a lower bound to the older cuts so that
these older cuts “fade” away.

An Early Cut can Cause Trouble


Updating Previous Cuts.
• If we assume that all recourse function lower bounds
are uniformly zero,
– Then for t < k, the “cut” from iteration t has the following
form:


Alternative Sampled Approximations
Sampled Expected
Recourse Function (SAA)


Updated
Cuts from
Previous Linearization of the
Iterations of SD Sampled ERF (SA or SQG)
Lower Bound on the
Linearization of the
sample mean (SD-cut)


SUMMARY OF APPROXIMATIONS

SAA

SD
SA


Incumbent Solutions and Proximal Master Program
(QP) … also called Regularized Master Program

• An incumbent is the best solution “estimated”
through iteration k and will be denoted
• Given an incumbent, we setup a proximal master
program as follows.

where,


Benefits of the Proximal Master
• Can show that a finite master problem,
consisting of n1+3 optimality cuts is enough!
(Here n1 is the number of first-stage variables)
• Convergence can be proven to a unique limit
which is optimal (with probability 1).
• Stopping rules based on QP duality are
reasonably efficient.


Algorithmic Summary
0. Initialize with the same candidate and incumbent x. k=1.
1. Use sampling to obtain an outcome 
k

2. Derive an SD cut at the candidate, and the incumbent solutions.
This calls for
- solution of 2 subproblems using 
k

- Add any new dual vertex to a list Vk
- for each prior { t }t 1
k

choose the best subproblem dual vertex seen thus far
3. Update cut t by multiplying coeffs by (t/k), t  1...k
4. Solve the updated QP master
5. Ascertain whether new candidate becomes new incumbent
6. If stopping rules are not met, increment k, and repeat from 1.
(Stopping rule is based on bootstrapping primal and dual QPs)


Comparisons including 2-stage SD
FeatureMethod SQG Algorithm SAA Stochastic
Decomposition

Subgradient or Estimation Estimation Estimation
Estimation
Step Length Choice Yes Depends Not Needed
Required
Stopping Rules Unknown Well Studied Resolved
Parallel Computations Good Depends Not known
Continuous Random Yes Yes Yes
Variables
First-stage Integer No Yes Yes
Variables
Second-stage Integer No Yes No
Variables

Of course for small instances, we can always try deterministic equivalents!

The Plan – Part II
• Resume (from Computational Results)
• How were these obtained?
– SD Solution Quality
• In-sample optimality tests
• Out-of-sample, Demo (Yifan Liu)
• An Example: Wind Energy with Sub-hourly
Dispatch
• Summary

Recall this question: What do we mean by
“solve”?
I. At the very least
– An algorithm which, under specified assumptions,
will provide
• A first-stage decision, with known metrics of optimality.
That is, report statistically quantified error
• Be reasonably fast on easily accessible machines
II. There are other things that people want


What else to people want?
There are other things that people want
• Evidence
– Experiment with some realistic instances
• Please note that CEP1 and PGP2 are not realistic. They are for
debugging!
– Some numerical controls
• E.g. Can we reduce “bias”/non-optimality?
• Output for decision support
– Dual estimates of first-stage
– Histograms
• Recourse function
• Dual prices of second-stage


Computational Results - SAA
Table 2: Statistical Quantification with SAA

SAA Estimates using a Source:
Computational Grid
Upper (UB) and • Linderoth, Shapiro and Wright,
Instance Name Lower Bounds Annals of Operations Research,
(LB)
Average Values 95% CI’s Vol. 142, pp. 215–241 (2006)
• Computational Configuration:
OBJ-UB 225.624 Grid Computing Using 100’s of PCs,
LandS
OBJ-LB 225.62 but only 100 PCs at any given time
OBJ-UB 254311.55
20TERM Comment by SS: In 2005, an
OBJ-LB 254298.57
average PC was a Pentium IV,
OBJ-UB 9.913
SSN Clock Speed: 2-2.4 GHz.
OBJ-LB 9.84
OBJ-UB 15498739.41 Each instance of SSN took 30-45
STORM
OBJ-LB 15498657.8 mins. of wall clock time.
Replications: 7-10


One Example: SSN with Latin
Hypercube Sampling
Sample Size Lower Bound Upper Bound

50 10.10 (+/- 0.81) 11.38 (+/- 0.023)

100 8.90(+/- 0.36) 10.542 (+/-0.021)

500 9.87 (+/-0.22) 10.069(+/- 0.026)

1000 9.83(+/- 0.29) 9.996 (+/- 0.025)

5000 9.84 (+/- 0.10) 9.913 (+/- 0.022)


PC Configuration for SD
• Mac Book Air
• Processor: Intel Core i5
• Clock Speed: 1.8 GHz
• 4 GB of 1600 MHz DDR3 Memory
• Replications: 30 for each.


Computational Results - SD
Table 3: Statistical Quantification with SAA and SD

SAA Estimates using a
SD Estimates using a Laptop
Computational Grid
Upper (UB) and
Instance Name Lower Bounds % Difference in
(LB) Avg.Values
Average Values 95% CI’s Average Values 95% CI’s

OBJ-UB 225.624 225.54 0.037%
LandS
OBJ-LB 225.62 225.24 0.168%
OBJ-UB 254311.55 254476.87 0.065%
20TERM
OBJ-LB 254298.57 253905.44 0.154%
OBJ-UB 9.913 9.91 0.03%
SSN
OBJ-LB 9.84 9.76 0.813%
OBJ-UB 15498739.41 15498624.37 0.0007%
STORM
OBJ-LB 15498657.8 15496619.98 0.013%


SD Solution quality and time
• Solutions are of comparable quality
• Processors are somewhat similar
• Solution times
– The comparable time for SSN is 50 mins for 30 replications on one
processor
– Compare:
– (30-45) mins x (7 – 10) replications x 100 procs (21000 – 45000)
processor mins
– Note: This is only time for sample size of 5000. (But remember, there
were other sample sizes: 50, 100, 500, …, which we didn’t count)
• Are we beating Moore’s Law? Yes, doubling computational speed
every 9 months?


How were these obtained?
• In-sample stopping rules: Lower Bounds
– Check Stability in Set of Dual Vertices
– Bootstrapped duality gap estimate
• The latter tests whether the current primal and dual
solutions from the Proximal Master Program are also
reasonably “good” solutions for Primal and Dual
Problems of a Resampled Proximal Master over
Multiple Replications
• Out-of-sample: Upper Bounds


Stability in Set of Dual Vertices


1.0
1.0

0.8
0.8

20TERM
0.6
0.6
LandS

Pi-Ratio
Pi-Ratio

0.4
0.4

0.2
0.2

0.0
0.0

Iteration Number Iteration Number
0 100 200 300 0 200 400 600 800

1.0
1.0
0.8

0.8
STORM
0.6

0.6
SSN
Pi-Ratio

Pi-Ratio
0.4

0.4
0.2

0.2
0.0

0.0

Iteration Number Iteration Number
0 500 1000 1500 2000 2500 0 500 1000 1500 2000


Primal and Dual Regularized Values


LB And UB of SSN objective function estimates

1.0
Tight Tolerance

0.8
Nominal Tolerance
0.6
Fn(x)

Loose Tolerance
0.4

Bias/Non-opt
0.2

Reduction

Tight Tol.
0.0

9.5 10.0 10.5 11.0
Nominal Tol.
x Loose Tol.


SSN: Sonet-Switched Network
Solution and evaluation time for 20’s replications (Tight)
Replication No. Solution Time (s) Evaluation Time (s)
0 313.881051 588.627184
1 203.690471 651.042227
2 465.949416 547.517996
Evaluation value ranges 3
4
313.606587 589.429828
355.160629 599.331808
from 9.944203 to 5 334.764385 616.674899
6 529.661100 604.565630
10.279154 7 327.888471 545.132039
8 169.432233 655.688530
(3.3% difference) 9 301.293535 604.098964
10 697.304541 567.361433
11 315.097318 532.595026
12 247.439006 555.664374
13 560.934417 577.258900
14 342.787909 506.836774
15 247.356803 570.577690
16 184.941379 517.999214
17 339.951593 589.265958
18 339.381007 579.033263
19 411.092300 602.349102
Total 7001.614151 11601.050839


Solutions
• Mean of Replications: Average all Solutions
• For each seed s, let

where fs denotes the final approximation for seed s
• Compromise solution:


Upper Bounds
• For each replication, the objective function
evaluations can be very time consuming
• We report Objective of both Mean and
Compromise Solutions


Main Take Away

In SP,
Numerical Optimization Meets Statistics
So When You Design Algorithms, Don’t Forget What
You Need to Deliver: Statistics
Most Numerical Optimization Methods were Not
Designed for This Goal.
• Does Speed-up with SD beat Moore’s Law? Yes,
doubling computational speed every 9 months!

Sd4 tignes 2013-2

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Sd4 tignes 2013-2

Similar to Sd4 tignes 2013-2 (20)

Sd4 tignes 2013-2