More Related Content Similar to shuyangli_summerpresentation08082014 Similar to shuyangli_summerpresentation08082014 (20) shuyangli_summerpresentation080820144. MOTIVATION โ Dynamic Programming
๏ง Solving Bellmanโs Equation:
๐ ๐ ๐ = max
๐
(๐ถ ๐ ๐, ๐ + ๐พ๐ผ ๐ ๐ ๐+1 |๐ ๐ )
From Optimal Learning (Powell & Ryzhov 2012)
๐บ ๐ป๐บ ๐ปโ๐
๐บ ๐ปโ๐
๐บ ๐ปโ๐
. . .๐บ ๐
Backward Pass
6. MOTIVATION - Literature
๏ง โA Comparison of Approximate Dynamic Programming
Techniques on Benchmark Energy Storage Problems: Does
Anything Work?โ
๏ง Jiang, Pham & Powell, 2014
๏ง โBenchmarking a Scalable Approximate Dynamic
Programming Algorithm for Stochastic Control of
Multidimensional Energy Storage Problemsโ
๏ง Salas & Powell, 2014
10. MODEL โ Action Constraints
Storage capacity:
๏ง ๐ ๐ก
๐๐
+ ๐ ๐ก
๐บ๐
โค ๐
๐๐๐ฅ โ ๐
๐ก
๏ง ๐ ๐ก
๐
๐ท
+ ๐ ๐ก
๐
๐บ
โค ๐
๐ก
Demand satisfied:
๏ง ๐ ๐ก
๐๐ท
+ ๐ ๐ ๐ ๐ก
๐
๐ท
+ ๐ ๐ก
๐บ๐ท
= ๐ท๐ก
Charging / Discharging
๏ง ๐ ๐ก
๐๐
+ ๐ ๐ก
๐บ๐
โค ๐พ ๐
๏ง ๐ ๐ก
๐
๐ท
+ ๐ ๐ก
๐
๐บ
โค ๐พ ๐
Flow Conservation
๏ง ๐ ๐ก
๐๐
+ ๐ ๐ก
๐๐ท
โค ๐ธ๐ก
Maximal Wind Usage:
๏ง ๐ ๐ก
๐๐ท
= min(๐ท๐ก, ๐ธ๐ก) ๏ง ๐ ๐ก
๐๐
= min(๐
๐๐๐ฅ โ ๐
๐ก, ๐ธ๐ก โ ๐ ๐ก
๐๐ท
)
From Benchmarking a Scalable Approximate Dynamic Programming Algorithm for Stochastic Control of Multidimensional Energy
Storage Problems (Salas, Powell) [Except Maximal Wind Usage]
11. MODEL โ Transition Functions
Storage:
๐
๐ก+1 = ๐
๐ก + ๐ ๐
๐ ๐ก ; ๐ = (0, โ1,0, ๐ ๐
, ๐ ๐
, โ1)
Simplified in the stochastic model:
๐
๐ก+1 = ๐
๐ก โ ๐ ๐ก
๐
๐ท
+ ๐ ๐ก
๐๐
+ ๐ ๐ก
๐บ๐
โ ๐ ๐ก
๐
๐บ
From Benchmarking a Scalable Approximate Dynamic Programming Algorithm for Stochastic Control of Multidimensional Energy
Storage Problems (Salas, Powell)
12. Wind:
๏ง ๐ธ๐ก+1 = ๐ธ๐ก + ๐ธ๐ก+1
๏ง ๐ธ๐ก~๐ฉ(๐ ๐ธ, ๐ ๐ธ
2
)
Demand:
๏ง ๐ท๐ก = max 0,3 โ 4 sin
2๐๐ก
๐
From Benchmarking a Scalable Approximate Dynamic Programming Algorithm for Stochastic Control of Multidimensional Energy
Storage Problems (Salas, Powell)
Price:
๏ง ๐๐ก+1 = ๐๐ก + ๐0,๐ก+1 + 1{๐ข ๐ก+1โค0.031} ๐1,๐ก+1
๏ง ๐0,๐ก~๐ฉ(๐ ๐, ๐ ๐
2
)
๏ง ๐1,๐ก~๐ฉ(0, 502
)
๏ง ๐ข ๐ก~๐ฐ(0,1)
MODEL โ Transition Functions
13. MODEL โ Objective Function
Reward Function:
๐ถ ๐๐ก, ๐ ๐ก = ๐๐ก ๐ท๐ก โ ๐๐ก ๐ ๐ก
๐บ๐
โ ๐ ๐
๐ ๐ก
๐
๐บ
+ ๐ ๐ก
๐บ๐ท
โ ๐โ
๐
๐ก+1
Objective Function:
๐น ๐โ
= max
๐โฮ
๐ผ
๐กโ๐
๐ถ(๐๐ก, ๐ด ๐ก
๐
( ๐๐ก))
From Benchmarking a Scalable Approximate Dynamic Programming Algorithm for Stochastic Control of Multidimensional Energy
Storage Problems (Salas, Powell)
15. ALGORITHMS โ Q-Learning
๏ง Action-value function Q
๐๐ก+1 ๐๐ก, ๐ ๐ก = ๐๐ก ๐๐ก, ๐ ๐ก + ๐ผ[๐
๐ก+1 + ๐พ ๐ฆ๐๐ฑ
๐
๐ธ ๐ ๐บ ๐+๐, ๐ โ ๐๐ก ๐๐ก, ๐ ๐ก ]
๏งOff-policy
๏ง Selection policy: ฮต-greedy
๏ง Evaluation policy: pure greedy
16. ALGORITHMS โ Q-Learning
๐บ ๐
๐ ๐
๐ธ ๐+๐ ๐บ ๐, ๐ ๐ = ๐ธ ๐ ๐บ ๐, ๐ ๐ + ๐ผ[๐น๐+๐ + ๐พ ๐๐๐
๐
๐ธ ๐ ๐บ ๐+๐, ๐ โ ๐๐ก ๐๐ก, ๐ ๐ก ]
๐1
๐2
๐ ๐
๐บ ๐+๐
๐๐ก(๐๐ก+1, ๐1)
๐ธ ๐(๐บ ๐+๐, ๐ ๐)
๐๐ก(๐๐ก+1, ๐2)๐ธ ๐(๐บ ๐, ๐ ๐)
17. ALGORITHMS - SARSA
๏ง SARSA [๐บ ๐ก ๐ ๐ก ๐น ๐ก+1 ๐บ ๐ก+1 ๐ ๐ก+1]
๐๐ก+1 ๐๐ก, ๐ ๐ก = ๐๐ก ๐๐ก, ๐ ๐ก + ๐ผ[๐
๐ก+1 + ๐พ๐ธ ๐ ๐บ ๐+๐, ๐ ๐+๐ โ ๐๐ก ๐๐ก, ๐ ๐ก ]
๏งOn-policy
๏ง Evaluation policy is selection policy
18. ๐บ ๐
๐ ๐
๐ธ ๐+๐ ๐บ ๐, ๐ ๐ = ๐ธ ๐ ๐บ ๐, ๐ ๐ + ๐ผ[๐น๐+๐ + ๐พ๐ธ ๐ ๐บ ๐+๐, ๐ ๐+๐ โ ๐๐ก ๐๐ก, ๐ ๐ก ]
๐ธ ๐(๐บ ๐, ๐ ๐) ๐บ ๐+๐
๐ ๐+๐
๐ธ ๐(๐บ ๐+๐, ๐ ๐+๐)
ALGORITHMS - SARSA
19. ALGORITHMS โ SARSA(ฮป)
๏ง SARSA(ฮป) includes a backward pass along a path (eligibility
trace)
๐๐ก+1 ๐, ๐ = ๐๐ก ๐, ๐ + ๐ผ๐ฟ๐ก ๐๐ก(๐ , ๐)
๐ฟ๐ก = ๐
๐ก+1 + ๐พ๐๐ก ๐๐ก+1, ๐ ๐ก+1 โ ๐๐ก ๐๐ก, ๐ ๐ก
๐๐ก =
๐พ๐๐๐กโ1 + 1, ๐ = ๐๐ก, ๐ = ๐ ๐ก
๐พ๐๐๐กโ1, ๐๐กโ๐๐๐ค๐๐ ๐
21. ALGORITHMS โ Step Sizes
๏ง Updating equation:
๐๐ก+1 ๐๐ก, ๐ ๐ก = ๐๐ก ๐๐ก, ๐ ๐ก + ๐ถ[๐
๐ก+1 + ๐พ max
๐
๐๐ก ๐๐ก+1, ๐ โ ๐๐ก ๐๐ก, ๐ ๐ก ]
๏ง Rearranged:
๐๐ก+1 ๐๐ก, ๐ ๐ก = 1 โ ๐ถ ๐๐ก ๐๐ก, ๐ ๐ก + ๐ถ[๐
๐ก+1 + ๐พ max
๐
๐๐ก ๐๐ก+1, ๐ ]
23. ALGORITHMS โ Step Sizes
๏ง Constant : ๐ผ = ๐
๏ง Harmonic : ๐ผ =
๐
๐+๐
๏ง 1/n : ๐ผ =
1
๐
๏ง Ryzhov Formula
๏ง Ryzhov, Frazier & Powell (2014)
25. METHODOLOGY - Simulation
๏ง 1 Deterministic Problem
๏ง 17 Stochastic Problems
๏ง 256 Sample Paths per stochastic problem
๏ง Training iterations: transitions by Monte Carlo simulation
๏ง ฮต-greedy action selection
๏ง Evaluation: Averaged over all sample paths
27. DATA โ Deterministic Parameters
๏ง ๐
๐๐๐ฅ
= 100
๏ง ๐
๐๐๐ = 0
๏ง ๐
0 = 0
๏ง ๐ ๐ = ๐ ๐ = 0.90
๏ง ๐พ ๐
= ๐พ ๐
= 0.10
๏ง ๐ป = ๐๐๐๐
31. DATA โ Stochastic Parameters
๏ง ๐
๐๐๐ฅ = 30
๏ง ๐
๐๐๐ = 0
๏ง ๐
0 = 25
๏ง ๐ ๐
= ๐ ๐
= 1.00
๏ง ๐พ ๐ = ๐พ ๐ = 5
๏ง ๐ป = ๐๐๐
๏ง ๐ ๐๐๐ฅ = 70
๏ง ๐ ๐๐๐ = 30
๏ง ๐ธ ๐๐๐ฅ = 7.00
๏ง ๐ธ ๐๐๐
= 1.00
36. RESULTS โ METRICS
๏ง Action Traces (Deterministic)
๏ง Step Size over Updates
๏ง Stored Energy over Time vs. Benchmark
๏ง Performance over # of Training Iterations
42. STEP SIZE TUNING
๏ง Declining Step Sizes
๏ง Tunable parameters
๏ง Harmonic:
๐
๐+๐
๏ง Ryzhov: Estimator update factor ๐
45. RYZHOV STEP SIZE
๏ง No-discount formula:
๐ผ ๐โ1 =
( ๐ ๐
)2
( ๐ ๐)2+( ๐ ๐)2
๏ง Tunable parameter:
๐ ๐ = 1 โ ๐ ๐โ๐ ๐ ๐โ1 + ๐ ๐โ๐ ๐ ๐
( ๐ ๐)2= 1 โ ๐ ๐โ๐ ( ๐ ๐โ1)2+๐ ๐โ๐( ๐ ๐โ ๐ ๐โ1)2
46. Ryzhov Step Sizes (Iteration 1)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 128)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 2k)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 4k)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 8k)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 16k)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 32k)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 65k)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 131k)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 262k)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 524k)
StepSize
Update Step
RYZHOV STEP SIZES
Ryzhov Step Sizes (Iteration 1M)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 2M)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 4M)
StepSize
Update Step
47. Ryzhov explanation
๏ง It should even out at a certain constant depending on the
variance in the rewards
๏ง For this problem, the rewards are deterministic, so the
variance measured is the variance in the different states that
we reach/are in
55. Explanations
๏ง Ryzhov flattens out very noticeably because the step size is
too high, it repeatedly overcompensates and bounces
around.
๏ง Harmonic looks like itโs still getting better! Need to have it
decline more slowly / not go to 0 so quickly. Ryzhov harmonic
constant maybe?
56. STEP SIZE PERFORMANCE
Performance of Constant 0.1, All Problems
ProportionofOptimal
Training Iteration
Performance of Harmonic, All Problems
ProportionofOptimal
Training Iteration
Performance of 1/n, All Problems
ProportionofOptimal
Training Iteration
Performance of Ryzhov, All Problems
ProportionofOptimal
Training Iteration
59. FURTHER EXPLORATION
๏ง Additional renewable energy sources / storage devices
๏ง Finer levels of discretization
๏ง Remove wind usage restriction
๏ง Additional algorithms
๏ง Actor-Critic
๏ง Gradient Descent (Linear/Non-Linear VFAs)