shuyangli_summerpresentation08082014

WIND AND ENERGY
STORAGE
Optimal Control of Power Systems in Context of Wind
Energy Generation and Storage
Shuyang Li

Overview
Motivation
 Model
 Algorithms
 Methodology
 Data
 Results
 Further Exploration

MOTIVATION
 Increasing reliance on wind energy
 Wind Energy Variance
 Avoid waste
 Storing Wind
 Storage/Unit Allocation

MOTIVATION – Dynamic Programming
 Solving Bellman’s Equation:
𝑉 𝑆 𝑛 = max
𝑎
(𝐶 𝑆 𝑛, 𝑎 + 𝛾𝔼 𝑉 𝑆 𝑛+1 |𝑆 𝑛 )
From Optimal Learning (Powell & Ryzhov 2012)
𝑺 𝑻𝑺 𝑻−𝟏
𝑺 𝑻−𝟐
𝑺 𝑻−𝟑
. . .𝑺 𝟎
Backward Pass

MOTIVATION – ADP/RL
 Monte Carlo Simulation
𝑺 𝟎
Forward Pass
𝑺 𝟏
𝑺 𝟐
𝑺 𝟑 . . .
𝑺 𝑻

MOTIVATION - Literature
 “A Comparison of Approximate Dynamic Programming
Techniques on Benchmark Energy Storage Problems: Does
Anything Work?”
 Jiang, Pham & Powell, 2014
 “Benchmarking a Scalable Approximate Dynamic
Programming Algorithm for Stochastic Control of
Multidimensional Energy Storage Problems”
 Salas & Powell, 2014

PROBLEM
 Algorithmic Performance
 Q-Learning
 SARSA(λ)
 Step Sizes
Ryzhov, Frazier & Powell (2014)

Overview
 Motivation
Model
 Algorithms
 Methodology
 Data
 Results

MODEL
Demand
Grid
Battery
Wind
𝒂 𝒕
𝑾𝑫
𝒂 𝒕
𝑹𝑫
𝒂 𝒕
𝑮𝑫
𝒂 𝒕
𝑾𝑹
𝒂 𝒕
𝑮𝑹
𝒂 𝒕
𝑹𝑮
𝑺 𝒕 = (𝑹 𝒕, 𝑬 𝒕, 𝑫 𝒕, 𝑷 𝒕)
𝒂 𝒕 = (𝒂 𝒕
𝑾𝑫
, 𝒂 𝒕
𝑹𝑫
, 𝒂 𝒕
𝑮𝑫
, 𝒂 𝒕
𝑾𝑹
, 𝒂 𝒕
𝑮𝑹
, 𝒂 𝒕
𝑹𝑮
)

MODEL – Action Constraints
Storage capacity:
 𝑎 𝑡
𝑊𝑅
+ 𝑎 𝑡
𝐺𝑅
≤ 𝑅 𝑚𝑎𝑥 − 𝑅𝑡
 𝑎 𝑡
𝑅𝐷
+ 𝑎 𝑡
𝑅𝐺
≤ 𝑅𝑡
Demand satisfied:
 𝑎 𝑡
𝑊𝐷
+ 𝜂 𝑑 𝑎 𝑡
𝑅𝐷
+ 𝑎 𝑡
𝐺𝐷
= 𝐷𝑡
Charging / Discharging
 𝑎 𝑡
𝑊𝑅
+ 𝑎 𝑡
𝐺𝑅
≤ 𝛾 𝑐
 𝑎 𝑡
𝑅𝐷
+ 𝑎 𝑡
𝑅𝐺
≤ 𝛾 𝑑
Flow Conservation
 𝑎 𝑡
𝑊𝑅
+ 𝑎 𝑡
𝑊𝐷
≤ 𝐸𝑡
Maximal Wind Usage:
 𝑎 𝑡
𝑊𝐷
= min(𝐷𝑡, 𝐸𝑡)  𝑎 𝑡
𝑊𝑅
= min(𝑅 𝑚𝑎𝑥 − 𝑅𝑡, 𝐸𝑡 − 𝑎 𝑡
𝑊𝐷
)
From Benchmarking a Scalable Approximate Dynamic Programming Algorithm for Stochastic Control of Multidimensional Energy
Storage Problems (Salas, Powell) [Except Maximal Wind Usage]

MODEL – Transition Functions
Storage:
𝑅𝑡+1 = 𝑅𝑡 + 𝜙 𝑇
𝑎 𝑡 ; 𝜙 = (0, −1,0, 𝜂 𝑐
, 𝜂 𝑐
, −1)
Simplified in the stochastic model:
𝑅𝑡+1 = 𝑅𝑡 − 𝑎 𝑡
𝑅𝐷
+ 𝑎 𝑡
𝑊𝑅
+ 𝑎 𝑡
𝐺𝑅
− 𝑎 𝑡
𝑅𝐺
Storage Problems (Salas, Powell)

Wind:
 𝐸𝑡+1 = 𝐸𝑡 + 𝐸𝑡+1
 𝐸𝑡~𝒩(𝜇 𝐸, 𝜎 𝐸
2
)
Demand:
 𝐷𝑡 = max 0,3 − 4 sin
2𝜋𝑡
𝑇
Price:
 𝑃𝑡+1 = 𝑃𝑡 + 𝑃0,𝑡+1 + 1{𝑢 𝑡+1≤0.031} 𝑃1,𝑡+1
 𝑃0,𝑡~𝒩(𝜇 𝑃, 𝜎 𝑃
2
)
 𝑃1,𝑡~𝒩(0, 502
)
 𝑢 𝑡~𝒰(0,1)
MODEL – Transition Functions

MODEL – Objective Function
Reward Function:
𝐶 𝑆𝑡, 𝑎 𝑡 = 𝑃𝑡 𝐷𝑡 − 𝑃𝑡 𝑎 𝑡
𝐺𝑅
− 𝜂 𝑑
𝑎 𝑡
𝑅𝐺
+ 𝑎 𝑡
𝐺𝐷
− 𝑐ℎ
𝑅𝑡+1
Objective Function:
𝐹 𝜋∗
= max
𝜋∈Π
𝔼
𝑡∈𝑇
𝐶(𝑆𝑡, 𝐴 𝑡
𝜋
( 𝑆𝑡))

Overview
 Motivation
 Model
Algorithms
 Methodology
 Data
 Results

ALGORITHMS – Q-Learning
 Action-value function Q
𝑄𝑡+1 𝑆𝑡, 𝑎 𝑡 = 𝑄𝑡 𝑆𝑡, 𝑎 𝑡 + 𝛼[𝑅𝑡+1 + 𝛾 𝐦𝐚𝐱
𝒂
𝑸 𝒕 𝑺 𝒕+𝟏, 𝒂 − 𝑄𝑡 𝑆𝑡, 𝑎 𝑡 ]
Off-policy
 Selection policy: ε-greedy
 Evaluation policy: pure greedy

ALGORITHMS – Q-Learning
𝑺 𝒕
𝒂 𝒕
𝑸 𝒕+𝟏 𝑺 𝒕, 𝒂 𝒕 = 𝑸 𝒕 𝑺 𝒕, 𝒂 𝒕 + 𝛼[𝑹𝒕+𝟏 + 𝛾 𝒎𝒂𝒙
𝒂
𝑸 𝒕 𝑺 𝒕+𝟏, 𝒂 − 𝑄𝑡 𝑆𝑡, 𝑎 𝑡 ]
𝑎1
𝑎2
𝒂 𝟑
𝑺 𝒕+𝟏
𝑄𝑡(𝑆𝑡+1, 𝑎1)
𝑸 𝒕(𝑺 𝒕+𝟏, 𝒂 𝟑)
𝑄𝑡(𝑆𝑡+1, 𝑎2)𝑸 𝒕(𝑺 𝒕, 𝒂 𝒕)

ALGORITHMS - SARSA
 SARSA [𝑺 𝑡 𝒂 𝑡 𝑹 𝑡+1 𝑺 𝑡+1 𝒂 𝑡+1]
𝑄𝑡+1 𝑆𝑡, 𝑎 𝑡 = 𝑄𝑡 𝑆𝑡, 𝑎 𝑡 + 𝛼[𝑅𝑡+1 + 𝛾𝑸 𝒕 𝑺 𝒕+𝟏, 𝒂 𝒕+𝟏 − 𝑄𝑡 𝑆𝑡, 𝑎 𝑡 ]
On-policy
 Evaluation policy is selection policy

𝑺 𝒕
𝒂 𝒕
𝑸 𝒕+𝟏 𝑺 𝒕, 𝒂 𝒕 = 𝑸 𝒕 𝑺 𝒕, 𝒂 𝒕 + 𝛼[𝑹𝒕+𝟏 + 𝛾𝑸 𝒕 𝑺 𝒕+𝟏, 𝒂 𝒕+𝟏 − 𝑄𝑡 𝑆𝑡, 𝑎 𝑡 ]
𝑸 𝒕(𝑺 𝒕, 𝒂 𝒕) 𝑺 𝒕+𝟏
𝒂 𝒕+𝟏
𝑸 𝒕(𝑺 𝒕+𝟏, 𝒂 𝒕+𝟏)
ALGORITHMS - SARSA

ALGORITHMS – SARSA(λ)
 SARSA(λ) includes a backward pass along a path (eligibility
trace)
𝑄𝑡+1 𝑆, 𝑎 = 𝑄𝑡 𝑆, 𝑎 + 𝛼𝛿𝑡 𝑍𝑡(𝑠, 𝑎)
𝛿𝑡 = 𝑅𝑡+1 + 𝛾𝑄𝑡 𝑆𝑡+1, 𝑎 𝑡+1 − 𝑄𝑡 𝑆𝑡, 𝑎 𝑡
𝑍𝑡 =
𝛾𝜆𝑍𝑡−1 + 1, 𝑆 = 𝑆𝑡, 𝑎 = 𝑎 𝑡
𝛾𝜆𝑍𝑡−1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

ALGORITHMS – SARSA(λ)
Reinforcement Learning: An Introduction 2nd Ed. (Richard Sutton & Andrew Barto; 2014)

ALGORITHMS – Step Sizes
 Updating equation:
𝑄𝑡+1 𝑆𝑡, 𝑎 𝑡 = 𝑄𝑡 𝑆𝑡, 𝑎 𝑡 + 𝜶[𝑅𝑡+1 + 𝛾 max
𝑎
𝑄𝑡 𝑆𝑡+1, 𝑎 − 𝑄𝑡 𝑆𝑡, 𝑎 𝑡 ]
 Rearranged:
𝑄𝑡+1 𝑆𝑡, 𝑎 𝑡 = 1 − 𝜶 𝑄𝑡 𝑆𝑡, 𝑎 𝑡 + 𝜶[𝑅𝑡+1 + 𝛾 max
𝑎
𝑄𝑡 𝑆𝑡+1, 𝑎 ]

Target
Target
High 𝜶
Low 𝜶

 Constant : 𝛼 = 𝑘
 Harmonic : 𝛼 =
𝑎
𝑎+𝑛
 1/n : 𝛼 =
1
𝑛
 Ryzhov Formula
 Ryzhov, Frazier & Powell (2014)

Overview
 Motivation
 Model
 Algorithms
Methodology
 Data
 Results

METHODOLOGY - Simulation
 1 Deterministic Problem
 17 Stochastic Problems
 256 Sample Paths per stochastic problem
 Training iterations: transitions by Monte Carlo simulation
 ε-greedy action selection
 Evaluation: Averaged over all sample paths

Overview
 Motivation
 Model
 Algorithms
 Methodology
Data
 Results

DATA – Deterministic Parameters
 𝑅 𝑚𝑎𝑥
= 100
 𝑅 𝑚𝑖𝑛 = 0
 𝑅0 = 0
 𝜂 𝑐 = 𝜂 𝑑 = 0.90
 𝛾 𝑐
= 𝛾 𝑑
= 0.10
 𝑻 = 𝟐𝟎𝟎𝟎

DATA – Deterministic Problems

DATA – Stochastic Parameters
 𝑅 𝑚𝑎𝑥 = 30
 𝑅 𝑚𝑖𝑛 = 0
 𝑅0 = 25
 𝜂 𝑐
= 𝜂 𝑑
= 1.00
 𝛾 𝑐 = 𝛾 𝑑 = 5
 𝑻 = 𝟏𝟎𝟎
 𝑃 𝑚𝑎𝑥 = 70
 𝑃 𝑚𝑖𝑛 = 30
 𝐸 𝑚𝑎𝑥 = 7.00
 𝐸 𝑚𝑖𝑛
= 1.00

Overview
 Motivation
 Model
 Algorithms
 Methodology
 Data
Results

RESULTS – METRICS
 Action Traces (Deterministic)
 Step Size over Updates
 Stored Energy over Time vs. Benchmark
 Performance over # of Training Iterations

DETERMINISTIC ACTIONS
Energy Charged (WR) over Time vs. Benchmark
EnergyUnits
Time Step

Energy Discharged (RD) over Time vs. Benchmark
EnergyUnits
Time Step

Energy Sold (RG) over Time vs. Benchmark
EnergyUnits
Time Step

DETERMINISTIC STORAGE
Energy in Storage vs. Benchmark
EnergyUnits
Time Step

DETERMINISTIC PERFORMANCE
Performance of Constant Step Sizes, D2
EnergyUnits
Training Iteration

STEP SIZE TUNING
 Declining Step Sizes
 Tunable parameters
 Harmonic:
𝒂
𝒂+𝑛
 Ryzhov: Estimator update factor 𝝂

HARMONIC STEP SIZES
Harmonic Step Sizes vs. Iteration
StepSize
Training Iteration

HARMONIC PERFORMANCE
Harmonic Step Size Performance
ProportionofOptimal
Training Iteration

RYZHOV STEP SIZE
 No-discount formula:
𝛼 𝑛−1 =
( 𝑐 𝑛
)2
( 𝑐 𝑛)2+( 𝜎 𝑛)2
 Tunable parameter:
𝑐 𝑛 = 1 − 𝝂 𝒏−𝟏 𝑐 𝑛−1 + 𝝂 𝒏−𝟏 𝑐 𝑛
( 𝜎 𝑛)2= 1 − 𝝂 𝒏−𝟏 ( 𝜎 𝑛−1)2+𝝂 𝒏−𝟏( 𝑐 𝑛− 𝑐 𝑛−1)2

Ryzhov Step Sizes (Iteration 1)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 128)
StepSize
Update Step
Ryzhov Step Sizes (Iteration 2k)
StepSize
Update Step
StepSize
Update Step
StepSize
Update Step
StepSize
Update Step
StepSize
Update Step
StepSize
Update Step
StepSize
Update Step
StepSize
Update Step
StepSize
Update Step
RYZHOV STEP SIZES
Ryzhov Step Sizes (Iteration 1M)
StepSize
Update Step
StepSize
Update Step
StepSize
Update Step

Ryzhov explanation
 It should even out at a certain constant depending on the
variance in the rewards
 For this problem, the rewards are deterministic, so the
variance measured is the variance in the different states that
we reach/are in

RYZHOV STEP SIZES vs.
ESTIMATOR VARIANCE
Update Step
StepSize
E.Variance

RYZHOV PERFORMANCE
Ryzhov Step Size Performance
ProportionofOptimal
Training Iteration

AGGREGATE PARAMETERS

AGGREGATE STORAGE
Storage vs. Benchmark
StepSize
Time Step

AGGREGATE PERFORMANCE
Performance of All Algorithms, S13
ProportionofOptimal
Training Iteration

Performance of Q-Learning, S13
ProportionofOptimal
Training Iteration

Performance of SARSA, S13
ProportionofOptimal
Training Iteration

Explanations
 Ryzhov flattens out very noticeably because the step size is
too high, it repeatedly overcompensates and bounces
around.
 Harmonic looks like it’s still getting better! Need to have it
decline more slowly / not go to 0 so quickly. Ryzhov harmonic
constant maybe?

STEP SIZE PERFORMANCE
Performance of Constant 0.1, All Problems
ProportionofOptimal
Training Iteration
Performance of Harmonic, All Problems
ProportionofOptimal
Training Iteration
Performance of 1/n, All Problems
ProportionofOptimal
Training Iteration
Performance of Ryzhov, All Problems
ProportionofOptimal
Training Iteration

Overview
 Motivation
 Model
 Algorithms
 Methodology
 Data
 Results
Further Exploration

FURTHER EXPLORATION
 Additional renewable energy sources / storage devices
 Finer levels of discretization
 Remove wind usage restriction
 Additional algorithms
 Actor-Critic
 Gradient Descent (Linear/Non-Linear VFAs)

shuyangli_summerpresentation08082014

Recommended

Recommended

More Related Content

Similar to shuyangli_summerpresentation08082014

Similar to shuyangli_summerpresentation08082014 (20)

shuyangli_summerpresentation08082014