SlideShare a Scribd company logo
1 of 11
Download to read offline
Understanding (Exact) Dynamic Programming through
Bellman Operators
Ashwin Rao
ICME, Stanford University
January 14, 2019
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 1 / 11
Overview
1 Vector Space of Value Functions
2 Bellman Operators
3 Contraction and Monotonicity
4 Policy Evaluation
5 Policy Iteration
6 Value Iteration
7 Policy Optimality
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 2 / 11
Vector Space of Value Functions
Assume State pace S consists of n states: {s1, s2, . . . , sn}
Assume Action space A consists of m actions {a1, a2, . . . , am}
This exposition extends easily to continuous state/action spaces too
We denote a stochastic policy as π(a|s) (probability of “a given s”)
Abusing notation, deterministic policy denoted as π(s) = a
Consider a n-dim vector space, each dim corresponding to a state in S
A vector in this space is a specific Value Function (VF) v: S → R
With coordinates [v(s1), v(s2), . . . , v(sn)]
Value Function (VF) for a policy π is denoted as vπ : S → R
Optimal VF denoted as v∗ : S → R such that for any s ∈ S,
v∗(s) = max
π
vπ(s)
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 3 / 11
Some more notation
Denote Ra
s as the Expected Reward upon action a in state s
Denote Pa
s,s as the probability of transition s → s upon action a
Define
Rπ(s) =
a∈A
π(a|s) · Ra
s
Pπ(s, s ) =
a∈A
π(a|s) · Pa
s,s
Denote Rπ as the vector [Rπ(s1), Rπ(s2), . . . , Rπ(sn)]
Denote Pπ as the matrix [Pπ(si , si )], 1 ≤ i, i ≤ n
Denote γ as the MDP discount factor
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 4 / 11
Bellman Operators Bπ and B∗
We define operators that transform a VF vector to another VF vector
Bellman Policy Operator Bπ (for policy π) operating on VF vector v:
Bπv = Rπ + γPπ · v
Bπ is a linear operator with fixed point vπ, meaning Bπvπ = vπ
Bellman Optimality Operator B∗ operating on VF vector v:
(B∗v)(s) = max
a
{Ra
s + γ
s ∈S
Pa
s,s · v(s )}
B∗ is a non-linear operator with fixed point v∗, meaning B∗v∗ = v∗
Define a function G mapping a VF v to a deterministic “greedy”
policy G(v) as follows:
G(v)(s) = arg max
a
{Ra
s + γ
s ∈S
Pa
s,s · v(s )}
BG(v)v = B∗v for any VF v (Policy G(v) achieves the max in B∗)
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 5 / 11
Contraction and Monotonicity of Operators
Both Bπ and B∗ are γ-contraction operators in L∞ norm, meaning:
For any two VFs v1 and v2,
Bπv1 − Bπv2 ∞ ≤ γ v1 − v2 ∞
B∗v1 − B∗v2 ∞ ≤ γ v1 − v2 ∞
So we can invoke Contraction Mapping Theorem to claim fixed point
We use the notation v1 ≤ v2 for any two VFs v1, v2 to mean:
v1(s) ≤ v2(s) for all s ∈ S
Also, both Bπ and B∗ are monotonic, meaning:
For any two VFs v1 and v2,
v1 ≤ v2 ⇒ Bπv1 ≤ Bπv2
v1 ≤ v2 ⇒ B∗v1 ≤ B∗v2
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 6 / 11
Policy Evaluation
Bπ satisfies the conditions of Contraction Mapping Theorem
Bπ has a unique fixed point vπ, meaning Bπvπ = vπ
This is a succinct representation of Bellman Expectation Equation
Starting with any VF v and repeatedly applying Bπ, we will reach vπ
lim
N→∞
BN
π v = vπ for any VF v
This is a succinct representation of the Policy Evaluation Algorithm
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 7 / 11
Policy Improvement
Let πk and vπk
denote the Policy and the VF for the Policy in
iteration k of Policy Iteration
Policy Improvement Step is: πk+1 = G(vπk
), i.e. deterministic greedy
Earlier we argued that B∗v = BG(v)v for any VF v. Therefore,
B∗vπk
= BG(vπk
)vπk
= Bπk+1
vπk
(1)
We also know from operator definitions that B∗v ≥ Bπv for all π, v
B∗vπk
≥ Bπk
vπk
= vπk
(2)
Combining (1) and (2), we get:
Bπk+1
vπk
≥ vπk
Monotonicity of Bπk+1
implies
BN
πk+1
vπk
≥ . . . B2
πk+1
vπk
≥ Bπk+1
vπk
≥ vπk
vπk+1
= lim
N→∞
BN
πk+1
vπk
≥ vπk
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 8 / 11
Policy Iteration
We have shown that in iteration k + 1 of Policy Iteration, vπk+1
≥ vπk
If vπk+1
= vπk
, the above inequalities would hold as equalities
So this would mean B∗vπk
= vπk
But B∗ has a unique fixed point v∗
So this would mean vπk
= v∗
Thus, at each iteration, Policy Iteration either strictly improves the
VF or achieves the optimal VF v∗
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 9 / 11
Value Iteration
B∗ satisfies the conditions of Contraction Mapping Theorem
B∗ has a unique fixed point v∗, meaning B∗v∗ = v∗
This is a succinct representation of Bellman Optimality Equation
Starting with any VF v and repeatedly applying B∗, we will reach v∗
lim
N→∞
BN
∗ v = v∗ for any VF v
This is a succinct representation of the Value Iteration Algorithm
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 10 / 11
Greedy Policy from Optimal VF is an Optimal Policy
Earlier we argued that BG(v)v = B∗v for any VF v. Therefore,
BG(v∗)v∗ = B∗v∗
But v∗ is the fixed point of B∗, meaning B∗v∗ = v∗. Therefore,
BG(v∗)v∗ = v∗
But we know that BG(v∗) has a unique fixed point vG(v∗). Therefore,
v∗ = vG(v∗)
This says that simply following the deterministic greedy policy G(v∗)
(created from the Optimal VF v∗) in fact achieves the Optimal VF v∗
In other words, G(v∗) is an Optimal (Deterministic) Policy
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 11 / 11

More Related Content

What's hot

Fine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its VariantsFine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its VariantsAkankshaAgrawal55
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticYamagata Yoriyuki
 
P, NP and NP-Complete, Theory of NP-Completeness V2
P, NP and NP-Complete, Theory of NP-Completeness V2P, NP and NP-Complete, Theory of NP-Completeness V2
P, NP and NP-Complete, Theory of NP-Completeness V2S.Shayan Daneshvar
 
Path Contraction Faster than 2^n
Path Contraction Faster than 2^nPath Contraction Faster than 2^n
Path Contraction Faster than 2^nAkankshaAgrawal55
 
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011Christian Robert
 
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...SYRTO Project
 
Kernel for Chordal Vertex Deletion
Kernel for Chordal Vertex DeletionKernel for Chordal Vertex Deletion
Kernel for Chordal Vertex DeletionAkankshaAgrawal55
 
Overview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsOverview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsAshwin Rao
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAshwin Rao
 
RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010Christian Robert
 
Semi-automatic ABC: a discussion
Semi-automatic ABC: a discussionSemi-automatic ABC: a discussion
Semi-automatic ABC: a discussionChristian Robert
 
Guarding Terrains though the Lens of Parameterized Complexity
Guarding Terrains though the Lens of Parameterized ComplexityGuarding Terrains though the Lens of Parameterized Complexity
Guarding Terrains though the Lens of Parameterized ComplexityAkankshaAgrawal55
 
Lec 5 asymptotic notations and recurrences
Lec 5 asymptotic notations and recurrencesLec 5 asymptotic notations and recurrences
Lec 5 asymptotic notations and recurrencesAnkita Karia
 
Cs ps, sat, fol resolution strategies
Cs ps, sat, fol resolution strategiesCs ps, sat, fol resolution strategies
Cs ps, sat, fol resolution strategiesYasir Khan
 
1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfa1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfaSampath Kumar S
 

What's hot (20)

Fine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its VariantsFine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its Variants
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
 
P, NP and NP-Complete, Theory of NP-Completeness V2
P, NP and NP-Complete, Theory of NP-Completeness V2P, NP and NP-Complete, Theory of NP-Completeness V2
P, NP and NP-Complete, Theory of NP-Completeness V2
 
Path Contraction Faster than 2^n
Path Contraction Faster than 2^nPath Contraction Faster than 2^n
Path Contraction Faster than 2^n
 
Reductions
ReductionsReductions
Reductions
 
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
 
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
 
Kernel for Chordal Vertex Deletion
Kernel for Chordal Vertex DeletionKernel for Chordal Vertex Deletion
Kernel for Chordal Vertex Deletion
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
smtlecture.3
smtlecture.3smtlecture.3
smtlecture.3
 
Asymptotic notation
Asymptotic notationAsymptotic notation
Asymptotic notation
 
Overview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsOverview of Stochastic Calculus Foundations
Overview of Stochastic Calculus Foundations
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
 
RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010
 
Semi-automatic ABC: a discussion
Semi-automatic ABC: a discussionSemi-automatic ABC: a discussion
Semi-automatic ABC: a discussion
 
Guarding Terrains though the Lens of Parameterized Complexity
Guarding Terrains though the Lens of Parameterized ComplexityGuarding Terrains though the Lens of Parameterized Complexity
Guarding Terrains though the Lens of Parameterized Complexity
 
Lecture26
Lecture26Lecture26
Lecture26
 
Lec 5 asymptotic notations and recurrences
Lec 5 asymptotic notations and recurrencesLec 5 asymptotic notations and recurrences
Lec 5 asymptotic notations and recurrences
 
Cs ps, sat, fol resolution strategies
Cs ps, sat, fol resolution strategiesCs ps, sat, fol resolution strategies
Cs ps, sat, fol resolution strategies
 
1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfa1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfa
 

More from Ashwin Rao

Stochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingStochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingAshwin Rao
 
Fundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingFundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingAshwin Rao
 
Evolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningEvolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningAshwin Rao
 
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Ashwin Rao
 
Stochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionStochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionAshwin Rao
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...Ashwin Rao
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryAshwin Rao
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Ashwin Rao
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemAshwin Rao
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesAshwin Rao
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkAshwin Rao
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffAshwin Rao
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesAshwin Rao
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationAshwin Rao
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAshwin Rao
 
OmniChannelNewsvendor
OmniChannelNewsvendorOmniChannelNewsvendor
OmniChannelNewsvendorAshwin Rao
 
The Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderThe Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderAshwin Rao
 
The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||Ashwin Rao
 
Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Ashwin Rao
 
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsCareers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsAshwin Rao
 

More from Ashwin Rao (20)

Stochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingStochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market Making
 
Fundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingFundamental Theorems of Asset Pricing
Fundamental Theorems of Asset Pricing
 
Evolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningEvolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement Learning
 
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
 
Stochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionStochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order Execution
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility Theory
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio Problem
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance Tradeoff
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) pictures
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to Correlation
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 Hours
 
OmniChannelNewsvendor
OmniChannelNewsvendorOmniChannelNewsvendor
OmniChannelNewsvendor
 
The Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderThe Newsvendor meets the Options Trader
The Newsvendor meets the Options Trader
 
The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||
 
Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.
 
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsCareers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Understanding Dynamic Programming through Bellman Operators

  • 1. Understanding (Exact) Dynamic Programming through Bellman Operators Ashwin Rao ICME, Stanford University January 14, 2019 Ashwin Rao (Stanford) Bellman Operators January 14, 2019 1 / 11
  • 2. Overview 1 Vector Space of Value Functions 2 Bellman Operators 3 Contraction and Monotonicity 4 Policy Evaluation 5 Policy Iteration 6 Value Iteration 7 Policy Optimality Ashwin Rao (Stanford) Bellman Operators January 14, 2019 2 / 11
  • 3. Vector Space of Value Functions Assume State pace S consists of n states: {s1, s2, . . . , sn} Assume Action space A consists of m actions {a1, a2, . . . , am} This exposition extends easily to continuous state/action spaces too We denote a stochastic policy as π(a|s) (probability of “a given s”) Abusing notation, deterministic policy denoted as π(s) = a Consider a n-dim vector space, each dim corresponding to a state in S A vector in this space is a specific Value Function (VF) v: S → R With coordinates [v(s1), v(s2), . . . , v(sn)] Value Function (VF) for a policy π is denoted as vπ : S → R Optimal VF denoted as v∗ : S → R such that for any s ∈ S, v∗(s) = max π vπ(s) Ashwin Rao (Stanford) Bellman Operators January 14, 2019 3 / 11
  • 4. Some more notation Denote Ra s as the Expected Reward upon action a in state s Denote Pa s,s as the probability of transition s → s upon action a Define Rπ(s) = a∈A π(a|s) · Ra s Pπ(s, s ) = a∈A π(a|s) · Pa s,s Denote Rπ as the vector [Rπ(s1), Rπ(s2), . . . , Rπ(sn)] Denote Pπ as the matrix [Pπ(si , si )], 1 ≤ i, i ≤ n Denote γ as the MDP discount factor Ashwin Rao (Stanford) Bellman Operators January 14, 2019 4 / 11
  • 5. Bellman Operators Bπ and B∗ We define operators that transform a VF vector to another VF vector Bellman Policy Operator Bπ (for policy π) operating on VF vector v: Bπv = Rπ + γPπ · v Bπ is a linear operator with fixed point vπ, meaning Bπvπ = vπ Bellman Optimality Operator B∗ operating on VF vector v: (B∗v)(s) = max a {Ra s + γ s ∈S Pa s,s · v(s )} B∗ is a non-linear operator with fixed point v∗, meaning B∗v∗ = v∗ Define a function G mapping a VF v to a deterministic “greedy” policy G(v) as follows: G(v)(s) = arg max a {Ra s + γ s ∈S Pa s,s · v(s )} BG(v)v = B∗v for any VF v (Policy G(v) achieves the max in B∗) Ashwin Rao (Stanford) Bellman Operators January 14, 2019 5 / 11
  • 6. Contraction and Monotonicity of Operators Both Bπ and B∗ are γ-contraction operators in L∞ norm, meaning: For any two VFs v1 and v2, Bπv1 − Bπv2 ∞ ≤ γ v1 − v2 ∞ B∗v1 − B∗v2 ∞ ≤ γ v1 − v2 ∞ So we can invoke Contraction Mapping Theorem to claim fixed point We use the notation v1 ≤ v2 for any two VFs v1, v2 to mean: v1(s) ≤ v2(s) for all s ∈ S Also, both Bπ and B∗ are monotonic, meaning: For any two VFs v1 and v2, v1 ≤ v2 ⇒ Bπv1 ≤ Bπv2 v1 ≤ v2 ⇒ B∗v1 ≤ B∗v2 Ashwin Rao (Stanford) Bellman Operators January 14, 2019 6 / 11
  • 7. Policy Evaluation Bπ satisfies the conditions of Contraction Mapping Theorem Bπ has a unique fixed point vπ, meaning Bπvπ = vπ This is a succinct representation of Bellman Expectation Equation Starting with any VF v and repeatedly applying Bπ, we will reach vπ lim N→∞ BN π v = vπ for any VF v This is a succinct representation of the Policy Evaluation Algorithm Ashwin Rao (Stanford) Bellman Operators January 14, 2019 7 / 11
  • 8. Policy Improvement Let πk and vπk denote the Policy and the VF for the Policy in iteration k of Policy Iteration Policy Improvement Step is: πk+1 = G(vπk ), i.e. deterministic greedy Earlier we argued that B∗v = BG(v)v for any VF v. Therefore, B∗vπk = BG(vπk )vπk = Bπk+1 vπk (1) We also know from operator definitions that B∗v ≥ Bπv for all π, v B∗vπk ≥ Bπk vπk = vπk (2) Combining (1) and (2), we get: Bπk+1 vπk ≥ vπk Monotonicity of Bπk+1 implies BN πk+1 vπk ≥ . . . B2 πk+1 vπk ≥ Bπk+1 vπk ≥ vπk vπk+1 = lim N→∞ BN πk+1 vπk ≥ vπk Ashwin Rao (Stanford) Bellman Operators January 14, 2019 8 / 11
  • 9. Policy Iteration We have shown that in iteration k + 1 of Policy Iteration, vπk+1 ≥ vπk If vπk+1 = vπk , the above inequalities would hold as equalities So this would mean B∗vπk = vπk But B∗ has a unique fixed point v∗ So this would mean vπk = v∗ Thus, at each iteration, Policy Iteration either strictly improves the VF or achieves the optimal VF v∗ Ashwin Rao (Stanford) Bellman Operators January 14, 2019 9 / 11
  • 10. Value Iteration B∗ satisfies the conditions of Contraction Mapping Theorem B∗ has a unique fixed point v∗, meaning B∗v∗ = v∗ This is a succinct representation of Bellman Optimality Equation Starting with any VF v and repeatedly applying B∗, we will reach v∗ lim N→∞ BN ∗ v = v∗ for any VF v This is a succinct representation of the Value Iteration Algorithm Ashwin Rao (Stanford) Bellman Operators January 14, 2019 10 / 11
  • 11. Greedy Policy from Optimal VF is an Optimal Policy Earlier we argued that BG(v)v = B∗v for any VF v. Therefore, BG(v∗)v∗ = B∗v∗ But v∗ is the fixed point of B∗, meaning B∗v∗ = v∗. Therefore, BG(v∗)v∗ = v∗ But we know that BG(v∗) has a unique fixed point vG(v∗). Therefore, v∗ = vG(v∗) This says that simply following the deterministic greedy policy G(v∗) (created from the Optimal VF v∗) in fact achieves the Optimal VF v∗ In other words, G(v∗) is an Optimal (Deterministic) Policy Ashwin Rao (Stanford) Bellman Operators January 14, 2019 11 / 11