SlideShare a Scribd company logo
Role of Bellman’s Equation in Reinforcement Learning
Dr. Varun Kumar
Dr. Varun Kumar Machine Learning 1 / 17
Outlines
1 Policy in Reinforcement Learning
2 Basic Problem
3 Bellman’s Optimality Criterion
4 Dynamic Programming Algorithm
5 Bellman’s Optimality Equation
6 Example
7 References
Dr. Varun Kumar Machine Learning 2 / 17
Policy in Reinforcement Learning
Brief about reinforcement learning
⇒ At each transition from one state to another, a cost is incurred by the
agent.
⇒ At nth transition from state i to state j under action aik, the agent
incurs a cost denoted by
Cost = γn
g(i, aik, j) (1)
♦ g(., ., .) → Prescribed function
♦ γ → Discount factor, 0 ≤ γ < 1
⇒ In reinforcement learning, there is a need proper policy, i.e mapping of
state of action.
⇒ Policy: It is a rule used by the agent to decide what to do, given
knowledge of the current state of the environment. It is denoted as
π = {µ0, µ1, µ2...} (2)
Dr. Varun Kumar Machine Learning 3 / 17
Continued–
where µn is a function that maps the state Xn = i into an action
An = a at time-step n = 0, 1, 2, ... This mapping is such that
µn(i) ∈ Ai For all states i ∈ X
Ai → Set of all possible action taken by agent in state i.
Types of policy:
Non-stationary → Time varying → π = {µ0, µ1, µ2...}
Stationary → Time invariant → π = {µ, µ, µ...}
⇒ Stationary policy specifies exactly the same action each time a
particular state is visited.
⇒ For a stationary policy, Markov chain may be stationary or
non-stationary.
.
Dr. Varun Kumar Machine Learning 4 / 17
Basic problem in dynamic programming
Infinite horizon problem
⇒ In an infinite-horizon problem, the cost accumulates over a infinite
number of stages.
⇒ Infinite-horizon problem provides a reasonable approximation, but
there is a need for high number of stages.
⇒ Let g Xn, µn(Xn), Xn+1

is a observed cost incurred as a result of a
transition from state Xn to state Xn+1 under the action of policy
µn(Xn). The total expected cost in an infinite horizon problem,
Jπ
(i) = E
h ∞
X
n=0
γn
g Xn, µn(Xn), Xn+1

|X0 = i
i
(3)
Jπ
(i) → Cost-to-go function
γ → Discount factor
Starting state X0 = i
,
Dr. Varun Kumar Machine Learning 5 / 17
Bellman’s optimality criterion
Note:
A stationary Markovian decision process describes the interaction
between an agent and its environment.
It find a stationary policy, π = {µ, µ, µ, ...}.
It minimizes the cost-to-go function Jπ(i) for all initial states i.
Bellman’s optimality criterion
Statement: An optimal policy has the property that whatever the initial
state and initial decision are, the remaining decisions must constitute an
optimal policy starting the state resulting from the first decision.
Decision: A choice of control at a particular time.
Policy: Entire control sequence or control function.
Dr. Varun Kumar Machine Learning 6 / 17
Finite horizon
A finite horizon problem for which the cost-to-go function is defined as
J0(X0) = E
h
gK (XK ) +
K−1
X
n=0
gn Xn, µn(Xn), Xn+1
i
(4)
1 K is the planning horizon (number of stages)
2 gK (XK ) is the terminal cost
3 X0 is the expectation wrt the remaining states X1, X2, ...
Optimal policy
⇒ Let π∗ = {µ∗
0, µ∗
1, ..., µ∗
K−1} be an optimal policy for the finite horizon.
⇒ Consider a sub-problem where the environment is in state Xn at time
n and we want to minimize the cost-to-go function Jn(Xn).
Dr. Varun Kumar Machine Learning 7 / 17
Continued–
Jn(Xn) = E
h
gK (Xk) +
K−1
X
k=n
gn Xk, µk(Xk), Xk+1
i
(5)
for n = 0, 1, ...., K − 1. Here truncated policy π∗ = {µ∗
n, µ∗
n+1, ...., µ∗
K−1}
will be optimal for the sub-problem.
Dr. Varun Kumar Machine Learning 8 / 17
Dynamic-programming algorithm
Dynamic-programming algorithm proceeds backward in time from N − 1 to 0.
Let π = {µ0, µ1, ..., µK−1} denotes the permissible policy.
For each n = 0, 1, ...K − 1, let πn = {µn, µn+1, ..., µK−1}
J∗
n (Xn) is the optimal cost for the (K − n) stages.
Problem starts at state Xn at time n and ends at time K
J∗
n (Xn) = min
πn
E
(Xn+1,...XK−1)
h
gK (XK ) +
K−1
X
k=n
gk(Xk, µk(Xk), Xk+1)
i
= min
µn
E
Xn+1
h
gn(Xn, µn(Xn), Xn+1) + J∗
n+1(Xn+1)
i
(6)
Dr. Varun Kumar Machine Learning 9 / 17
Bellman’s optimality equation
⇒ Dynamic programming algorithm deals with a finite horizon problem.
Aim:
⇒ To extend Dynamic programming algorithm for an infinite
horizon problem.
Using the discounted problem described by the cost-to-go function in (3),
under a stationary policy π = {µ, µ, ...}. Two things can be done under
given objective
1 Reverse the time index of the algorithm.
2 Define the cost gn(Xn, µ(Xn), Xn+1) as
gn(Xn, µ(Xn), Xn+1) = γn
g(Xn, µ(Xn), Xn+1) (7)
By reformulating the dynamic-programming algorithm
Jn+1(Xn) = min
µ
E
X1
h
g(X0, µ(X0), X1) + γJn(X1)
i
(8)
Dr. Varun Kumar Machine Learning 10 / 17
Continued–
Let J∗(i) denotes the optimal infinite horizon cost for the initial state
X0 = i then mathematically it can be expressed as
J∗
(i) = lim
K→∞
JK (i) (9)
For expressing the optimal infinite horizon cost J∗(i), we proceed in two
stages.
1 Evaluate the expectation of the cost g(i, µ(i), X1) wrt X1. Hence,
E[g(i), µ(i), X1] =
N
X
j=1
pij g(i, µ(i), j) (10)
(a) N → Number of states of the environment.
(b) pij → Transition probability from state X0 = i to X1 = j.
Dr. Varun Kumar Machine Learning 11 / 17
Continued–
The quantity defined in (10) is the immediate expected cost incurred at
state i by the action recommended by the policy µ. This cost is denoted
by c(i, µ(i))
c(i, µ(i)) =
N
X
j=1
pij g(i, µ(i), j) (11)
E[J∗
(X1)] =
N
X
j=1
pij J∗
(j) (12)
J∗
(i) = min
µ

c(i, µ(i)) + γ
N
X
j=1
pij (µ)J∗
(j)

(13)
Dr. Varun Kumar Machine Learning 12 / 17
Policy iteration
content...
Dr. Varun Kumar Machine Learning 13 / 17
Example 1
Dice game (in terms of reward)
For each round r = 1, 2, 3, ...., 6
⇒ You can choose stay or quit.
⇒ If quit, you get 10$ and end the game.
⇒ If stay, you get 4$ and then roll the 6-sided dice.
If the dice result in 1 or 2, we end the game.
Otherwise continue to the next round.
Dr. Varun Kumar Machine Learning 14 / 17
Continued–
Expected utility
.
Expected utility =
1
3
× (4) +
2
3
×
1
3
× (8) + ..... = 12
MDP for dice game
Dr. Varun Kumar Machine Learning 15 / 17
Continued–
From above figure,
⇒ Initial state → In → Part of action
⇒ Next state
In
End game
⇒ Successor function (s,a) → Transition probability→ 1
3 for stay
⇒ Cost → Reward (4$ or 10$)
⇒ Aim: Maximizing the reward
⇒ Policy: Rewards type not the penalty type
Transition probability table: T(s, a, s0) s → Initial state, s0 → Next state
s a s’ T(s, a, s’)
In Quit End 1
In Stay In 1/3
In Stay End 2/3
Dr. Varun Kumar Machine Learning 16 / 17
References
E. Alpaydin, Introduction to machine learning. MIT press, 2020.
T. M. Mitchell, The discipline of machine learning. Carnegie Mellon University,
School of Computer Science, Machine Learning , 2006, vol. 9.
J. Grus, Data science from scratch: first principles with python. O’Reilly Media,
2019.
Dr. Varun Kumar Machine Learning 17 / 17

More Related Content

What's hot

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Muhammad Iqbal Tawakal
 
5 csp
5 csp5 csp
5 csp
Mhd Sb
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Seung Jae Lee
 
Uncertainty in AI
Uncertainty in AIUncertainty in AI
Uncertainty in AI
Amruth Veerabhadraiah
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
Dongmin Lee
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
VARUN KUMAR
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
Seung Jae Lee
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
Kai-Wen Zhao
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step Bootstrapping
Seung Jae Lee
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen
 
Regularization
RegularizationRegularization
Regularization
Darren Yow-Bang Wang
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
Jie-Han Chen
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
Seung Jae Lee
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
Ronald Teo
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
Prithvi Paneru
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
Dong Guo
 
Adversarial search
Adversarial search Adversarial search
Adversarial search
Farah M. Altufaili
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Bill Liu
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
Hamed Abdi
 

What's hot (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
5 csp
5 csp5 csp
5 csp
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
 
Uncertainty in AI
Uncertainty in AIUncertainty in AI
Uncertainty in AI
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step Bootstrapping
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Regularization
RegularizationRegularization
Regularization
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Adversarial search
Adversarial search Adversarial search
Adversarial search
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 

Similar to Role of Bellman's Equation in Reinforcement Learning

Gaussian process in machine learning
Gaussian process in machine learningGaussian process in machine learning
Gaussian process in machine learning
VARUN KUMAR
 
4optmizationtechniques-150308051251-conversion-gate01.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf4optmizationtechniques-150308051251-conversion-gate01.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf
BechanYadav4
 
optmizationtechniques.pdf
optmizationtechniques.pdfoptmizationtechniques.pdf
optmizationtechniques.pdf
SantiagoGarridoBulln
 
Optmization techniques
Optmization techniquesOptmization techniques
Optmization techniques
Deepshika Reddy
 
Fractional programming (A tool for optimization)
Fractional programming (A tool for optimization)Fractional programming (A tool for optimization)
Fractional programming (A tool for optimization)
VARUN KUMAR
 
Optimization Methods in Finance
Optimization Methods in FinanceOptimization Methods in Finance
Optimization Methods in Finance
thilankm
 
Policy-Gradient for deep reinforcement learning.pdf
Policy-Gradient for  deep reinforcement learning.pdfPolicy-Gradient for  deep reinforcement learning.pdf
Policy-Gradient for deep reinforcement learning.pdf
21522733
 
CPP.pptx
CPP.pptxCPP.pptx
Concentration inequality in Machine Learning
Concentration inequality in Machine LearningConcentration inequality in Machine Learning
Concentration inequality in Machine Learning
VARUN KUMAR
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
Edgar Marca
 
A Level Set Method For Multiobjective Combinatorial Optimization Application...
A Level Set Method For Multiobjective Combinatorial Optimization  Application...A Level Set Method For Multiobjective Combinatorial Optimization  Application...
A Level Set Method For Multiobjective Combinatorial Optimization Application...
Scott Faria
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
Pantelis Sopasakis
 
Numerical analysis m2 l4slides
Numerical analysis  m2 l4slidesNumerical analysis  m2 l4slides
Numerical analysis m2 l4slides
SHAMJITH KM
 
i2ml3e-chap3.pptx
i2ml3e-chap3.pptxi2ml3e-chap3.pptx
i2ml3e-chap3.pptx
waseem214905
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
謙益 黃
 
Traveling Salesman Problem
Traveling Salesman Problem Traveling Salesman Problem
Traveling Salesman Problem
Indian Institute of Technology, Roorkee
 
02 basics i-handout
02 basics i-handout02 basics i-handout
02 basics i-handout
sheetslibrary
 
Thesis
ThesisThesis
Ch3(1).pptxbbbbbbbbbbbbbbbbbbbhhhhhhhhhh
Ch3(1).pptxbbbbbbbbbbbbbbbbbbbhhhhhhhhhhCh3(1).pptxbbbbbbbbbbbbbbbbbbbhhhhhhhhhh
Ch3(1).pptxbbbbbbbbbbbbbbbbbbbhhhhhhhhhh
danielgetachew0922
 
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
IJERA Editor
 

Similar to Role of Bellman's Equation in Reinforcement Learning (20)

Gaussian process in machine learning
Gaussian process in machine learningGaussian process in machine learning
Gaussian process in machine learning
 
4optmizationtechniques-150308051251-conversion-gate01.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf4optmizationtechniques-150308051251-conversion-gate01.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf
 
optmizationtechniques.pdf
optmizationtechniques.pdfoptmizationtechniques.pdf
optmizationtechniques.pdf
 
Optmization techniques
Optmization techniquesOptmization techniques
Optmization techniques
 
Fractional programming (A tool for optimization)
Fractional programming (A tool for optimization)Fractional programming (A tool for optimization)
Fractional programming (A tool for optimization)
 
Optimization Methods in Finance
Optimization Methods in FinanceOptimization Methods in Finance
Optimization Methods in Finance
 
Policy-Gradient for deep reinforcement learning.pdf
Policy-Gradient for  deep reinforcement learning.pdfPolicy-Gradient for  deep reinforcement learning.pdf
Policy-Gradient for deep reinforcement learning.pdf
 
CPP.pptx
CPP.pptxCPP.pptx
CPP.pptx
 
Concentration inequality in Machine Learning
Concentration inequality in Machine LearningConcentration inequality in Machine Learning
Concentration inequality in Machine Learning
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
A Level Set Method For Multiobjective Combinatorial Optimization Application...
A Level Set Method For Multiobjective Combinatorial Optimization  Application...A Level Set Method For Multiobjective Combinatorial Optimization  Application...
A Level Set Method For Multiobjective Combinatorial Optimization Application...
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
Numerical analysis m2 l4slides
Numerical analysis  m2 l4slidesNumerical analysis  m2 l4slides
Numerical analysis m2 l4slides
 
i2ml3e-chap3.pptx
i2ml3e-chap3.pptxi2ml3e-chap3.pptx
i2ml3e-chap3.pptx
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Traveling Salesman Problem
Traveling Salesman Problem Traveling Salesman Problem
Traveling Salesman Problem
 
02 basics i-handout
02 basics i-handout02 basics i-handout
02 basics i-handout
 
Thesis
ThesisThesis
Thesis
 
Ch3(1).pptxbbbbbbbbbbbbbbbbbbbhhhhhhhhhh
Ch3(1).pptxbbbbbbbbbbbbbbbbbbbhhhhhhhhhhCh3(1).pptxbbbbbbbbbbbbbbbbbbbhhhhhhhhhh
Ch3(1).pptxbbbbbbbbbbbbbbbbbbbhhhhhhhhhh
 
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
 

More from VARUN KUMAR

Distributed rc Model
Distributed rc ModelDistributed rc Model
Distributed rc Model
VARUN KUMAR
 
Electrical Wire Model
Electrical Wire ModelElectrical Wire Model
Electrical Wire Model
VARUN KUMAR
 
Interconnect Parameter in Digital VLSI Design
Interconnect Parameter in Digital VLSI DesignInterconnect Parameter in Digital VLSI Design
Interconnect Parameter in Digital VLSI Design
VARUN KUMAR
 
Introduction to Digital VLSI Design
Introduction to Digital VLSI DesignIntroduction to Digital VLSI Design
Introduction to Digital VLSI Design
VARUN KUMAR
 
Challenges of Massive MIMO System
Challenges of Massive MIMO SystemChallenges of Massive MIMO System
Challenges of Massive MIMO System
VARUN KUMAR
 
E-democracy or Digital Democracy
E-democracy or Digital DemocracyE-democracy or Digital Democracy
E-democracy or Digital Democracy
VARUN KUMAR
 
Ethics of Parasitic Computing
Ethics of Parasitic ComputingEthics of Parasitic Computing
Ethics of Parasitic Computing
VARUN KUMAR
 
Action Lines of Geneva Plan of Action
Action Lines of Geneva Plan of ActionAction Lines of Geneva Plan of Action
Action Lines of Geneva Plan of Action
VARUN KUMAR
 
Geneva Plan of Action
Geneva Plan of ActionGeneva Plan of Action
Geneva Plan of Action
VARUN KUMAR
 
Fair Use in the Electronic Age
Fair Use in the Electronic AgeFair Use in the Electronic Age
Fair Use in the Electronic Age
VARUN KUMAR
 
Software as a Property
Software as a PropertySoftware as a Property
Software as a Property
VARUN KUMAR
 
Orthogonal Polynomial
Orthogonal PolynomialOrthogonal Polynomial
Orthogonal Polynomial
VARUN KUMAR
 
Patent Protection
Patent ProtectionPatent Protection
Patent Protection
VARUN KUMAR
 
Copyright Vs Patent and Trade Secrecy Law
Copyright Vs Patent and Trade Secrecy LawCopyright Vs Patent and Trade Secrecy Law
Copyright Vs Patent and Trade Secrecy Law
VARUN KUMAR
 
Property Right and Software
Property Right and SoftwareProperty Right and Software
Property Right and Software
VARUN KUMAR
 
Investigating Data Trials
Investigating Data TrialsInvestigating Data Trials
Investigating Data Trials
VARUN KUMAR
 
Gaussian Numerical Integration
Gaussian Numerical IntegrationGaussian Numerical Integration
Gaussian Numerical Integration
VARUN KUMAR
 
Censorship and Controversy
Censorship and ControversyCensorship and Controversy
Censorship and Controversy
VARUN KUMAR
 
Romberg's Integration
Romberg's IntegrationRomberg's Integration
Romberg's Integration
VARUN KUMAR
 
Introduction to Censorship
Introduction to Censorship Introduction to Censorship
Introduction to Censorship
VARUN KUMAR
 

More from VARUN KUMAR (20)

Distributed rc Model
Distributed rc ModelDistributed rc Model
Distributed rc Model
 
Electrical Wire Model
Electrical Wire ModelElectrical Wire Model
Electrical Wire Model
 
Interconnect Parameter in Digital VLSI Design
Interconnect Parameter in Digital VLSI DesignInterconnect Parameter in Digital VLSI Design
Interconnect Parameter in Digital VLSI Design
 
Introduction to Digital VLSI Design
Introduction to Digital VLSI DesignIntroduction to Digital VLSI Design
Introduction to Digital VLSI Design
 
Challenges of Massive MIMO System
Challenges of Massive MIMO SystemChallenges of Massive MIMO System
Challenges of Massive MIMO System
 
E-democracy or Digital Democracy
E-democracy or Digital DemocracyE-democracy or Digital Democracy
E-democracy or Digital Democracy
 
Ethics of Parasitic Computing
Ethics of Parasitic ComputingEthics of Parasitic Computing
Ethics of Parasitic Computing
 
Action Lines of Geneva Plan of Action
Action Lines of Geneva Plan of ActionAction Lines of Geneva Plan of Action
Action Lines of Geneva Plan of Action
 
Geneva Plan of Action
Geneva Plan of ActionGeneva Plan of Action
Geneva Plan of Action
 
Fair Use in the Electronic Age
Fair Use in the Electronic AgeFair Use in the Electronic Age
Fair Use in the Electronic Age
 
Software as a Property
Software as a PropertySoftware as a Property
Software as a Property
 
Orthogonal Polynomial
Orthogonal PolynomialOrthogonal Polynomial
Orthogonal Polynomial
 
Patent Protection
Patent ProtectionPatent Protection
Patent Protection
 
Copyright Vs Patent and Trade Secrecy Law
Copyright Vs Patent and Trade Secrecy LawCopyright Vs Patent and Trade Secrecy Law
Copyright Vs Patent and Trade Secrecy Law
 
Property Right and Software
Property Right and SoftwareProperty Right and Software
Property Right and Software
 
Investigating Data Trials
Investigating Data TrialsInvestigating Data Trials
Investigating Data Trials
 
Gaussian Numerical Integration
Gaussian Numerical IntegrationGaussian Numerical Integration
Gaussian Numerical Integration
 
Censorship and Controversy
Censorship and ControversyCensorship and Controversy
Censorship and Controversy
 
Romberg's Integration
Romberg's IntegrationRomberg's Integration
Romberg's Integration
 
Introduction to Censorship
Introduction to Censorship Introduction to Censorship
Introduction to Censorship
 

Recently uploaded

一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
sydezfe
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
Lubi Valves
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
DharmaBanothu
 
Unit -II Spectroscopy - EC I B.Tech.pdf
Unit -II Spectroscopy - EC  I B.Tech.pdfUnit -II Spectroscopy - EC  I B.Tech.pdf
Unit -II Spectroscopy - EC I B.Tech.pdf
TeluguBadi
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
GiselleginaGloria
 
Properties of Fluids, Fluid Statics, Pressure Measurement
Properties of Fluids, Fluid Statics, Pressure MeasurementProperties of Fluids, Fluid Statics, Pressure Measurement
Properties of Fluids, Fluid Statics, Pressure Measurement
Indrajeet sahu
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
MuhammadJazib15
 
Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
pvpriya2
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
paraasingh12 #V08
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
snaprevwdev
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
OKORIE1
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Balvir Singh
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
Kamal Acharya
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
vmspraneeth
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
DharmaBanothu
 
Introduction to Artificial Intelligence.
Introduction to Artificial Intelligence.Introduction to Artificial Intelligence.
Introduction to Artificial Intelligence.
supriyaDicholkar1
 

Recently uploaded (20)

一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
 
Unit -II Spectroscopy - EC I B.Tech.pdf
Unit -II Spectroscopy - EC  I B.Tech.pdfUnit -II Spectroscopy - EC  I B.Tech.pdf
Unit -II Spectroscopy - EC I B.Tech.pdf
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
 
Properties of Fluids, Fluid Statics, Pressure Measurement
Properties of Fluids, Fluid Statics, Pressure MeasurementProperties of Fluids, Fluid Statics, Pressure Measurement
Properties of Fluids, Fluid Statics, Pressure Measurement
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
 
Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
 
Introduction to Artificial Intelligence.
Introduction to Artificial Intelligence.Introduction to Artificial Intelligence.
Introduction to Artificial Intelligence.
 

Role of Bellman's Equation in Reinforcement Learning

  • 1. Role of Bellman’s Equation in Reinforcement Learning Dr. Varun Kumar Dr. Varun Kumar Machine Learning 1 / 17
  • 2. Outlines 1 Policy in Reinforcement Learning 2 Basic Problem 3 Bellman’s Optimality Criterion 4 Dynamic Programming Algorithm 5 Bellman’s Optimality Equation 6 Example 7 References Dr. Varun Kumar Machine Learning 2 / 17
  • 3. Policy in Reinforcement Learning Brief about reinforcement learning ⇒ At each transition from one state to another, a cost is incurred by the agent. ⇒ At nth transition from state i to state j under action aik, the agent incurs a cost denoted by Cost = γn g(i, aik, j) (1) ♦ g(., ., .) → Prescribed function ♦ γ → Discount factor, 0 ≤ γ < 1 ⇒ In reinforcement learning, there is a need proper policy, i.e mapping of state of action. ⇒ Policy: It is a rule used by the agent to decide what to do, given knowledge of the current state of the environment. It is denoted as π = {µ0, µ1, µ2...} (2) Dr. Varun Kumar Machine Learning 3 / 17
  • 4. Continued– where µn is a function that maps the state Xn = i into an action An = a at time-step n = 0, 1, 2, ... This mapping is such that µn(i) ∈ Ai For all states i ∈ X Ai → Set of all possible action taken by agent in state i. Types of policy: Non-stationary → Time varying → π = {µ0, µ1, µ2...} Stationary → Time invariant → π = {µ, µ, µ...} ⇒ Stationary policy specifies exactly the same action each time a particular state is visited. ⇒ For a stationary policy, Markov chain may be stationary or non-stationary. . Dr. Varun Kumar Machine Learning 4 / 17
  • 5. Basic problem in dynamic programming Infinite horizon problem ⇒ In an infinite-horizon problem, the cost accumulates over a infinite number of stages. ⇒ Infinite-horizon problem provides a reasonable approximation, but there is a need for high number of stages. ⇒ Let g Xn, µn(Xn), Xn+1 is a observed cost incurred as a result of a transition from state Xn to state Xn+1 under the action of policy µn(Xn). The total expected cost in an infinite horizon problem, Jπ (i) = E h ∞ X n=0 γn g Xn, µn(Xn), Xn+1 |X0 = i i (3) Jπ (i) → Cost-to-go function γ → Discount factor Starting state X0 = i , Dr. Varun Kumar Machine Learning 5 / 17
  • 6. Bellman’s optimality criterion Note: A stationary Markovian decision process describes the interaction between an agent and its environment. It find a stationary policy, π = {µ, µ, µ, ...}. It minimizes the cost-to-go function Jπ(i) for all initial states i. Bellman’s optimality criterion Statement: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy starting the state resulting from the first decision. Decision: A choice of control at a particular time. Policy: Entire control sequence or control function. Dr. Varun Kumar Machine Learning 6 / 17
  • 7. Finite horizon A finite horizon problem for which the cost-to-go function is defined as J0(X0) = E h gK (XK ) + K−1 X n=0 gn Xn, µn(Xn), Xn+1 i (4) 1 K is the planning horizon (number of stages) 2 gK (XK ) is the terminal cost 3 X0 is the expectation wrt the remaining states X1, X2, ... Optimal policy ⇒ Let π∗ = {µ∗ 0, µ∗ 1, ..., µ∗ K−1} be an optimal policy for the finite horizon. ⇒ Consider a sub-problem where the environment is in state Xn at time n and we want to minimize the cost-to-go function Jn(Xn). Dr. Varun Kumar Machine Learning 7 / 17
  • 8. Continued– Jn(Xn) = E h gK (Xk) + K−1 X k=n gn Xk, µk(Xk), Xk+1 i (5) for n = 0, 1, ...., K − 1. Here truncated policy π∗ = {µ∗ n, µ∗ n+1, ...., µ∗ K−1} will be optimal for the sub-problem. Dr. Varun Kumar Machine Learning 8 / 17
  • 9. Dynamic-programming algorithm Dynamic-programming algorithm proceeds backward in time from N − 1 to 0. Let π = {µ0, µ1, ..., µK−1} denotes the permissible policy. For each n = 0, 1, ...K − 1, let πn = {µn, µn+1, ..., µK−1} J∗ n (Xn) is the optimal cost for the (K − n) stages. Problem starts at state Xn at time n and ends at time K J∗ n (Xn) = min πn E (Xn+1,...XK−1) h gK (XK ) + K−1 X k=n gk(Xk, µk(Xk), Xk+1) i = min µn E Xn+1 h gn(Xn, µn(Xn), Xn+1) + J∗ n+1(Xn+1) i (6) Dr. Varun Kumar Machine Learning 9 / 17
  • 10. Bellman’s optimality equation ⇒ Dynamic programming algorithm deals with a finite horizon problem. Aim: ⇒ To extend Dynamic programming algorithm for an infinite horizon problem. Using the discounted problem described by the cost-to-go function in (3), under a stationary policy π = {µ, µ, ...}. Two things can be done under given objective 1 Reverse the time index of the algorithm. 2 Define the cost gn(Xn, µ(Xn), Xn+1) as gn(Xn, µ(Xn), Xn+1) = γn g(Xn, µ(Xn), Xn+1) (7) By reformulating the dynamic-programming algorithm Jn+1(Xn) = min µ E X1 h g(X0, µ(X0), X1) + γJn(X1) i (8) Dr. Varun Kumar Machine Learning 10 / 17
  • 11. Continued– Let J∗(i) denotes the optimal infinite horizon cost for the initial state X0 = i then mathematically it can be expressed as J∗ (i) = lim K→∞ JK (i) (9) For expressing the optimal infinite horizon cost J∗(i), we proceed in two stages. 1 Evaluate the expectation of the cost g(i, µ(i), X1) wrt X1. Hence, E[g(i), µ(i), X1] = N X j=1 pij g(i, µ(i), j) (10) (a) N → Number of states of the environment. (b) pij → Transition probability from state X0 = i to X1 = j. Dr. Varun Kumar Machine Learning 11 / 17
  • 12. Continued– The quantity defined in (10) is the immediate expected cost incurred at state i by the action recommended by the policy µ. This cost is denoted by c(i, µ(i)) c(i, µ(i)) = N X j=1 pij g(i, µ(i), j) (11) E[J∗ (X1)] = N X j=1 pij J∗ (j) (12) J∗ (i) = min µ c(i, µ(i)) + γ N X j=1 pij (µ)J∗ (j) (13) Dr. Varun Kumar Machine Learning 12 / 17
  • 13. Policy iteration content... Dr. Varun Kumar Machine Learning 13 / 17
  • 14. Example 1 Dice game (in terms of reward) For each round r = 1, 2, 3, ...., 6 ⇒ You can choose stay or quit. ⇒ If quit, you get 10$ and end the game. ⇒ If stay, you get 4$ and then roll the 6-sided dice. If the dice result in 1 or 2, we end the game. Otherwise continue to the next round. Dr. Varun Kumar Machine Learning 14 / 17
  • 15. Continued– Expected utility . Expected utility = 1 3 × (4) + 2 3 × 1 3 × (8) + ..... = 12 MDP for dice game Dr. Varun Kumar Machine Learning 15 / 17
  • 16. Continued– From above figure, ⇒ Initial state → In → Part of action ⇒ Next state In End game ⇒ Successor function (s,a) → Transition probability→ 1 3 for stay ⇒ Cost → Reward (4$ or 10$) ⇒ Aim: Maximizing the reward ⇒ Policy: Rewards type not the penalty type Transition probability table: T(s, a, s0) s → Initial state, s0 → Next state s a s’ T(s, a, s’) In Quit End 1 In Stay In 1/3 In Stay End 2/3 Dr. Varun Kumar Machine Learning 16 / 17
  • 17. References E. Alpaydin, Introduction to machine learning. MIT press, 2020. T. M. Mitchell, The discipline of machine learning. Carnegie Mellon University, School of Computer Science, Machine Learning , 2006, vol. 9. J. Grus, Data science from scratch: first principles with python. O’Reilly Media, 2019. Dr. Varun Kumar Machine Learning 17 / 17