SlideShare a Scribd company logo
Value Functions and Markov Decision Process
Easwar Subramanian
TCS Innovation Labs, Hyderabad
Email : easwar.subramanian@tcs.com / cs5500.2020@iith.ac.in
August 12, 2022
Overview
1 Review
2 Value Function
3 Markov Decision Process
Easwar Subramanian, IIT Hyderabad 2 of 32
Review
Easwar Subramanian, IIT Hyderabad 3 of 32
Markov Property
Markov Property
A state st of a stochastic process {st}t∈T is said to have Markov property if
P(st+1|st) = P(st+1|s1, · · · , st)
The state st at time t captures all relevant information from history and is a sufficient
statistic of the future
Easwar Subramanian, IIT Hyderabad 4 of 32
State Transition Matrix
State Transition Probability
For a Markov state s and a successor state s0
, the state transition probability is defined by
Pss0 = P(st+1 = s0
|st = s)
State transition matrix P then denotes the transition probabilities from all states s to all
successor states s0
(with each row summing to 1)
P =



P11 P12 · · · P1n
.
.
.
Pn1 Pn2 · · · Pnn



Easwar Subramanian, IIT Hyderabad 5 of 32
Markov Chain
A stochastic process {st}t∈T is a Markov process or Markov Chain if it satisfies
Markov property for every state st. It is represented by tuple < S, P > where S denote
the set of states and P denote the state transition probablity
No notion of reward or action
Easwar Subramanian, IIT Hyderabad 6 of 32
Markov Reward Process
Markov Reward Process
A Markov reward process is a tuple < S, P, R, γ > is a Markov chain with values
I S : (Finite) set of states
I P : State transition probablity
I R : Reward for being in state st is given by a deterministic function R
rt+1 = R(st)
I γ : Discount factor such that γ ∈ [0, 1]
No notion of action
Easwar Subramanian, IIT Hyderabad 7 of 32
Markov Reward Process
Markov Reward Process
A Markov reward process is a tuple < S, P, R, γ > is a Markov chain with values
I S : (Finite) set of states
I P : State transition probablity
I R : Reward for being in state st is given by a deterministic function R
rt+1 = R(st)
I γ : Discount factor such that γ ∈ [0, 1]
I In general, the reward function can also be an expectation R(st = s) = E[rt+1|st = s]
Easwar Subramanian, IIT Hyderabad 8 of 32
Value Function
Easwar Subramanian, IIT Hyderabad 9 of 32
Snakes and Ladders : Revisited
I Reward R : R(s) = −1 for s ∈ s1, · · · , s99 and for R(s100) = 0
I Discount Factor γ = 1
Easwar Subramanian, IIT Hyderabad 10 of 32
Snakes and Ladders : Revisited
Question : Are all intermediate states equally ’valuable ’ just because they have equal
reward ?
Easwar Subramanian, IIT Hyderabad 11 of 32
Value Function
The value function V (s) gives the long-term value of state s ∈ S
V (s) = E (Gt|st = s) = E
∞
X
k=0
γk
rt+k+1|st = s
!
I Value function V (s) determines the value of being in state s
I V (s) measures the potential future rewards we may get from being in state s
I V (s) is independent of t
Easwar Subramanian, IIT Hyderabad 12 of 32
Value Function Computation : Example
Consider the following MRP. Assume γ = 1
I V (s1) = 6.8
I V (s2) = 1 + γ ∗ 6 = 7
I V (s3) = 3 + γ ∗ 6 = 9
I V (s4) = 6
Easwar Subramanian, IIT Hyderabad 13 of 32
Example : Snakes and Ladders
Question : How can we evaluate the value of each state in a large MRP such as ’Snakes
and Ladders ’ ?
Easwar Subramanian, IIT Hyderabad 14 of 32
Decomposition of Value Function
Let s and s0
be successor states at time steps t and t + 1, the value function can be
decomposed into sum of two parts
I Immediate reward rt+1
I Discounted value of next state s0
(i.e. γV (s0
))
V (s) = E (Gt|st = s) = E
∞
X
k=0
γk
rt+k+1|st = s
!
= E (rt+1 + γV (st+1)|st = s)
Easwar Subramanian, IIT Hyderabad 15 of 32
Decomposition of Value Function
Recall that,
Gt = rt+1 + γrt+2 + γ2
rt+3 + · · ·

=
∞
X
k=0
γk
rt+k+1

V (s) = E (Gt|st = s) = E
∞
X
k=0
γk
rt+k+1|st = s
!
= E rt+1 + γrt+2 + γ2
rt+3 + · · · |st = s

= E(rt+1|st = s) +
∞
X
k=1
γk
E (rt+k+1|st = s)
= E(rt+1|st = s) + γ
X
s0∈S
P(s0
|s)
∞
X
k=0
γk
E (rt+k+1|st = s, st+1 = s0
)
= E(rt+1|st = s) + γ
X
s0∈S
P(s0
|s)
∞
X
k=0
γk
E (rt+k+1|st+1 = s0
) (Markov property)
= E(rt+1 + γV (st+1)|st = s)
Easwar Subramanian, IIT Hyderabad 16 of 32
Value Function : Evaluation
We have
V (s) = E(rt+1 + γV (st+1)|st = s)
V (s) = R(s) + γ
h
Pss0
a
V (s
0
a) + Pss
0
b
V (s
0
b) + Pss0
c
V (s
0
c) + Pss
0
d
V (s
0
d)
i
Easwar Subramanian, IIT Hyderabad 17 of 32
Value Function Computation : Example
Consider the following MRP. Assume γ = 1
I V (s4) = 6
I V (s3) = 3 + γ ∗ 6 = 9
I V (s2) = 1 + γ ∗ 6 = 7
I V (s1) = − 1 + γ ∗ (0.6 ∗ 7 + 0.4 ∗ 9) = 6.8
Easwar Subramanian, IIT Hyderabad 18 of 32
Bellman Equation for Markov Reward Process
V (s) = E(rt+1 + γV (st+1)|st = s)
For any s0
∈ S a successor state of s with transition probability Pss0 , we can rewrite the
above equation as (using definition of Expectation)
V (s) = E(rt+1|st = s) + γ
X
s0∈S
Pss0 V (s0
)
This is the Bellman Equation for value functions
Easwar Subramanian, IIT Hyderabad 19 of 32
Snakes and Ladders
Question : How can we evaluate the value of (all) states using the value function
decomposition ?
V (s) = E(rt+1|st = s) + γ
X
s0∈S
Pss0 V (s0
)
Easwar Subramanian, IIT Hyderabad 20 of 32
Bellman Equation in Matrix Form
Let S = {1, 2, · · · , n} and P be known. Then one can write the Bellman equation can as,
V = R + γPV
where 




V (1)
V (2)
.
.
.
V (n)





=





R(1)
R(2)
.
.
.
R(n)





+ γ





P11 P12 · · · P1n
P21 P22 · · · P2n
.
.
.
Pn1 Pn2 · · · Pnn





×





V (1)
V (2)
.
.
.
V (n)





Solving for V , we get,
V = (I − γP)−1
R
The discount factor should be γ  1 for the inverse to exist
Easwar Subramanian, IIT Hyderabad 21 of 32
Example : Snakes and Ladders
I We can now compute the value of states in such ’large’ MRP using the matrix form of
Bellman equation
I Value function computed for a particular state provides the expected number of
plays to reach the goal state s100 from that state
Easwar Subramanian, IIT Hyderabad 22 of 32
Few Remarks on Discounting
V (s) = E (Gt|st = s) = E
∞
X
k=0
γk
rt+k+1|st = s
!
I Mathematically convienient to discount rewards
I Avoids infinite returns in cyclic and infinite horizon setting
I Discount rate determines the present value of future reward
I Offers trade-off between being ’myopic’ and ’far-sighted’ reward
I In certain class of MDPs, it is sometimes possible to use undiscounted reward (i.e.
γ = 1), for example, if all sequences terminate
Easwar Subramanian, IIT Hyderabad 23 of 32
Markov Decision Process
Easwar Subramanian, IIT Hyderabad 24 of 32
Markov Decision Process
Markov decision process is a tuple  S, A, P, R, γ  where
I S : (Finite) set of states
I A : (Finite) set of actions
I P : State transition probability
Pa
ss0 = P(st+1 = s0
|st = s, at = a), at ∈ A
I R : Reward for taking action at at state st and transitioning to state st+1 is given by
the deterministic function R
rt+1 = R(st, at, st+1)
I γ : Discount factor such that γ ∈ [0, 1]
Easwar Subramanian, IIT Hyderabad 25 of 32
Wealth Management Problem
I States S : Current value of the portfolio and current valuation of instruments in the
portfolio
I Actions A : Buy / Sell instruments of the portfolio
I Reward R : Return on portfolio compared to previous decision epoch
Easwar Subramanian, IIT Hyderabad 26 of 32
Navigation Problem
I States S : Squares of the grid
I Actions A : Any of the four directions possible
I Reward R : -1 for every move made until reaching goal state
Easwar Subramanian, IIT Hyderabad 27 of 32
Example : Atari Games
I States S : Possible set of all (Atari) images
I Actions A : Move the paddle up or down
I Reward R : +1 for making the opponent miss the ball; -1 if the agent miss the ball; 0
otherwise;
Easwar Subramanian, IIT Hyderabad 28 of 32
Flow Diagram
I The goal is to choose a sequence of actions such that the expected total discounted
future reward E(Gt|st = s) is maximized where
Gt =
∞
X
k=0
γk
rt+k+1

Easwar Subramanian, IIT Hyderabad 29 of 32
Windy Grid World : Stochastic Environment
Recall given an MDP  S, A, P, R, γ , we have the state transition probability P defined
as
Pa
ss0 = P(st+1 = s0
|st = s, at = a), at ∈ A
I In general, note that even after choosing action a at state s (as prescribed by the
policy) the next state s0
need not be a fixed state
Easwar Subramanian, IIT Hyderabad 30 of 32
Finite and Infinite Horizon MDPs
I If T is fixed and finite, the resultant MDP is a finite horizon MDP
F Wealth management problem
I If T is infinite, the resultant MDP is infinite horizon MDP
F Certain Atari games
I When |S| is finite, the MDP is called finite state MDPs
Easwar Subramanian, IIT Hyderabad 31 of 32
Grid World Example
Question : Is Grid world finite / infinite horizon problem ? Why ?
(Stochastic shortest path MDPs)
I For finite horizon MDPs and stochastic shortest path MDPs, one can use γ = 1
Easwar Subramanian, IIT Hyderabad 32 of 32

More Related Content

Similar to Lecture3-MDP.pdf

Unit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithmsUnit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithms
Amol Gaikwad
 
Ingeniería de control: Lugar geométrico de las raíces
Ingeniería de control: Lugar geométrico de las raíces Ingeniería de control: Lugar geométrico de las raíces
Ingeniería de control: Lugar geométrico de las raíces
SANTIAGO PABLO ALBERTO
 
IIT JAM Mathematical Statistics - MS 2022 | Sourav Sir's Classes
IIT JAM Mathematical Statistics - MS 2022 | Sourav Sir's ClassesIIT JAM Mathematical Statistics - MS 2022 | Sourav Sir's Classes
IIT JAM Mathematical Statistics - MS 2022 | Sourav Sir's Classes
SOURAV DAS
 
A kernel-free particle method: Smile Problem Resolved
A kernel-free particle method: Smile Problem ResolvedA kernel-free particle method: Smile Problem Resolved
A kernel-free particle method: Smile Problem Resolved
Kaiju Capital Management
 
quadcopter modelling and controller design
quadcopter modelling and controller designquadcopter modelling and controller design
quadcopter modelling and controller design
Vijay Kumar Jadon
 
Is unit 4_number_theory
Is unit 4_number_theoryIs unit 4_number_theory
Is unit 4_number_theorySarthak Patel
 
Introduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu NguyenIntroduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu Nguyen
Tu Le Dinh
 
Multiple Choice Questions on Frequency Response Analysis
Multiple Choice Questions on Frequency Response AnalysisMultiple Choice Questions on Frequency Response Analysis
Multiple Choice Questions on Frequency Response Analysis
VijayalaxmiKumbhar
 
Notions of equivalence for linear multivariable systems
Notions of equivalence for linear multivariable systemsNotions of equivalence for linear multivariable systems
Notions of equivalence for linear multivariable systems
Stavros Vologiannidis
 
20180831 riemannian representation learning
20180831 riemannian representation learning20180831 riemannian representation learning
20180831 riemannian representation learning
segwangkim
 
MFCS-17.ppt
MFCS-17.pptMFCS-17.ppt
MFCS-17.ppt
SharmaDeep4
 
IIT JAM MATH 2018 Question Paper | Sourav Sir's Classes
IIT JAM MATH 2018 Question Paper | Sourav Sir's ClassesIIT JAM MATH 2018 Question Paper | Sourav Sir's Classes
IIT JAM MATH 2018 Question Paper | Sourav Sir's Classes
SOURAV DAS
 
IIT JAM Math 2022 Question Paper | Sourav Sir's Classes
IIT JAM Math 2022 Question Paper | Sourav Sir's ClassesIIT JAM Math 2022 Question Paper | Sourav Sir's Classes
IIT JAM Math 2022 Question Paper | Sourav Sir's Classes
SOURAV DAS
 
lecture 10 formal methods in software enginnering.pptx
lecture 10 formal methods in software enginnering.pptxlecture 10 formal methods in software enginnering.pptx
lecture 10 formal methods in software enginnering.pptx
SohaibAlviWebster
 
PART I.3 - Physical Mathematics
PART I.3 - Physical MathematicsPART I.3 - Physical Mathematics
PART I.3 - Physical Mathematics
Maurice R. TREMBLAY
 
kactl.pdf
kactl.pdfkactl.pdf
kactl.pdf
Rayhan331
 
Skiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracingSkiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracingzukun
 
ObjectiveQuestionsonEngineeringMathematicsForGATE2022.pdf
ObjectiveQuestionsonEngineeringMathematicsForGATE2022.pdfObjectiveQuestionsonEngineeringMathematicsForGATE2022.pdf
ObjectiveQuestionsonEngineeringMathematicsForGATE2022.pdf
MohammedArish6
 
Collinearity Equations
Collinearity EquationsCollinearity Equations
Collinearity Equations
National Cheng Kung University
 
Chapter 2&3 (java fundamentals and Control Structures).ppt
Chapter 2&3 (java fundamentals and Control Structures).pptChapter 2&3 (java fundamentals and Control Structures).ppt
Chapter 2&3 (java fundamentals and Control Structures).ppt
henokmetaferia1
 

Similar to Lecture3-MDP.pdf (20)

Unit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithmsUnit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithms
 
Ingeniería de control: Lugar geométrico de las raíces
Ingeniería de control: Lugar geométrico de las raíces Ingeniería de control: Lugar geométrico de las raíces
Ingeniería de control: Lugar geométrico de las raíces
 
IIT JAM Mathematical Statistics - MS 2022 | Sourav Sir's Classes
IIT JAM Mathematical Statistics - MS 2022 | Sourav Sir's ClassesIIT JAM Mathematical Statistics - MS 2022 | Sourav Sir's Classes
IIT JAM Mathematical Statistics - MS 2022 | Sourav Sir's Classes
 
A kernel-free particle method: Smile Problem Resolved
A kernel-free particle method: Smile Problem ResolvedA kernel-free particle method: Smile Problem Resolved
A kernel-free particle method: Smile Problem Resolved
 
quadcopter modelling and controller design
quadcopter modelling and controller designquadcopter modelling and controller design
quadcopter modelling and controller design
 
Is unit 4_number_theory
Is unit 4_number_theoryIs unit 4_number_theory
Is unit 4_number_theory
 
Introduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu NguyenIntroduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu Nguyen
 
Multiple Choice Questions on Frequency Response Analysis
Multiple Choice Questions on Frequency Response AnalysisMultiple Choice Questions on Frequency Response Analysis
Multiple Choice Questions on Frequency Response Analysis
 
Notions of equivalence for linear multivariable systems
Notions of equivalence for linear multivariable systemsNotions of equivalence for linear multivariable systems
Notions of equivalence for linear multivariable systems
 
20180831 riemannian representation learning
20180831 riemannian representation learning20180831 riemannian representation learning
20180831 riemannian representation learning
 
MFCS-17.ppt
MFCS-17.pptMFCS-17.ppt
MFCS-17.ppt
 
IIT JAM MATH 2018 Question Paper | Sourav Sir's Classes
IIT JAM MATH 2018 Question Paper | Sourav Sir's ClassesIIT JAM MATH 2018 Question Paper | Sourav Sir's Classes
IIT JAM MATH 2018 Question Paper | Sourav Sir's Classes
 
IIT JAM Math 2022 Question Paper | Sourav Sir's Classes
IIT JAM Math 2022 Question Paper | Sourav Sir's ClassesIIT JAM Math 2022 Question Paper | Sourav Sir's Classes
IIT JAM Math 2022 Question Paper | Sourav Sir's Classes
 
lecture 10 formal methods in software enginnering.pptx
lecture 10 formal methods in software enginnering.pptxlecture 10 formal methods in software enginnering.pptx
lecture 10 formal methods in software enginnering.pptx
 
PART I.3 - Physical Mathematics
PART I.3 - Physical MathematicsPART I.3 - Physical Mathematics
PART I.3 - Physical Mathematics
 
kactl.pdf
kactl.pdfkactl.pdf
kactl.pdf
 
Skiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracingSkiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracing
 
ObjectiveQuestionsonEngineeringMathematicsForGATE2022.pdf
ObjectiveQuestionsonEngineeringMathematicsForGATE2022.pdfObjectiveQuestionsonEngineeringMathematicsForGATE2022.pdf
ObjectiveQuestionsonEngineeringMathematicsForGATE2022.pdf
 
Collinearity Equations
Collinearity EquationsCollinearity Equations
Collinearity Equations
 
Chapter 2&3 (java fundamentals and Control Structures).ppt
Chapter 2&3 (java fundamentals and Control Structures).pptChapter 2&3 (java fundamentals and Control Structures).ppt
Chapter 2&3 (java fundamentals and Control Structures).ppt
 

Recently uploaded

Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
goswamiyash170123
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 

Recently uploaded (20)

Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 

Lecture3-MDP.pdf

  • 1. Value Functions and Markov Decision Process Easwar Subramanian TCS Innovation Labs, Hyderabad Email : easwar.subramanian@tcs.com / cs5500.2020@iith.ac.in August 12, 2022
  • 2. Overview 1 Review 2 Value Function 3 Markov Decision Process Easwar Subramanian, IIT Hyderabad 2 of 32
  • 3. Review Easwar Subramanian, IIT Hyderabad 3 of 32
  • 4. Markov Property Markov Property A state st of a stochastic process {st}t∈T is said to have Markov property if P(st+1|st) = P(st+1|s1, · · · , st) The state st at time t captures all relevant information from history and is a sufficient statistic of the future Easwar Subramanian, IIT Hyderabad 4 of 32
  • 5. State Transition Matrix State Transition Probability For a Markov state s and a successor state s0 , the state transition probability is defined by Pss0 = P(st+1 = s0 |st = s) State transition matrix P then denotes the transition probabilities from all states s to all successor states s0 (with each row summing to 1) P =    P11 P12 · · · P1n . . . Pn1 Pn2 · · · Pnn    Easwar Subramanian, IIT Hyderabad 5 of 32
  • 6. Markov Chain A stochastic process {st}t∈T is a Markov process or Markov Chain if it satisfies Markov property for every state st. It is represented by tuple < S, P > where S denote the set of states and P denote the state transition probablity No notion of reward or action Easwar Subramanian, IIT Hyderabad 6 of 32
  • 7. Markov Reward Process Markov Reward Process A Markov reward process is a tuple < S, P, R, γ > is a Markov chain with values I S : (Finite) set of states I P : State transition probablity I R : Reward for being in state st is given by a deterministic function R rt+1 = R(st) I γ : Discount factor such that γ ∈ [0, 1] No notion of action Easwar Subramanian, IIT Hyderabad 7 of 32
  • 8. Markov Reward Process Markov Reward Process A Markov reward process is a tuple < S, P, R, γ > is a Markov chain with values I S : (Finite) set of states I P : State transition probablity I R : Reward for being in state st is given by a deterministic function R rt+1 = R(st) I γ : Discount factor such that γ ∈ [0, 1] I In general, the reward function can also be an expectation R(st = s) = E[rt+1|st = s] Easwar Subramanian, IIT Hyderabad 8 of 32
  • 9. Value Function Easwar Subramanian, IIT Hyderabad 9 of 32
  • 10. Snakes and Ladders : Revisited I Reward R : R(s) = −1 for s ∈ s1, · · · , s99 and for R(s100) = 0 I Discount Factor γ = 1 Easwar Subramanian, IIT Hyderabad 10 of 32
  • 11. Snakes and Ladders : Revisited Question : Are all intermediate states equally ’valuable ’ just because they have equal reward ? Easwar Subramanian, IIT Hyderabad 11 of 32
  • 12. Value Function The value function V (s) gives the long-term value of state s ∈ S V (s) = E (Gt|st = s) = E ∞ X k=0 γk rt+k+1|st = s ! I Value function V (s) determines the value of being in state s I V (s) measures the potential future rewards we may get from being in state s I V (s) is independent of t Easwar Subramanian, IIT Hyderabad 12 of 32
  • 13. Value Function Computation : Example Consider the following MRP. Assume γ = 1 I V (s1) = 6.8 I V (s2) = 1 + γ ∗ 6 = 7 I V (s3) = 3 + γ ∗ 6 = 9 I V (s4) = 6 Easwar Subramanian, IIT Hyderabad 13 of 32
  • 14. Example : Snakes and Ladders Question : How can we evaluate the value of each state in a large MRP such as ’Snakes and Ladders ’ ? Easwar Subramanian, IIT Hyderabad 14 of 32
  • 15. Decomposition of Value Function Let s and s0 be successor states at time steps t and t + 1, the value function can be decomposed into sum of two parts I Immediate reward rt+1 I Discounted value of next state s0 (i.e. γV (s0 )) V (s) = E (Gt|st = s) = E ∞ X k=0 γk rt+k+1|st = s ! = E (rt+1 + γV (st+1)|st = s) Easwar Subramanian, IIT Hyderabad 15 of 32
  • 16. Decomposition of Value Function Recall that, Gt = rt+1 + γrt+2 + γ2 rt+3 + · · · = ∞ X k=0 γk rt+k+1 V (s) = E (Gt|st = s) = E ∞ X k=0 γk rt+k+1|st = s ! = E rt+1 + γrt+2 + γ2 rt+3 + · · · |st = s = E(rt+1|st = s) + ∞ X k=1 γk E (rt+k+1|st = s) = E(rt+1|st = s) + γ X s0∈S P(s0 |s) ∞ X k=0 γk E (rt+k+1|st = s, st+1 = s0 ) = E(rt+1|st = s) + γ X s0∈S P(s0 |s) ∞ X k=0 γk E (rt+k+1|st+1 = s0 ) (Markov property) = E(rt+1 + γV (st+1)|st = s) Easwar Subramanian, IIT Hyderabad 16 of 32
  • 17. Value Function : Evaluation We have V (s) = E(rt+1 + γV (st+1)|st = s) V (s) = R(s) + γ h Pss0 a V (s 0 a) + Pss 0 b V (s 0 b) + Pss0 c V (s 0 c) + Pss 0 d V (s 0 d) i Easwar Subramanian, IIT Hyderabad 17 of 32
  • 18. Value Function Computation : Example Consider the following MRP. Assume γ = 1 I V (s4) = 6 I V (s3) = 3 + γ ∗ 6 = 9 I V (s2) = 1 + γ ∗ 6 = 7 I V (s1) = − 1 + γ ∗ (0.6 ∗ 7 + 0.4 ∗ 9) = 6.8 Easwar Subramanian, IIT Hyderabad 18 of 32
  • 19. Bellman Equation for Markov Reward Process V (s) = E(rt+1 + γV (st+1)|st = s) For any s0 ∈ S a successor state of s with transition probability Pss0 , we can rewrite the above equation as (using definition of Expectation) V (s) = E(rt+1|st = s) + γ X s0∈S Pss0 V (s0 ) This is the Bellman Equation for value functions Easwar Subramanian, IIT Hyderabad 19 of 32
  • 20. Snakes and Ladders Question : How can we evaluate the value of (all) states using the value function decomposition ? V (s) = E(rt+1|st = s) + γ X s0∈S Pss0 V (s0 ) Easwar Subramanian, IIT Hyderabad 20 of 32
  • 21. Bellman Equation in Matrix Form Let S = {1, 2, · · · , n} and P be known. Then one can write the Bellman equation can as, V = R + γPV where      V (1) V (2) . . . V (n)      =      R(1) R(2) . . . R(n)      + γ      P11 P12 · · · P1n P21 P22 · · · P2n . . . Pn1 Pn2 · · · Pnn      ×      V (1) V (2) . . . V (n)      Solving for V , we get, V = (I − γP)−1 R The discount factor should be γ 1 for the inverse to exist Easwar Subramanian, IIT Hyderabad 21 of 32
  • 22. Example : Snakes and Ladders I We can now compute the value of states in such ’large’ MRP using the matrix form of Bellman equation I Value function computed for a particular state provides the expected number of plays to reach the goal state s100 from that state Easwar Subramanian, IIT Hyderabad 22 of 32
  • 23. Few Remarks on Discounting V (s) = E (Gt|st = s) = E ∞ X k=0 γk rt+k+1|st = s ! I Mathematically convienient to discount rewards I Avoids infinite returns in cyclic and infinite horizon setting I Discount rate determines the present value of future reward I Offers trade-off between being ’myopic’ and ’far-sighted’ reward I In certain class of MDPs, it is sometimes possible to use undiscounted reward (i.e. γ = 1), for example, if all sequences terminate Easwar Subramanian, IIT Hyderabad 23 of 32
  • 24. Markov Decision Process Easwar Subramanian, IIT Hyderabad 24 of 32
  • 25. Markov Decision Process Markov decision process is a tuple S, A, P, R, γ where I S : (Finite) set of states I A : (Finite) set of actions I P : State transition probability Pa ss0 = P(st+1 = s0 |st = s, at = a), at ∈ A I R : Reward for taking action at at state st and transitioning to state st+1 is given by the deterministic function R rt+1 = R(st, at, st+1) I γ : Discount factor such that γ ∈ [0, 1] Easwar Subramanian, IIT Hyderabad 25 of 32
  • 26. Wealth Management Problem I States S : Current value of the portfolio and current valuation of instruments in the portfolio I Actions A : Buy / Sell instruments of the portfolio I Reward R : Return on portfolio compared to previous decision epoch Easwar Subramanian, IIT Hyderabad 26 of 32
  • 27. Navigation Problem I States S : Squares of the grid I Actions A : Any of the four directions possible I Reward R : -1 for every move made until reaching goal state Easwar Subramanian, IIT Hyderabad 27 of 32
  • 28. Example : Atari Games I States S : Possible set of all (Atari) images I Actions A : Move the paddle up or down I Reward R : +1 for making the opponent miss the ball; -1 if the agent miss the ball; 0 otherwise; Easwar Subramanian, IIT Hyderabad 28 of 32
  • 29. Flow Diagram I The goal is to choose a sequence of actions such that the expected total discounted future reward E(Gt|st = s) is maximized where Gt = ∞ X k=0 γk rt+k+1 Easwar Subramanian, IIT Hyderabad 29 of 32
  • 30. Windy Grid World : Stochastic Environment Recall given an MDP S, A, P, R, γ , we have the state transition probability P defined as Pa ss0 = P(st+1 = s0 |st = s, at = a), at ∈ A I In general, note that even after choosing action a at state s (as prescribed by the policy) the next state s0 need not be a fixed state Easwar Subramanian, IIT Hyderabad 30 of 32
  • 31. Finite and Infinite Horizon MDPs I If T is fixed and finite, the resultant MDP is a finite horizon MDP F Wealth management problem I If T is infinite, the resultant MDP is infinite horizon MDP F Certain Atari games I When |S| is finite, the MDP is called finite state MDPs Easwar Subramanian, IIT Hyderabad 31 of 32
  • 32. Grid World Example Question : Is Grid world finite / infinite horizon problem ? Why ? (Stochastic shortest path MDPs) I For finite horizon MDPs and stochastic shortest path MDPs, one can use γ = 1 Easwar Subramanian, IIT Hyderabad 32 of 32