Lecture3-MDP.pdf

Value Functions and Markov Decision Process
Easwar Subramanian
TCS Innovation Labs, Hyderabad
Email : easwar.subramanian@tcs.com / cs5500.2020@iith.ac.in
August 12, 2022

Overview
1 Review
2 Value Function
3 Markov Decision Process
Easwar Subramanian, IIT Hyderabad 2 of 32

Review

Markov Property
Markov Property
A state st of a stochastic process {st}t∈T is said to have Markov property if
P(st+1|st) = P(st+1|s1, · · · , st)
The state st at time t captures all relevant information from history and is a sufficient
statistic of the future

State Transition Matrix
State Transition Probability
For a Markov state s and a successor state s0
, the state transition probability is defined by
Pss0 = P(st+1 = s0
|st = s)
State transition matrix P then denotes the transition probabilities from all states s to all
successor states s0
(with each row summing to 1)
P =



P11 P12 · · · P1n
.
.
.
Pn1 Pn2 · · · Pnn




Markov Chain
A stochastic process {st}t∈T is a Markov process or Markov Chain if it satisfies
Markov property for every state st. It is represented by tuple < S, P > where S denote
the set of states and P denote the state transition probablity
No notion of reward or action

Markov Reward Process
A Markov reward process is a tuple < S, P, R, γ > is a Markov chain with values
I S : (Finite) set of states
I P : State transition probablity
I R : Reward for being in state st is given by a deterministic function R
rt+1 = R(st)
I γ : Discount factor such that γ ∈ [0, 1]
No notion of action

A Markov reward process is a tuple < S, P, R, γ > is a Markov chain with values
I P : State transition probablity
I R : Reward for being in state st is given by a deterministic function R
rt+1 = R(st)
I In general, the reward function can also be an expectation R(st = s) = E[rt+1|st = s]

Value Function

Snakes and Ladders : Revisited
I Reward R : R(s) = −1 for s ∈ s1, · · · , s99 and for R(s100) = 0
I Discount Factor γ = 1

Snakes and Ladders : Revisited
Question : Are all intermediate states equally ’valuable ’ just because they have equal
reward ?

Value Function
The value function V (s) gives the long-term value of state s ∈ S
V (s) = E (Gt|st = s) = E
∞
X
k=0
γk
rt+k+1|st = s
!
I Value function V (s) determines the value of being in state s
I V (s) measures the potential future rewards we may get from being in state s
I V (s) is independent of t

Value Function Computation : Example
Consider the following MRP. Assume γ = 1
I V (s1) = 6.8
I V (s2) = 1 + γ ∗ 6 = 7
I V (s3) = 3 + γ ∗ 6 = 9
I V (s4) = 6

Example : Snakes and Ladders
Question : How can we evaluate the value of each state in a large MRP such as ’Snakes
and Ladders ’ ?

Decomposition of Value Function
Let s and s0
be successor states at time steps t and t + 1, the value function can be
decomposed into sum of two parts
I Immediate reward rt+1
I Discounted value of next state s0
(i.e. γV (s0
))
V (s) = E (Gt|st = s) = E
∞
X
k=0
γk
rt+k+1|st = s
!
= E (rt+1 + γV (st+1)|st = s)

Value Function : Evaluation
We have
V (s) = E(rt+1 + γV (st+1)|st = s)
V (s) = R(s) + γ
h
Pss0
a
V (s
0
a) + Pss
0
b
V (s
0
b) + Pss0
c
V (s
0
c) + Pss
0
d
V (s
0
d)
i

Value Function Computation : Example
Consider the following MRP. Assume γ = 1
I V (s4) = 6
I V (s3) = 3 + γ ∗ 6 = 9
I V (s2) = 1 + γ ∗ 6 = 7
I V (s1) = − 1 + γ ∗ (0.6 ∗ 7 + 0.4 ∗ 9) = 6.8

Bellman Equation for Markov Reward Process
V (s) = E(rt+1 + γV (st+1)|st = s)
For any s0
∈ S a successor state of s with transition probability Pss0 , we can rewrite the
above equation as (using definition of Expectation)
V (s) = E(rt+1|st = s) + γ
X
s0∈S
Pss0 V (s0
)
This is the Bellman Equation for value functions

Snakes and Ladders
Question : How can we evaluate the value of (all) states using the value function
decomposition ?
V (s) = E(rt+1|st = s) + γ
X
s0∈S
Pss0 V (s0
)

Bellman Equation in Matrix Form
Let S = {1, 2, · · · , n} and P be known. Then one can write the Bellman equation can as,
V = R + γPV
where 




V (1)
V (2)
.
.
.
V (n)





=





R(1)
R(2)
.
.
.
R(n)





+ γ





P11 P12 · · · P1n
P21 P22 · · · P2n
.
.
.
Pn1 Pn2 · · · Pnn





×





V (1)
V (2)
.
.
.
V (n)





Solving for V , we get,
V = (I − γP)−1
R
The discount factor should be γ 1 for the inverse to exist

Example : Snakes and Ladders
I We can now compute the value of states in such ’large’ MRP using the matrix form of
Bellman equation
I Value function computed for a particular state provides the expected number of
plays to reach the goal state s100 from that state

Few Remarks on Discounting
V (s) = E (Gt|st = s) = E
∞
X
k=0
γk
rt+k+1|st = s
!
I Mathematically convienient to discount rewards
I Avoids infinite returns in cyclic and infinite horizon setting
I Discount rate determines the present value of future reward
I Offers trade-off between being ’myopic’ and ’far-sighted’ reward
I In certain class of MDPs, it is sometimes possible to use undiscounted reward (i.e.
γ = 1), for example, if all sequences terminate

Markov Decision Process

Markov Decision Process
Markov decision process is a tuple S, A, P, R, γ where
I A : (Finite) set of actions
I P : State transition probability
Pa
ss0 = P(st+1 = s0
|st = s, at = a), at ∈ A
I R : Reward for taking action at at state st and transitioning to state st+1 is given by
the deterministic function R
rt+1 = R(st, at, st+1)

Wealth Management Problem
I States S : Current value of the portfolio and current valuation of instruments in the
portfolio
I Actions A : Buy / Sell instruments of the portfolio
I Reward R : Return on portfolio compared to previous decision epoch

Navigation Problem
I States S : Squares of the grid
I Actions A : Any of the four directions possible
I Reward R : -1 for every move made until reaching goal state

Example : Atari Games
I States S : Possible set of all (Atari) images
I Actions A : Move the paddle up or down
I Reward R : +1 for making the opponent miss the ball; -1 if the agent miss the ball; 0
otherwise;

Flow Diagram
I The goal is to choose a sequence of actions such that the expected total discounted
future reward E(Gt|st = s) is maximized where
Gt =
∞
X
k=0
γk
rt+k+1


Windy Grid World : Stochastic Environment
Recall given an MDP S, A, P, R, γ , we have the state transition probability P defined
as
Pa
ss0 = P(st+1 = s0
|st = s, at = a), at ∈ A
I In general, note that even after choosing action a at state s (as prescribed by the
policy) the next state s0
need not be a fixed state

Finite and Infinite Horizon MDPs
I If T is fixed and finite, the resultant MDP is a finite horizon MDP
F Wealth management problem
I If T is infinite, the resultant MDP is infinite horizon MDP
F Certain Atari games
I When |S| is finite, the MDP is called finite state MDPs

Grid World Example
Question : Is Grid world finite / infinite horizon problem ? Why ?
(Stochastic shortest path MDPs)
I For finite horizon MDPs and stochastic shortest path MDPs, one can use γ = 1

Lecture3-MDP.pdf

Recommended

Recommended

More Related Content

Similar to Lecture3-MDP.pdf

Similar to Lecture3-MDP.pdf (20)

Recently uploaded

Recently uploaded (20)

Lecture3-MDP.pdf