Reinforcement learning Markov principle

Value Iteration Algorithm
Example
Dr. Surya Prakash
Associate Professor
Department of Computer Science & Engineering
Indian Institute of Technology Indore, Indore-453552, INDIA
E-mail: surya@iiti.ac.in
Dr. Surya Prakash (CSE, IIT Indore)

Quick Recap
 Policy iteration algorithm
–Iterative policy evaluation
–Policy improvement
 Value iteration algorithm
–Iterative policy evaluation + Policy improvement
2

Policy Iteration Algorithm
3

4

 In policy iteration:
– we iteratively alternate policy evaluation and policy improvement.
 policy evaluation:
– we keep policy constant and update utility (value) based on that
policy
 policy improvement:
– we keep utility (value) constant and update policy based on that
utility
5

 A utility of a state is the sum of its immediate reward and
the utility of its successor state with a discounted factor
 Here, every utility is defined w.r.t. a certain policy
– For instance,
• policy π₁ has its associated utility v₁
• policy π₂ has its associated utility v₂
• …
• and policy πᵢ has its associated utility vᵢ
6
𝑉𝑉(𝑠𝑠) = 𝑅𝑅𝑠𝑠𝑠𝑠𝑠 + γV(s’)

 In policy iteration algorithm, two parts
– Policy evaluation
– Policy improvement
 We club these two parts in Value Iteration
 Value Iteration combines
– simple backup operation that combines the policy improvement, and
– truncated policy evaluation steps
7

8
//V(terminal)=0 necessarily

Value Iteration
 Policy evaluation – how to get V(s)?
– It has linear equations that can be solved directly
– Alternatively, Iterative Policy Evaluation can be used to get V(s)
values for a given policy
 Value iteration – how to get V(s)?
– The equations are not linear anymore here, so we cannot solve them
directly
– as a result, we have to use an iterative procedure to solve them
–Non-linearity due to max operation
9

Example – Value Iteration
 Grid world
– Actions: UP, DOWN, LEFT, RIGHT
10

 As we did in policy iteration, we start from initializing the
utility of every state as zero and we set γ as 0.5
– v(s)=0 for all s
– γ = 0.5
11

 What we need to do is to loop through states using the Bellman
equation.
 Considering r(s) as the reward function, the value of a state s can
be given as:
12
 r(s) is the reward value for a state
(reward obtained in moving to the state s)
 A different notion of reward (the
value is independent of action)
 Reaching to a state from anywhere
with any action yields the same
reward

 Stochastic world:
– world is not-deterministic
– From a certain state, if we choose the same action, we are not
guaranteed to move into the same next state.
– for example, robot somehow has some probability of
malfunctioning.
– For instance,
• If it decides to go left, it has a high possibility to actually go left.
• However, there is a small possibility, no matter how tiny it may be, that it
goes wild and moves into directions other than left.
13

 Stochastic world:
– the probability of actually moving in the intended direction is 0.8.
– there is a 0.1 probability of moving 90 degrees left to the intended
direction and another
– 0.1 probability of moving 90 degrees right to the intended direction.
 Reward:
– In our grid world, a normal state has a reward of -0.04
– a good green ending state has a reward of +1, and
– a bad red ending state has a reward of -1
14

 Let’s start from state
from s = 0:
15

 We are using an in-place procedure
– this means from now on whenever we see v(0), it is -0.04
instead of 0
 Next, for s = 1, we have
16

 This is repeated
for states 2, 3,
…11,
 And, we get these
utility values for
all the states
17

 Now it is time
to iterate again
 The utility
values need to
be computed
from s = 0
to s = 11 again
18

 And, iterate
again:
19

 Repeat iteration until
the change of utility
between two
consecutive iterations
are marginal
 After 11 iterations:
– the change of
utility value of any
state is smaller than
0.001.
 It is stopped here
and the utility we
get is the utility
associated with the
optimal policy
20

 Compared with policy iteration, why does value iteration
works is because it incorporates the max operation during the
value iterations.
 Since we choose the maximum utility in each iteration, this
performs
– implicitly argmax operation to exclude the suboptimal actions, and
– converges to the optimal action
21

Getting the Optimal Policy
 Using value iteration, we have determined the utility of the
optimal policy
 Now, we need to know how to get the optimal policy?
– Similar to what is done in policy iteration, we can get the optimal
policy by applying the following equation for each state.
22

 Comparison of utilities of Policy & Value iteration algorithms:
– If we compare the utilities obtained using value iteration to those of
using policy iteration, we can find that the utilities values are very
close
 The obtained utilities are the solutions of the Bellman
equations
 Policy iteration and value iteration are just two alternative
methods to solve the Bellman equations
23
𝑉𝑉(𝑠𝑠) = 𝑅𝑅𝑠𝑠𝑠𝑠𝑠 + γV(s’)

 For the same MDP with the same Bellman equations,
regardless of the method, it is expected to get the same
results, right?
–Theoretically, Yes
 In practice, slightly different results are obtained
–This is because of the differences such as stop
criterion in algorithms of policy iteration and value
iteration
24

25

26

 Slightly different utility values usually do not affect the choice
of policy
– Since the policy is determined by relatively rankings of utility
values, not the absolute values, slightly different utility values usually
do not affect the choice of policy
 When determining the optimal policy, if there is a tie between
actions, we randomly choose one of them as the optimal
action.
27

Identical Outcomes: Policy and Value Iteration
 We see here that use of policy iteration and value iteration,
results in identical policy
28

Effects of Discounted Factor
 Changing the discounted factor does not change the
fact that these two methods are still solving the same
Bellman equations.
 Similar as γ of 0.5, when γ is 0.1 or 0.9,
–the utilities from policy iteration and value iteration are
slightly different while the policies are identical
29

30

 Larger γ requires more iterations
– Similar as the number of sweeps in policy evaluation during policy
iteration, in value iteration, larger γ requires more iterations.
– For our example,
• it takes 4 iterations when γ is 0.1 for the change of utility values (∆) to be less than
0.001.
• it requires 11 iterations when γ is 0.5
• it requires 67 iterations when γ is 0.9
 It is same as in policy iteration,
– larger γ tends to generates better results but demands the price of more
computation
31

32

Pseudo-code of Value Iteration
33

Pseudo-code of Value Iteration
 Here
– threshold θ is used as the stop criterion (like policy iteration)
– initialization of policy not required (unlike policy iteration)
 We do not need policy in value iteration
– we do not need to consider policy until at the very end
– after the utility is converged, we derive a policy which is the optimal
policy
34

From MDP to Reinforcement Learning
 At first glance,
–MDP seems to be super useful in many aspects of real life
–Not only simple games like Pac-Man but also complex
systems like stock market may be represented as MDP
• for instance in stock market, prices as states and buy/hold/sell as
actions.
35

 However, there is a catch:
–we do not know the reward function or transitional model
–if we somehow know the reward function of the MDP
representing the stock market, we could quickly become
millionaires
–In most cases of real life MDPs, we cannot access either
reward function or transitional model
36

 In real life (on contrary to Pac-Man game), we do not know
– where the diamond is,
– where the poison is,
– where the walls are,
– how big the map is,
– what the probability that the robot accurately execute the intended
action is,
– what the robot will do when it does not accurately execute our
intended action
– etc.
37

 All we know is following
– choose an action,
– reach a new state,
– receive -0.04 (pay a penalty of 0.04),
– continue to make a choice of actions
– reach another state,
– receive -0.04…
38

 In other words:
– In MDP, we consider fully observable environment while in real life,
it is not.
 Methods such as policy iteration and value iteration can solve
fully observable MDP
 In contrast, if reward function and transitional model are not
known, that is where reinforcement learning fits in
39

 Since we do not know reward function and transitional model,
we need to learn them
– reinforcement learning helps there
 Reinforcement learning approaches
– Monte Carlo Approach
– Temporal Difference Learning
40

References
 Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An
Introduction, MIT press (Chapter 4).
http://incompleteideas.net/book/ebook/
 Markov decision process: value iteration with code implementation:
https://medium.com/@ngao7/markov-decision-process-value-
iteration-2d161d50a6ff
 Markov decision process: policy iteration with code implementation:
https://medium.com/@ngao7/markov-decision-process-policy-
iteration-42d35ee87c82
41

Projects
 Tools:
– OpenAI Gym - a toolkit for developing and comparing RL algorithms
– Python + TensorFlow (TF-Agents)
– MuJoCo - Advanced physics simulation
 Problems
– Robot navigation
– Stock trading
– Traffic Light Control
– Point cloud completion
– Self-driving taxis
– Inverted Pendulum (CartPole Game)
– Atari games - Breakout, Montezuma Revenge, and Space Invaders
42

Thank You
Dr. Surya Prakash (CSE, IIT Indore) 43

Reinforcement learning Markov principle

Recommended

Recommended

More Related Content

Similar to Reinforcement learning Markov principle

Similar to Reinforcement learning Markov principle (20)

Recently uploaded

Recently uploaded (20)

Reinforcement learning Markov principle