SlideShare a Scribd company logo
Value Iteration Algorithm
Example
Dr. Surya Prakash
Associate Professor
Department of Computer Science & Engineering
Indian Institute of Technology Indore, Indore-453552, INDIA
E-mail: surya@iiti.ac.in
Dr. Surya Prakash (CSE, IIT Indore)
Quick Recap
 Policy iteration algorithm
–Iterative policy evaluation
–Policy improvement
 Value iteration algorithm
–Iterative policy evaluation + Policy improvement
Dr. Surya Prakash (CSE, IIT Indore)
2
Policy Iteration Algorithm
Dr. Surya Prakash (CSE, IIT Indore)
3
Policy Iteration Algorithm
Dr. Surya Prakash (CSE, IIT Indore)
4
Policy Iteration Algorithm
 In policy iteration:
– we iteratively alternate policy evaluation and policy improvement.
 policy evaluation:
– we keep policy constant and update utility (value) based on that
policy
 policy improvement:
– we keep utility (value) constant and update policy based on that
utility
Dr. Surya Prakash (CSE, IIT Indore)
5
Policy Iteration Algorithm
 A utility of a state is the sum of its immediate reward and
the utility of its successor state with a discounted factor
 Here, every utility is defined w.r.t. a certain policy
– For instance,
• policy π₁ has its associated utility v₁
• policy π₂ has its associated utility v₂
• …
• and policy πᵢ has its associated utility vᵢ
Dr. Surya Prakash (CSE, IIT Indore)
6
𝑉𝑉(𝑠𝑠) = 𝑅𝑅𝑠𝑠𝑠𝑠𝑠 + γV(s’)
Value Iteration Algorithm
 In policy iteration algorithm, two parts
– Policy evaluation
– Policy improvement
 We club these two parts in Value Iteration
 Value Iteration combines
– simple backup operation that combines the policy improvement, and
– truncated policy evaluation steps
Dr. Surya Prakash (CSE, IIT Indore)
7
Value Iteration Algorithm
Dr. Surya Prakash (CSE, IIT Indore)
8
//V(terminal)=0 necessarily
Value Iteration
 Policy evaluation – how to get V(s)?
– It has linear equations that can be solved directly
– Alternatively, Iterative Policy Evaluation can be used to get V(s)
values for a given policy
 Value iteration – how to get V(s)?
– The equations are not linear anymore here, so we cannot solve them
directly
– as a result, we have to use an iterative procedure to solve them
–Non-linearity due to max operation
Dr. Surya Prakash (CSE, IIT Indore)
9
Example – Value Iteration
 Grid world
– Actions: UP, DOWN, LEFT, RIGHT
Dr. Surya Prakash (CSE, IIT Indore)
10
Example – Value Iteration
 As we did in policy iteration, we start from initializing the
utility of every state as zero and we set γ as 0.5
– v(s)=0 for all s
– γ = 0.5
Dr. Surya Prakash (CSE, IIT Indore)
11
Example – Value Iteration
 What we need to do is to loop through states using the Bellman
equation.
 Considering r(s) as the reward function, the value of a state s can
be given as:
Dr. Surya Prakash (CSE, IIT Indore)
12
 r(s) is the reward value for a state
(reward obtained in moving to the state s)
 A different notion of reward (the
value is independent of action)
 Reaching to a state from anywhere
with any action yields the same
reward
Example – Value Iteration
 Stochastic world:
– world is not-deterministic
– From a certain state, if we choose the same action, we are not
guaranteed to move into the same next state.
– for example, robot somehow has some probability of
malfunctioning.
– For instance,
• If it decides to go left, it has a high possibility to actually go left.
• However, there is a small possibility, no matter how tiny it may be, that it
goes wild and moves into directions other than left.
Dr. Surya Prakash (CSE, IIT Indore)
13
Example – Value Iteration
 Stochastic world:
– the probability of actually moving in the intended direction is 0.8.
– there is a 0.1 probability of moving 90 degrees left to the intended
direction and another
– 0.1 probability of moving 90 degrees right to the intended direction.
 Reward:
– In our grid world, a normal state has a reward of -0.04
– a good green ending state has a reward of +1, and
– a bad red ending state has a reward of -1
Dr. Surya Prakash (CSE, IIT Indore)
14
Example – Value Iteration
 Let’s start from state
from s = 0:
Dr. Surya Prakash (CSE, IIT Indore)
15
Example – Value Iteration
 We are using an in-place procedure
– this means from now on whenever we see v(0), it is -0.04
instead of 0
 Next, for s = 1, we have
Dr. Surya Prakash (CSE, IIT Indore)
16
Example – Value Iteration
 This is repeated
for states 2, 3,
…11,
 And, we get these
utility values for
all the states
Dr. Surya Prakash (CSE, IIT Indore)
17
Example – Value Iteration
 Now it is time
to iterate again
 The utility
values need to
be computed
from s = 0
to s = 11 again
Dr. Surya Prakash (CSE, IIT Indore)
18
Example – Value Iteration
 And, iterate
again:
Dr. Surya Prakash (CSE, IIT Indore)
19
Example – Value Iteration
 Repeat iteration until
the change of utility
between two
consecutive iterations
are marginal
 After 11 iterations:
– the change of
utility value of any
state is smaller than
0.001.
 It is stopped here
and the utility we
get is the utility
associated with the
optimal policy
Dr. Surya Prakash (CSE, IIT Indore)
20
Example – Value Iteration
 Compared with policy iteration, why does value iteration
works is because it incorporates the max operation during the
value iterations.
 Since we choose the maximum utility in each iteration, this
performs
– implicitly argmax operation to exclude the suboptimal actions, and
– converges to the optimal action
Dr. Surya Prakash (CSE, IIT Indore)
21
Getting the Optimal Policy
 Using value iteration, we have determined the utility of the
optimal policy
 Now, we need to know how to get the optimal policy?
– Similar to what is done in policy iteration, we can get the optimal
policy by applying the following equation for each state.
Dr. Surya Prakash (CSE, IIT Indore)
22
Getting the Optimal Policy
 Comparison of utilities of Policy & Value iteration algorithms:
– If we compare the utilities obtained using value iteration to those of
using policy iteration, we can find that the utilities values are very
close
 The obtained utilities are the solutions of the Bellman
equations
 Policy iteration and value iteration are just two alternative
methods to solve the Bellman equations
Dr. Surya Prakash (CSE, IIT Indore)
23
𝑉𝑉(𝑠𝑠) = 𝑅𝑅𝑠𝑠𝑠𝑠𝑠 + γV(s’)
Getting the Optimal Policy
 For the same MDP with the same Bellman equations,
regardless of the method, it is expected to get the same
results, right?
–Theoretically, Yes
 In practice, slightly different results are obtained
–This is because of the differences such as stop
criterion in algorithms of policy iteration and value
iteration
Dr. Surya Prakash (CSE, IIT Indore)
24
Getting the Optimal Policy
Dr. Surya Prakash (CSE, IIT Indore)
25
Getting the Optimal Policy
Dr. Surya Prakash (CSE, IIT Indore)
26
Getting the Optimal Policy
 Slightly different utility values usually do not affect the choice
of policy
– Since the policy is determined by relatively rankings of utility
values, not the absolute values, slightly different utility values usually
do not affect the choice of policy
 When determining the optimal policy, if there is a tie between
actions, we randomly choose one of them as the optimal
action.
Dr. Surya Prakash (CSE, IIT Indore)
27
Identical Outcomes: Policy and Value Iteration
 We see here that use of policy iteration and value iteration,
results in identical policy
Dr. Surya Prakash (CSE, IIT Indore)
28
Effects of Discounted Factor
 Changing the discounted factor does not change the
fact that these two methods are still solving the same
Bellman equations.
 Similar as γ of 0.5, when γ is 0.1 or 0.9,
–the utilities from policy iteration and value iteration are
slightly different while the policies are identical
Dr. Surya Prakash (CSE, IIT Indore)
29
Effects of Discounted Factor
Dr. Surya Prakash (CSE, IIT Indore)
30
Effects of Discounted Factor
 Larger γ requires more iterations
– Similar as the number of sweeps in policy evaluation during policy
iteration, in value iteration, larger γ requires more iterations.
– For our example,
• it takes 4 iterations when γ is 0.1 for the change of utility values (∆) to be less than
0.001.
• it requires 11 iterations when γ is 0.5
• it requires 67 iterations when γ is 0.9
 It is same as in policy iteration,
– larger γ tends to generates better results but demands the price of more
computation
Dr. Surya Prakash (CSE, IIT Indore)
31
Effects of Discounted Factor
Dr. Surya Prakash (CSE, IIT Indore)
32
Pseudo-code of Value Iteration
Dr. Surya Prakash (CSE, IIT Indore)
33
Pseudo-code of Value Iteration
 Here
– threshold θ is used as the stop criterion (like policy iteration)
– initialization of policy not required (unlike policy iteration)
 We do not need policy in value iteration
– we do not need to consider policy until at the very end
– after the utility is converged, we derive a policy which is the optimal
policy
Dr. Surya Prakash (CSE, IIT Indore)
34
From MDP to Reinforcement Learning
 At first glance,
–MDP seems to be super useful in many aspects of real life
–Not only simple games like Pac-Man but also complex
systems like stock market may be represented as MDP
• for instance in stock market, prices as states and buy/hold/sell as
actions.
Dr. Surya Prakash (CSE, IIT Indore)
35
From MDP to Reinforcement Learning
 However, there is a catch:
–we do not know the reward function or transitional model
–if we somehow know the reward function of the MDP
representing the stock market, we could quickly become
millionaires
–In most cases of real life MDPs, we cannot access either
reward function or transitional model
Dr. Surya Prakash (CSE, IIT Indore)
36
From MDP to Reinforcement Learning
 In real life (on contrary to Pac-Man game), we do not know
– where the diamond is,
– where the poison is,
– where the walls are,
– how big the map is,
– what the probability that the robot accurately execute the intended
action is,
– what the robot will do when it does not accurately execute our
intended action
– etc.
Dr. Surya Prakash (CSE, IIT Indore)
37
From MDP to Reinforcement Learning
 All we know is following
– choose an action,
– reach a new state,
– receive -0.04 (pay a penalty of 0.04),
– continue to make a choice of actions
– reach another state,
– receive -0.04…
Dr. Surya Prakash (CSE, IIT Indore)
38
From MDP to Reinforcement Learning
 In other words:
– In MDP, we consider fully observable environment while in real life,
it is not.
 Methods such as policy iteration and value iteration can solve
fully observable MDP
 In contrast, if reward function and transitional model are not
known, that is where reinforcement learning fits in
Dr. Surya Prakash (CSE, IIT Indore)
39
From MDP to Reinforcement Learning
 Since we do not know reward function and transitional model,
we need to learn them
– reinforcement learning helps there
 Reinforcement learning approaches
– Monte Carlo Approach
– Temporal Difference Learning
Dr. Surya Prakash (CSE, IIT Indore)
40
References
 Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An
Introduction, MIT press (Chapter 4).
http://incompleteideas.net/book/ebook/
 Markov decision process: value iteration with code implementation:
https://medium.com/@ngao7/markov-decision-process-value-
iteration-2d161d50a6ff
 Markov decision process: policy iteration with code implementation:
https://medium.com/@ngao7/markov-decision-process-policy-
iteration-42d35ee87c82
Dr. Surya Prakash (CSE, IIT Indore)
41
Projects
 Tools:
– OpenAI Gym - a toolkit for developing and comparing RL algorithms
– Python + TensorFlow (TF-Agents)
– MuJoCo - Advanced physics simulation
 Problems
– Robot navigation
– Stock trading
– Traffic Light Control
– Point cloud completion
– Self-driving taxis
– Inverted Pendulum (CartPole Game)
– Atari games - Breakout, Montezuma Revenge, and Space Invaders
Dr. Surya Prakash (CSE, IIT Indore)
42
Thank You
Dr. Surya Prakash (CSE, IIT Indore) 43

More Related Content

Similar to Reinforcement learning Markov principle

Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
gokulprasath06
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
SVijaylakshmi
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
SKS
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
DataScienceLab
 
Introduction to Regression . pptx
Introduction     to    Regression . pptxIntroduction     to    Regression . pptx
Introduction to Regression . pptx
Harsha Patel
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
Jie-Han Chen
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
azzeddine chenine
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithm
SameerJolly2
 
rlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piuttrlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piutt
201roopikha
 
Introduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptxIntroduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptx
Harsha Patel
 
CS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptxCS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptx
logesswarisrinivasan
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
YasutoTamura1
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
Flavian Vasile
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
ShubhaManikarnike
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playing
Saeid Ghafouri
 

Similar to Reinforcement learning Markov principle (20)

Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Introduction to Regression . pptx
Introduction     to    Regression . pptxIntroduction     to    Regression . pptx
Introduction to Regression . pptx
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithm
 
rlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piuttrlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piutt
 
Introduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptxIntroduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptx
 
CS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptxCS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptx
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playing
 

Recently uploaded

4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
PIMR BHOPAL
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
morris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdfmorris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdf
ycwu0509
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
bjmsejournal
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
Engineering Standards Wiring methods.pdf
Engineering Standards Wiring methods.pdfEngineering Standards Wiring methods.pdf
Engineering Standards Wiring methods.pdf
edwin408357
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
bijceesjournal
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
ijaia
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
Gas agency management system project report.pdf
Gas agency management system project report.pdfGas agency management system project report.pdf
Gas agency management system project report.pdf
Kamal Acharya
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 

Recently uploaded (20)

4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
morris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdfmorris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdf
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
Engineering Standards Wiring methods.pdf
Engineering Standards Wiring methods.pdfEngineering Standards Wiring methods.pdf
Engineering Standards Wiring methods.pdf
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
Gas agency management system project report.pdf
Gas agency management system project report.pdfGas agency management system project report.pdf
Gas agency management system project report.pdf
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 

Reinforcement learning Markov principle

  • 1. Value Iteration Algorithm Example Dr. Surya Prakash Associate Professor Department of Computer Science & Engineering Indian Institute of Technology Indore, Indore-453552, INDIA E-mail: surya@iiti.ac.in Dr. Surya Prakash (CSE, IIT Indore)
  • 2. Quick Recap  Policy iteration algorithm –Iterative policy evaluation –Policy improvement  Value iteration algorithm –Iterative policy evaluation + Policy improvement Dr. Surya Prakash (CSE, IIT Indore) 2
  • 3. Policy Iteration Algorithm Dr. Surya Prakash (CSE, IIT Indore) 3
  • 4. Policy Iteration Algorithm Dr. Surya Prakash (CSE, IIT Indore) 4
  • 5. Policy Iteration Algorithm  In policy iteration: – we iteratively alternate policy evaluation and policy improvement.  policy evaluation: – we keep policy constant and update utility (value) based on that policy  policy improvement: – we keep utility (value) constant and update policy based on that utility Dr. Surya Prakash (CSE, IIT Indore) 5
  • 6. Policy Iteration Algorithm  A utility of a state is the sum of its immediate reward and the utility of its successor state with a discounted factor  Here, every utility is defined w.r.t. a certain policy – For instance, • policy π₁ has its associated utility v₁ • policy π₂ has its associated utility v₂ • … • and policy πᵢ has its associated utility vᵢ Dr. Surya Prakash (CSE, IIT Indore) 6 𝑉𝑉(𝑠𝑠) = 𝑅𝑅𝑠𝑠𝑠𝑠𝑠 + γV(s’)
  • 7. Value Iteration Algorithm  In policy iteration algorithm, two parts – Policy evaluation – Policy improvement  We club these two parts in Value Iteration  Value Iteration combines – simple backup operation that combines the policy improvement, and – truncated policy evaluation steps Dr. Surya Prakash (CSE, IIT Indore) 7
  • 8. Value Iteration Algorithm Dr. Surya Prakash (CSE, IIT Indore) 8 //V(terminal)=0 necessarily
  • 9. Value Iteration  Policy evaluation – how to get V(s)? – It has linear equations that can be solved directly – Alternatively, Iterative Policy Evaluation can be used to get V(s) values for a given policy  Value iteration – how to get V(s)? – The equations are not linear anymore here, so we cannot solve them directly – as a result, we have to use an iterative procedure to solve them –Non-linearity due to max operation Dr. Surya Prakash (CSE, IIT Indore) 9
  • 10. Example – Value Iteration  Grid world – Actions: UP, DOWN, LEFT, RIGHT Dr. Surya Prakash (CSE, IIT Indore) 10
  • 11. Example – Value Iteration  As we did in policy iteration, we start from initializing the utility of every state as zero and we set γ as 0.5 – v(s)=0 for all s – γ = 0.5 Dr. Surya Prakash (CSE, IIT Indore) 11
  • 12. Example – Value Iteration  What we need to do is to loop through states using the Bellman equation.  Considering r(s) as the reward function, the value of a state s can be given as: Dr. Surya Prakash (CSE, IIT Indore) 12  r(s) is the reward value for a state (reward obtained in moving to the state s)  A different notion of reward (the value is independent of action)  Reaching to a state from anywhere with any action yields the same reward
  • 13. Example – Value Iteration  Stochastic world: – world is not-deterministic – From a certain state, if we choose the same action, we are not guaranteed to move into the same next state. – for example, robot somehow has some probability of malfunctioning. – For instance, • If it decides to go left, it has a high possibility to actually go left. • However, there is a small possibility, no matter how tiny it may be, that it goes wild and moves into directions other than left. Dr. Surya Prakash (CSE, IIT Indore) 13
  • 14. Example – Value Iteration  Stochastic world: – the probability of actually moving in the intended direction is 0.8. – there is a 0.1 probability of moving 90 degrees left to the intended direction and another – 0.1 probability of moving 90 degrees right to the intended direction.  Reward: – In our grid world, a normal state has a reward of -0.04 – a good green ending state has a reward of +1, and – a bad red ending state has a reward of -1 Dr. Surya Prakash (CSE, IIT Indore) 14
  • 15. Example – Value Iteration  Let’s start from state from s = 0: Dr. Surya Prakash (CSE, IIT Indore) 15
  • 16. Example – Value Iteration  We are using an in-place procedure – this means from now on whenever we see v(0), it is -0.04 instead of 0  Next, for s = 1, we have Dr. Surya Prakash (CSE, IIT Indore) 16
  • 17. Example – Value Iteration  This is repeated for states 2, 3, …11,  And, we get these utility values for all the states Dr. Surya Prakash (CSE, IIT Indore) 17
  • 18. Example – Value Iteration  Now it is time to iterate again  The utility values need to be computed from s = 0 to s = 11 again Dr. Surya Prakash (CSE, IIT Indore) 18
  • 19. Example – Value Iteration  And, iterate again: Dr. Surya Prakash (CSE, IIT Indore) 19
  • 20. Example – Value Iteration  Repeat iteration until the change of utility between two consecutive iterations are marginal  After 11 iterations: – the change of utility value of any state is smaller than 0.001.  It is stopped here and the utility we get is the utility associated with the optimal policy Dr. Surya Prakash (CSE, IIT Indore) 20
  • 21. Example – Value Iteration  Compared with policy iteration, why does value iteration works is because it incorporates the max operation during the value iterations.  Since we choose the maximum utility in each iteration, this performs – implicitly argmax operation to exclude the suboptimal actions, and – converges to the optimal action Dr. Surya Prakash (CSE, IIT Indore) 21
  • 22. Getting the Optimal Policy  Using value iteration, we have determined the utility of the optimal policy  Now, we need to know how to get the optimal policy? – Similar to what is done in policy iteration, we can get the optimal policy by applying the following equation for each state. Dr. Surya Prakash (CSE, IIT Indore) 22
  • 23. Getting the Optimal Policy  Comparison of utilities of Policy & Value iteration algorithms: – If we compare the utilities obtained using value iteration to those of using policy iteration, we can find that the utilities values are very close  The obtained utilities are the solutions of the Bellman equations  Policy iteration and value iteration are just two alternative methods to solve the Bellman equations Dr. Surya Prakash (CSE, IIT Indore) 23 𝑉𝑉(𝑠𝑠) = 𝑅𝑅𝑠𝑠𝑠𝑠𝑠 + γV(s’)
  • 24. Getting the Optimal Policy  For the same MDP with the same Bellman equations, regardless of the method, it is expected to get the same results, right? –Theoretically, Yes  In practice, slightly different results are obtained –This is because of the differences such as stop criterion in algorithms of policy iteration and value iteration Dr. Surya Prakash (CSE, IIT Indore) 24
  • 25. Getting the Optimal Policy Dr. Surya Prakash (CSE, IIT Indore) 25
  • 26. Getting the Optimal Policy Dr. Surya Prakash (CSE, IIT Indore) 26
  • 27. Getting the Optimal Policy  Slightly different utility values usually do not affect the choice of policy – Since the policy is determined by relatively rankings of utility values, not the absolute values, slightly different utility values usually do not affect the choice of policy  When determining the optimal policy, if there is a tie between actions, we randomly choose one of them as the optimal action. Dr. Surya Prakash (CSE, IIT Indore) 27
  • 28. Identical Outcomes: Policy and Value Iteration  We see here that use of policy iteration and value iteration, results in identical policy Dr. Surya Prakash (CSE, IIT Indore) 28
  • 29. Effects of Discounted Factor  Changing the discounted factor does not change the fact that these two methods are still solving the same Bellman equations.  Similar as γ of 0.5, when γ is 0.1 or 0.9, –the utilities from policy iteration and value iteration are slightly different while the policies are identical Dr. Surya Prakash (CSE, IIT Indore) 29
  • 30. Effects of Discounted Factor Dr. Surya Prakash (CSE, IIT Indore) 30
  • 31. Effects of Discounted Factor  Larger γ requires more iterations – Similar as the number of sweeps in policy evaluation during policy iteration, in value iteration, larger γ requires more iterations. – For our example, • it takes 4 iterations when γ is 0.1 for the change of utility values (∆) to be less than 0.001. • it requires 11 iterations when γ is 0.5 • it requires 67 iterations when γ is 0.9  It is same as in policy iteration, – larger γ tends to generates better results but demands the price of more computation Dr. Surya Prakash (CSE, IIT Indore) 31
  • 32. Effects of Discounted Factor Dr. Surya Prakash (CSE, IIT Indore) 32
  • 33. Pseudo-code of Value Iteration Dr. Surya Prakash (CSE, IIT Indore) 33
  • 34. Pseudo-code of Value Iteration  Here – threshold θ is used as the stop criterion (like policy iteration) – initialization of policy not required (unlike policy iteration)  We do not need policy in value iteration – we do not need to consider policy until at the very end – after the utility is converged, we derive a policy which is the optimal policy Dr. Surya Prakash (CSE, IIT Indore) 34
  • 35. From MDP to Reinforcement Learning  At first glance, –MDP seems to be super useful in many aspects of real life –Not only simple games like Pac-Man but also complex systems like stock market may be represented as MDP • for instance in stock market, prices as states and buy/hold/sell as actions. Dr. Surya Prakash (CSE, IIT Indore) 35
  • 36. From MDP to Reinforcement Learning  However, there is a catch: –we do not know the reward function or transitional model –if we somehow know the reward function of the MDP representing the stock market, we could quickly become millionaires –In most cases of real life MDPs, we cannot access either reward function or transitional model Dr. Surya Prakash (CSE, IIT Indore) 36
  • 37. From MDP to Reinforcement Learning  In real life (on contrary to Pac-Man game), we do not know – where the diamond is, – where the poison is, – where the walls are, – how big the map is, – what the probability that the robot accurately execute the intended action is, – what the robot will do when it does not accurately execute our intended action – etc. Dr. Surya Prakash (CSE, IIT Indore) 37
  • 38. From MDP to Reinforcement Learning  All we know is following – choose an action, – reach a new state, – receive -0.04 (pay a penalty of 0.04), – continue to make a choice of actions – reach another state, – receive -0.04… Dr. Surya Prakash (CSE, IIT Indore) 38
  • 39. From MDP to Reinforcement Learning  In other words: – In MDP, we consider fully observable environment while in real life, it is not.  Methods such as policy iteration and value iteration can solve fully observable MDP  In contrast, if reward function and transitional model are not known, that is where reinforcement learning fits in Dr. Surya Prakash (CSE, IIT Indore) 39
  • 40. From MDP to Reinforcement Learning  Since we do not know reward function and transitional model, we need to learn them – reinforcement learning helps there  Reinforcement learning approaches – Monte Carlo Approach – Temporal Difference Learning Dr. Surya Prakash (CSE, IIT Indore) 40
  • 41. References  Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, MIT press (Chapter 4). http://incompleteideas.net/book/ebook/  Markov decision process: value iteration with code implementation: https://medium.com/@ngao7/markov-decision-process-value- iteration-2d161d50a6ff  Markov decision process: policy iteration with code implementation: https://medium.com/@ngao7/markov-decision-process-policy- iteration-42d35ee87c82 Dr. Surya Prakash (CSE, IIT Indore) 41
  • 42. Projects  Tools: – OpenAI Gym - a toolkit for developing and comparing RL algorithms – Python + TensorFlow (TF-Agents) – MuJoCo - Advanced physics simulation  Problems – Robot navigation – Stock trading – Traffic Light Control – Point cloud completion – Self-driving taxis – Inverted Pendulum (CartPole Game) – Atari games - Breakout, Montezuma Revenge, and Space Invaders Dr. Surya Prakash (CSE, IIT Indore) 42
  • 43. Thank You Dr. Surya Prakash (CSE, IIT Indore) 43