Monte Carlo
Method
Presented by-Shajal Afaq
M.Tech CSE(M.tech 2nd year)
1
Monte Carlo Methods
2
❐ Monte Carlo methods learn from complete sample returns

 Only defined for episodic tasks
❐ Monte Carlo methods learn directly from experience

 On-line: No model necessary and still attains optimality

 Simulated: No need for a full model
Monte Carlo Policy
Evaluation
3
❐ Goal: learn V(s)
❐ Given: some number of episodes under  which contain s
❐ Idea: Average returns observed after visits to s
1 2 3 4 5
❐ Every-Visit MC: average returns for every time s is visited
in an episode
❐ First-visit MC: average returns only for first time s is
visited in an episode
❐ Both converge asymptotically
First-visit Monte Carlo policy
evaluation
4
Blackjack
example
5
❐ Object: Have your card sum be greater than the dealers
without exceeding 21.
❐ States (200 of them):

 current sum (12-21)

 dealer’sshowing card(ace-10)

 do I have a useable ace?
❐ Reward: +1 for winning, 0 for a draw, -1 for losing
❐ Actions: stick (stop receiving cards), hit (receive another
card)
❐ Policy: Stick if my sum is 20 or 21, else hit
Blackjack value
functions
6
1
1
After 500,000 episodes
After 10,000 episodes
Usable
ace
No
usable
ace
Backup diagram for Monte
Carlo
7
❐ Entire episode included
❐ Only one choice at each state
(unlike DP)
❐ MC does not bootstrap
❐ Time required to estimate one
state does not depend on the
total number of states
terminal state
Monte Carlo Estimation of Action
Values (Q)
8
❐ Monte Carlo is most useful when a model is not available

 We want to learn Q*
❐ Q(s,a) - average return starting from state s and action a
following 
❐ Also converges asymptotically if every state-action pair is
visited
❐ Exploring starts: Every state-action pair has a non-zero
probability of being the starting pair
Monte Carlo Control
9
greedy(Q)
improvement
❐ MC policy iteration: Policy evaluation using MC methods
followed by policy improvement
❐ Policy improvement step: greedify with respect to value
(or action-value) function
 Q
evaluation
Q Q
Convergence of MC Control
10
❐Policy improvement theorem tells us:
Q k
(s, k1
(s))  Qk
(s,arg max Qk
(s,a))
a
 max Q k
(s,a)
a
 Qk
(s, (s))
k
 V k
(s)
❐ This assumes exploring starts and infinite number of
episodes for MC policy evaluation
❐ To solve the latter:

 update only to a given level of performance

 alternate between evaluation and improvement per
episode
Usable
ace
No
usable
ace
A 2 3 4 5 6 7 8 9 10
Dealer showing
Player
sum
HIT
STICK
21
20
19
18
17
16
15
14
13
12
11
❐ Exploring starts
❐ Initial policy as described before
*
A 2 3 4 5 6 7 8 9 10
HIT
STICK
21
20
19
18
17
16
15
14
13
12
11
V*
1
1
Blackjack example
continued
11
On-policy Monte Carlo
Control
12
A(s)
1  

non-max greedy
❐ Similar to GPI: move policy towards greedy policy (i.e. -
soft)
❐ Converges to best -soft policy
A(s)

❐ On-policy: learn about policy currently executing
❐ How do we get rid of exploring starts?

 Need soft policies: (s,a) > 0 for all s and a

 e.g. -soft policy:
Off-policy Monte Carlo
control
13
❐ Behavior policy generates behavior in environment
❐ Estimation policy is policy being learned about
❐ Average returns from behavior policy by probability their
probabilities in the estimation policy
Incremental Implementation
❐ MC can be implemented incrementally

 saves memory
❐ Compute the weighted average of each return
n
V 
n
wk Rk
k
k 1
n
w
k1
 w
14
Wn1 Wn n1
V0  W0  0
incremental equivalent
W
n
n1
n1
wn1
 R V 
Vn1 Vn
non-incremental
Summary
15
❐MC has several advantages over DP:

 Can learn directly from interaction with environment

 No need for full models

 No need to learn about ALL states

 Less harm by Markovian violations (later in book)
❐ MC methods provide an alternate policy evaluation
process
❐ One issue to watch for: maintaining sufficient exploration

 exploring starts, soft policies
❐ No bootstrapping (as opposed to DP)
Thank you
16

The Monte Carlo method! A computational technique.pptx

  • 1.
    Monte Carlo Method Presented by-ShajalAfaq M.Tech CSE(M.tech 2nd year) 1
  • 2.
    Monte Carlo Methods 2 ❐Monte Carlo methods learn from complete sample returns   Only defined for episodic tasks ❐ Monte Carlo methods learn directly from experience   On-line: No model necessary and still attains optimality   Simulated: No need for a full model
  • 3.
    Monte Carlo Policy Evaluation 3 ❐Goal: learn V(s) ❐ Given: some number of episodes under  which contain s ❐ Idea: Average returns observed after visits to s 1 2 3 4 5 ❐ Every-Visit MC: average returns for every time s is visited in an episode ❐ First-visit MC: average returns only for first time s is visited in an episode ❐ Both converge asymptotically
  • 4.
    First-visit Monte Carlopolicy evaluation 4
  • 5.
    Blackjack example 5 ❐ Object: Haveyour card sum be greater than the dealers without exceeding 21. ❐ States (200 of them):   current sum (12-21)   dealer’sshowing card(ace-10)   do I have a useable ace? ❐ Reward: +1 for winning, 0 for a draw, -1 for losing ❐ Actions: stick (stop receiving cards), hit (receive another card) ❐ Policy: Stick if my sum is 20 or 21, else hit
  • 6.
    Blackjack value functions 6 1 1 After 500,000episodes After 10,000 episodes Usable ace No usable ace
  • 7.
    Backup diagram forMonte Carlo 7 ❐ Entire episode included ❐ Only one choice at each state (unlike DP) ❐ MC does not bootstrap ❐ Time required to estimate one state does not depend on the total number of states terminal state
  • 8.
    Monte Carlo Estimationof Action Values (Q) 8 ❐ Monte Carlo is most useful when a model is not available   We want to learn Q* ❐ Q(s,a) - average return starting from state s and action a following  ❐ Also converges asymptotically if every state-action pair is visited ❐ Exploring starts: Every state-action pair has a non-zero probability of being the starting pair
  • 9.
    Monte Carlo Control 9 greedy(Q) improvement ❐MC policy iteration: Policy evaluation using MC methods followed by policy improvement ❐ Policy improvement step: greedify with respect to value (or action-value) function  Q evaluation Q Q
  • 10.
    Convergence of MCControl 10 ❐Policy improvement theorem tells us: Q k (s, k1 (s))  Qk (s,arg max Qk (s,a)) a  max Q k (s,a) a  Qk (s, (s)) k  V k (s) ❐ This assumes exploring starts and infinite number of episodes for MC policy evaluation ❐ To solve the latter:   update only to a given level of performance   alternate between evaluation and improvement per episode
  • 11.
    Usable ace No usable ace A 2 34 5 6 7 8 9 10 Dealer showing Player sum HIT STICK 21 20 19 18 17 16 15 14 13 12 11 ❐ Exploring starts ❐ Initial policy as described before * A 2 3 4 5 6 7 8 9 10 HIT STICK 21 20 19 18 17 16 15 14 13 12 11 V* 1 1 Blackjack example continued 11
  • 12.
    On-policy Monte Carlo Control 12 A(s) 1   non-max greedy ❐ Similar to GPI: move policy towards greedy policy (i.e. - soft) ❐ Converges to best -soft policy A(s)  ❐ On-policy: learn about policy currently executing ❐ How do we get rid of exploring starts?   Need soft policies: (s,a) > 0 for all s and a   e.g. -soft policy:
  • 13.
    Off-policy Monte Carlo control 13 ❐Behavior policy generates behavior in environment ❐ Estimation policy is policy being learned about ❐ Average returns from behavior policy by probability their probabilities in the estimation policy
  • 14.
    Incremental Implementation ❐ MCcan be implemented incrementally   saves memory ❐ Compute the weighted average of each return n V  n wk Rk k k 1 n w k1  w 14 Wn1 Wn n1 V0  W0  0 incremental equivalent W n n1 n1 wn1  R V  Vn1 Vn non-incremental
  • 15.
    Summary 15 ❐MC has severaladvantages over DP:   Can learn directly from interaction with environment   No need for full models   No need to learn about ALL states   Less harm by Markovian violations (later in book) ❐ MC methods provide an alternate policy evaluation process ❐ One issue to watch for: maintaining sufficient exploration   exploring starts, soft policies ❐ No bootstrapping (as opposed to DP)
  • 16.