• Like
  • Save
Lecture notes
Upcoming SlideShare
Loading in...5
×
 

Lecture notes

on

  • 326 views

 

Statistics

Views

Total Views
326
Views on SlideShare
326
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Lecture notes Lecture notes Presentation Transcript

    • Reinforcement Learning Michael L. Littman Slides from http://www.cs.vu.nl/~elena/ml_13light.ppt which appear to have been adapted from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/l20.ps
    • Reinforcement Learning
      • Control learning
      • Control policies that choose optimal actions
      • Q learning
      • Convergence
      [Read Ch. 13] [Exercise 13.2]
    • Control Learning
      • Consider learning to choose actions, e.g.,
      • Robot learning to dock on battery charger
      • Learning to choose actions to optimize factory output
      • Learning to play Backgammon
      • Note several problem characteristics:
      • Delayed reward
      • Opportunity for active exploration
      • Possibility that state only partially observable
      • Possible need to learn multiple tasks with same sensors/effectors
    • One Example: TD-Gammon
      • [Tesauro, 1995]
      • Learn to play Backgammon.
      • Immediate reward:
      • +100 if win
      • -100 if lose
      • 0 for all other states
      • Trained by playing 1.5 million games against itself.
      • Now approximately equal to best human player.
    • Reinforcement Learning Problem
      • Goal: learn to choose actions that maximize
      • r 0 +  r 1 +  2 r 2 + … , where 0   < 1
    • Markov Decision Processes
      • Assume
      • finite set of states S
      • set of actions A
      • at each discrete time agent observes state s t  S and chooses action a t  A
      • then receives immediate reward r t
      • and state changes to s t+1
      • Markov assumption: s t+1 =  (s t , a t ) and r t = r(s t , a t )
        • i.e., r t and s t+1 depend only on current state and action
        • functions  and r may be nondeterministic
        • functions  and r not necessarily known to agent
    • Agent’s Learning Task
      • Execute actions in environment, observe results, and
      • learn action policy  : S  A that maximizes E [r t +  r t+1 +  2 r t+2 + … ] from any starting state in S
      • here 0   < 1 is the discount factor for future rewards (sometimes makes sense with  = 1)
      • Note something new:
      • Target function is  : S  A
      • But, we have no training examples of form  s, a 
      • Training examples are of form  s, a  , r 
    • Value Function
      • To begin, consider deterministic worlds...
      • For each possible policy  the agent might adopt, we can define an evaluation function over states
      where r t , r t+1 , ... are generated by following policy  starting at state s Restated, the task is to learn the optimal policy  *
    •  
    • What to Learn
      • We might try to have agent learn the evaluation function V  * (which we write as V*)
      • It could then do a look-ahead search to choose the best action from any state s because
      • A problem:
      • This can work if agent knows  : S  A  S, and r : S  A  
      • But when it doesn't, it can't choose actions this way
    • Q Function
      • Define a new function very similar to V*
      If agent learns Q, it can choose optimal action even without knowing  ! Q is the evaluation function the agent will learn [Watkins 1989].
    • Training Rule to Learn Q
      • Note Q and V* are closely related:
      This allows us to write Q recursively as Nice! Let denote learner’s current approximation to Q. Consider training rule where s’ is the state resulting from applying action a in state s.
    • Q Learning for Deterministic Worlds
      • For each s, a initialize table entry
      • Observe current state s
      • Do forever:
      • Select an action a and execute it
      • Receive immediate reward r
      • Observe the new state s’
      • Update the table entry for as follows:
      • s  s’.
    • Updating Q Notice if rewards non-negative, then and
      • converges to Q. Consider case of deterministic world, with bounded immediate rewards, where each  s, a  visited infinitely often.
      • Proof : Define a full interval to be an interval during which each  s, a  is visited. During each full interval the largest error in table is reduced by factor of  .
      • Let be table after n updates, and Δ n be the maximum error in : that is
      • For any table entry updated on iteration n+1, the error in the revised estimate is
      Convergence Theorem
    • Note we used general fact that: This works with things other than max that satisfy this non-expansion property [Szepesv á ri & Littman, 1999].
    • Non-deterministic Case (1)
      • What if reward and next state are non-deterministic?
      • We redefine V and Q by taking expected values .
    • Nondeterministic Case (2) Q learning generalizes to nondeterministic worlds Alter training rule to where Can still prove convergence of to Q [Watkins and Dayan, 1992]. Standard properties:   n = 0,   n 2 =  .
    • Temporal Difference Learning (1) Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n ? Blend all of these:
    • Temporal Difference Learning (2)
      • TD(  ) algorithm uses above training rule
      • Sometimes converges faster than Q learning
      • converges for learning V * for any 0    1 [Dayan, 1992]
      • Tesauro's TD-Gammon uses this algorithm
      Equivalent expression:
    • Subtleties and Ongoing Research
      • Replace table with neural net or other generalizer
      • Handle case where state only partially observable (next?)
      • Design optimal exploration strategies
      • Extend to continuous action, state
      • Learn and use
      • Relationship to dynamic programming (next?)