Reinforcement Learning Michael L. Littman Slides from  http://www.cs.vu.nl/~elena/ml_13light.ppt which appear to have been...
Reinforcement Learning <ul><li>Control learning  </li></ul><ul><li>Control policies that choose optimal actions  </li></ul...
Control Learning <ul><li>Consider learning to choose actions, e.g., </li></ul><ul><li>Robot learning to dock on battery ch...
One Example: TD-Gammon <ul><li>[Tesauro, 1995] </li></ul><ul><li>Learn to play Backgammon. </li></ul><ul><li>Immediate rew...
Reinforcement Learning Problem <ul><li>Goal: learn to choose actions that maximize </li></ul><ul><li>r 0  +   r 1  +   2...
Markov Decision Processes  <ul><li>Assume </li></ul><ul><li>finite set of states S  </li></ul><ul><li>set of actions A  </...
Agent’s Learning Task  <ul><li>Execute actions in environment, observe results, and </li></ul><ul><li>learn action policy ...
Value Function  <ul><li>To begin, consider deterministic worlds...  </li></ul><ul><li>For each possible policy    the age...
 
What to Learn  <ul><li>We might try to have agent learn the evaluation function V  *   (which we write as V*) </li></ul><...
Q Function  <ul><li>Define a new function very similar to V* </li></ul>If agent learns Q, it can choose optimal action eve...
Training Rule to Learn Q  <ul><li>Note Q and V* are closely related: </li></ul>This allows us to write Q recursively as Ni...
Q Learning for Deterministic Worlds <ul><li>For each s, a initialize table entry </li></ul><ul><li>Observe current state s...
Updating Q Notice if rewards non-negative, then and
<ul><li>  converges to Q.  Consider case of deterministic world, with bounded immediate rewards, where each   s, a   vis...
Note we used general fact that: This works with things other than max that satisfy this  non-expansion  property [Szepesv ...
Non-deterministic Case (1) <ul><li>What if reward and next state are non-deterministic? </li></ul><ul><li>We redefine V an...
Nondeterministic Case (2) Q  learning generalizes to nondeterministic worlds Alter training rule to where Can still prove ...
Temporal Difference Learning (1) Q learning: reduce discrepancy between successive Q estimates One step time difference: W...
Temporal Difference Learning (2) <ul><li>TD(  ) algorithm uses above training rule </li></ul><ul><li>Sometimes converges ...
Subtleties and Ongoing Research <ul><li>Replace  table with neural net or other generalizer </li></ul><ul><li>Handle case ...
Upcoming SlideShare
Loading in …5
×

Lecture notes

328 views
246 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
328
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lecture notes

  1. 1. Reinforcement Learning Michael L. Littman Slides from http://www.cs.vu.nl/~elena/ml_13light.ppt which appear to have been adapted from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/l20.ps
  2. 2. Reinforcement Learning <ul><li>Control learning </li></ul><ul><li>Control policies that choose optimal actions </li></ul><ul><li>Q learning </li></ul><ul><li>Convergence </li></ul>[Read Ch. 13] [Exercise 13.2]
  3. 3. Control Learning <ul><li>Consider learning to choose actions, e.g., </li></ul><ul><li>Robot learning to dock on battery charger </li></ul><ul><li>Learning to choose actions to optimize factory output </li></ul><ul><li>Learning to play Backgammon </li></ul><ul><li>Note several problem characteristics: </li></ul><ul><li>Delayed reward </li></ul><ul><li>Opportunity for active exploration </li></ul><ul><li>Possibility that state only partially observable </li></ul><ul><li>Possible need to learn multiple tasks with same sensors/effectors </li></ul>
  4. 4. One Example: TD-Gammon <ul><li>[Tesauro, 1995] </li></ul><ul><li>Learn to play Backgammon. </li></ul><ul><li>Immediate reward: </li></ul><ul><li>+100 if win </li></ul><ul><li>-100 if lose </li></ul><ul><li>0 for all other states </li></ul><ul><li>Trained by playing 1.5 million games against itself. </li></ul><ul><li>Now approximately equal to best human player. </li></ul>
  5. 5. Reinforcement Learning Problem <ul><li>Goal: learn to choose actions that maximize </li></ul><ul><li>r 0 +  r 1 +  2 r 2 + … , where 0   < 1 </li></ul>
  6. 6. Markov Decision Processes <ul><li>Assume </li></ul><ul><li>finite set of states S </li></ul><ul><li>set of actions A </li></ul><ul><li>at each discrete time agent observes state s t  S and chooses action a t  A </li></ul><ul><li>then receives immediate reward r t </li></ul><ul><li>and state changes to s t+1 </li></ul><ul><li>Markov assumption: s t+1 =  (s t , a t ) and r t = r(s t , a t ) </li></ul><ul><ul><li>i.e., r t and s t+1 depend only on current state and action </li></ul></ul><ul><ul><li>functions  and r may be nondeterministic </li></ul></ul><ul><ul><li>functions  and r not necessarily known to agent </li></ul></ul>
  7. 7. Agent’s Learning Task <ul><li>Execute actions in environment, observe results, and </li></ul><ul><li>learn action policy  : S  A that maximizes E [r t +  r t+1 +  2 r t+2 + … ] from any starting state in S </li></ul><ul><li>here 0   < 1 is the discount factor for future rewards (sometimes makes sense with  = 1) </li></ul><ul><li>Note something new: </li></ul><ul><li>Target function is  : S  A </li></ul><ul><li>But, we have no training examples of form  s, a  </li></ul><ul><li>Training examples are of form  s, a  , r  </li></ul>
  8. 8. Value Function <ul><li>To begin, consider deterministic worlds... </li></ul><ul><li>For each possible policy  the agent might adopt, we can define an evaluation function over states </li></ul>where r t , r t+1 , ... are generated by following policy  starting at state s Restated, the task is to learn the optimal policy  *
  9. 10. What to Learn <ul><li>We might try to have agent learn the evaluation function V  * (which we write as V*) </li></ul><ul><li>It could then do a look-ahead search to choose the best action from any state s because </li></ul><ul><li>A problem: </li></ul><ul><li>This can work if agent knows  : S  A  S, and r : S  A   </li></ul><ul><li>But when it doesn't, it can't choose actions this way </li></ul>
  10. 11. Q Function <ul><li>Define a new function very similar to V* </li></ul>If agent learns Q, it can choose optimal action even without knowing  ! Q is the evaluation function the agent will learn [Watkins 1989].
  11. 12. Training Rule to Learn Q <ul><li>Note Q and V* are closely related: </li></ul>This allows us to write Q recursively as Nice! Let denote learner’s current approximation to Q. Consider training rule where s’ is the state resulting from applying action a in state s.
  12. 13. Q Learning for Deterministic Worlds <ul><li>For each s, a initialize table entry </li></ul><ul><li>Observe current state s </li></ul><ul><li>Do forever: </li></ul><ul><li>Select an action a and execute it </li></ul><ul><li>Receive immediate reward r </li></ul><ul><li>Observe the new state s’ </li></ul><ul><li>Update the table entry for as follows: </li></ul><ul><li>s  s’. </li></ul>
  13. 14. Updating Q Notice if rewards non-negative, then and
  14. 15. <ul><li> converges to Q. Consider case of deterministic world, with bounded immediate rewards, where each  s, a  visited infinitely often. </li></ul><ul><li>Proof : Define a full interval to be an interval during which each  s, a  is visited. During each full interval the largest error in table is reduced by factor of  . </li></ul><ul><li>Let be table after n updates, and Δ n be the maximum error in : that is </li></ul><ul><li>For any table entry updated on iteration n+1, the error in the revised estimate is </li></ul>Convergence Theorem
  15. 16. Note we used general fact that: This works with things other than max that satisfy this non-expansion property [Szepesv á ri & Littman, 1999].
  16. 17. Non-deterministic Case (1) <ul><li>What if reward and next state are non-deterministic? </li></ul><ul><li>We redefine V and Q by taking expected values . </li></ul>
  17. 18. Nondeterministic Case (2) Q learning generalizes to nondeterministic worlds Alter training rule to where Can still prove convergence of to Q [Watkins and Dayan, 1992]. Standard properties:   n = 0,   n 2 =  .
  18. 19. Temporal Difference Learning (1) Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n ? Blend all of these:
  19. 20. Temporal Difference Learning (2) <ul><li>TD(  ) algorithm uses above training rule </li></ul><ul><li>Sometimes converges faster than Q learning </li></ul><ul><li>converges for learning V * for any 0    1 [Dayan, 1992] </li></ul><ul><li>Tesauro's TD-Gammon uses this algorithm </li></ul>Equivalent expression:
  20. 21. Subtleties and Ongoing Research <ul><li>Replace table with neural net or other generalizer </li></ul><ul><li>Handle case where state only partially observable (next?) </li></ul><ul><li>Design optimal exploration strategies </li></ul><ul><li>Extend to continuous action, state </li></ul><ul><li>Learn and use </li></ul><ul><li>Relationship to dynamic programming (next?) </li></ul>

×