Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lecture notes


Published on

  • Be the first to comment

  • Be the first to like this

Lecture notes

  1. 1. Reinforcement Learning in Partially Observable Environments Michael L. Littman
  2. 2. Temporal Difference Learning (1) Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n ? Blend all of these:
  3. 3. Temporal Difference Learning (2) <ul><li>TD(  ) algorithm uses above training rule </li></ul><ul><li>Sometimes converges faster than Q learning </li></ul><ul><li>converges for learning V * for any 0    1 [Dayan, 1992] </li></ul><ul><li>Tesauro's TD-Gammon uses this algorithm </li></ul><ul><li>Bias-variance tradeoff [Kearns & Singh 2000] </li></ul><ul><li>Implemented using “eligibility traces” [Sutton 1988] </li></ul><ul><li>Helps overcome non-Markov environments [Loch & Singh, 1998] </li></ul>Equivalent expression:
  4. 4. non-Markov Examples <ul><li>Can you solve them? </li></ul>
  5. 5. Markov Decision Processes <ul><li>Recall MDP: </li></ul><ul><li>finite set of states S </li></ul><ul><li>set of actions A </li></ul><ul><li>at each discrete time agent observes state s t  S and chooses action a t  A </li></ul><ul><li>receives reward r t, and state changes to s t+1 </li></ul><ul><li>Markov assumption: s t+1 =  (s t , a t ) and r t = r(s t , a t ) </li></ul><ul><ul><li>r t and s t+1 depend only on current state and action </li></ul></ul><ul><ul><li>functions  and r may be nondeterministic </li></ul></ul><ul><ul><li>functions  and r not necessarily known to agent </li></ul></ul>
  6. 6. Partially Observable MDPs <ul><li>Same as MDP, but additional observation function  that translates the state into what the learner can observe: o t =  (s t ) </li></ul><ul><li>Transitions and rewards still depend on state, but learner only sees a “shadow”. </li></ul><ul><li>How can we learn what to do? </li></ul>
  7. 7. State Approaches to POMDPs <ul><li>Q learning (dynamic programming) states: </li></ul><ul><ul><li>observations </li></ul></ul><ul><ul><li>short histories </li></ul></ul><ul><ul><li>learn POMDP model: most likely state </li></ul></ul><ul><ul><li>learn POMDP model: information state </li></ul></ul><ul><ul><li>learn predictive model: predictive state </li></ul></ul><ul><ul><li>experience as state </li></ul></ul><ul><li>Advantages, disadvantages? </li></ul>
  8. 8. Learning a POMDP <ul><li>Input: history (action-observation seq). </li></ul><ul><li>Output: POMDP that “explains” the data. </li></ul><ul><li>EM, iterative algorithm (Baum et al. 70; Chrisman 92) </li></ul>state occupation probabilities POMDP model E: Forward-backward M: Fractional counting
  9. 9. EM Pitfalls <ul><li>Each iteration increases data likelihood. </li></ul><ul><li>Local maxima. (Shatkay & Kaelbling 97; Nikovski 99) </li></ul><ul><li>Rarely learns good model. </li></ul><ul><li>Hidden states are truly unobservable. </li></ul>
  10. 10. Information State <ul><li>Assumes: </li></ul><ul><li>objective reality </li></ul><ul><li>known “map” </li></ul><ul><li>Also belief state : represents location. </li></ul><ul><li>Vector of probabilities, one for each state. </li></ul><ul><li>Easily updated if model known. (Ex.) </li></ul>
  11. 11. Plan with Information States <ul><li>Now, learner is 50% here and 50% there instead of in any particular state. </li></ul><ul><li>Good news: Markov in these vectors </li></ul><ul><li>Bad news: States continuous </li></ul><ul><li>Good news: Can be solved </li></ul><ul><li>Bad news: ...slowly </li></ul><ul><li>More bad news: Model is approximate! </li></ul>
  12. 12. Predictions as State <ul><li>Idea: Key information from distance past, but never too far in the future. (Littman et al. 02) </li></ul><ul><li>start at blue : down red ( left red ) odd up __? </li></ul><ul><li>history: forget </li></ul>up blue left red up not blue left not red up blue left not red predict: up blue ? left red ? up not blue left red
  13. 13. Experience as State <ul><li>Nearest sequence memory (McCallum 1995) </li></ul><ul><li>Relate current episode to past experience. </li></ul><ul><li>k longest matches considered to be the same for purposes of estimating value and updating. </li></ul><ul><li>Current work: Extend TD(  ), extend notion of similarity (allow for soft matches, sensors) </li></ul>
  14. 14. Classification Dialog (Keim & Littman 99) <ul><li>User to travel to Roma, Torino, or Merino? </li></ul><ul><li>States: S R , S T , S M , done . Transitions to done . </li></ul><ul><li>Actions: </li></ul><ul><ul><li>QC (What city?), </li></ul></ul><ul><ul><li>QR , QT , QM (Going to X ?), </li></ul></ul><ul><ul><li>R , T , M (I think X ). </li></ul></ul><ul><li>Observations: </li></ul><ul><ul><li>Yes, no (more reliable), R, T, M (T/M confusable). </li></ul></ul><ul><li>Objective: </li></ul><ul><ul><li>Reward for correct class, cost for questions. </li></ul></ul>
  15. 15. Incremental Pruning Output <ul><li>Optimal plan varies with priors ( S R = S M ). </li></ul>S T S R S M
  16. 16. S T = 0.00
  17. 17. S T = 0.02
  18. 18. S T = 0.22
  19. 19. S T = 0.76
  20. 20. S T = 0.90
  21. 21. Wrap Up <ul><li>Reinforcement learning: Get the right answer without being told. Hard, less developed than supervised learning. </li></ul><ul><li>Lecture slides on web. </li></ul><ul><li> </li></ul><ul><li>ml02-rl1.ppt </li></ul><ul><li>ml02-rl2.ppt </li></ul>