Successfully reported this slideshow.
Upcoming SlideShare
×

# Lecture notes

398 views

Published on

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Lecture notes

1. 1. Reinforcement Learning in Partially Observable Environments Michael L. Littman
2. 2. Temporal Difference Learning (1) Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n ? Blend all of these:
3. 3. Temporal Difference Learning (2) <ul><li>TD(  ) algorithm uses above training rule </li></ul><ul><li>Sometimes converges faster than Q learning </li></ul><ul><li>converges for learning V * for any 0    1 [Dayan, 1992] </li></ul><ul><li>Tesauro's TD-Gammon uses this algorithm </li></ul><ul><li>Bias-variance tradeoff [Kearns & Singh 2000] </li></ul><ul><li>Implemented using “eligibility traces” [Sutton 1988] </li></ul><ul><li>Helps overcome non-Markov environments [Loch & Singh, 1998] </li></ul>Equivalent expression:
4. 4. non-Markov Examples <ul><li>Can you solve them? </li></ul>
5. 5. Markov Decision Processes <ul><li>Recall MDP: </li></ul><ul><li>finite set of states S </li></ul><ul><li>set of actions A </li></ul><ul><li>at each discrete time agent observes state s t  S and chooses action a t  A </li></ul><ul><li>receives reward r t, and state changes to s t+1 </li></ul><ul><li>Markov assumption: s t+1 =  (s t , a t ) and r t = r(s t , a t ) </li></ul><ul><ul><li>r t and s t+1 depend only on current state and action </li></ul></ul><ul><ul><li>functions  and r may be nondeterministic </li></ul></ul><ul><ul><li>functions  and r not necessarily known to agent </li></ul></ul>
6. 6. Partially Observable MDPs <ul><li>Same as MDP, but additional observation function  that translates the state into what the learner can observe: o t =  (s t ) </li></ul><ul><li>Transitions and rewards still depend on state, but learner only sees a “shadow”. </li></ul><ul><li>How can we learn what to do? </li></ul>
7. 7. State Approaches to POMDPs <ul><li>Q learning (dynamic programming) states: </li></ul><ul><ul><li>observations </li></ul></ul><ul><ul><li>short histories </li></ul></ul><ul><ul><li>learn POMDP model: most likely state </li></ul></ul><ul><ul><li>learn POMDP model: information state </li></ul></ul><ul><ul><li>learn predictive model: predictive state </li></ul></ul><ul><ul><li>experience as state </li></ul></ul><ul><li>Advantages, disadvantages? </li></ul>
8. 8. Learning a POMDP <ul><li>Input: history (action-observation seq). </li></ul><ul><li>Output: POMDP that “explains” the data. </li></ul><ul><li>EM, iterative algorithm (Baum et al. 70; Chrisman 92) </li></ul>state occupation probabilities POMDP model E: Forward-backward M: Fractional counting
9. 9. EM Pitfalls <ul><li>Each iteration increases data likelihood. </li></ul><ul><li>Local maxima. (Shatkay & Kaelbling 97; Nikovski 99) </li></ul><ul><li>Rarely learns good model. </li></ul><ul><li>Hidden states are truly unobservable. </li></ul>
10. 10. Information State <ul><li>Assumes: </li></ul><ul><li>objective reality </li></ul><ul><li>known “map” </li></ul><ul><li>Also belief state : represents location. </li></ul><ul><li>Vector of probabilities, one for each state. </li></ul><ul><li>Easily updated if model known. (Ex.) </li></ul>
11. 11. Plan with Information States <ul><li>Now, learner is 50% here and 50% there instead of in any particular state. </li></ul><ul><li>Good news: Markov in these vectors </li></ul><ul><li>Bad news: States continuous </li></ul><ul><li>Good news: Can be solved </li></ul><ul><li>Bad news: ...slowly </li></ul><ul><li>More bad news: Model is approximate! </li></ul>
12. 12. Predictions as State <ul><li>Idea: Key information from distance past, but never too far in the future. (Littman et al. 02) </li></ul><ul><li>start at blue : down red ( left red ) odd up __? </li></ul><ul><li>history: forget </li></ul>up blue left red up not blue left not red up blue left not red predict: up blue ? left red ? up not blue left red
13. 13. Experience as State <ul><li>Nearest sequence memory (McCallum 1995) </li></ul><ul><li>Relate current episode to past experience. </li></ul><ul><li>k longest matches considered to be the same for purposes of estimating value and updating. </li></ul><ul><li>Current work: Extend TD(  ), extend notion of similarity (allow for soft matches, sensors) </li></ul>
14. 14. Classification Dialog (Keim & Littman 99) <ul><li>User to travel to Roma, Torino, or Merino? </li></ul><ul><li>States: S R , S T , S M , done . Transitions to done . </li></ul><ul><li>Actions: </li></ul><ul><ul><li>QC (What city?), </li></ul></ul><ul><ul><li>QR , QT , QM (Going to X ?), </li></ul></ul><ul><ul><li>R , T , M (I think X ). </li></ul></ul><ul><li>Observations: </li></ul><ul><ul><li>Yes, no (more reliable), R, T, M (T/M confusable). </li></ul></ul><ul><li>Objective: </li></ul><ul><ul><li>Reward for correct class, cost for questions. </li></ul></ul>
15. 15. Incremental Pruning Output <ul><li>Optimal plan varies with priors ( S R = S M ). </li></ul>S T S R S M
16. 16. S T = 0.00
17. 17. S T = 0.02
18. 18. S T = 0.22
19. 19. S T = 0.76
20. 20. S T = 0.90
21. 21. Wrap Up <ul><li>Reinforcement learning: Get the right answer without being told. Hard, less developed than supervised learning. </li></ul><ul><li>Lecture slides on web. </li></ul><ul><li>http://www.cs.rutgers.edu/~mlittman/talks/ </li></ul><ul><li>ml02-rl1.ppt </li></ul><ul><li>ml02-rl2.ppt </li></ul>