Reinforcement Learning

943 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
943
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
46
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Reinforcement Learning

  1. 1. Lisa Torrey<br />University of Wisconsin – Madison<br />HAMLET 2009<br />Reinforcement Learning<br />
  2. 2. Reinforcement learning<br />What is it and why is it important in machine learning?<br />What machine learning algorithms exist for it?<br />Q-learning in theory<br />How does it work?<br />How can it be improved?<br />Q-learning in practice<br />What are the challenges?<br />What are the applications?<br />Link with psychology<br />Do people use similar mechanisms?<br />Do people use other methods that could inspire algorithms?<br />Resources for future reference<br />Outline<br />
  3. 3. Reinforcement learning<br />What is it and why is it important in machine learning?<br />What machine learning algorithms exist for it?<br />Q-learning in theory<br />How does it work?<br />How can it be improved?<br />Q-learning in practice<br />What are the challenges?<br />What are the applications?<br />Link with psychology<br />Do people use similar mechanisms?<br />Do people use other methods that could inspire algorithms?<br />Resources for future reference<br />Outline<br />
  4. 4. Machine Learning<br />Classification: where AI meets statistics<br />Given<br />Training data<br />Learn<br />A model for making a single prediction or decision<br />xnew<br />Classification Algorithm<br />Training Data<br />(x1, y1)<br />(x2, y2)<br />(x3, y3)<br />…<br />Model<br />ynew<br />
  5. 5. Animal/Human Learning<br />Memorization<br />x1<br />y1<br />Classification<br />xnew<br />ynew<br />Procedural<br />decision<br />Other?<br />environment<br />
  6. 6. Learning how to act to accomplish goals<br />Given<br />Environment that contains rewards<br />Learn<br />A policy for acting<br />Important differences from classification<br />You don’t get examples of correct answers<br />You have to try things in order to learn<br />Procedural Learning<br />
  7. 7. A Good Policy<br />
  8. 8. Do you know your environment?<br />The effects of actions<br />The rewards<br />If yes, you can use Dynamic Programming<br />More like planning than learning<br />Value Iteration and Policy Iteration<br />If no, you can use Reinforcement Learning (RL)<br />Acting and observing in the environment<br />What You Know Matters<br />
  9. 9. RL shapes behavior using reinforcement<br />Agent takes actions in an environment (in episodes)<br />Those actions change the state and trigger rewards<br />Through experience, an agent learns a policy for acting<br />Given a state, choose an action<br />Maximize cumulative reward during an episode<br />Interesting things about this problem<br />Requires solving credit assignment<br />What action(s) are responsible for a reward?<br />Requires both exploring and exploiting<br />Do what looks best, or see if something else is really best?<br />RL as Operant Conditioning<br />
  10. 10. Search-based: evolution directly on a policy<br />E.g. genetic algorithms<br />Model-based: build a model of the environment<br />Then you can use dynamic programming<br />Memory-intensive learning method<br />Model-free: learn a policy without any model<br />Temporal difference methods (TD)<br />Requires limited episodic memory (though more helps)<br />Types of Reinforcement Learning<br />
  11. 11. Actor-critic learning<br />The TD version of Policy Iteration<br />Q-learning<br />The TD version of Value Iteration<br />This is the most widely used RL algorithm<br />Types of Model-Free RL<br />
  12. 12. Reinforcement learning<br />What is it and why is it important in machine learning?<br />What machine learning algorithms exist for it?<br />Q-learning in theory<br />How does it work?<br />How can it be improved?<br />Q-learning in practice<br />What are the challenges?<br />What are the applications?<br />Link with psychology<br />Do people use similar mechanisms?<br />Do people use other methods that could inspire algorithms?<br />Resources for future reference<br />Outline<br />
  13. 13. Current state: s<br />Current action: a<br />Transition function: δ(s, a) = sʹ<br />Reward function: r(s, a) Є R<br />Policy π(s) = a<br />Q(s, a) ≈ value of taking action a from state s<br />Q-Learning: Definitions<br />Markov property: this is independent of previous states given current state<br />In classification we’d have examples (s, π(s)) to learn from<br />
  14. 14. Q(s, a) estimates the discounted cumulative reward<br />Starting in state s<br />Taking action a<br />Following the current policy thereafter<br />Suppose we have the optimal Q-function<br />What’s the optimal policy in state s?<br />The action argmaxb Q(s, b)<br />But we don’t have the optimal Q-function at first<br />Let’s act as if we do<br />And updates it after each step so it’s closer to optimal<br />Eventually it will be optimal!<br />The Q-function<br />
  15. 15. Q-Learning: The Procedure<br />Agent<br />Q(s1, a) = 0<br />π(s1) = a1<br />Q(s1, a1)  Q(s1, a1) + Δ<br />π(s2) = a2<br />s2<br />s3<br />a1<br />a2<br />r2<br />r3<br />s1<br />Environment<br />δ(s2, a2) = s3<br />r(s2, a2) = r3<br />δ(s1, a1) = s2<br />r(s1, a1) = r2<br />
  16. 16. Q-Learning: Updates<br /><ul><li>The basic update equation
  17. 17. With a discount factor to give later rewards less impact
  18. 18. With a learning rate for non-deterministic worlds</li></li></ul><li>Q-Learning: Update Example<br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />
  19. 19. Q-Learning: Update Example<br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />
  20. 20. Q-Learning: Update Example<br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />
  21. 21. The Need for Exploration<br />1<br />2<br />3<br />Explore!<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />
  22. 22. Can’t always choose the action with highest Q-value<br />The Q-function is initially unreliable<br />Need to explore until it is optimal<br />Most common method: ε-greedy<br />Take a random action in a small fraction of steps (ε)<br />Decay ε over time<br />There is some work on optimizing exploration <br />Kearns & Singh, ML 1998<br />But people usually use this simple method<br />Explore/Exploit Tradeoff<br />
  23. 23. Under certain conditions, Q-learning will converge to the correct Q-function<br />The environment model doesn’t change<br />States and actions are finite<br />Rewards are bounded<br />Learning rate decays with visits to state-action pairs<br />Exploration method would guarantee infinite visits to every state-action pair over an infinite training period<br />Q-Learning: Convergence<br />
  24. 24. Extensions: SARSA<br /><ul><li>SARSA: Take exploration into account in updates
  25. 25. Use the action actually chosen in updates</li></ul>Regular:<br />PIT!<br />SARSA:<br />
  26. 26. Extensions: Look-ahead<br /><ul><li>Look-ahead: Do updates over multiple states
  27. 27. Use some episodic memory to speed credit assignment</li></ul>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />TD(λ): a weighted combination of look-ahead distances<br />The parameter λ controls the weighting<br />
  28. 28. Eligibility traces: Lookahead with less memory<br />Visiting a state leaves a trace that decays<br />Update multiple states at once<br />States get credit according to their trace<br />Extensions: Eligibility Traces<br />3<br />1<br />2<br />4<br />5<br />6<br />9<br />7<br />8<br />10<br />11<br />
  29. 29. Options: Create higher-level actions<br />Extensions: Options and Hierarchies<br /><ul><li>Hierarchical RL: Design a tree of RL tasks</li></ul>Whole Maze<br />Room A<br />Room B<br />
  30. 30. Extensions: Function Approximation<br />Function approximation: allow complex environments<br />The Q-function table could be too big (or infinitely big!)<br />Describe a state by a feature vector f = (f1 , f2 , … , fn)<br />Then the Q-function can be any regression model<br />E.g. linear regression: Q(s, a) = w1 f1 + w2 f2 + … + wn fn<br />Cost: convergence goes away in theory, though often not in practice<br />Benefit: generalization over similar states<br />Easiest if the approximator can be updated incrementally, like neural networks with gradient descent, but you can also do this in batches<br />
  31. 31. Reinforcement learning<br />What is it and why is it important in machine learning?<br />What machine learning algorithms exist for it?<br />Q-learning in theory<br />How does it work?<br />How can it be improved?<br />Q-learning in practice<br />What are the challenges?<br />What are the applications?<br />Link with psychology<br />Do people use similar mechanisms?<br />Do people use other methods that could inspire algorithms?<br />Resources for future reference<br />Outline<br />
  32. 32. Feature/reward design can be very involved<br />Online learning (no time for tuning)<br />Continuous features(handled by tiling)<br />Delayed rewards (handled by shaping)<br />Parameters can have large effects on learning speed<br />Tuning has just one effect: slowing it down<br />Realistic environments can have partial observability<br />Realistic environments can be non-stationary<br />There may be multiple agents<br />Challenges in Reinforcement Learning<br />
  33. 33. Tesauro 1995: Backgammon<br />Crites & Barto 1996: Elevator scheduling<br />Kaelbling et al. 1996: Packaging task<br />Singh & Bertsekas 1997: Cell phone channel allocation<br />Nevmyvaka et al. 2006: Stock investment decisions<br />Ipek et al. 2008: Memory control in hardware<br />Kosorok 2009: Chemotherapy treatment decisions<br />No textbook “killer app”<br />Just behind the times?<br />Too much design and tuning required?<br />Training too long or expensive?<br />Too much focus on toy domains in research?<br />Applications of Reinforcement Learning<br />
  34. 34. Reinforcement learning<br />What is it and why is it important in machine learning?<br />What machine learning algorithms exist for it?<br />Q-learning in theory<br />How does it work?<br />How can it be improved?<br />Q-learning in practice<br />What are the challenges?<br />What are the applications?<br />Link with psychology<br />Do people use similar mechanisms?<br />Do people use other methods that could inspire algorithms?<br />Resources for future reference<br />Outline<br />
  35. 35. Should machine learning researchers care?<br />Planes don’t fly the way birds do; should machines learn the way people do?<br />But why not look for inspiration?<br />Psychological research does show neuron activity associated with rewards<br />Really prediction error: actual – expected<br />Primarily in the striatum<br />Do Brains Perform RL?<br />
  36. 36. Schönberg et al., J. Neuroscience 2007<br />Good learners have stronger signals in the striatum than bad learners<br />Frank et al., Science 2004<br />Parkinson’s patients learn better from negatives<br />On dopamine medication, they learn better from positives<br />Bayer & Glimcher, Neuron 2005<br />Average firing rate corresponds to positive prediction errors<br />Interestingly, not to negative ones<br />Cohen & Ranganath, J. Neuroscience 2007<br />ERP magnitude predicts whether subjects change behavior after losing<br />Support for Reward Systems<br />
  37. 37. Various results in animals support different algorithms<br />Montague et al., J. Neuroscience 1996: TD<br />O’Doherty et al., Science 2004: Actor-critic<br />Daw, Nature 2005: Parallel model-free and model-based<br />Morris et al., Nature 2006: SARSA<br />Roesch et al., Nature 2007: Q-learning<br />Other results support extensions<br />Bogacz et al., Brain Research 2005: Eligibility traces<br />Daw, Nature 2006: Novelty bonuses to promote exploration<br />Mixed results on reward discounting (short vs. long term)<br />Ainslie 2001: people are more impulsive than algorithms<br />McClure et al., Science 2004: Two parallel systems<br />Frank et al., PNAS 2007: Controlled by genetic differences<br />Schweighofer et al., J. Neuroscience 2008: Influenced by serotonin<br />Support for Specific Mechanisms<br />
  38. 38. Parallelism<br />Separate systems for positive/negative errors<br />Multiple algorithms running simultaneously<br />Use of RL in combination with other systems<br />Planning: Reasoning about why things do or don’t work<br />Advice: Someone to imitate or correct us<br />Transfer: Knowledge about similar tasks<br />More impulsivity<br />Is this necessarily better?<br />The goal for machine learning: Take inspiration from humans without being limited by their shortcomings<br />What People Do Better<br />My work<br />
  39. 39. Reinforcement LearningSutton & Barto, MIT Press 1998<br />The standard reference book on computational RL<br />Reinforcement LearningDayan, Encyclopedia of Cognitive Science 2001<br />A briefer introduction that still touches on many computational issues<br />Reinforcement learning: the good, the bad, and the uglyDayan & Niv, Current Opinions in Neurobiology 2008<br />A comprehensive survey of work on RL in the human brain<br />Resources on Reinforcement Learning <br />

×