Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Reinforcement Learning : A Beginner... by Omar Enayet 10272 views
- Reinforcement learning by Chandra Meena 2928 views
- Reinforcement learning 7313 by Slideshare 3505 views
- An introduction to reinforcement le... by pauldix 7817 views
- Introduction to Reinforcement Learning by Edward Balaban 1485 views
- Reinforcement Learning by Muhammad Iqbal Ta... 257 views

- 1. Lisa Torrey<br />University of Wisconsin – Madison<br />HAMLET 2009<br />Reinforcement Learning<br />
- 2. Reinforcement learning<br />What is it and why is it important in machine learning?<br />What machine learning algorithms exist for it?<br />Q-learning in theory<br />How does it work?<br />How can it be improved?<br />Q-learning in practice<br />What are the challenges?<br />What are the applications?<br />Link with psychology<br />Do people use similar mechanisms?<br />Do people use other methods that could inspire algorithms?<br />Resources for future reference<br />Outline<br />
- 3. Reinforcement learning<br />What is it and why is it important in machine learning?<br />What machine learning algorithms exist for it?<br />Q-learning in theory<br />How does it work?<br />How can it be improved?<br />Q-learning in practice<br />What are the challenges?<br />What are the applications?<br />Link with psychology<br />Do people use similar mechanisms?<br />Do people use other methods that could inspire algorithms?<br />Resources for future reference<br />Outline<br />
- 4. Machine Learning<br />Classification: where AI meets statistics<br />Given<br />Training data<br />Learn<br />A model for making a single prediction or decision<br />xnew<br />Classification Algorithm<br />Training Data<br />(x1, y1)<br />(x2, y2)<br />(x3, y3)<br />…<br />Model<br />ynew<br />
- 5. Animal/Human Learning<br />Memorization<br />x1<br />y1<br />Classification<br />xnew<br />ynew<br />Procedural<br />decision<br />Other?<br />environment<br />
- 6. Learning how to act to accomplish goals<br />Given<br />Environment that contains rewards<br />Learn<br />A policy for acting<br />Important differences from classification<br />You don’t get examples of correct answers<br />You have to try things in order to learn<br />Procedural Learning<br />
- 7. A Good Policy<br />
- 8. Do you know your environment?<br />The effects of actions<br />The rewards<br />If yes, you can use Dynamic Programming<br />More like planning than learning<br />Value Iteration and Policy Iteration<br />If no, you can use Reinforcement Learning (RL)<br />Acting and observing in the environment<br />What You Know Matters<br />
- 9. RL shapes behavior using reinforcement<br />Agent takes actions in an environment (in episodes)<br />Those actions change the state and trigger rewards<br />Through experience, an agent learns a policy for acting<br />Given a state, choose an action<br />Maximize cumulative reward during an episode<br />Interesting things about this problem<br />Requires solving credit assignment<br />What action(s) are responsible for a reward?<br />Requires both exploring and exploiting<br />Do what looks best, or see if something else is really best?<br />RL as Operant Conditioning<br />
- 10. Search-based: evolution directly on a policy<br />E.g. genetic algorithms<br />Model-based: build a model of the environment<br />Then you can use dynamic programming<br />Memory-intensive learning method<br />Model-free: learn a policy without any model<br />Temporal difference methods (TD)<br />Requires limited episodic memory (though more helps)<br />Types of Reinforcement Learning<br />
- 11. Actor-critic learning<br />The TD version of Policy Iteration<br />Q-learning<br />The TD version of Value Iteration<br />This is the most widely used RL algorithm<br />Types of Model-Free RL<br />
- 12. Reinforcement learning<br />What is it and why is it important in machine learning?<br />What machine learning algorithms exist for it?<br />Q-learning in theory<br />How does it work?<br />How can it be improved?<br />Q-learning in practice<br />What are the challenges?<br />What are the applications?<br />Link with psychology<br />Do people use similar mechanisms?<br />Do people use other methods that could inspire algorithms?<br />Resources for future reference<br />Outline<br />
- 13. Current state: s<br />Current action: a<br />Transition function: δ(s, a) = sʹ<br />Reward function: r(s, a) Є R<br />Policy π(s) = a<br />Q(s, a) ≈ value of taking action a from state s<br />Q-Learning: Definitions<br />Markov property: this is independent of previous states given current state<br />In classification we’d have examples (s, π(s)) to learn from<br />
- 14. Q(s, a) estimates the discounted cumulative reward<br />Starting in state s<br />Taking action a<br />Following the current policy thereafter<br />Suppose we have the optimal Q-function<br />What’s the optimal policy in state s?<br />The action argmaxb Q(s, b)<br />But we don’t have the optimal Q-function at first<br />Let’s act as if we do<br />And updates it after each step so it’s closer to optimal<br />Eventually it will be optimal!<br />The Q-function<br />
- 15. Q-Learning: The Procedure<br />Agent<br />Q(s1, a) = 0<br />π(s1) = a1<br />Q(s1, a1) Q(s1, a1) + Δ<br />π(s2) = a2<br />s2<br />s3<br />a1<br />a2<br />r2<br />r3<br />s1<br />Environment<br />δ(s2, a2) = s3<br />r(s2, a2) = r3<br />δ(s1, a1) = s2<br />r(s1, a1) = r2<br />
- 16. Q-Learning: Updates<br /><ul><li>The basic update equation
- 17. With a discount factor to give later rewards less impact
- 18. With a learning rate for non-deterministic worlds</li></li></ul><li>Q-Learning: Update Example<br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />
- 19. Q-Learning: Update Example<br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />
- 20. Q-Learning: Update Example<br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />
- 21. The Need for Exploration<br />1<br />2<br />3<br />Explore!<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />
- 22. Can’t always choose the action with highest Q-value<br />The Q-function is initially unreliable<br />Need to explore until it is optimal<br />Most common method: ε-greedy<br />Take a random action in a small fraction of steps (ε)<br />Decay ε over time<br />There is some work on optimizing exploration <br />Kearns & Singh, ML 1998<br />But people usually use this simple method<br />Explore/Exploit Tradeoff<br />
- 23. Under certain conditions, Q-learning will converge to the correct Q-function<br />The environment model doesn’t change<br />States and actions are finite<br />Rewards are bounded<br />Learning rate decays with visits to state-action pairs<br />Exploration method would guarantee infinite visits to every state-action pair over an infinite training period<br />Q-Learning: Convergence<br />
- 24. Extensions: SARSA<br /><ul><li>SARSA: Take exploration into account in updates
- 25. Use the action actually chosen in updates</li></ul>Regular:<br />PIT!<br />SARSA:<br />
- 26. Extensions: Look-ahead<br /><ul><li>Look-ahead: Do updates over multiple states
- 27. Use some episodic memory to speed credit assignment</li></ul>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />TD(λ): a weighted combination of look-ahead distances<br />The parameter λ controls the weighting<br />
- 28. Eligibility traces: Lookahead with less memory<br />Visiting a state leaves a trace that decays<br />Update multiple states at once<br />States get credit according to their trace<br />Extensions: Eligibility Traces<br />3<br />1<br />2<br />4<br />5<br />6<br />9<br />7<br />8<br />10<br />11<br />
- 29. Options: Create higher-level actions<br />Extensions: Options and Hierarchies<br /><ul><li>Hierarchical RL: Design a tree of RL tasks</li></ul>Whole Maze<br />Room A<br />Room B<br />
- 30. Extensions: Function Approximation<br />Function approximation: allow complex environments<br />The Q-function table could be too big (or infinitely big!)<br />Describe a state by a feature vector f = (f1 , f2 , … , fn)<br />Then the Q-function can be any regression model<br />E.g. linear regression: Q(s, a) = w1 f1 + w2 f2 + … + wn fn<br />Cost: convergence goes away in theory, though often not in practice<br />Benefit: generalization over similar states<br />Easiest if the approximator can be updated incrementally, like neural networks with gradient descent, but you can also do this in batches<br />
- 31. Reinforcement learning<br />What is it and why is it important in machine learning?<br />What machine learning algorithms exist for it?<br />Q-learning in theory<br />How does it work?<br />How can it be improved?<br />Q-learning in practice<br />What are the challenges?<br />What are the applications?<br />Link with psychology<br />Do people use similar mechanisms?<br />Do people use other methods that could inspire algorithms?<br />Resources for future reference<br />Outline<br />
- 32. Feature/reward design can be very involved<br />Online learning (no time for tuning)<br />Continuous features(handled by tiling)<br />Delayed rewards (handled by shaping)<br />Parameters can have large effects on learning speed<br />Tuning has just one effect: slowing it down<br />Realistic environments can have partial observability<br />Realistic environments can be non-stationary<br />There may be multiple agents<br />Challenges in Reinforcement Learning<br />
- 33. Tesauro 1995: Backgammon<br />Crites & Barto 1996: Elevator scheduling<br />Kaelbling et al. 1996: Packaging task<br />Singh & Bertsekas 1997: Cell phone channel allocation<br />Nevmyvaka et al. 2006: Stock investment decisions<br />Ipek et al. 2008: Memory control in hardware<br />Kosorok 2009: Chemotherapy treatment decisions<br />No textbook “killer app”<br />Just behind the times?<br />Too much design and tuning required?<br />Training too long or expensive?<br />Too much focus on toy domains in research?<br />Applications of Reinforcement Learning<br />
- 34. Reinforcement learning<br />What is it and why is it important in machine learning?<br />What machine learning algorithms exist for it?<br />Q-learning in theory<br />How does it work?<br />How can it be improved?<br />Q-learning in practice<br />What are the challenges?<br />What are the applications?<br />Link with psychology<br />Do people use similar mechanisms?<br />Do people use other methods that could inspire algorithms?<br />Resources for future reference<br />Outline<br />
- 35. Should machine learning researchers care?<br />Planes don’t fly the way birds do; should machines learn the way people do?<br />But why not look for inspiration?<br />Psychological research does show neuron activity associated with rewards<br />Really prediction error: actual – expected<br />Primarily in the striatum<br />Do Brains Perform RL?<br />
- 36. Schönberg et al., J. Neuroscience 2007<br />Good learners have stronger signals in the striatum than bad learners<br />Frank et al., Science 2004<br />Parkinson’s patients learn better from negatives<br />On dopamine medication, they learn better from positives<br />Bayer & Glimcher, Neuron 2005<br />Average firing rate corresponds to positive prediction errors<br />Interestingly, not to negative ones<br />Cohen & Ranganath, J. Neuroscience 2007<br />ERP magnitude predicts whether subjects change behavior after losing<br />Support for Reward Systems<br />
- 37. Various results in animals support different algorithms<br />Montague et al., J. Neuroscience 1996: TD<br />O’Doherty et al., Science 2004: Actor-critic<br />Daw, Nature 2005: Parallel model-free and model-based<br />Morris et al., Nature 2006: SARSA<br />Roesch et al., Nature 2007: Q-learning<br />Other results support extensions<br />Bogacz et al., Brain Research 2005: Eligibility traces<br />Daw, Nature 2006: Novelty bonuses to promote exploration<br />Mixed results on reward discounting (short vs. long term)<br />Ainslie 2001: people are more impulsive than algorithms<br />McClure et al., Science 2004: Two parallel systems<br />Frank et al., PNAS 2007: Controlled by genetic differences<br />Schweighofer et al., J. Neuroscience 2008: Influenced by serotonin<br />Support for Specific Mechanisms<br />
- 38. Parallelism<br />Separate systems for positive/negative errors<br />Multiple algorithms running simultaneously<br />Use of RL in combination with other systems<br />Planning: Reasoning about why things do or don’t work<br />Advice: Someone to imitate or correct us<br />Transfer: Knowledge about similar tasks<br />More impulsivity<br />Is this necessarily better?<br />The goal for machine learning: Take inspiration from humans without being limited by their shortcomings<br />What People Do Better<br />My work<br />
- 39. Reinforcement LearningSutton & Barto, MIT Press 1998<br />The standard reference book on computational RL<br />Reinforcement LearningDayan, Encyclopedia of Cognitive Science 2001<br />A briefer introduction that still touches on many computational issues<br />Reinforcement learning: the good, the bad, and the uglyDayan & Niv, Current Opinions in Neurobiology 2008<br />A comprehensive survey of work on RL in the human brain<br />Resources on Reinforcement Learning <br />

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment