Lisa Torrey
University of Wisconsin – Madison
HAMLET 2009
 Reinforcement learning
 What is it and why is it important in machine learning?
 What machine learning algorithms exis...
 Reinforcement learning
 What is it and why is it important in machine learning?
 What machine learning algorithms exis...
 Classification: where AI meets statistics
 Given
 Training data
 Learn
 A model for making a single prediction or de...
Other?
Classification
xnew ynew
Memorization
x1 y1
Procedural
environment
decision
 Learning how to act to accomplish goals
 Given
 Environment that contains rewards
 Learn
 A policy for acting
 Impo...
 Do you know your environment?
 The effects of actions
 The rewards
 If yes, you can use Dynamic Programming
 More li...
 RL shapes behavior using reinforcement
 Agent takes actions in an environment (in episodes)
 Those actions change the ...
 Search-based: evolution directly on a policy
 E.g. genetic algorithms
 Model-based: build a model of the environment
...
 Actor-critic learning
 The TD version of Policy Iteration
 Q-learning
 The TD version of Value Iteration
 This is th...
 Reinforcement learning
 What is it and why is it important in machine learning?
 What machine learning algorithms exis...
 Current state: s
 Current action: a
 Transition function: δ(s, a) = sʹ
 Reward function: r(s, a) Є R
 Policy π(s) = ...
 Q(s, a) estimates the discounted cumulative reward
 Starting in state s
 Taking action a
 Following the current polic...
Environment
s1
Agent
Q(s1, a) = 0
π(s1) = a1
a1
s2
r2
δ(s1, a1) = s2
r(s1, a1) = r2
Q(s1, a1)  Q(s1, a1) + Δ
π(s2) = a2
a...
),'(max),(),( bsQasrasQ b
 With a discount factor to give later rewards less impact
   ),'(max),(),(1),( bsQasra...
1 2 3
4 5 6
7 8 9
10 11
 ),( 11 asQ
1 2 3
4 5 6
7 8 9
10 11

0),( 9 asQ
1 2 3
4 5 6
7 8 9
10 11
2
8 0),( asQ

1 2 3
4 5 6
7 8 9
10 11
2
3

4

5
 6

),(maxarg 2 asQ
best
Explore!
 Can’t always choose the action with highest Q-value
 The Q-function is initially unreliable
 Need to explore until it ...
 Under certain conditions, Q-learning will converge to
the correct Q-function
 The environment model doesn’t change
 St...
 SARSA: Take exploration into account in updates
 Use the action actually chosen in updates
PIT!


Regular:
SARSA:...
 TD(λ): a weighted combination of look-ahead distances
 The parameter λ controls the weighting
)'',''()','(),(),( 2
asQa...
 Eligibility traces: Lookahead with less memory
 Visiting a state leaves a trace that decays
 Update multiple states at...
 Options: Create higher-level actions
 Hierarchical RL: Design a tree of RL tasks
Room A Room B
Whole Maze
 Function approximation: allow complex environments
 The Q-function table could be too big (or infinitely big!)
 Descri...
 Reinforcement learning
 What is it and why is it important in machine learning?
 What machine learning algorithms exis...
 Feature/reward design can be very involved
 Online learning (no time for tuning)
 Continuous features (handled by tili...
 Tesauro 1995: Backgammon
 Crites & Barto 1996: Elevator scheduling
 Kaelbling et al. 1996: Packaging task
 Singh & Be...
 Reinforcement learning
 What is it and why is it important in machine learning?
 What machine learning algorithms exis...
 Should machine learning researchers care?
 Planes don’t fly the way birds do; should machines learn
the way people do?
...
 Schönberg et al., J. Neuroscience 2007
 Good learners have stronger signals in the striatum than bad learners
 Frank e...
 Various results in animals support different algorithms
 Montague et al., J. Neuroscience 1996: TD
 O’Doherty et al., ...
 Parallelism
 Separate systems for positive/negative errors
 Multiple algorithms running simultaneously
 Use of RL in ...
 Reinforcement Learning
Sutton & Barto, MIT Press 1998
 The standard reference book on computational RL
 Reinforcement ...
Reinforcement Learning
Upcoming SlideShare
Loading in …5
×

Reinforcement Learning

670 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
670
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Reinforcement Learning

  1. 1. Lisa Torrey University of Wisconsin – Madison HAMLET 2009
  2. 2.  Reinforcement learning  What is it and why is it important in machine learning?  What machine learning algorithms exist for it?  Q-learning in theory  How does it work?  How can it be improved?  Q-learning in practice  What are the challenges?  What are the applications?  Link with psychology  Do people use similar mechanisms?  Do people use other methods that could inspire algorithms?  Resources for future reference
  3. 3.  Reinforcement learning  What is it and why is it important in machine learning?  What machine learning algorithms exist for it?          
  4. 4.  Classification: where AI meets statistics  Given  Training data  Learn  A model for making a single prediction or decision xnew ynew Classification Algorithm Training Data (x1, y1) (x2, y2) (x3, y3) … Model
  5. 5. Other? Classification xnew ynew Memorization x1 y1 Procedural environment decision
  6. 6.  Learning how to act to accomplish goals  Given  Environment that contains rewards  Learn  A policy for acting  Important differences from classification  You don’t get examples of correct answers  You have to try things in order to learn
  7. 7.  Do you know your environment?  The effects of actions  The rewards  If yes, you can use Dynamic Programming  More like planning than learning  Value Iteration and Policy Iteration  If no, you can use Reinforcement Learning (RL)  Acting and observing in the environment
  8. 8.  RL shapes behavior using reinforcement  Agent takes actions in an environment (in episodes)  Those actions change the state and trigger rewards  Through experience, an agent learns a policy for acting  Given a state, choose an action  Maximize cumulative reward during an episode  Interesting things about this problem  Requires solving credit assignment  What action(s) are responsible for a reward?  Requires both exploring and exploiting  Do what looks best, or see if something else is really best?
  9. 9.  Search-based: evolution directly on a policy  E.g. genetic algorithms  Model-based: build a model of the environment  Then you can use dynamic programming  Memory-intensive learning method  Model-free: learn a policy without any model  Temporal difference methods (TD)  Requires limited episodic memory (though more helps)
  10. 10.  Actor-critic learning  The TD version of Policy Iteration  Q-learning  The TD version of Value Iteration  This is the most widely used RL algorithm
  11. 11.  Reinforcement learning  What is it and why is it important in machine learning?  What machine learning algorithms exist for it?  Q-learning in theory  How does it work?  How can it be improved?       
  12. 12.  Current state: s  Current action: a  Transition function: δ(s, a) = sʹ  Reward function: r(s, a) Є R  Policy π(s) = a  Q(s, a) ≈ value of taking action a from state s Markov property: this is independent of previous states given current state In classification we’d have examples (s, π(s)) to learn from
  13. 13.  Q(s, a) estimates the discounted cumulative reward  Starting in state s  Taking action a  Following the current policy thereafter  Suppose we have the optimal Q-function  What’s the optimal policy in state s?  The action argmaxb Q(s, b)  But we don’t have the optimal Q-function at first  Let’s act as if we do  And updates it after each step so it’s closer to optimal  Eventually it will be optimal!
  14. 14. Environment s1 Agent Q(s1, a) = 0 π(s1) = a1 a1 s2 r2 δ(s1, a1) = s2 r(s1, a1) = r2 Q(s1, a1)  Q(s1, a1) + Δ π(s2) = a2 a2 δ(s2, a2) = s3 r(s2, a2) = r3 s3 r3
  15. 15. ),'(max),(),( bsQasrasQ b  With a discount factor to give later rewards less impact    ),'(max),(),(1),( bsQasrasQasQ b   With a learning rate for non-deterministic worlds ),'(max),(),( bsQasrasQ b  The basic update equation
  16. 16. 1 2 3 4 5 6 7 8 9 10 11  ),( 11 asQ
  17. 17. 1 2 3 4 5 6 7 8 9 10 11  0),( 9 asQ
  18. 18. 1 2 3 4 5 6 7 8 9 10 11 2 8 0),( asQ 
  19. 19. 1 2 3 4 5 6 7 8 9 10 11 2 3  4  5  6  ),(maxarg 2 asQ best Explore!
  20. 20.  Can’t always choose the action with highest Q-value  The Q-function is initially unreliable  Need to explore until it is optimal  Most common method: ε-greedy  Take a random action in a small fraction of steps (ε)  Decay ε over time  There is some work on optimizing exploration  Kearns & Singh, ML 1998  But people usually use this simple method
  21. 21.  Under certain conditions, Q-learning will converge to the correct Q-function  The environment model doesn’t change  States and actions are finite  Rewards are bounded  Learning rate decays with visits to state-action pairs  Exploration method would guarantee infinite visits to every state-action pair over an infinite training period
  22. 22.  SARSA: Take exploration into account in updates  Use the action actually chosen in updates PIT!   Regular: SARSA: )','(),(),( asQasrasQ  ),'(max),(),( bsQasrasQ b
  23. 23.  TD(λ): a weighted combination of look-ahead distances  The parameter λ controls the weighting )'',''()','(),(),( 2 asQasrasrasQ    Look-ahead: Do updates over multiple states  Use some episodic memory to speed credit assignment 1 2 3 4 5 6 7 8 9 10 11
  24. 24.  Eligibility traces: Lookahead with less memory  Visiting a state leaves a trace that decays  Update multiple states at once  States get credit according to their trace 1 2 3 4 5 6 7 8 9 10 11
  25. 25.  Options: Create higher-level actions  Hierarchical RL: Design a tree of RL tasks Room A Room B Whole Maze
  26. 26.  Function approximation: allow complex environments  The Q-function table could be too big (or infinitely big!)  Describe a state by a feature vector f = (f1 , f2 , … , fn)  Then the Q-function can be any regression model  E.g. linear regression: Q(s, a) = w1 f1 + w2 f2 + … + wn fn  Cost: convergence goes away in theory, though often not in practice  Benefit: generalization over similar states  Easiest if the approximator can be updated incrementally, like neural networks with gradient descent, but you can also do this in batches
  27. 27.  Reinforcement learning  What is it and why is it important in machine learning?  What machine learning algorithms exist for it?  Q-learning in theory  How does it work?  How can it be improved?  Q-learning in practice  What are the challenges?  What are the applications?    
  28. 28.  Feature/reward design can be very involved  Online learning (no time for tuning)  Continuous features (handled by tiling)  Delayed rewards (handled by shaping)  Parameters can have large effects on learning speed  Tuning has just one effect: slowing it down  Realistic environments can have partial observability  Realistic environments can be non-stationary  There may be multiple agents
  29. 29.  Tesauro 1995: Backgammon  Crites & Barto 1996: Elevator scheduling  Kaelbling et al. 1996: Packaging task  Singh & Bertsekas 1997: Cell phone channel allocation  Nevmyvaka et al. 2006: Stock investment decisions  Ipek et al. 2008: Memory control in hardware  Kosorok 2009: Chemotherapy treatment decisions  No textbook “killer app”  Just behind the times?  Too much design and tuning required?  Training too long or expensive?  Too much focus on toy domains in research?
  30. 30.  Reinforcement learning  What is it and why is it important in machine learning?  What machine learning algorithms exist for it?  Q-learning in theory  How does it work?  How can it be improved?  Q-learning in practice  What are the challenges?  What are the applications?  Link with psychology  Do people use similar mechanisms?  Do people use other methods that could inspire algorithms? 
  31. 31.  Should machine learning researchers care?  Planes don’t fly the way birds do; should machines learn the way people do?  But why not look for inspiration?  Psychological research does show neuron activity associated with rewards  Really prediction error: actual – expected  Primarily in the striatum
  32. 32.  Schönberg et al., J. Neuroscience 2007  Good learners have stronger signals in the striatum than bad learners  Frank et al., Science 2004  Parkinson’s patients learn better from negatives  On dopamine medication, they learn better from positives  Bayer & Glimcher, Neuron 2005  Average firing rate corresponds to positive prediction errors  Interestingly, not to negative ones  Cohen & Ranganath, J. Neuroscience 2007  ERP magnitude predicts whether subjects change behavior after losing
  33. 33.  Various results in animals support different algorithms  Montague et al., J. Neuroscience 1996: TD  O’Doherty et al., Science 2004: Actor-critic  Daw, Nature 2005: Parallel model-free and model-based  Morris et al., Nature 2006: SARSA  Roesch et al., Nature 2007: Q-learning  Other results support extensions  Bogacz et al., Brain Research 2005: Eligibility traces  Daw, Nature 2006: Novelty bonuses to promote exploration  Mixed results on reward discounting (short vs. long term)  Ainslie 2001: people are more impulsive than algorithms  McClure et al., Science 2004: Two parallel systems  Frank et al., PNAS 2007: Controlled by genetic differences  Schweighofer et al., J. Neuroscience 2008: Influenced by serotonin
  34. 34.  Parallelism  Separate systems for positive/negative errors  Multiple algorithms running simultaneously  Use of RL in combination with other systems  Planning: Reasoning about why things do or don’t work  Advice: Someone to imitate or correct us  Transfer: Knowledge about similar tasks  More impulsivity  Is this necessarily better?  The goal for machine learning: Take inspiration from humans without being limited by their shortcomings My work
  35. 35.  Reinforcement Learning Sutton & Barto, MIT Press 1998  The standard reference book on computational RL  Reinforcement Learning Dayan, Encyclopedia of Cognitive Science 2001  A briefer introduction that still touches on many computational issues  Reinforcement learning: the good, the bad, and the ugly Dayan & Niv, Current Opinions in Neurobiology 2008  A comprehensive survey of work on RL in the human brain

×