Lisa TorreyUniversity of Wisconsin – MadisonHAMLET 2009Reinforcement Learning
Reinforcement learningWhat is it and why is it important in machine learning?What machine learning algorithms exist for it?Q-learning in theoryHow does it work?How can it be improved?Q-learning in practiceWhat are the challenges?What are the applications?Link with psychologyDo people use similar mechanisms?Do people use other methods that could inspire algorithms?Resources for future referenceOutline
Reinforcement learningWhat is it and why is it important in machine learning?What machine learning algorithms exist for it?Q-learning in theoryHow does it work?How can it be improved?Q-learning in practiceWhat are the challenges?What are the applications?Link with psychologyDo people use similar mechanisms?Do people use other methods that could inspire algorithms?Resources for future referenceOutline
Machine LearningClassification:  where AI meets statisticsGivenTraining dataLearnA model for making a single prediction or decisionxnewClassification AlgorithmTraining Data(x1, y1)(x2, y2)(x3, y3)…Modelynew
Animal/Human LearningMemorizationx1y1ClassificationxnewynewProceduraldecisionOther?environment
Learning how to act to accomplish goalsGivenEnvironment that contains rewardsLearnA policy for actingImportant differences from classificationYou don’t get examples of correct answersYou have to try things in order to learnProcedural Learning
A Good Policy
Do you know your environment?The effects of actionsThe rewardsIf yes, you can use Dynamic ProgrammingMore like planning than learningValue Iteration and Policy IterationIf no, you can use Reinforcement Learning (RL)Acting and observing in the environmentWhat You Know Matters
RL shapes behavior using reinforcementAgent takes actions in an environment (in episodes)Those actions change the state and trigger rewardsThrough experience, an agent learns a policy for actingGiven a state, choose an actionMaximize cumulative reward during an episodeInteresting things about this problemRequires solving credit assignmentWhat action(s) are responsible for a reward?Requires both exploring and exploitingDo what looks best, or see if something else is really best?RL as Operant Conditioning
Search-based:  evolution directly on a policyE.g. genetic algorithmsModel-based:  build a model of the environmentThen you can use dynamic programmingMemory-intensive learning methodModel-free:  learn a policy without any modelTemporal difference methods (TD)Requires limited episodic memory (though more helps)Types of Reinforcement Learning
Actor-critic learningThe TD version of Policy IterationQ-learningThe TD version of Value IterationThis is the most widely used RL algorithmTypes of Model-Free RL
Reinforcement learningWhat is it and why is it important in machine learning?What machine learning algorithms exist for it?Q-learning in theoryHow does it work?How can it be improved?Q-learning in practiceWhat are the challenges?What are the applications?Link with psychologyDo people use similar mechanisms?Do people use other methods that could inspire algorithms?Resources for future referenceOutline
Current state:   sCurrent action:   aTransition function:   δ(s, a) = sʹReward function:   r(s, a) Є RPolicy π(s) = aQ(s, a) ≈ value of taking action a from state sQ-Learning:  DefinitionsMarkov property: this is independent of previous states given current stateIn classification we’d have examples (s, π(s)) to learn from
Q(s, a) estimates the discounted cumulative rewardStarting in state sTaking action aFollowing the current policy thereafterSuppose we have the optimal Q-functionWhat’s the optimal policy in state s?The action argmaxb Q(s, b)But we don’t have the optimal Q-function at firstLet’s act as if we doAnd updates it after each step so it’s closer to optimalEventually it will be optimal!The Q-function
Q-Learning:  The ProcedureAgentQ(s1, a) = 0π(s1) = a1Q(s1, a1)  Q(s1, a1) + Δπ(s2) = a2s2s3a1a2r2r3s1Environmentδ(s2, a2) = s3r(s2, a2) = r3δ(s1, a1) = s2r(s1, a1) = r2
Q-Learning:  UpdatesThe basic update equation
With a discount factor to give later rewards less impact
With a learning rate for non-deterministic worldsQ-Learning:  Update Example1234567891011
Q-Learning:  Update Example1234567891011
Q-Learning:  Update Example1234567891011
The Need for Exploration123Explore!4567891011
Can’t always choose the action with highest Q-valueThe Q-function is initially unreliableNeed to explore until it is optimalMost common method:  ε-greedyTake a random action in a small fraction of steps (ε)Decay ε over timeThere is some work on optimizing exploration Kearns & Singh, ML 1998But people usually use this simple methodExplore/Exploit Tradeoff
Under certain conditions, Q-learning will converge to the correct Q-functionThe environment model doesn’t changeStates and actions are finiteRewards are boundedLearning rate decays with visits to state-action pairsExploration method would guarantee infinite visits to every state-action pair over an infinite training periodQ-Learning:  Convergence
Extensions:  SARSASARSA:  Take exploration into account in updates
Use the action actually chosen in updatesRegular:PIT!SARSA:
Extensions:  Look-aheadLook-ahead:  Do updates over multiple states
Use some episodic memory to speed credit assignment1234567891011TD(λ):  a weighted combination of look-ahead distancesThe parameter λ controls the weighting
Eligibility traces:  Lookahead with less memoryVisiting a state leaves a trace that decaysUpdate multiple states at onceStates get credit according to their traceExtensions:  Eligibility Traces3124569781011
Options:  Create higher-level actionsExtensions:  Options and HierarchiesHierarchical RL:  Design a tree of RL tasksWhole MazeRoom ARoom B
Extensions:  Function ApproximationFunction approximation:  allow complex environmentsThe Q-function table could be too big (or infinitely big!)Describe a state by a feature vector f = (f1 , f2 , … , fn)Then the Q-function can be any regression modelE.g. linear regression:   Q(s, a) = w1 f1  + w2 f2  + … + wn fnCost:  convergence goes away in theory, though often not in practiceBenefit:  generalization over similar statesEasiest if the approximator can be updated incrementally, like neural networks with gradient descent, but you can also do this in batches
Reinforcement learningWhat is it and why is it important in machine learning?What machine learning algorithms exist for it?Q-learning in theoryHow does it work?How can it be improved?Q-learning in practiceWhat are the challenges?What are the applications?Link with psychologyDo people use similar mechanisms?Do people use other methods that could inspire algorithms?Resources for future referenceOutline
Feature/reward design can be very involvedOnline learning (no time for tuning)Continuous features(handled by tiling)Delayed rewards (handled by shaping)Parameters can have large effects on learning speedTuning has just one effect: slowing it downRealistic environments can have partial observabilityRealistic environments can be non-stationaryThere may be multiple agentsChallenges in Reinforcement Learning
Tesauro 1995:  BackgammonCrites & Barto 1996:  Elevator schedulingKaelbling et al. 1996:  Packaging taskSingh & Bertsekas 1997:  Cell phone channel allocationNevmyvaka et al. 2006:  Stock investment decisionsIpek et al. 2008:  Memory control in hardwareKosorok 2009:  Chemotherapy treatment decisionsNo textbook “killer app”Just behind the times?Too much design and tuning required?Training too long or expensive?Too much focus on toy domains in research?Applications of Reinforcement Learning
Reinforcement learningWhat is it and why is it important in machine learning?What machine learning algorithms exist for it?Q-learning in theoryHow does it work?How can it be improved?Q-learning in practiceWhat are the challenges?What are the applications?Link with psychologyDo people use similar mechanisms?Do people use other methods that could inspire algorithms?Resources for future referenceOutline
Should machine learning researchers care?Planes don’t fly the way birds do; should machines learn the way people do?But why not look for inspiration?Psychological research does show neuron activity associated with rewardsReally prediction error:  actual – expectedPrimarily in the striatumDo Brains Perform RL?
Schönberg et al., J. Neuroscience 2007Good learners have stronger signals in the striatum than bad learnersFrank et al., Science 2004Parkinson’s patients learn better from negativesOn dopamine medication, they learn better from positivesBayer & Glimcher, Neuron 2005Average firing rate corresponds to positive prediction errorsInterestingly, not to negative onesCohen & Ranganath, J. Neuroscience 2007ERP magnitude predicts whether subjects change behavior after losingSupport for Reward Systems

Reinforcement Learning

  • 1.
    Lisa TorreyUniversity ofWisconsin – MadisonHAMLET 2009Reinforcement Learning
  • 2.
    Reinforcement learningWhat isit and why is it important in machine learning?What machine learning algorithms exist for it?Q-learning in theoryHow does it work?How can it be improved?Q-learning in practiceWhat are the challenges?What are the applications?Link with psychologyDo people use similar mechanisms?Do people use other methods that could inspire algorithms?Resources for future referenceOutline
  • 3.
    Reinforcement learningWhat isit and why is it important in machine learning?What machine learning algorithms exist for it?Q-learning in theoryHow does it work?How can it be improved?Q-learning in practiceWhat are the challenges?What are the applications?Link with psychologyDo people use similar mechanisms?Do people use other methods that could inspire algorithms?Resources for future referenceOutline
  • 4.
    Machine LearningClassification: where AI meets statisticsGivenTraining dataLearnA model for making a single prediction or decisionxnewClassification AlgorithmTraining Data(x1, y1)(x2, y2)(x3, y3)…Modelynew
  • 5.
  • 6.
    Learning how toact to accomplish goalsGivenEnvironment that contains rewardsLearnA policy for actingImportant differences from classificationYou don’t get examples of correct answersYou have to try things in order to learnProcedural Learning
  • 7.
  • 8.
    Do you knowyour environment?The effects of actionsThe rewardsIf yes, you can use Dynamic ProgrammingMore like planning than learningValue Iteration and Policy IterationIf no, you can use Reinforcement Learning (RL)Acting and observing in the environmentWhat You Know Matters
  • 9.
    RL shapes behaviorusing reinforcementAgent takes actions in an environment (in episodes)Those actions change the state and trigger rewardsThrough experience, an agent learns a policy for actingGiven a state, choose an actionMaximize cumulative reward during an episodeInteresting things about this problemRequires solving credit assignmentWhat action(s) are responsible for a reward?Requires both exploring and exploitingDo what looks best, or see if something else is really best?RL as Operant Conditioning
  • 10.
    Search-based: evolutiondirectly on a policyE.g. genetic algorithmsModel-based: build a model of the environmentThen you can use dynamic programmingMemory-intensive learning methodModel-free: learn a policy without any modelTemporal difference methods (TD)Requires limited episodic memory (though more helps)Types of Reinforcement Learning
  • 11.
    Actor-critic learningThe TDversion of Policy IterationQ-learningThe TD version of Value IterationThis is the most widely used RL algorithmTypes of Model-Free RL
  • 12.
    Reinforcement learningWhat isit and why is it important in machine learning?What machine learning algorithms exist for it?Q-learning in theoryHow does it work?How can it be improved?Q-learning in practiceWhat are the challenges?What are the applications?Link with psychologyDo people use similar mechanisms?Do people use other methods that could inspire algorithms?Resources for future referenceOutline
  • 13.
    Current state: sCurrent action: aTransition function: δ(s, a) = sʹReward function: r(s, a) Є RPolicy π(s) = aQ(s, a) ≈ value of taking action a from state sQ-Learning: DefinitionsMarkov property: this is independent of previous states given current stateIn classification we’d have examples (s, π(s)) to learn from
  • 14.
    Q(s, a) estimatesthe discounted cumulative rewardStarting in state sTaking action aFollowing the current policy thereafterSuppose we have the optimal Q-functionWhat’s the optimal policy in state s?The action argmaxb Q(s, b)But we don’t have the optimal Q-function at firstLet’s act as if we doAnd updates it after each step so it’s closer to optimalEventually it will be optimal!The Q-function
  • 15.
    Q-Learning: TheProcedureAgentQ(s1, a) = 0π(s1) = a1Q(s1, a1)  Q(s1, a1) + Δπ(s2) = a2s2s3a1a2r2r3s1Environmentδ(s2, a2) = s3r(s2, a2) = r3δ(s1, a1) = s2r(s1, a1) = r2
  • 16.
    Q-Learning: UpdatesThebasic update equation
  • 17.
    With a discountfactor to give later rewards less impact
  • 18.
    With a learningrate for non-deterministic worldsQ-Learning: Update Example1234567891011
  • 19.
    Q-Learning: UpdateExample1234567891011
  • 20.
    Q-Learning: UpdateExample1234567891011
  • 21.
    The Need forExploration123Explore!4567891011
  • 22.
    Can’t always choosethe action with highest Q-valueThe Q-function is initially unreliableNeed to explore until it is optimalMost common method: ε-greedyTake a random action in a small fraction of steps (ε)Decay ε over timeThere is some work on optimizing exploration Kearns & Singh, ML 1998But people usually use this simple methodExplore/Exploit Tradeoff
  • 23.
    Under certain conditions,Q-learning will converge to the correct Q-functionThe environment model doesn’t changeStates and actions are finiteRewards are boundedLearning rate decays with visits to state-action pairsExploration method would guarantee infinite visits to every state-action pair over an infinite training periodQ-Learning: Convergence
  • 24.
    Extensions: SARSASARSA: Take exploration into account in updates
  • 25.
    Use the actionactually chosen in updatesRegular:PIT!SARSA:
  • 26.
    Extensions: Look-aheadLook-ahead: Do updates over multiple states
  • 27.
    Use some episodicmemory to speed credit assignment1234567891011TD(λ): a weighted combination of look-ahead distancesThe parameter λ controls the weighting
  • 28.
    Eligibility traces: Lookahead with less memoryVisiting a state leaves a trace that decaysUpdate multiple states at onceStates get credit according to their traceExtensions: Eligibility Traces3124569781011
  • 29.
    Options: Createhigher-level actionsExtensions: Options and HierarchiesHierarchical RL: Design a tree of RL tasksWhole MazeRoom ARoom B
  • 30.
    Extensions: FunctionApproximationFunction approximation: allow complex environmentsThe Q-function table could be too big (or infinitely big!)Describe a state by a feature vector f = (f1 , f2 , … , fn)Then the Q-function can be any regression modelE.g. linear regression: Q(s, a) = w1 f1 + w2 f2 + … + wn fnCost: convergence goes away in theory, though often not in practiceBenefit: generalization over similar statesEasiest if the approximator can be updated incrementally, like neural networks with gradient descent, but you can also do this in batches
  • 31.
    Reinforcement learningWhat isit and why is it important in machine learning?What machine learning algorithms exist for it?Q-learning in theoryHow does it work?How can it be improved?Q-learning in practiceWhat are the challenges?What are the applications?Link with psychologyDo people use similar mechanisms?Do people use other methods that could inspire algorithms?Resources for future referenceOutline
  • 32.
    Feature/reward design canbe very involvedOnline learning (no time for tuning)Continuous features(handled by tiling)Delayed rewards (handled by shaping)Parameters can have large effects on learning speedTuning has just one effect: slowing it downRealistic environments can have partial observabilityRealistic environments can be non-stationaryThere may be multiple agentsChallenges in Reinforcement Learning
  • 33.
    Tesauro 1995: BackgammonCrites & Barto 1996: Elevator schedulingKaelbling et al. 1996: Packaging taskSingh & Bertsekas 1997: Cell phone channel allocationNevmyvaka et al. 2006: Stock investment decisionsIpek et al. 2008: Memory control in hardwareKosorok 2009: Chemotherapy treatment decisionsNo textbook “killer app”Just behind the times?Too much design and tuning required?Training too long or expensive?Too much focus on toy domains in research?Applications of Reinforcement Learning
  • 34.
    Reinforcement learningWhat isit and why is it important in machine learning?What machine learning algorithms exist for it?Q-learning in theoryHow does it work?How can it be improved?Q-learning in practiceWhat are the challenges?What are the applications?Link with psychologyDo people use similar mechanisms?Do people use other methods that could inspire algorithms?Resources for future referenceOutline
  • 35.
    Should machine learningresearchers care?Planes don’t fly the way birds do; should machines learn the way people do?But why not look for inspiration?Psychological research does show neuron activity associated with rewardsReally prediction error: actual – expectedPrimarily in the striatumDo Brains Perform RL?
  • 36.
    Schönberg et al.,J. Neuroscience 2007Good learners have stronger signals in the striatum than bad learnersFrank et al., Science 2004Parkinson’s patients learn better from negativesOn dopamine medication, they learn better from positivesBayer & Glimcher, Neuron 2005Average firing rate corresponds to positive prediction errorsInterestingly, not to negative onesCohen & Ranganath, J. Neuroscience 2007ERP magnitude predicts whether subjects change behavior after losingSupport for Reward Systems
  • 37.
    Various results inanimals support different algorithmsMontague et al., J. Neuroscience 1996: TDO’Doherty et al., Science 2004: Actor-criticDaw, Nature 2005: Parallel model-free and model-basedMorris et al., Nature 2006: SARSARoesch et al., Nature 2007: Q-learningOther results support extensionsBogacz et al., Brain Research 2005: Eligibility tracesDaw, Nature 2006: Novelty bonuses to promote explorationMixed results on reward discounting (short vs. long term)Ainslie 2001: people are more impulsive than algorithmsMcClure et al., Science 2004: Two parallel systemsFrank et al., PNAS 2007: Controlled by genetic differencesSchweighofer et al., J. Neuroscience 2008: Influenced by serotoninSupport for Specific Mechanisms
  • 38.
    ParallelismSeparate systems forpositive/negative errorsMultiple algorithms running simultaneouslyUse of RL in combination with other systemsPlanning: Reasoning about why things do or don’t workAdvice: Someone to imitate or correct usTransfer: Knowledge about similar tasksMore impulsivityIs this necessarily better?The goal for machine learning: Take inspiration from humans without being limited by their shortcomingsWhat People Do BetterMy work
  • 39.
    Reinforcement LearningSutton &Barto, MIT Press 1998The standard reference book on computational RLReinforcement LearningDayan, Encyclopedia of Cognitive Science 2001A briefer introduction that still touches on many computational issuesReinforcement learning: the good, the bad, and the uglyDayan & Niv, Current Opinions in Neurobiology 2008A comprehensive survey of work on RL in the human brainResources on Reinforcement Learning