LECTURE 21: REINFORCEMENT LEARNINGObjectives:Definitions and TerminologyAgent-Environment InteractionMarkov Decision ProcessesValue FunctionsBellman Equations
Resources:RSAB/TO: The RL ProblemAM: RL TutorialRKVM: RL SIMTM: Intro to RLGT: Active SpeechAudio:URL:
AttributionsThese slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted and slightly annotated to better integrate them into this course.)
The original slides have been incorporated into many machine learning courses, including Tim Oates’ Introduction of Machine Learning, which contains links to several good lectures on various topics in machine learning (and is where I first found these slides).
A slightly more advanced version of the same material is available as part of Andrew Moore’s excellent set of statistical data mining tutorials.
The objectives of this lecture are:
describe the RL problem;
present idealized form of the RL problem for which we have precise theoretical results;
introduce key components of the mathematics: value functions and Bellman equations;
describe trade-offs between applicability and mathematical tractability.rrr. . .. . .t +1t +2sst +3sst +1t +2t +3aaaatt +1t +2tt +3The Agent-Environment Interface
The Agent Learns A PolicyDefinition of a policy:
Policy at state t, t, is a mapping from states to action probabilities.
t (s,a)= probability that at = a when st = s.
Reinforcement learning methods specify how the agent changes its policy as a result of experience.
Roughly, the agent’s goal is to get as much reward as it can over the long run.
Learning can occur in several ways:
Adaptation of classifier parameters based on prior and current data (e.g., many help systems now ask you “was this answer helpful to you”).
Selection of the most appropriate next training pattern during classifier training (e.g., active learning).
Common algorithm design issues include rate of convergence, bias vs. variance, adaptation speed, and batch vs. incremental adaptation. Getting the Degree of Abstraction RightTime steps need not refer to fixed intervals of real time (e.g., each new training pattern can be considered a time step).Actions can be low level (e.g., voltages to motors), or high level (e.g., accept a job offer), “mental” (e.g., shift in focus of attention), etc. Actions can be rule-based (e.g., user expresses a preference) or mathematics-based (e.g., assignment of a class or update of a probability).States can be low-level “sensations”, or they can be abstract, symbolic, based on memory, or subjective (e.g., the state of being “surprised” or “lost”). States can be hidden or observable.A reinforcement learning (RL) agent is not like a whole animal or robot, which consist of many RL agents as well as other components.The environment is not necessarily unknown to the agent, only incompletely controllable.Reward computation is in the agent’s environment because the agent cannot change it arbitrarily.

lecture_21.pptx - PowerPoint Presentation

  • 1.
    LECTURE 21: REINFORCEMENTLEARNINGObjectives:Definitions and TerminologyAgent-Environment InteractionMarkov Decision ProcessesValue FunctionsBellman Equations
  • 2.
    Resources:RSAB/TO: The RLProblemAM: RL TutorialRKVM: RL SIMTM: Intro to RLGT: Active SpeechAudio:URL:
  • 3.
    AttributionsThese slides wereoriginally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted and slightly annotated to better integrate them into this course.)
  • 4.
    The original slideshave been incorporated into many machine learning courses, including Tim Oates’ Introduction of Machine Learning, which contains links to several good lectures on various topics in machine learning (and is where I first found these slides).
  • 5.
    A slightly moreadvanced version of the same material is available as part of Andrew Moore’s excellent set of statistical data mining tutorials.
  • 6.
    The objectives ofthis lecture are:
  • 7.
  • 8.
    present idealized formof the RL problem for which we have precise theoretical results;
  • 9.
    introduce key componentsof the mathematics: value functions and Bellman equations;
  • 10.
    describe trade-offs betweenapplicability and mathematical tractability.rrr. . .. . .t +1t +2sst +3sst +1t +2t +3aaaatt +1t +2tt +3The Agent-Environment Interface
  • 11.
    The Agent LearnsA PolicyDefinition of a policy:
  • 12.
    Policy at statet, t, is a mapping from states to action probabilities.
  • 13.
    t (s,a)= probabilitythat at = a when st = s.
  • 14.
    Reinforcement learning methodsspecify how the agent changes its policy as a result of experience.
  • 15.
    Roughly, the agent’sgoal is to get as much reward as it can over the long run.
  • 16.
    Learning can occurin several ways:
  • 17.
    Adaptation of classifierparameters based on prior and current data (e.g., many help systems now ask you “was this answer helpful to you”).
  • 18.
    Selection of themost appropriate next training pattern during classifier training (e.g., active learning).
  • 19.
    Common algorithm designissues include rate of convergence, bias vs. variance, adaptation speed, and batch vs. incremental adaptation. Getting the Degree of Abstraction RightTime steps need not refer to fixed intervals of real time (e.g., each new training pattern can be considered a time step).Actions can be low level (e.g., voltages to motors), or high level (e.g., accept a job offer), “mental” (e.g., shift in focus of attention), etc. Actions can be rule-based (e.g., user expresses a preference) or mathematics-based (e.g., assignment of a class or update of a probability).States can be low-level “sensations”, or they can be abstract, symbolic, based on memory, or subjective (e.g., the state of being “surprised” or “lost”). States can be hidden or observable.A reinforcement learning (RL) agent is not like a whole animal or robot, which consist of many RL agents as well as other components.The environment is not necessarily unknown to the agent, only incompletely controllable.Reward computation is in the agent’s environment because the agent cannot change it arbitrarily.

Editor's Notes

  • #2 MS Equation 3.0 was used with settings of: 18, 12, 8, 18, 12.