Reinforcement Learning

Lisa TorreyUniversity of Wisconsin – MadisonHAMLET 2009Reinforcement Learning

Reinforcement learningWhat is it and why is it important in machine learning?What machine learning algorithms exist for it?Q-learning in theoryHow does it work?How can it be improved?Q-learning in practiceWhat are the challenges?What are the applications?Link with psychologyDo people use similar mechanisms?Do people use other methods that could inspire algorithms?Resources for future referenceOutline

Machine LearningClassification: where AI meets statisticsGivenTraining dataLearnA model for making a single prediction or decisionxnewClassification AlgorithmTraining Data(x1, y1)(x2, y2)(x3, y3)…Modelynew

Animal/Human LearningMemorizationx1y1ClassificationxnewynewProceduraldecisionOther?environment

Learning how to act to accomplish goalsGivenEnvironment that contains rewardsLearnA policy for actingImportant differences from classificationYou don’t get examples of correct answersYou have to try things in order to learnProcedural Learning

Do you know your environment?The effects of actionsThe rewardsIf yes, you can use Dynamic ProgrammingMore like planning than learningValue Iteration and Policy IterationIf no, you can use Reinforcement Learning (RL)Acting and observing in the environmentWhat You Know Matters

RL shapes behavior using reinforcementAgent takes actions in an environment (in episodes)Those actions change the state and trigger rewardsThrough experience, an agent learns a policy for actingGiven a state, choose an actionMaximize cumulative reward during an episodeInteresting things about this problemRequires solving credit assignmentWhat action(s) are responsible for a reward?Requires both exploring and exploitingDo what looks best, or see if something else is really best?RL as Operant Conditioning

Search-based: evolution directly on a policyE.g. genetic algorithmsModel-based: build a model of the environmentThen you can use dynamic programmingMemory-intensive learning methodModel-free: learn a policy without any modelTemporal difference methods (TD)Requires limited episodic memory (though more helps)Types of Reinforcement Learning

Actor-critic learningThe TD version of Policy IterationQ-learningThe TD version of Value IterationThis is the most widely used RL algorithmTypes of Model-Free RL

Current state: sCurrent action: aTransition function: δ(s, a) = sʹReward function: r(s, a) Є RPolicy π(s) = aQ(s, a) ≈ value of taking action a from state sQ-Learning: DefinitionsMarkov property: this is independent of previous states given current stateIn classification we’d have examples (s, π(s)) to learn from

Q(s, a) estimates the discounted cumulative rewardStarting in state sTaking action aFollowing the current policy thereafterSuppose we have the optimal Q-functionWhat’s the optimal policy in state s?The action argmaxb Q(s, b)But we don’t have the optimal Q-function at firstLet’s act as if we doAnd updates it after each step so it’s closer to optimalEventually it will be optimal!The Q-function

Q-Learning: The ProcedureAgentQ(s1, a) = 0π(s1) = a1Q(s1, a1)  Q(s1, a1) + Δπ(s2) = a2s2s3a1a2r2r3s1Environmentδ(s2, a2) = s3r(s2, a2) = r3δ(s1, a1) = s2r(s1, a1) = r2

Q-Learning: UpdatesThe basic update equation

With a discount factor to give later rewards less impact

With a learning rate for non-deterministic worldsQ-Learning: Update Example1234567891011

Q-Learning: Update Example1234567891011

The Need for Exploration123Explore!4567891011

Can’t always choose the action with highest Q-valueThe Q-function is initially unreliableNeed to explore until it is optimalMost common method: ε-greedyTake a random action in a small fraction of steps (ε)Decay ε over timeThere is some work on optimizing exploration Kearns & Singh, ML 1998But people usually use this simple methodExplore/Exploit Tradeoff

Under certain conditions, Q-learning will converge to the correct Q-functionThe environment model doesn’t changeStates and actions are finiteRewards are boundedLearning rate decays with visits to state-action pairsExploration method would guarantee infinite visits to every state-action pair over an infinite training periodQ-Learning: Convergence

Extensions: SARSASARSA: Take exploration into account in updates

Use the action actually chosen in updatesRegular:PIT!SARSA:

Extensions: Look-aheadLook-ahead: Do updates over multiple states

Use some episodic memory to speed credit assignment1234567891011TD(λ): a weighted combination of look-ahead distancesThe parameter λ controls the weighting

Eligibility traces: Lookahead with less memoryVisiting a state leaves a trace that decaysUpdate multiple states at onceStates get credit according to their traceExtensions: Eligibility Traces3124569781011

Options: Create higher-level actionsExtensions: Options and HierarchiesHierarchical RL: Design a tree of RL tasksWhole MazeRoom ARoom B

Extensions: Function ApproximationFunction approximation: allow complex environmentsThe Q-function table could be too big (or infinitely big!)Describe a state by a feature vector f = (f1 , f2 , … , fn)Then the Q-function can be any regression modelE.g. linear regression: Q(s, a) = w1 f1 + w2 f2 + … + wn fnCost: convergence goes away in theory, though often not in practiceBenefit: generalization over similar statesEasiest if the approximator can be updated incrementally, like neural networks with gradient descent, but you can also do this in batches

Feature/reward design can be very involvedOnline learning (no time for tuning)Continuous features(handled by tiling)Delayed rewards (handled by shaping)Parameters can have large effects on learning speedTuning has just one effect: slowing it downRealistic environments can have partial observabilityRealistic environments can be non-stationaryThere may be multiple agentsChallenges in Reinforcement Learning

Tesauro 1995: BackgammonCrites & Barto 1996: Elevator schedulingKaelbling et al. 1996: Packaging taskSingh & Bertsekas 1997: Cell phone channel allocationNevmyvaka et al. 2006: Stock investment decisionsIpek et al. 2008: Memory control in hardwareKosorok 2009: Chemotherapy treatment decisionsNo textbook “killer app”Just behind the times?Too much design and tuning required?Training too long or expensive?Too much focus on toy domains in research?Applications of Reinforcement Learning

Should machine learning researchers care?Planes don’t fly the way birds do; should machines learn the way people do?But why not look for inspiration?Psychological research does show neuron activity associated with rewardsReally prediction error: actual – expectedPrimarily in the striatumDo Brains Perform RL?

Schönberg et al., J. Neuroscience 2007Good learners have stronger signals in the striatum than bad learnersFrank et al., Science 2004Parkinson’s patients learn better from negativesOn dopamine medication, they learn better from positivesBayer & Glimcher, Neuron 2005Average firing rate corresponds to positive prediction errorsInterestingly, not to negative onesCohen & Ranganath, J. Neuroscience 2007ERP magnitude predicts whether subjects change behavior after losingSupport for Reward Systems

Reinforcement Learning

More Related Content

What's hot

Viewers also liked

Similar to Reinforcement Learning

More from butest

Reinforcement Learning