Reinforcement Learning Michael L. Littman Slides from http://www.cs.vu.nl/~elena/ml_13light.ppt which appear to have been adapted from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/l20.ps
Control policies that choose optimal actions
[Read Ch. 13] [Exercise 13.2]
Consider learning to choose actions, e.g.,
Robot learning to dock on battery charger
Learning to choose actions to optimize factory output
Learning to play Backgammon
Note several problem characteristics:
Opportunity for active exploration
Possibility that state only partially observable
Possible need to learn multiple tasks with same sensors/effectors
One Example: TD-Gammon
Learn to play Backgammon.
+100 if win
-100 if lose
0 for all other states
Trained by playing 1.5 million games against itself.
Now approximately equal to best human player.
Reinforcement Learning Problem
Goal: learn to choose actions that maximize
r 0 + r 1 + 2 r 2 + … , where 0 < 1
Markov Decision Processes
finite set of states S
set of actions A
at each discrete time agent observes state s t S and chooses action a t A
then receives immediate reward r t
and state changes to s t+1
Markov assumption: s t+1 = (s t , a t ) and r t = r(s t , a t )
i.e., r t and s t+1 depend only on current state and action
functions and r may be nondeterministic
functions and r not necessarily known to agent
Agent’s Learning Task
Execute actions in environment, observe results, and
learn action policy : S A that maximizes E [r t + r t+1 + 2 r t+2 + … ] from any starting state in S
here 0 < 1 is the discount factor for future rewards (sometimes makes sense with = 1)
Note something new:
Target function is : S A
But, we have no training examples of form s, a
Training examples are of form s, a , r
To begin, consider deterministic worlds...
For each possible policy the agent might adopt, we can define an evaluation function over states
where r t , r t+1 , ... are generated by following policy starting at state s Restated, the task is to learn the optimal policy *
What to Learn
We might try to have agent learn the evaluation function V * (which we write as V*)
It could then do a look-ahead search to choose the best action from any state s because
This can work if agent knows : S A S, and r : S A
But when it doesn't, it can't choose actions this way
Define a new function very similar to V*
If agent learns Q, it can choose optimal action even without knowing ! Q is the evaluation function the agent will learn [Watkins 1989].
Training Rule to Learn Q
Note Q and V* are closely related:
This allows us to write Q recursively as Nice! Let denote learner’s current approximation to Q. Consider training rule where s’ is the state resulting from applying action a in state s.
Q Learning for Deterministic Worlds
For each s, a initialize table entry
Observe current state s
Select an action a and execute it
Receive immediate reward r
Observe the new state s’
Update the table entry for as follows:
Updating Q Notice if rewards non-negative, then and
converges to Q. Consider case of deterministic world, with bounded immediate rewards, where each s, a visited infinitely often.
Proof : Define a full interval to be an interval during which each s, a is visited. During each full interval the largest error in table is reduced by factor of .
Let be table after n updates, and Δ n be the maximum error in : that is
For any table entry updated on iteration n+1, the error in the revised estimate is
Note we used general fact that: This works with things other than max that satisfy this non-expansion property [Szepesv á ri & Littman, 1999].
Non-deterministic Case (1)
What if reward and next state are non-deterministic?
We redefine V and Q by taking expected values .
Nondeterministic Case (2) Q learning generalizes to nondeterministic worlds Alter training rule to where Can still prove convergence of to Q [Watkins and Dayan, 1992]. Standard properties: n = 0, n 2 = .
Temporal Difference Learning (1) Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n ? Blend all of these:
Temporal Difference Learning (2)
TD( ) algorithm uses above training rule
Sometimes converges faster than Q learning
converges for learning V * for any 0 1 [Dayan, 1992]
Tesauro's TD-Gammon uses this algorithm
Subtleties and Ongoing Research
Replace table with neural net or other generalizer
Handle case where state only partially observable (next?)