Lecture notes

Reinforcement Learning Michael L. Littman Slides from http://www.cs.vu.nl/~elena/ml_13light.ppt which appear to have been adapted from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/l20.ps

Reinforcement Learning ,[object Object],[object Object],[object Object],[object Object],[Read Ch. 13] [Exercise 13.2]

Control Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

One Example: TD-Gammon ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Reinforcement Learning Problem ,[object Object],[object Object]

Markov Decision Processes ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Agent’s Learning Task ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Value Function ,[object Object],[object Object],where r t , r t+1 , ... are generated by following policy  starting at state s Restated, the task is to learn the optimal policy  *

What to Learn ,[object Object],[object Object],[object Object],[object Object],[object Object]

Q Function ,[object Object],If agent learns Q, it can choose optimal action even without knowing  ! Q is the evaluation function the agent will learn [Watkins 1989].

Training Rule to Learn Q ,[object Object],This allows us to write Q recursively as Nice! Let denote learner’s current approximation to Q. Consider training rule where s’ is the state resulting from applying action a in state s.

Q Learning for Deterministic Worlds ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Updating Q Notice if rewards non-negative, then and

[object Object],[object Object],[object Object],[object Object],Convergence Theorem

Note we used general fact that: This works with things other than max that satisfy this non-expansion property [Szepesv á ri & Littman, 1999].

Non-deterministic Case (1) ,[object Object],[object Object]

Nondeterministic Case (2) Q learning generalizes to nondeterministic worlds Alter training rule to where Can still prove convergence of to Q [Watkins and Dayan, 1992]. Standard properties:   n = 0,   n 2 =  .

Temporal Difference Learning (1) Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n ? Blend all of these:

Temporal Difference Learning (2) ,[object Object],[object Object],[object Object],[object Object],Equivalent expression:

Subtleties and Ongoing Research ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Lecture notes

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Lecture notes

Similar to Lecture notes (20)

More from butest

More from butest (20)

Lecture notes