Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reinforcement Learning 10. On-policy Control with Approximation


Published on

A summary of Chapter 10: On-policy Control with Approximation of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website:

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Reinforcement Learning 10. On-policy Control with Approximation

  1. 1. Chapter 10: On-policy Control with Approximation Seungjae Ryan Lee
  2. 2. Episodic 1-step semi-gradient Sarsa ● Approximate action values (instead of state values) ● Use Sarsa to define target ● Converges the same ways as TD(0) with same error bound
  3. 3. Control with Episodic 1-step semi-gradient Sarsa ● Select action and improve policy using an ε-greedy action w.r.t.
  4. 4. Mountain Car Example ● Task: Drive an underpowered car up a steep mountain road ○ Gravity is stronger than car’s engine ○ Must swing back and forth to build enough inertia ● State: position , velocity ● Actions: Forward (+1), Reverse (-1), No-op (0) ● Reward: -1 until the goal is reached
  5. 5. Approximation for Mountain Car ● Tile coding used to select binary features (8 tiles)
  6. 6. Results of Mountain Car ● Plot the cost-to-go function: ● Initial action values set to 0 ○ Very optimistic
  7. 7. Results of Mountain Car
  8. 8. Episodic n-step Semi-gradient Sarsa ● Use n-step return as the update target
  9. 9. Episodic n-step Semi-gradient Sarsa in Practice
  10. 10. Episodic n-step Semi-gradient Sarsa Results ● Faster learning ● Better asymptotic performance
  11. 11. Episodic n-step Semi-gradient Sarsa Results ● Best performance for intermediate values of n-step
  12. 12. Average Reward Setting ● Quality of policy defined by the average reward following policy ● Continuing tasks without discounting
  13. 13. Differential Return and Value Functions Differential Return: differences between rewards and average reward Differential Value Functions: Expected differential returns
  14. 14. Bellman Equations ● Remove all ● Replace rewards with difference of rewards
  15. 15. Differential semi-gradient Sarsa ● Same update rule with differential TD error ● Original TD error: ● Differential TD error:
  16. 16. Differential semi-gradient Sarsa
  17. 17. Access-Control Queuing Example ● Agent can grant access to 10 servers ○ Agent can accept or reject customers ● Customers arrive at a single queue ○ Customers have 4 different priorities, randomly distributed ○ Pay a reward of 1, 2, 4, or 8 when granted access to a server ● A busy server is freed with some probability
  18. 18. Access-Control Queuing Results ● Tabular solution with differential semi-gradient Sarsa
  19. 19. n-step Semi-gradient Sarsa ● Use n-step return ○
  20. 20. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● ●