Published on

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Introduction to Machine Learning Lecture 22 Reinforcement Learning Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull
  2. 2. Recap of Lecture 21 Value functions Vπ(s): Long-term reward estimation from s a e s following po cy π o state o o g policy Qπ(s,a): Long-term reward estimation from s a e s e ecu g ac o a o state executing action and then following policy π The long term reward is a recency weighted average of recency-weighted the received rewards …r … at rt+1 at+1 rt+2 at+2 rt+3 at+3 t st st+1 st+2 st+3 Slide 2 Artificial Intelligence Machine Learning
  3. 3. Recap of Lecture 21 Policy A policy, π, is a mapping from states, s∈S, and actions, a∈A(s), to the probability π(s, a) of taking action a when in state s. Slide 3 Artificial Intelligence Machine Learning
  4. 4. Today’s Agenda Bellman equations for value functions Optimal policy Learning the optimal policy Q-learning Slide 4 Artificial Intelligence Machine Learning
  5. 5. Let’s Estimate the Future Reward I want to estimate which will be my reward g y given a certain state and a policy π For the state value function Vπ(s) state-value For the action-value function Qπ(s,a) Slide 5 Artificial Intelligence Machine Learning
  6. 6. Bellman Equation for a Policy π Playing a little with the equations yg q Therefore Finally Slide 6 Artificial Intelligence Machine Learning
  7. 7. Q-value Bellman Equation If we estimate the q-value q Slide 7 Artificial Intelligence Machine Learning
  8. 8. Calculation of Value Functions How to calculate the value functions for a given policy g p y Solve a set of linear equations 1. Bellman equation for Vπ This is a system of |S| linear equations Iterative method (convergence proved) 2. Calculate the value by sweeping through the states Greedy methods 3. Slide 8 Artificial Intelligence Machine Learning
  9. 9. Example: The Gridworld Rewards -1 if the agent goes out of the grid 0 for all the other states except from state A and B From A, all four actions yield a reward of 10 and take the agent to A’ From B, all four actions yield a reward of 5 and take the agent to B’ (b) obtained by solving Policy = equal probability for each movement γ=0.9 Slide 9 Artificial Intelligence Machine Learning
  10. 10. Looking for the Optimal Policy Slide 10 Artificial Intelligence Machine Learning
  11. 11. Optimal Policy We search for a policy that achieves a lot of reward over p y the long run Value functions enable us to define a partial order over policies A policy π is better than or equal to π’ if its expected return is π greater than or equal to that of π’ for all states Optimal policies π* share the optimal state value function V* π state-value V Which can be written as Slide 11 Artificial Intelligence Machine Learning
  12. 12. Learning Optimal Policies Slide 12 Artificial Intelligence Machine Learning
  13. 13. Focusing on the Objective We want to find the optimal policy p p y There are many methods for this purpose Dynamic programming D i i Policy iteration Value iteration [Asynchronous versions] RL algorithms Q-learning Sarsa TD-learning We are going to see Q-learning Slide 13 Artificial Intelligence Machine Learning
  14. 14. Q-learning RL algorithms g Learning by doing Temporal difference method Learn directly from raw experience without a model of the environment’s dynamics Advantages No model of the world needed Good policies before learning the optimal policy Reacts to changes in the environment g Slide 14 Artificial Intelligence Machine Learning
  15. 15. Dynamic Programming in Brief Needs a model of the environment to compute true expected values A very informative backup Slide 15 Artificial Intelligence Machine Learning
  16. 16. Temporal Difference Leraning No model of the world needed Most incremental Slide 16 Artificial Intelligence Machine Learning
  17. 17. Q-learning Based on Q-backups Q p The learned action-value function Q directly approximates Q*, independent of the policy being followed Slide 17 Artificial Intelligence Machine Learning
  18. 18. Q-learning: Pseudo code Pseudo code for Q-learning Q g Slide 18 Artificial Intelligence Machine Learning
  19. 19. Q-learning in Action 15x15 maze world; R(goal)=1; R(other)=0 γ=0.9 α=0.65 Slide 19
  20. 20. Q-learning in Action Initial policy Slide 20
  21. 21. Q-learning in Action After 20 episodes Slide 21
  22. 22. Q-learning in Action After 30 episodes Slide 22
  23. 23. Q-learning in Action After 100 episodes Slide 23
  24. 24. Q-learning in Action After 150 episodes Slide 24
  25. 25. Q-learning in Action After 200 episodes Slide 25
  26. 26. Q-learning in Action After 250 episodes Slide 26
  27. 27. Q-learning in Action After 300 episodes Slide 27
  28. 28. Q-learning in Action After 350 episodes Slide 28
  29. 29. Q-learning in Action After 400 episodes Slide 29
  30. 30. Some Last Remarks Exploration regime p g Explore vs. exploit ε-greedy ε greedy action selection Soft-max action selection Initialization f Q-values: b optimistic I iti li ti of Q l be ti i ti Learning rate α In stationary environments α(s) = 1 / (number of visits to state s) In non-stationary environments α takes a constant value The higher the value the higher the influence of recent value, experiences Slide 30 Artificial Intelligence Machine Learning
  31. 31. Next Class Reinforcement l Rif t learning with LCSs i ith LCS Slide 31 Artificial Intelligence Machine Learning
  32. 32. Introduction to Machine Learning Lecture 22 Reinforcement Learning Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull