Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

3,896 views

Published on

No Downloads

Total views

3,896

On SlideShare

0

From Embeds

0

Number of Embeds

28

Shares

0

Downloads

110

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Introduction to Machine Learning Lecture 22 Reinforcement Learning Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull
- 2. Recap of Lecture 21 Value functions Vπ(s): Long-term reward estimation from s a e s following po cy π o state o o g policy Qπ(s,a): Long-term reward estimation from s a e s e ecu g ac o a o state executing action and then following policy π The long term reward is a recency weighted average of recency-weighted the received rewards …r … at rt+1 at+1 rt+2 at+2 rt+3 at+3 t st st+1 st+2 st+3 Slide 2 Artificial Intelligence Machine Learning
- 3. Recap of Lecture 21 Policy A policy, π, is a mapping from states, s∈S, and actions, a∈A(s), to the probability π(s, a) of taking action a when in state s. Slide 3 Artificial Intelligence Machine Learning
- 4. Today’s Agenda Bellman equations for value functions Optimal policy Learning the optimal policy Q-learning Slide 4 Artificial Intelligence Machine Learning
- 5. Let’s Estimate the Future Reward I want to estimate which will be my reward g y given a certain state and a policy π For the state value function Vπ(s) state-value For the action-value function Qπ(s,a) Slide 5 Artificial Intelligence Machine Learning
- 6. Bellman Equation for a Policy π Playing a little with the equations yg q Therefore Finally Slide 6 Artificial Intelligence Machine Learning
- 7. Q-value Bellman Equation If we estimate the q-value q Slide 7 Artificial Intelligence Machine Learning
- 8. Calculation of Value Functions How to calculate the value functions for a given policy g p y Solve a set of linear equations 1. Bellman equation for Vπ This is a system of |S| linear equations Iterative method (convergence proved) 2. Calculate the value by sweeping through the states Greedy methods 3. Slide 8 Artificial Intelligence Machine Learning
- 9. Example: The Gridworld Rewards -1 if the agent goes out of the grid 0 for all the other states except from state A and B From A, all four actions yield a reward of 10 and take the agent to A’ From B, all four actions yield a reward of 5 and take the agent to B’ (b) obtained by solving Policy = equal probability for each movement γ=0.9 Slide 9 Artificial Intelligence Machine Learning
- 10. Looking for the Optimal Policy Slide 10 Artificial Intelligence Machine Learning
- 11. Optimal Policy We search for a policy that achieves a lot of reward over p y the long run Value functions enable us to define a partial order over policies A policy π is better than or equal to π’ if its expected return is π greater than or equal to that of π’ for all states Optimal policies π* share the optimal state value function V* π state-value V Which can be written as Slide 11 Artificial Intelligence Machine Learning
- 12. Learning Optimal Policies Slide 12 Artificial Intelligence Machine Learning
- 13. Focusing on the Objective We want to find the optimal policy p p y There are many methods for this purpose Dynamic programming D i i Policy iteration Value iteration [Asynchronous versions] RL algorithms Q-learning Sarsa TD-learning We are going to see Q-learning Slide 13 Artificial Intelligence Machine Learning
- 14. Q-learning RL algorithms g Learning by doing Temporal difference method Learn directly from raw experience without a model of the environment’s dynamics Advantages No model of the world needed Good policies before learning the optimal policy Reacts to changes in the environment g Slide 14 Artificial Intelligence Machine Learning
- 15. Dynamic Programming in Brief Needs a model of the environment to compute true expected values A very informative backup Slide 15 Artificial Intelligence Machine Learning
- 16. Temporal Difference Leraning No model of the world needed Most incremental Slide 16 Artificial Intelligence Machine Learning
- 17. Q-learning Based on Q-backups Q p The learned action-value function Q directly approximates Q*, independent of the policy being followed Slide 17 Artificial Intelligence Machine Learning
- 18. Q-learning: Pseudo code Pseudo code for Q-learning Q g Slide 18 Artificial Intelligence Machine Learning
- 19. Q-learning in Action 15x15 maze world; R(goal)=1; R(other)=0 γ=0.9 α=0.65 Slide 19
- 20. Q-learning in Action Initial policy Slide 20
- 21. Q-learning in Action After 20 episodes Slide 21
- 22. Q-learning in Action After 30 episodes Slide 22
- 23. Q-learning in Action After 100 episodes Slide 23
- 24. Q-learning in Action After 150 episodes Slide 24
- 25. Q-learning in Action After 200 episodes Slide 25
- 26. Q-learning in Action After 250 episodes Slide 26
- 27. Q-learning in Action After 300 episodes Slide 27
- 28. Q-learning in Action After 350 episodes Slide 28
- 29. Q-learning in Action After 400 episodes Slide 29
- 30. Some Last Remarks Exploration regime p g Explore vs. exploit ε-greedy ε greedy action selection Soft-max action selection Initialization f Q-values: b optimistic I iti li ti of Q l be ti i ti Learning rate α In stationary environments α(s) = 1 / (number of visits to state s) In non-stationary environments α takes a constant value The higher the value the higher the influence of recent value, experiences Slide 30 Artificial Intelligence Machine Learning
- 31. Next Class Reinforcement l Rif t learning with LCSs i ith LCS Slide 31 Artificial Intelligence Machine Learning
- 32. Introduction to Machine Learning Lecture 22 Reinforcement Learning Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment