Lecture22

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Lecture22 - Presentation Transcript

    1. Introduction to Machine Learning Lecture 22 Reinforcement Learning Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull
    2. Recap of Lecture 21 Value functions Vπ(s): Long-term reward estimation from s a e s following po cy π o state o o g policy Qπ(s,a): Long-term reward estimation from s a e s e ecu g ac o a o state executing action and then following policy π The long term reward is a recency weighted average of recency-weighted the received rewards …r … at rt+1 at+1 rt+2 at+2 rt+3 at+3 t st st+1 st+2 st+3 Slide 2 Artificial Intelligence Machine Learning
    3. Recap of Lecture 21 Policy A policy, π, is a mapping from states, s∈S, and actions, a∈A(s), to the probability π(s, a) of taking action a when in state s. Slide 3 Artificial Intelligence Machine Learning
    4. Today’s Agenda Bellman equations for value functions Optimal policy Learning the optimal policy Q-learning Slide 4 Artificial Intelligence Machine Learning
    5. Let’s Estimate the Future Reward I want to estimate which will be my reward g y given a certain state and a policy π For the state value function Vπ(s) state-value For the action-value function Qπ(s,a) Slide 5 Artificial Intelligence Machine Learning
    6. Bellman Equation for a Policy π Playing a little with the equations yg q Therefore Finally Slide 6 Artificial Intelligence Machine Learning
    7. Q-value Bellman Equation If we estimate the q-value q Slide 7 Artificial Intelligence Machine Learning
    8. Calculation of Value Functions How to calculate the value functions for a given policy g p y Solve a set of linear equations 1. Bellman equation for Vπ This is a system of |S| linear equations Iterative method (convergence proved) 2. Calculate the value by sweeping through the states Greedy methods 3. Slide 8 Artificial Intelligence Machine Learning
    9. Example: The Gridworld Rewards -1 if the agent goes out of the grid 0 for all the other states except from state A and B From A, all four actions yield a reward of 10 and take the agent to A’ From B, all four actions yield a reward of 5 and take the agent to B’ (b) obtained by solving Policy = equal probability for each movement γ=0.9 Slide 9 Artificial Intelligence Machine Learning
    10. Looking for the Optimal Policy Slide 10 Artificial Intelligence Machine Learning
    11. Optimal Policy We search for a policy that achieves a lot of reward over p y the long run Value functions enable us to define a partial order over policies A policy π is better than or equal to π’ if its expected return is π greater than or equal to that of π’ for all states Optimal policies π* share the optimal state value function V* π state-value V Which can be written as Slide 11 Artificial Intelligence Machine Learning
    12. Learning Optimal Policies Slide 12 Artificial Intelligence Machine Learning
    13. Focusing on the Objective We want to find the optimal policy p p y There are many methods for this purpose Dynamic programming D i i Policy iteration Value iteration [Asynchronous versions] RL algorithms Q-learning Sarsa TD-learning We are going to see Q-learning Slide 13 Artificial Intelligence Machine Learning
    14. Q-learning RL algorithms g Learning by doing Temporal difference method Learn directly from raw experience without a model of the environment’s dynamics Advantages No model of the world needed Good policies before learning the optimal policy Reacts to changes in the environment g Slide 14 Artificial Intelligence Machine Learning
    15. Dynamic Programming in Brief Needs a model of the environment to compute true expected values A very informative backup Slide 15 Artificial Intelligence Machine Learning
    16. Temporal Difference Leraning No model of the world needed Most incremental Slide 16 Artificial Intelligence Machine Learning
    17. Q-learning Based on Q-backups Q p The learned action-value function Q directly approximates Q*, independent of the policy being followed Slide 17 Artificial Intelligence Machine Learning
    18. Q-learning: Pseudo code Pseudo code for Q-learning Q g Slide 18 Artificial Intelligence Machine Learning
    19. Q-learning in Action 15x15 maze world; R(goal)=1; R(other)=0 γ=0.9 α=0.65 Slide 19
    20. Q-learning in Action Initial policy Slide 20
    21. Q-learning in Action After 20 episodes Slide 21
    22. Q-learning in Action After 30 episodes Slide 22
    23. Q-learning in Action After 100 episodes Slide 23
    24. Q-learning in Action After 150 episodes Slide 24
    25. Q-learning in Action After 200 episodes Slide 25
    26. Q-learning in Action After 250 episodes Slide 26
    27. Q-learning in Action After 300 episodes Slide 27
    28. Q-learning in Action After 350 episodes Slide 28
    29. Q-learning in Action After 400 episodes Slide 29
    30. Some Last Remarks Exploration regime p g Explore vs. exploit ε-greedy ε greedy action selection Soft-max action selection Initialization f Q-values: b optimistic I iti li ti of Q l be ti i ti Learning rate α In stationary environments α(s) = 1 / (number of visits to state s) In non-stationary environments α takes a constant value The higher the value the higher the influence of recent value, experiences Slide 30 Artificial Intelligence Machine Learning
    31. Next Class Reinforcement l Rif t learning with LCSs i ith LCS Slide 31 Artificial Intelligence Machine Learning
    32. Introduction to Machine Learning Lecture 22 Reinforcement Learning Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull
    SlideShare Zeitgeist 2009

    + Albert Orriols-PuigAlbert Orriols-Puig Nominate

    custom

    250 views, 0 favs, 2 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 250
      • 246 on SlideShare
      • 4 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 4
    Most viewed embeds
    • 3 views on http://www.albertorriols.net
    • 1 views on http://localhost

    more

    All embeds
    • 3 views on http://www.albertorriols.net
    • 1 views on http://localhost

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories