Lecture22
Upcoming SlideShare
Loading in...5
×
 

Lecture22

on

  • 3,594 views

 

Statistics

Views

Total Views
3,594
Views on SlideShare
3,570
Embed Views
24

Actions

Likes
1
Downloads
67
Comments
0

4 Embeds 24

http://www.albertorriols.net 13
http://www.slideshare.net 7
http://webcache.googleusercontent.com 3
http://localhost 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lecture22 Lecture22 Presentation Transcript

  • Introduction to Machine Learning Lecture 22 Reinforcement Learning Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull
  • Recap of Lecture 21 Value functions Vπ(s): Long-term reward estimation from s a e s following po cy π o state o o g policy Qπ(s,a): Long-term reward estimation from s a e s e ecu g ac o a o state executing action and then following policy π The long term reward is a recency weighted average of recency-weighted the received rewards …r … at rt+1 at+1 rt+2 at+2 rt+3 at+3 t st st+1 st+2 st+3 Slide 2 Artificial Intelligence Machine Learning
  • Recap of Lecture 21 Policy A policy, π, is a mapping from states, s∈S, and actions, a∈A(s), to the probability π(s, a) of taking action a when in state s. Slide 3 Artificial Intelligence Machine Learning View slide
  • Today’s Agenda Bellman equations for value functions Optimal policy Learning the optimal policy Q-learning Slide 4 Artificial Intelligence Machine Learning View slide
  • Let’s Estimate the Future Reward I want to estimate which will be my reward g y given a certain state and a policy π For the state value function Vπ(s) state-value For the action-value function Qπ(s,a) Slide 5 Artificial Intelligence Machine Learning
  • Bellman Equation for a Policy π Playing a little with the equations yg q Therefore Finally Slide 6 Artificial Intelligence Machine Learning
  • Q-value Bellman Equation If we estimate the q-value q Slide 7 Artificial Intelligence Machine Learning
  • Calculation of Value Functions How to calculate the value functions for a given policy g p y Solve a set of linear equations 1. Bellman equation for Vπ This is a system of |S| linear equations Iterative method (convergence proved) 2. Calculate the value by sweeping through the states Greedy methods 3. Slide 8 Artificial Intelligence Machine Learning
  • Example: The Gridworld Rewards -1 if the agent goes out of the grid 0 for all the other states except from state A and B From A, all four actions yield a reward of 10 and take the agent to A’ From B, all four actions yield a reward of 5 and take the agent to B’ (b) obtained by solving Policy = equal probability for each movement γ=0.9 Slide 9 Artificial Intelligence Machine Learning
  • Looking for the Optimal Policy Slide 10 Artificial Intelligence Machine Learning
  • Optimal Policy We search for a policy that achieves a lot of reward over p y the long run Value functions enable us to define a partial order over policies A policy π is better than or equal to π’ if its expected return is π greater than or equal to that of π’ for all states Optimal policies π* share the optimal state value function V* π state-value V Which can be written as Slide 11 Artificial Intelligence Machine Learning
  • Learning Optimal Policies Slide 12 Artificial Intelligence Machine Learning
  • Focusing on the Objective We want to find the optimal policy p p y There are many methods for this purpose Dynamic programming D i i Policy iteration Value iteration [Asynchronous versions] RL algorithms Q-learning Sarsa TD-learning We are going to see Q-learning Slide 13 Artificial Intelligence Machine Learning
  • Q-learning RL algorithms g Learning by doing Temporal difference method Learn directly from raw experience without a model of the environment’s dynamics Advantages No model of the world needed Good policies before learning the optimal policy Reacts to changes in the environment g Slide 14 Artificial Intelligence Machine Learning
  • Dynamic Programming in Brief Needs a model of the environment to compute true expected values A very informative backup Slide 15 Artificial Intelligence Machine Learning
  • Temporal Difference Leraning No model of the world needed Most incremental Slide 16 Artificial Intelligence Machine Learning
  • Q-learning Based on Q-backups Q p The learned action-value function Q directly approximates Q*, independent of the policy being followed Slide 17 Artificial Intelligence Machine Learning
  • Q-learning: Pseudo code Pseudo code for Q-learning Q g Slide 18 Artificial Intelligence Machine Learning
  • Q-learning in Action 15x15 maze world; R(goal)=1; R(other)=0 γ=0.9 α=0.65 Slide 19
  • Q-learning in Action Initial policy Slide 20
  • Q-learning in Action After 20 episodes Slide 21
  • Q-learning in Action After 30 episodes Slide 22
  • Q-learning in Action After 100 episodes Slide 23
  • Q-learning in Action After 150 episodes Slide 24
  • Q-learning in Action After 200 episodes Slide 25
  • Q-learning in Action After 250 episodes Slide 26
  • Q-learning in Action After 300 episodes Slide 27
  • Q-learning in Action After 350 episodes Slide 28
  • Q-learning in Action After 400 episodes Slide 29
  • Some Last Remarks Exploration regime p g Explore vs. exploit ε-greedy ε greedy action selection Soft-max action selection Initialization f Q-values: b optimistic I iti li ti of Q l be ti i ti Learning rate α In stationary environments α(s) = 1 / (number of visits to state s) In non-stationary environments α takes a constant value The higher the value the higher the influence of recent value, experiences Slide 30 Artificial Intelligence Machine Learning
  • Next Class Reinforcement l Rif t learning with LCSs i ith LCS Slide 31 Artificial Intelligence Machine Learning
  • Introduction to Machine Learning Lecture 22 Reinforcement Learning Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull