Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reinforcement Learning

2,120 views

Published on

Published in: Education, Technology
  • please enable download.
    thanks alot
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Reinforcement Learning

  1. 1. Reinforcement Learning: An Introduction Ch. 3, 4, 6 R. Sutton and A. Barto. KAIST AIPR Lab. Jung-Yeol Lee 3rd June 2010 1
  2. 2. KAIST AIPR Lab. Contents • Reinforcement learning • Markov decision processes • Value function • Policy iteration • Value iteration • Sarsa • Q-learning 2
  3. 3. KAIST AIPR Lab. Reinforcement Learning • An approach to machine learning • How to take actions in an environment responding to those actions and presenting new situations • To find a policy that maps situations to the actions • To discover which action yield the most reward signal over the long run 3
  4. 4. KAIST AIPR Lab. Agent-Environment Interface • Agent  The learner and decision maker • Environment  Everything outside the agent  Responding to actions and presenting new situations  Giving a reward (feedback, or reinforcement) 4
  5. 5. KAIST AIPR Lab. Agent-Environment Interface (cont’d) state st Agent reward r t action at st+1 Environment r t+1 • Agent and environment interact at time steps t  0,1, 2,3,...  The environment’s state, st  S where S is the set of possible states  An action, at  A(st ) where A( st ) is the set of actions available in state st  A numerical reward, rt 1  • Agent’s policy,  t   t ( s, a), the probability that at  a if st  s   t (s)  A(s), the deterministic policy 5
  6. 6. KAIST AIPR Lab. Goals and Rewards • Goal  What we want to achieve, not how we want to achieve it • Rewards  To formalize the idea of a goal  A numerical value by the environment. 6
  7. 7. KAIST AIPR Lab. Returns • Specific function of the reward sequence • Types of returns  Episodic tasks Rt  rt 1  rt 2  rt 3   rT , where T is a final time step  Continuing tasks ( T   ) The additional concept, discount rate   Rt  rt 1   rt  2   rt 3  2    k rt  k 1 , where 0    1 k 0 7
  8. 8. KAIST AIPR Lab. Exploration vs. Exploitation • Exploration  To discover better action selections  To improve its knowledge • Exploitation  To maximize its reward based on what it already knows • Exploration-exploitation dilemma  Both can’t be pursued exclusively without failing 8
  9. 9. KAIST AIPR Lab. Markov Property • State signal retaining all relevant information • “Independence of path” property • Formally,  Pr st 1  s ', rt 1  r | st , at , rt , st 1 , at 1 , , r1 , s0 , a0   Pr st 1  s ', rt 1  r | st , at  9
  10. 10. KAIST AIPR Lab. Markov Decision Processes (MDP) • 4-tuple, (S , A, T , R)  S is a set of states  A is a set of actions  Transition probabilities T (s, a, s ')  Pr st 1  s ' | st  s, at  a , for all s, s '  S , a  A(s)  The expected reward R(s, a, s ')  E rt 1 | st  s, at  a, st 1  s ' • Finite MDP: the state and action spaces are finite 10
  11. 11. KAIST AIPR Lab. Example: Gridworld • S  1, 2,...,14 • A  up, down, right , left • E.g., T (5, right ,6)  1, T (5, right ,10)  0, T (7, right ,7)  1 • R(s, a, s ')  1, s, s ', a 11
  12. 12. KAIST AIPR Lab. Value Functions • “How good” it is to perform a given action in a given state • The value of a state s under a policy   The state-value function for policy • Expected return when starting in s and following   k  V ( s)  E Rt | st  s  E   rt  k 1 | st  s    k 0   The action-value function for policy  • Expected return starting from s , taking the action a and following   k  Q ( s, a)  E Rt | st  s, at  a  E   rt  k 1 | st  s, at  a    k 0  12
  13. 13. KAIST AIPR Lab. Bellman Equation • Particular recursive relationships of value functions • The Bellman equation for V   V ( s)  E   rt  k 1 | st  s    k   k 0      E rt 1     k rt  k  2 | st  s   k 0        ( s, a) T ( s, a, s ')  R( s, a, s ')   E {  k rt  k  2 | st 1  s '} a s'  k 0     ( s, a) T ( s, a, s ')  R( s, a, s ')   V  ( s ')    a s' • The value function is the unique solution to its Bellman equation 13
  14. 14. KAIST AIPR Lab. Optimal Value Functions • Policies are partially ordered    ' if and only if V  (s)  V  ' (s) for all s  S • Optimal policy π*  Policies that is better than or equal to all other policies • Optimal state-value function V *  V * (s)  max V  ( s), for all s  S  • Optimal action-value function Q*  Q* (s, a)  max Q ( s, a), for all s  S and a  A(s)   E rt 1   V * (st 1 ) | st  s, at  a 14
  15. 15. KAIST AIPR Lab. Bellman Optimality Equation • The Bellman equation for V * * V ( s)  max Q ( s, a) * aA( s )  max E * Rt | st  s, at  a a  k   max E *   rt  k 1 | st  s, at  a  a  k 0      max E * rt 1     k rt  k  2 | st  s, at  a  a  k 0   max E rt 1   V * (st 1 ) | st  s, at  a a  max  T ( s, a, s ')  R( s, a, s ')   V * ( s ')    a s' 15
  16. 16. KAIST AIPR Lab. Bellman Optimality Equation (cont’d) • The Bellman optimality equation for Q*   Q* (s, a)  E rt 1   max Q* (st 1 , a ') | st  s, at  a a'    T (s, a, s ')  R( s, a, s ')   max Q* ( s ', a ')  s'  a'  • Optimal policy from Q*   * (s)  arg max Q* (s, a) aA( s ) • Optimal policy from V *  Any policy that is greedy with respect to V * (s)  max Q (s, a) * aA( s ) 16
  17. 17. KAIST AIPR Lab. Dynamic Programming (DP) • Algorithms for optimal policies under a perfect model • Limited utility in reinforcement learning, but theoretically important • Foundation for the understanding of other methods 17
  18. 18. KAIST AIPR Lab. Policy Evaluation • How to compute the state-value function V  • Recall, the Bellman equation for V   V (s)    ( s, a) T ( s, a, s ')  R( s, a, s ')   V ( s ')      a s' • A sequence of approximate value functions V0 ,V1 ,V2 , • Successive approximation  Vk 1 ( s)   T ( s, a, s ')  R( s, a, s ')   Vk ( s ')  s' • Vk converge to V  as k   18
  19. 19. KAIST AIPR Lab. Policy Improvement • Policy improvement theorem (proof)  If Q (s,  '(s))  V  (s), for all s  S then, V  ' (s)  V  (s)  Better to switch action iff Q (s,  '(s))  V  (s) • The new greedy policy,  '  Selecting the action that appears best  '(s)  arg max Q ( s, a) a  arg max  T ( s, a, s ')  R( s, a, s ')   V  ( s ')    a s' • What if V  '  V  ?  Both  ' and  are optimal policies 19
  20. 20. KAIST AIPR Lab. Policy Iteration •  0  V   1  V    2    *  V * , E  0 I  E1  I  E  I  E  where  denotes a policy evaluation and E   denotes a policy improvement I  • Policy iteration finishes when a policy is stable 20
  21. 21. KAIST AIPR Lab. Policy Iteration (cont’d) Initialization V (s)  and arbitrarily for all Policy Evaluation repeat 0 for each s  S do v  V (s) V (s)   s ' T (s,  (s), s ')  R(s,  ( s), s ')   V ( s ')    max(, v  V (s) ) end for until    (a small positive number) Policy Improvement policy  stable  true for all s  S do b   ( s)  (s)  arg max a  s ' T (s, a, s ')  R(s, a, s ')   V (s ') If b   (s) then policy  stable  false end for If policy  stable then stop; else go to Policy evaluation 21
  22. 22. KAIST AIPR Lab. Value Iteration • Turning the Bellman optimality equation into an update rule  Vk 1 (s)  max E rt 1   Vk (st 1 ) | st  s, at  a a  max  T ( s, a, s ')  R(s, a, s ')   Vk (s ')  for all s  S a s' • Policy π, such that  (s)  arg max  T ( s, a, s ')  R( s, a, s ')   V ( s ')  a s' 22
  23. 23. KAIST AIPR Lab. Value Iteration (cont’d) Initialization V arbitrarily, e.g., V (s)  0 for all s  S  repeat 0 for each s  S do v  V (s) V (s)  max a  s ' T (s, a, s ')  R(s, a, s ')   V (s ')   max(, v  V (s) ) end for until    (a small positive number) Output a deterministic policy,  , such that  (s)  arg max a  s ' T (s, a, s ')  R(s, a, s ')   V (s ') 23
  24. 24. KAIST AIPR Lab. Temporal-Difference (TD) Prediction • Model free method • Basic Update rule  NewEstimate  OldEstimate  StepSize Target  OldEstimate • The simplest TD method, TD(0) error  V (st )  V (st )    Rt  V (st )  V (st )    rt 1   V (st 1 )  V (st )  α: step size 24
  25. 25. KAIST AIPR Lab. Advantages of TD Prediction Methods • Bootstrapping  Estimate on the basis of other estimates (a guess from a guess) • Over DP methods  Model free methods • Wait only one time step  In case of continuing tasks and no episodes • Guarantee convergence to the correct answer  Sufficiently small α  Selecting all actions infinitely often 25
  26. 26. KAIST AIPR Lab. Sarsa: On-Policy TD Control • On-policy  Improve the policy that is used to make decisions  • Estimate Q under the current policy π • Under TD(0), apply to the corresponding algorithm  Q(st , at )  Q(st , at )   rt 1   Q(st 1, at 1 )  Q(st , at )   For every quintuples of events, (st , at , rt 1 , st 1 , at 1 ) (Sarsa)  If st 1 is terminal, then Q(st 1 , at 1 )  0 • Change π toward greediness w.r.t. Q • Converges if all pairs are visited infinite times and policy converges to the greedy (e.g., ε= t in ε-greedy) 1 26
  27. 27. KAIST AIPR Lab. Sarsa: On-Policy TD Control (cont’d) Initialize Q(s, a) arbitrarily Repeat (for each episode): Initialize s Choose a from s using policy derived from Q (e.g., ε-greedy) Repeat (for each step of episode): Take action a, observe r , s ' Choose( a ' from s ' using policy derived from Q(e.g., ε-greedy) Q(s, a)  Q(s, a)    r   Q(s ', a ')  Q(s, a) s  s '; a  a ' until s is terminal 27
  28. 28. KAIST AIPR Lab. Q-Learning: Off-Policy TD Control • Off-policy  Behavior policy  Estimation policy (may be deterministic (e.g., greedy) ) • Simplest form, one-step Q-learning, is defined by  Q( st , at )  Q( st , at )   rt 1   max Q( st 1 , a)  Q( st , at )   a  • Directly approximate Q*, the optimal action-value function • Qt converges to Q* with probability 1  Correct converges if all pairs continue to be updated 28
  29. 29. KAIST AIPR Lab. Q-Learning: Off-Policy TD Control (cont’d) Initialize Q(s, a) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step of episode): Choose( a from s using policy derived from Q (e.g., ε-greedy) Take action a, observe r , s ' Q(s, a)  Q(s, a)    r   max a ' Q(s ', a ')  Q(s, a) s  s' until s is terminal 29
  30. 30. KAIST AIPR Lab. Example: Cliffwalking • ε-greedy action selection  ε=0.1 (fixed) • Sarsa  Learns longer but safer • Q-learning  Learns the optimal policy • If ε were reduced,  Both converge to the optimal policy 30
  31. 31. KAIST AIPR Lab. Summary • Goal of reinforcement learning  To find an optimal policy to maximize the long-term reward • Model-based methods  Policy iteration: a sequence of improving policies and value function  Value iteration: backup operations for V * • Model-free methods   Sarsa estimates Q for the behavior policy  , change  toward greediness w.r.t. Q  Q-learning directly approximates the optimal action-value function 31
  32. 32. KAIST AIPR Lab. References [1] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. Pages 51-158, 1998. [2] S. Russel and P. Norvig. Artificial Intelligence: A Modern Approach. Pages 613-784, 2003. 32
  33. 33. KAIST AIPR Lab. Q&A • Thank you 33
  34. 34. KAIST AIPR Lab. Appendix 1. Policy Improvement Theorem • Proof) 34
  35. 35. KAIST AIPR Lab. Appendix 2. Convergence of Value Iteration • Prove 35
  36. 36. KAIST AIPR Lab. Appendix 3. Target Estimation • DP • Simple TD V (st )  E  rt 1   V (st 1 ) V (st )  V (st )    rt 1   V (st 1 )  V (st ) 36

×