Hierarchical Reinforcement Learning

1,147 views

Published on

Hierarchical Reinforcement Learning

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,147
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hierarchical Reinforcement Learning

  1. 1. Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]
  2. 2. The Outline of the Talk <ul><li>MDPs and Bellman’s curse of dimensionality. </li></ul><ul><li>RL: Simultaneous learning and planning. </li></ul><ul><li>Explore avenues to speed up RL. </li></ul><ul><li>Illustrate prominent HRL methods. </li></ul><ul><li>Compare prominent HRL methods. </li></ul><ul><li>Discuss future research. </li></ul><ul><li>Summarise </li></ul>
  3. 3. Decision Making Slide courtesy Dan Weld Environment Percept Action What action next?
  4. 4. Personal Printerbot <ul><li>States ( S ) : {loc,has-robot-printout, user-loc,has-user-printout},map </li></ul><ul><li>Actions ( A ) : {move n ,move s ,move e ,move w , extend-arm,grab-page,release-pages} </li></ul><ul><li>Reward ( R ) : if h-u-po +20 else -1 </li></ul><ul><li>Goal ( G ) : All states with h-u-po true. </li></ul><ul><li>Start state : A state with h-u-po false. </li></ul>
  5. 5. Episodic Markov Decision Process <ul><li>hS , A , P , R , G , s 0 i </li></ul><ul><li>S : Set of environment states. </li></ul><ul><li>A : Set of available actions. </li></ul><ul><li>P : Probability Transition model. P (s’|s,a)* </li></ul><ul><li>R : Reward model. R (s)* </li></ul><ul><li>G : Absorbing goal states. </li></ul><ul><li>s 0 : Start state. </li></ul><ul><li> : Discount factor**. </li></ul>* Markovian assumption. ** bounds R for infinite horizon. Episodic MDP ´ MDP with absorbing goals
  6. 6. Goal of an Episodic MDP <ul><li>Find a policy ( S ! A ), which: </li></ul><ul><ul><li>maximises expected discounted reward for a </li></ul></ul><ul><ul><li>a fully observable* Episodic MDP. </li></ul></ul><ul><ul><li>if agent is allowed to execute for an indefinite horizon. </li></ul></ul>* Non-noisy complete information perceptors
  7. 7. Solution of an Episodic MDP <ul><li>Define V *(s) : Optimal reward starting in state s. </li></ul><ul><li>Value Iteration : Start with an estimate of V *(s) and successively re-estimate it to converge to a fixed point. </li></ul>
  8. 8. Complexity of Value Iteration <ul><li>Each iteration – polynomial in | S | </li></ul><ul><li>Number of iterations – polynomial in | S | </li></ul><ul><li>Overall – polynomial in | S | </li></ul><ul><li>Polynomial in | S | -  </li></ul><ul><li>| S | : exponential in number of </li></ul><ul><li>features in the domain*. </li></ul>* Bellman’s curse of dimensionality
  9. 9. The Outline of the Talk <ul><li>MDPs and Bellman’s curse of dimensionality. </li></ul><ul><li>RL: Simultaneous learning and planning. </li></ul><ul><li>Explore avenues to speed up RL. </li></ul><ul><li>Illustrate prominent HRL methods. </li></ul><ul><li>Compare prominent HRL methods. </li></ul><ul><li>Discuss future research. </li></ul><ul><li>Summarise </li></ul>
  10. 10. Learning Environment Data <ul><li>Gain knowledge </li></ul><ul><li>Gain understanding </li></ul><ul><li>Gain skills </li></ul><ul><li>Modification of behavioural tendency </li></ul>
  11. 11. Decision Making while Learning* Environment Percepts Datum Action * Known as Reinforcement Learning What action next? <ul><li>Gain knowledge </li></ul><ul><li>Gain understanding </li></ul><ul><li>Gain skills </li></ul><ul><li>Modification of behavioural tendency </li></ul>
  12. 12. Reinforcement Learning <ul><li>Unknown P and reward R . </li></ul><ul><li>Learning Component : Estimate the P and R values via data observed from the environment. </li></ul><ul><li>Planning Component : Decide which actions to take that will maximise reward. </li></ul><ul><li>Exploration vs. Exploitation </li></ul><ul><ul><li>GLIE (Greedy in Limit with </li></ul></ul><ul><ul><li>Infinite Exploration) </li></ul></ul>
  13. 13. Planning vs. MDP vs. RL <ul><li>MDP model system dynamics. </li></ul><ul><li>MDP algorithms solve the optimisation equations. </li></ul><ul><li>Planning is modeled as MDPs. </li></ul><ul><li>Planning algorithms speed up MDP algorithms. </li></ul><ul><li>RL is modeled over MDPs. </li></ul><ul><li>RL algorithms use MDP equations as basis. </li></ul><ul><li>RL algorithms speed up algorithms for simultaneous planning and learning. </li></ul>
  14. 14. Exploration vs. Exploitation <ul><li>Exploration : Choose actions that visit new states in order to obtain more data for better learning. </li></ul><ul><li>Exploitation : Choose actions that maximise the reward given current learnt model. </li></ul><ul><li>A solution : GLIE - Greedy in Limit </li></ul><ul><li>with Infinite Exploration. </li></ul>
  15. 15. Model Based Learning <ul><li>First learn the model. </li></ul><ul><li>Then use MDP algorithms. </li></ul><ul><li>Very slow, and uses a lot of data. </li></ul><ul><li>Optimisations proposed – DYNA, Prioritised Sweeping etc. </li></ul><ul><li>Uses less data, comparitively slow. </li></ul>
  16. 16. Model Free Learning <ul><li>Learn the policy without learning an explicit model. </li></ul><ul><li>Do not estimate P , and R explicitly. </li></ul><ul><li>E.g. : Temporal Difference Learning. </li></ul><ul><li>Very popular, fast, require a lot of data. </li></ul>
  17. 17. Learning <ul><li>Model-based learning </li></ul><ul><ul><li>Learn the model, and do planning </li></ul></ul><ul><ul><li>Requires less data, more computation </li></ul></ul><ul><li>Model-free learning </li></ul><ul><ul><li>Plan without learning an explicit model </li></ul></ul><ul><ul><li>Requires a lot of data, less computation </li></ul></ul>
  18. 18. Q-Learning <ul><li>Instead of learning, P and R , learn Q * directly. </li></ul><ul><li>Q *(s,a) : Optimal reward starting in s, if the first action is a, and after that the optimal policy is followed. </li></ul><ul><li>Q * directly defines the optimal policy: </li></ul>Optimal policy is the action with maximum Q * value.
  19. 19. Q-Learning <ul><li>Given an experience tuple h s,a,s’,r i </li></ul><ul><li>Under suitable assumptions, and GLIE exploration Q-Learning converges to optimal. </li></ul>New estimate of Q value Old estimate of Q value
  20. 20. Semi-MDP: When actions take time. <ul><li>The Semi-MDP equation: </li></ul><ul><li>Semi-MDP Q-Learning equation: </li></ul><ul><li> where experience tuple is h s,a,s’,r,N i </li></ul><ul><li> r = accumulated discounted reward while action a was executing. </li></ul>
  21. 21. Printerbot <ul><li>Paul G. Allen Center has 85000 sq ft space </li></ul><ul><li>Each floor ~ 85000/7 ~ 12000 sq ft </li></ul><ul><li>Discretise location on a floor: 12000 parts. </li></ul><ul><li>State Space (without map) : 2*2*12000*12000 --- very large!!!!! </li></ul><ul><li>How do humans do the decision making? </li></ul>
  22. 22. The Outline of the Talk <ul><li>MDPs and Bellman’s curse of dimensionality. </li></ul><ul><li>RL: Simultaneous learning and planning. </li></ul><ul><li>Explore avenues to speedup RL. </li></ul><ul><li>Illustrate prominent HRL methods. </li></ul><ul><li>Compare prominent HRL methods. </li></ul><ul><li>Discuss future research. </li></ul><ul><li>Summarise </li></ul>
  23. 23. 1. The Mathematical Perspective A Structure Paradigm <ul><li>S : Relational MDP </li></ul><ul><li>A : Concurrent MDP </li></ul><ul><li>P : Dynamic Bayes Nets </li></ul><ul><li>R : Continuous-state MDP </li></ul><ul><li>G : Conjunction of state variables </li></ul><ul><li>V : Algebraic Decision Diagrams </li></ul><ul><li> : Decision List (RMDP) </li></ul>
  24. 24. 2. Modular Decision Making
  25. 25. 2. Modular Decision Making <ul><li>Go out of room </li></ul><ul><li>Walk in hallway </li></ul><ul><li>Go in the room </li></ul>
  26. 26. 2. Modular Decision Making <ul><li>Humans plan modularly at different granularities of understanding. </li></ul><ul><li>Going out of one room is similar to going out of another room. </li></ul><ul><li>Navigation steps do not depend on whether we have the print out or not. </li></ul>
  27. 27. 3. Background Knowledge <ul><li>Classical Planners using additional control knowledge can scale up to larger problems. </li></ul><ul><li>(E.g. : HTN planning, TLPlan) </li></ul><ul><li>What forms of control knowledge can we provide to our Printerbot? </li></ul><ul><ul><li>First pick printouts, then deliver them. </li></ul></ul><ul><ul><li>Navigation – consider rooms, hallway, separately, etc. </li></ul></ul>
  28. 28. A mechanism that exploits all three avenues : Hierarchies <ul><li>Way to add a special (hierarchical) structure on different parameters of an MDP. </li></ul><ul><li>Draws from the intuition and reasoning in human decision making. </li></ul><ul><li>Way to provide additional control knowledge to the system. </li></ul>
  29. 29. The Outline of the Talk <ul><li>MDPs and Bellman’s curse of dimensionality. </li></ul><ul><li>RL: Simultaneous learning and planning. </li></ul><ul><li>Explore avenues to speedup RL. </li></ul><ul><li>Illustrate prominent HRL methods. </li></ul><ul><li>Compare prominent HRL methods. </li></ul><ul><li>Discuss future research. </li></ul><ul><li>Summarise </li></ul>
  30. 30. Hierarchy <ul><li>Hierarchy of : Behaviour, Skill, Module, SubTask, Macro-action, etc. </li></ul><ul><ul><li>picking the pages </li></ul></ul><ul><ul><li>collision avoidance </li></ul></ul><ul><ul><li>fetch pages phase </li></ul></ul><ul><ul><li>walk in hallway </li></ul></ul><ul><li>HRL ´ RL with temporally extended actions </li></ul>
  31. 31. Hierarchical Algos ´ Gating Mechanism <ul><li>Hierarchical Learning </li></ul><ul><ul><li>Learning the gating function </li></ul></ul><ul><ul><li>Learning the individual behaviours </li></ul></ul><ul><ul><li>Learning both </li></ul></ul>* *Can be a multi- level hierarchy. g is a gate b i is a behaviour
  32. 32. Option : Move e until end of hallway <ul><li>Start : Any state in the hallway. </li></ul><ul><li>Execute : policy as shown. </li></ul><ul><li> Terminate : when s is end of hallway. </li></ul>
  33. 33. Options [Sutton, Precup, Singh’99] <ul><li>An option is a well defined behaviour. </li></ul><ul><li>o = h I o ,  o ,  o i </li></ul><ul><li>I o : Set of states ( I o µS ) in which o can be initiated. </li></ul><ul><li> o  s  : Policy ( S!A *) when o is executing. </li></ul><ul><li> o ( s) : Probability that o terminates in s. </li></ul>*Can be a policy over lower level options.
  34. 34. Learning <ul><li>An option is temporally extended action with well defined policy. </li></ul><ul><li>Set of options ( O ) replaces the set of actions ( A ) </li></ul><ul><li>Learning occurs outside options. </li></ul><ul><li>Learning over options ´ Semi MDP Q-Learning. </li></ul>
  35. 35. Machine: Move e + Collision Avoidance Move e Choose Return End of hallway : End of hallway Obstacle Call M1 Call M2 M1 M2 Move w Move n Move n Return Move w Move s Move s Return
  36. 36. Hierarchies of Abstract Machines [Parr, Russell’97] <ul><li>A machine is a partial policy represented by a Finite State Automaton. </li></ul><ul><li>Node : </li></ul><ul><ul><li>Execute a ground action. </li></ul></ul><ul><ul><li>Call a machine as a subroutine. </li></ul></ul><ul><ul><li>Choose the next node. </li></ul></ul><ul><ul><li>Return to the calling machine. </li></ul></ul>
  37. 37. Hierarchies of Abstract Machines <ul><li>A machine is a partial policy represented by a Finite State Automaton. </li></ul><ul><li>Node : </li></ul><ul><ul><li>Execute a ground action. </li></ul></ul><ul><ul><li>Call a machine as subroutine. </li></ul></ul><ul><ul><li>Choose the next node. </li></ul></ul><ul><ul><li>Return to the calling machine. </li></ul></ul>
  38. 38. Learning <ul><li>Learning occurs within machines, as machines are only partially defined. </li></ul><ul><li>Flatten all machines out and consider states [s,m] where s is a world state, and m, a machine node ´ MDP </li></ul><ul><li>reduce ( S o M ) : Consider only states where machine node is a choice node ´ Semi-MDP. </li></ul><ul><li>Learning ¼ Semi-MDP Q-Learning </li></ul>
  39. 39. Task Hierarchy: MAXQ Decomposition [Dietterich’00] Root Take Give Navigate(loc) Deliver Fetch Extend-arm Extend-arm Grab Release Move e Move w Move s Move n Children of a task are unordered
  40. 40. MAXQ Decomposition <ul><li>Augment the state s by adding the subtask i : [s,i]. </li></ul><ul><li>Define C ([s,i],j) as the reward received in i after j finishes. </li></ul><ul><li>Q ( [s,Fetch],Navigate(prr)) = V ([s,Navigate(prr)]) + C ([s,Fetch],Navigate(prr)) * </li></ul><ul><li>Express V in terms of C </li></ul><ul><li>Learn C , instead of learning Q </li></ul>*Observe the context-free nature of Q -value Reward received while navigating Reward received after navigation
  41. 41. The Outline of the Talk <ul><li>MDPs and Bellman’s curse of dimensionality. </li></ul><ul><li>RL: Simultaneous learning and planning. </li></ul><ul><li>Explore avenues to speedup RL. </li></ul><ul><li>Illustrate prominent HRL methods. </li></ul><ul><li>Compare prominent HRL methods. </li></ul><ul><li>Discuss future research. </li></ul><ul><li>Summarise </li></ul>
  42. 42. 1. State Abstraction <ul><li>Abstract state : A state having fewer state variables; different world states maps to the same abstract state. </li></ul><ul><li>If we can reduce some state variables, then we can reduce on the learning time considerably! </li></ul><ul><li>We may use different abstract states for different macro-actions. </li></ul>
  43. 43. State Abstraction in MAXQ <ul><li>Relevance : Only some variables are relevant for the task. </li></ul><ul><ul><li>Fetch : user-loc irrelevant </li></ul></ul><ul><ul><li>Navigate(printer-room) : h-r-po,h-u-po,user-loc </li></ul></ul><ul><ul><li>Fewer params for V of lower levels. </li></ul></ul><ul><li>Funnelling : Subtask maps many states to smaller set of states. </li></ul><ul><ul><li>Fetch : All states map to h-r-po =true, loc =pr.room. </li></ul></ul><ul><ul><li>Fewer params for C of higher levels. </li></ul></ul>
  44. 44. State Abstraction in Options, HAM <ul><li>Options : Learning required only in states that are terminal states for some option. </li></ul><ul><li>HAM : Original work has no abstraction. </li></ul><ul><ul><li>Extension: Three-way value decomposition*: </li></ul></ul><ul><ul><li>Q ([s,m],n) = V ([s,n]) + C ([s,m],n) + C ex ([s,m]) </li></ul></ul><ul><ul><li>Similar abstractions are employed. </li></ul></ul>*[Andre,Russell’02]
  45. 45. 2. Optimality Hierarchical Optimality vs. Recursive Optimality
  46. 46. Optimality <ul><li>Options : Hierarchical </li></ul><ul><ul><li>Use ( A [ O ) : Global** </li></ul></ul><ul><ul><li>Interrupt options </li></ul></ul><ul><li>HAM : Hierarchical* </li></ul><ul><li>MAXQ : Recursive* </li></ul><ul><ul><li>Interrupt subtasks </li></ul></ul><ul><ul><li>Use Pseudo-rewards </li></ul></ul><ul><ul><li>Iterate! </li></ul></ul>* Can define eqns for both optimalities **Adv. of using macro-actions maybe lost.
  47. 47. 3. Language Expressiveness <ul><li>Option </li></ul><ul><ul><li>Can only input a complete policy </li></ul></ul><ul><li>HAM </li></ul><ul><ul><li>Can input a complete policy. </li></ul></ul><ul><ul><li>Can input a task hierarchy. </li></ul></ul><ul><ul><li>Can represent “amount of effort”. </li></ul></ul><ul><ul><li>Later extended to partial programs. </li></ul></ul><ul><li>MAXQ </li></ul><ul><ul><li>Cannot input a policy (full/partial) </li></ul></ul>
  48. 48. 4. Knowledge Requirements <ul><li>Options </li></ul><ul><ul><li>Requires complete specification of policy. </li></ul></ul><ul><ul><li>One could learn option policies – given subtasks. </li></ul></ul><ul><li>HAM </li></ul><ul><ul><li>Medium requirements </li></ul></ul><ul><li>MAXQ </li></ul><ul><ul><li>Minimal requirements </li></ul></ul>
  49. 49. 5. Models advanced <ul><li>Options : Concurrency </li></ul><ul><li>HAM : Richer representation, Concurrency </li></ul><ul><li>MAXQ : Continuous time, state, actions; Multi-agents, Average-reward. </li></ul><ul><li>In general, more researchers have followed MAXQ </li></ul><ul><ul><li>Less input knowledge </li></ul></ul><ul><ul><li>Value decomposition </li></ul></ul>
  50. 50. 6. Structure Paradigm <ul><li>S : Options, MAXQ </li></ul><ul><li>A : All </li></ul><ul><li>P : None </li></ul><ul><li>R : MAXQ </li></ul><ul><li>G : All </li></ul><ul><li>V : MAXQ </li></ul><ul><li> : All </li></ul>
  51. 51. The Outline of the Talk <ul><li>MDPs and Bellman’s curse of dimensionality. </li></ul><ul><li>RL: Simultaneous learning and planning. </li></ul><ul><li>Explore avenues to speedup RL. </li></ul><ul><li>Illustrate prominent HRL methods. </li></ul><ul><li>Compare prominent HRL methods. </li></ul><ul><li>Discuss future research. </li></ul><ul><li>Summarise </li></ul>
  52. 52. Directions for Future Research <ul><li>Bidirectional State Abstractions </li></ul><ul><li>Hierarchies over other RL research </li></ul><ul><ul><li>Model based methods </li></ul></ul><ul><ul><li>Function Approximators </li></ul></ul><ul><li>Probabilistic Planning </li></ul><ul><ul><li>Hierarchical P and Hierarchical R </li></ul></ul><ul><li>Imitation Learning </li></ul>
  53. 53. Directions for Future Research <ul><li>Theory </li></ul><ul><ul><li>Bounds (goodness of hierarchy) </li></ul></ul><ul><ul><li>Non-asymptotic analysis </li></ul></ul><ul><li>Automated Discovery </li></ul><ul><ul><li>Discovery of Hierarchies </li></ul></ul><ul><ul><li>Discovery of State Abstraction </li></ul></ul><ul><li>Apply… </li></ul>
  54. 54. Applications <ul><li>Toy Robot </li></ul><ul><li>Flight Simulator </li></ul><ul><li>AGV Scheduling </li></ul><ul><li>Keepaway soccer </li></ul>Images courtesy various sources Parts Assemblies Ware-house P2 P1 P3 P4 D2 D3 D4 D1                    
  55. 55. Thinking Big… <ul><li>&quot; ... consider maze domains. Reinforcement learning researchers, including this author, have spent countless years of research solving a solved problem! Navigating in grid worlds, even with stochastic dynamics, has been far from rocket science since the advent of search techniques such as A*. ” -- David Andre </li></ul><ul><li>Use planners, theorem provers, etc. as components in big hierarchical solver. </li></ul>
  56. 56. The Outline of the Talk <ul><li>MDPs and Bellman’s curse of dimensionality. </li></ul><ul><li>RL: Simultaneous learning and planning. </li></ul><ul><li>Explore avenues to speedup RL. </li></ul><ul><li>Illustrate prominent HRL methods. </li></ul><ul><li>Compare prominent HRL methods. </li></ul><ul><li>Discuss future research. </li></ul><ul><li>Summarise </li></ul>
  57. 57. How to choose appropriate hierarchy <ul><li>Look at available domain knowledge </li></ul><ul><ul><li>If some behaviours are completely specified – options </li></ul></ul><ul><ul><li>If some behaviours are partially specified – HAM </li></ul></ul><ul><ul><li>If less domain knowledge available – MAXQ </li></ul></ul><ul><li>We can use all three to specify different behaviours in tandem. </li></ul>
  58. 58. The Structure Paradigm <ul><li>Organised way to view optimisations. </li></ul><ul><li>Assists in figuring out unexploited avenues for speedup. </li></ul>
  59. 59. Main ideas in HRL community <ul><li>Hierarchies speedup learning </li></ul><ul><li>Value function decomposition </li></ul><ul><li>State Abstractions </li></ul><ul><li>Greedy non-hierarchical execution </li></ul><ul><li>Context-free learning and pseudo-rewards </li></ul><ul><li>Policy improvement by re-estimation and re-learning. </li></ul>

×