Your SlideShare is downloading. ×
Hierarchical Reinforcement Learning
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hierarchical Reinforcement Learning

855
views

Published on

Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
855
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]
  • 2. The Outline of the Talk
    • MDPs and Bellman’s curse of dimensionality.
    • RL: Simultaneous learning and planning.
    • Explore avenues to speed up RL.
    • Illustrate prominent HRL methods.
    • Compare prominent HRL methods.
    • Discuss future research.
    • Summarise
  • 3. Decision Making Slide courtesy Dan Weld Environment Percept Action What action next?
  • 4. Personal Printerbot
    • States ( S ) : {loc,has-robot-printout, user-loc,has-user-printout},map
    • Actions ( A ) : {move n ,move s ,move e ,move w , extend-arm,grab-page,release-pages}
    • Reward ( R ) : if h-u-po +20 else -1
    • Goal ( G ) : All states with h-u-po true.
    • Start state : A state with h-u-po false.
  • 5. Episodic Markov Decision Process
    • hS , A , P , R , G , s 0 i
    • S : Set of environment states.
    • A : Set of available actions.
    • P : Probability Transition model. P (s’|s,a)*
    • R : Reward model. R (s)*
    • G : Absorbing goal states.
    • s 0 : Start state.
    •  : Discount factor**.
    * Markovian assumption. ** bounds R for infinite horizon. Episodic MDP ´ MDP with absorbing goals
  • 6. Goal of an Episodic MDP
    • Find a policy ( S ! A ), which:
      • maximises expected discounted reward for a
      • a fully observable* Episodic MDP.
      • if agent is allowed to execute for an indefinite horizon.
    * Non-noisy complete information perceptors
  • 7. Solution of an Episodic MDP
    • Define V *(s) : Optimal reward starting in state s.
    • Value Iteration : Start with an estimate of V *(s) and successively re-estimate it to converge to a fixed point.
  • 8. Complexity of Value Iteration
    • Each iteration – polynomial in | S |
    • Number of iterations – polynomial in | S |
    • Overall – polynomial in | S |
    • Polynomial in | S | - 
    • | S | : exponential in number of
    • features in the domain*.
    * Bellman’s curse of dimensionality
  • 9. The Outline of the Talk
    • MDPs and Bellman’s curse of dimensionality.
    • RL: Simultaneous learning and planning.
    • Explore avenues to speed up RL.
    • Illustrate prominent HRL methods.
    • Compare prominent HRL methods.
    • Discuss future research.
    • Summarise
  • 10. Learning Environment Data
    • Gain knowledge
    • Gain understanding
    • Gain skills
    • Modification of behavioural tendency
  • 11. Decision Making while Learning* Environment Percepts Datum Action * Known as Reinforcement Learning What action next?
    • Gain knowledge
    • Gain understanding
    • Gain skills
    • Modification of behavioural tendency
  • 12. Reinforcement Learning
    • Unknown P and reward R .
    • Learning Component : Estimate the P and R values via data observed from the environment.
    • Planning Component : Decide which actions to take that will maximise reward.
    • Exploration vs. Exploitation
      • GLIE (Greedy in Limit with
      • Infinite Exploration)
  • 13. Planning vs. MDP vs. RL
    • MDP model system dynamics.
    • MDP algorithms solve the optimisation equations.
    • Planning is modeled as MDPs.
    • Planning algorithms speed up MDP algorithms.
    • RL is modeled over MDPs.
    • RL algorithms use MDP equations as basis.
    • RL algorithms speed up algorithms for simultaneous planning and learning.
  • 14. Exploration vs. Exploitation
    • Exploration : Choose actions that visit new states in order to obtain more data for better learning.
    • Exploitation : Choose actions that maximise the reward given current learnt model.
    • A solution : GLIE - Greedy in Limit
    • with Infinite Exploration.
  • 15. Model Based Learning
    • First learn the model.
    • Then use MDP algorithms.
    • Very slow, and uses a lot of data.
    • Optimisations proposed – DYNA, Prioritised Sweeping etc.
    • Uses less data, comparitively slow.
  • 16. Model Free Learning
    • Learn the policy without learning an explicit model.
    • Do not estimate P , and R explicitly.
    • E.g. : Temporal Difference Learning.
    • Very popular, fast, require a lot of data.
  • 17. Learning
    • Model-based learning
      • Learn the model, and do planning
      • Requires less data, more computation
    • Model-free learning
      • Plan without learning an explicit model
      • Requires a lot of data, less computation
  • 18. Q-Learning
    • Instead of learning, P and R , learn Q * directly.
    • Q *(s,a) : Optimal reward starting in s, if the first action is a, and after that the optimal policy is followed.
    • Q * directly defines the optimal policy:
    Optimal policy is the action with maximum Q * value.
  • 19. Q-Learning
    • Given an experience tuple h s,a,s’,r i
    • Under suitable assumptions, and GLIE exploration Q-Learning converges to optimal.
    New estimate of Q value Old estimate of Q value
  • 20. Semi-MDP: When actions take time.
    • The Semi-MDP equation:
    • Semi-MDP Q-Learning equation:
    • where experience tuple is h s,a,s’,r,N i
    • r = accumulated discounted reward while action a was executing.
  • 21. Printerbot
    • Paul G. Allen Center has 85000 sq ft space
    • Each floor ~ 85000/7 ~ 12000 sq ft
    • Discretise location on a floor: 12000 parts.
    • State Space (without map) : 2*2*12000*12000 --- very large!!!!!
    • How do humans do the decision making?
  • 22. The Outline of the Talk
    • MDPs and Bellman’s curse of dimensionality.
    • RL: Simultaneous learning and planning.
    • Explore avenues to speedup RL.
    • Illustrate prominent HRL methods.
    • Compare prominent HRL methods.
    • Discuss future research.
    • Summarise
  • 23. 1. The Mathematical Perspective A Structure Paradigm
    • S : Relational MDP
    • A : Concurrent MDP
    • P : Dynamic Bayes Nets
    • R : Continuous-state MDP
    • G : Conjunction of state variables
    • V : Algebraic Decision Diagrams
    •  : Decision List (RMDP)
  • 24. 2. Modular Decision Making
  • 25. 2. Modular Decision Making
    • Go out of room
    • Walk in hallway
    • Go in the room
  • 26. 2. Modular Decision Making
    • Humans plan modularly at different granularities of understanding.
    • Going out of one room is similar to going out of another room.
    • Navigation steps do not depend on whether we have the print out or not.
  • 27. 3. Background Knowledge
    • Classical Planners using additional control knowledge can scale up to larger problems.
    • (E.g. : HTN planning, TLPlan)
    • What forms of control knowledge can we provide to our Printerbot?
      • First pick printouts, then deliver them.
      • Navigation – consider rooms, hallway, separately, etc.
  • 28. A mechanism that exploits all three avenues : Hierarchies
    • Way to add a special (hierarchical) structure on different parameters of an MDP.
    • Draws from the intuition and reasoning in human decision making.
    • Way to provide additional control knowledge to the system.
  • 29. The Outline of the Talk
    • MDPs and Bellman’s curse of dimensionality.
    • RL: Simultaneous learning and planning.
    • Explore avenues to speedup RL.
    • Illustrate prominent HRL methods.
    • Compare prominent HRL methods.
    • Discuss future research.
    • Summarise
  • 30. Hierarchy
    • Hierarchy of : Behaviour, Skill, Module, SubTask, Macro-action, etc.
      • picking the pages
      • collision avoidance
      • fetch pages phase
      • walk in hallway
    • HRL ´ RL with temporally extended actions
  • 31. Hierarchical Algos ´ Gating Mechanism
    • Hierarchical Learning
      • Learning the gating function
      • Learning the individual behaviours
      • Learning both
    * *Can be a multi- level hierarchy. g is a gate b i is a behaviour
  • 32. Option : Move e until end of hallway
    • Start : Any state in the hallway.
    • Execute : policy as shown.
    •  Terminate : when s is end of hallway.
  • 33. Options [Sutton, Precup, Singh’99]
    • An option is a well defined behaviour.
    • o = h I o ,  o ,  o i
    • I o : Set of states ( I o µS ) in which o can be initiated.
    •  o  s  : Policy ( S!A *) when o is executing.
    •  o ( s) : Probability that o terminates in s.
    *Can be a policy over lower level options.
  • 34. Learning
    • An option is temporally extended action with well defined policy.
    • Set of options ( O ) replaces the set of actions ( A )
    • Learning occurs outside options.
    • Learning over options ´ Semi MDP Q-Learning.
  • 35. Machine: Move e + Collision Avoidance Move e Choose Return End of hallway : End of hallway Obstacle Call M1 Call M2 M1 M2 Move w Move n Move n Return Move w Move s Move s Return
  • 36. Hierarchies of Abstract Machines [Parr, Russell’97]
    • A machine is a partial policy represented by a Finite State Automaton.
    • Node :
      • Execute a ground action.
      • Call a machine as a subroutine.
      • Choose the next node.
      • Return to the calling machine.
  • 37. Hierarchies of Abstract Machines
    • A machine is a partial policy represented by a Finite State Automaton.
    • Node :
      • Execute a ground action.
      • Call a machine as subroutine.
      • Choose the next node.
      • Return to the calling machine.
  • 38. Learning
    • Learning occurs within machines, as machines are only partially defined.
    • Flatten all machines out and consider states [s,m] where s is a world state, and m, a machine node ´ MDP
    • reduce ( S o M ) : Consider only states where machine node is a choice node ´ Semi-MDP.
    • Learning ¼ Semi-MDP Q-Learning
  • 39. Task Hierarchy: MAXQ Decomposition [Dietterich’00] Root Take Give Navigate(loc) Deliver Fetch Extend-arm Extend-arm Grab Release Move e Move w Move s Move n Children of a task are unordered
  • 40. MAXQ Decomposition
    • Augment the state s by adding the subtask i : [s,i].
    • Define C ([s,i],j) as the reward received in i after j finishes.
    • Q ( [s,Fetch],Navigate(prr)) = V ([s,Navigate(prr)]) + C ([s,Fetch],Navigate(prr)) *
    • Express V in terms of C
    • Learn C , instead of learning Q
    *Observe the context-free nature of Q -value Reward received while navigating Reward received after navigation
  • 41. The Outline of the Talk
    • MDPs and Bellman’s curse of dimensionality.
    • RL: Simultaneous learning and planning.
    • Explore avenues to speedup RL.
    • Illustrate prominent HRL methods.
    • Compare prominent HRL methods.
    • Discuss future research.
    • Summarise
  • 42. 1. State Abstraction
    • Abstract state : A state having fewer state variables; different world states maps to the same abstract state.
    • If we can reduce some state variables, then we can reduce on the learning time considerably!
    • We may use different abstract states for different macro-actions.
  • 43. State Abstraction in MAXQ
    • Relevance : Only some variables are relevant for the task.
      • Fetch : user-loc irrelevant
      • Navigate(printer-room) : h-r-po,h-u-po,user-loc
      • Fewer params for V of lower levels.
    • Funnelling : Subtask maps many states to smaller set of states.
      • Fetch : All states map to h-r-po =true, loc =pr.room.
      • Fewer params for C of higher levels.
  • 44. State Abstraction in Options, HAM
    • Options : Learning required only in states that are terminal states for some option.
    • HAM : Original work has no abstraction.
      • Extension: Three-way value decomposition*:
      • Q ([s,m],n) = V ([s,n]) + C ([s,m],n) + C ex ([s,m])
      • Similar abstractions are employed.
    *[Andre,Russell’02]
  • 45. 2. Optimality Hierarchical Optimality vs. Recursive Optimality
  • 46. Optimality
    • Options : Hierarchical
      • Use ( A [ O ) : Global**
      • Interrupt options
    • HAM : Hierarchical*
    • MAXQ : Recursive*
      • Interrupt subtasks
      • Use Pseudo-rewards
      • Iterate!
    * Can define eqns for both optimalities **Adv. of using macro-actions maybe lost.
  • 47. 3. Language Expressiveness
    • Option
      • Can only input a complete policy
    • HAM
      • Can input a complete policy.
      • Can input a task hierarchy.
      • Can represent “amount of effort”.
      • Later extended to partial programs.
    • MAXQ
      • Cannot input a policy (full/partial)
  • 48. 4. Knowledge Requirements
    • Options
      • Requires complete specification of policy.
      • One could learn option policies – given subtasks.
    • HAM
      • Medium requirements
    • MAXQ
      • Minimal requirements
  • 49. 5. Models advanced
    • Options : Concurrency
    • HAM : Richer representation, Concurrency
    • MAXQ : Continuous time, state, actions; Multi-agents, Average-reward.
    • In general, more researchers have followed MAXQ
      • Less input knowledge
      • Value decomposition
  • 50. 6. Structure Paradigm
    • S : Options, MAXQ
    • A : All
    • P : None
    • R : MAXQ
    • G : All
    • V : MAXQ
    •  : All
  • 51. The Outline of the Talk
    • MDPs and Bellman’s curse of dimensionality.
    • RL: Simultaneous learning and planning.
    • Explore avenues to speedup RL.
    • Illustrate prominent HRL methods.
    • Compare prominent HRL methods.
    • Discuss future research.
    • Summarise
  • 52. Directions for Future Research
    • Bidirectional State Abstractions
    • Hierarchies over other RL research
      • Model based methods
      • Function Approximators
    • Probabilistic Planning
      • Hierarchical P and Hierarchical R
    • Imitation Learning
  • 53. Directions for Future Research
    • Theory
      • Bounds (goodness of hierarchy)
      • Non-asymptotic analysis
    • Automated Discovery
      • Discovery of Hierarchies
      • Discovery of State Abstraction
    • Apply…
  • 54. Applications
    • Toy Robot
    • Flight Simulator
    • AGV Scheduling
    • Keepaway soccer
    Images courtesy various sources Parts Assemblies Ware-house P2 P1 P3 P4 D2 D3 D4 D1                    
  • 55. Thinking Big…
    • " ... consider maze domains. Reinforcement learning researchers, including this author, have spent countless years of research solving a solved problem! Navigating in grid worlds, even with stochastic dynamics, has been far from rocket science since the advent of search techniques such as A*. ” -- David Andre
    • Use planners, theorem provers, etc. as components in big hierarchical solver.
  • 56. The Outline of the Talk
    • MDPs and Bellman’s curse of dimensionality.
    • RL: Simultaneous learning and planning.
    • Explore avenues to speedup RL.
    • Illustrate prominent HRL methods.
    • Compare prominent HRL methods.
    • Discuss future research.
    • Summarise
  • 57. How to choose appropriate hierarchy
    • Look at available domain knowledge
      • If some behaviours are completely specified – options
      • If some behaviours are partially specified – HAM
      • If less domain knowledge available – MAXQ
    • We can use all three to specify different behaviours in tandem.
  • 58. The Structure Paradigm
    • Organised way to view optimisations.
    • Assists in figuring out unexploited avenues for speedup.
  • 59. Main ideas in HRL community
    • Hierarchies speedup learning
    • Value function decomposition
    • State Abstractions
    • Greedy non-hierarchical execution
    • Context-free learning and pseudo-rewards
    • Policy improvement by re-estimation and re-learning.