Upcoming SlideShare
×

# Economic Hierarchical Q-Learning

672 views
602 views

Published on

Slides from my oral presentation of my paper at AAAI'08 in Chicago. The paper was co-authored with Ruggiero Cavallo and David Parkes; it was based on my undergraduate thesis, which David advised.

Published in: Technology, Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
672
On SlideShare
0
From Embeds
0
Number of Embeds
30
Actions
Shares
0
0
0
Likes
0
Embeds 0
No embeds

No notes for slide
• HRL is a variation on RL where the problem is decomposed into a set of sub-problems. These sub-problems can then be solved more-or-less independently and their solutions combined to build a solution to the original problem. There are several potential advantages to this approach: first, state abstraction – in many cases, certain aspects of the original state space can be ignored in the context of a particular subproblem, allowing that sub-problem to be solved in a much smaller “abstract” state space. Second, the hierarchical structure of the decomposition lends itself to value decomposition – traditional RL Q-values can intstead be expressed as a sum of several components; the components of Q-values can often be re-used, reducing the number of values that must be learned. Additionally, the solution policy to a given sub-problem may be able to be re-used in other parts of the hierarchy.
• Convert to non-technical slide on HRL. Why HRL – allows state abstraction, decompose into sub-problems
• To help illustrate these concepts, we introduce the HOFuel domain, constructed to emphasize the distinction between the RO and HO solution policies. It is a grid-world navigation task with a fuel constraint. Running into walls is no-op with a penalty; add opti
• But HRL can introduce a tension for some domains; solving sub-problems without enough regard for how the solutions to individual sub-problems impact the overall solution quality can lead to solutions that are sub-optimal from the perspective of the original problem. Additionally, the structure of the hierarchy itself may artificially limit the solution quality .We thus differentiate between three concepts of optimality. The first, global optimality, is equivalent to the traditional notion of optimality in reinforcement learning.
• The second, Hierarchical optimality, is equivalent to global optimality except where constrained by the hierarchy.
• The third, recursive optimality, is defined as each subtask being solved optimally with respect to the solutions to the sub problems below it in the hierarchy. The globally optimal solution policy is always equivalent to or better than the HO solution. Similarly, the HO solution policy is always equivalent to or better than the RO solution policy.RO is easier, because the agent only has to reason about local rewards. Resolving this tension will be the focus of my work
• We conceptualize the hierarchy as though each sub-problem is being solved by a different agent. Dietterich (2000) noted that exit-reward payments could alter incentives in the problem to make the RO and HO solutions equivalent. We took further inspiration from the Hayek system development by the Eric Baum, which involved agents buying and selling control of the world to solve the problem. Hayek was itself based on Holland classifiers; both systems are applied to traditional RL not HRL.Hayek and market like system; Baum buying control of the world in evolutionary context; Holland in RL, not HRL work.
• HRL decompositions can improve learning speed by allowing extraneous state variables within a given subtask to be ignored within that subtask.
• EHQ follows this decomposition framework, as do several other HRL algorithms in literatio. Notably, not all model Qe explicitly (or at all).
• ALispQ and HOCQ provide impressive HO convergence results, however, EHQ can achieve HO using a simple and decentralized pricing mechanism.
• Add rewards in timesteps ….
• Modeling QE allows for HO convergence, but is often depends on many state variables, lessening the potential for state abstractions and slowing learning speed.In practice, we found it beneficial limit Ej to the set of reachable exit-states, as discovered empirically during learning.(briefly mention the other possible normalizations if time permits)
• Replace this with a high-level overview of the algorithm? (ie agent at each node in the hierarchies does a form of Q-learning to update it’s local QV and QC values. Parent models the expected reward of invoking a macroaction, implemented by a child agent, by receiving a “bid” from that agent of it’s expected reward for the given state. When the parent chooses a macroaction to invoke, control is passed to the child agent along with information about what subsidies that child will be paid for its possible exit-states. When the child reaches an exit-state, it receives the subsidy for the state it achieved. Control is returned to the parent, which receives reward equal to the child’s bid less the subsidy it paid the child.
• Normalizing to min reachable (briefly mention the other possible normalizations if time permits)
• ### Economic Hierarchical Q-Learning

1. 1. Economic Hierarchical Q-Learning<br />Erik G. Schultink, Ruggiero Cavalloand David C. Parkes<br />Harvard University<br />AAAI-08 July 17, 2008<br />
2. 2. Introduction<br />Economic paradigms applied to hierarchical reinforcement learning<br />Building on the work of:<br />Holland Classifier system (Holland 1986)<br />Eric Baum’s Hayek system, with competitive, evolutionary agents that buy and sell control of the world to collectively solve the problem (Baum et al. 1998)<br />Our thesis is that price systems can help resolve the tension between recursive optimality and hierarchical optimality<br />We introduce the EHQ algorithm<br />
3. 3. <ul><li>Decompose problem into sub-problems:
4. 4. Each sub-problem solved by a different agent
5. 5. Leaf nodes are primitive actions; non-leaf nodes are macroactions
6. 6. State abstraction
7. 7. Addresses curse-of-dimensionality, leaving smaller state space to explore
8. 8. Rewards accrue only for primitive actions
9. 9. Credit assignment problem: How to distribute reward in the system?</li></ul>Hierarchical Reinforcement Learning<br />Root<br />Drive to work<br />Eat Breakfast<br />eat donut<br />drink coffee<br />eat cereal<br />stop<br />drive forward<br />turn right<br />turn left<br />
10. 10. Hierarchical Reinforcement Learning<br />Decompose an MDP, M, into a set of subtasks <br />{ M0 , M1, … , Mn} where Mi consists of:<br /> Ti : termination predicate partitioning Mi into active states Si and exit-states Ei<br /> Ai: set of actions that can be performed in Mi<br />Ri: local-reward function<br />
11. 11. Hierarchical Reinforcement Learning<br />A hierarchical policy πis a set of {π1, π2, … , πn}, where πi is a mapping from state s to either a primitive action a or πj<br />
12. 12. HOFuel Domain<br />Grid world navigation task<br />A={north, south, east, west, fill-up}<br />The fill-up action is available only in the left hand room<br />Begin with 5 units of fuel<br />Based on concepts described by Dietterich (2000).<br />
13. 13. Hierarchy for HOFuel<br />fill-up<br />north<br />east<br />south<br />west<br />fill-up available only in “Leave left room” macroaction<br />Root<br />Leave left room<br />Reach goal<br />
14. 14. Optimality Concepts<br />Global Optimality<br />Hierarchical Optimality<br />Recursive Optimality<br />
15. 15. Optimality Concepts<br />Global Optimality<br />Hierarchical Optimality<br />A hierarchically optimal (HO) policy selects the same primitive actions as the optimal policy in every state, subject to constraints of the hierarchy. (Dietterich 2000a)<br />Recursive Optimality<br />
16. 16. Optimality Concepts<br />Global Optimality<br />Hierarchical Optimality<br />Recursive Optimality<br />A policy is recursively optimal (RO) if, for each subtask in the hierarchy, the policy πi is optimal given the policies for all descendents of the subtask Mi in the hierarchy.<br />
17. 17. Optimality in HOFuel<br />Hierarchically Optimal<br />Recursively Optimal<br />Root<br />Leave left room<br />Reach goal<br />
18. 18. Intuitive Motivation for EHQ<br />Transfer between agentsto incentivize “Leave left room” to choose upper door over lower door<br />Root<br />Leave left room<br />Reach goal<br />
19. 19. Safe State Abstraction<br />To obtain hierarchical optimality, we must use state abstractions that are safe – that is, the optimal policy in the original space is also optimal in the abstract space.<br />Principles for safe state abstraction shown in [Dietterich 2000].<br />
20. 20. Value Decomposition<br />Different HRL algorithms use different additive decompositions for Q(s,a). In the most general form, Q(s,a) can be decomposed into:<br />QV(i,s,a): expected discounted reward to i<br /> upon completion of a<br />QC(i,s,a): expected discounted reward to i after a completes, until i exits<br />QE(i,s,a): expected total discounted reward <br /> after subtask i exits<br />(Dietterich 2000a, Andre and Russell 2002)<br />Local reward<br />to subtask i<br />Reward not seen directly by subtask i<br />
21. 21. Decentralization<br />An HRL algorithm is decentralized if every agent in the hierarchy needs only locally stored information to select an action. <br />
22. 22. Summary of Related HRL Algorithms<br />* shown only empirically<br /><ul><li> HAMQ – [Parr, Russell 1998]
23. 23. MAXQQ – [Dietterich 2000]
24. 24. ALispQ – [Andre and Russell 2002]
25. 25. HOCQ – [Marthi and Russell 2006]</li></li></ul><li>EHQ Transfer System<br />parent<br />child<br />child<br />child<br />
26. 26. EHQ Transfer System<br />parent<br />child<br />child<br />child<br />Children submit bids <br />(bid = V*(s) = expected reward they will obtain during execution, including expected exit-state subsidy)<br />
27. 27. EHQ Transfer System<br />parent<br />child<br />child<br />child<br />Parent passes control to “winning” child<br />(based on exploration policy)<br />
28. 28. EHQ Transfer System<br />0<br />0<br />0<br />0<br />parent<br />child<br />child<br />child<br />+5<br />+2<br />-6<br />+3<br />Child executes until reaches exit-state, reward accrues to child<br />
29. 29. EHQ Transfer System<br />+4<br />0<br />0<br />0<br />0<br />parent<br />child<br />child<br />child<br />-4<br />+5<br />+2<br />-6<br />+3<br />Child returns control and pays bid to parent<br />
30. 30. EHQ Transfer System<br />+4<br />0<br />0<br />0<br />0<br />-1<br />parent<br />child<br />child<br />child<br />-4<br />+5<br />+2<br />-6<br />+3<br />+1<br />Parent pays child subsidy for exit-state obtained<br />
31. 31. EHQ Subsidy Policy<br />Rather than explicitly model QE, EHQ provides subsidies to the child subtask for the quality, from the perspective of the parent, of the exit-state the child achieves<br />
32. 32. EHQ Transfer System<br />+4<br />-1<br />0<br />0<br />0<br />0<br />parent<br />child<br />child<br />child<br />+1<br />-4<br />+5<br />+2<br />-6<br />+3<br />During execution, both parent and child update their local Q-values based on their stream of rewards<br />
33. 33. EHQ Algorithm<br />
34. 34. HOFuel Convergence Results<br />
35. 35. HOFuel Subsidy Convergence<br />Root<br />Leave left room<br />Reach goal<br />
36. 36. Taxi Domain<br />RO = HO in this domain, which is taken from [Dietterich 2000]<br />
37. 37. TaxiFuel Convergence<br /><ul><li>Taxi domain with a fuel constraint
38. 38. EHQ appears to converge, but does not clearly surpass MAXQQ
39. 39. not a large difference between HO and RO solution quality</li></li></ul><li>Conclusions<br />Simple, intuitive reward transfers between agents in the hierarchy can lead to HO convergence<br />EHQ limits the amount of knowledge a given agent in the hierarchy must have about the rest of the system<br />Modeling exit-state value is critical for achieving HO solution policies; EHQ uses subsidies for this purpose<br />
40. 40. References<br />Andre, D., and Russell, S. 2002. State abstraction for programmable reinforcement learning agents. In AAAI-02. Edmonton, Alberta: AAAI Press.<br />Baum, E. B., and Durdanovich, I. 1998. Evolution of cooperative problem-solving in an artificial economy. Journal of Artificial Intelligence Research.<br />Dean, T., and Lin, S.-H. 1995. Decomposition techniques for planning in stochastic domains. In IJCAI-95, 1121–1127. San Francisco, CA: Morgan Kaufmann Publishers.<br />Dietterich, T. G. 2000a. Hierarchical reinforcement learning with MAXQ value function decomposition. Journal of Artificial Intelligence Research13:227–303.<br />
41. 41. References<br />Dietterich, T. G. 2000b. State abstraction in MAXQ hierarchical reinforcement learning. Advances in Neural Information Processing Systems 12:994–1000.<br />Holland, J. 1986. Escaping brittleness: The possibilities of general purpose learning algorithms applied to parallel rule-based systems. In Machine Learning, volume 2. San Mateo, CA: Morgan Kaufmann.<br />Marthi, B.; Russell, S.; and Andre, D. 2006. A compact, hierarchically optimal Q-function decomposition. In UAI-06.<br />Parr, R., and Russell, S. 1998. Reinforcement learning with hierarchies of machines. Advances in Neural Information Processing Systems 10.<br />