Hierarchical POMDP Planning and Execution Joelle Pineau Machine Learning Lunch November 20, 2000
Partially Observable MDP POMDPs are characterized by: States: s  S Actions: a  A Observations: o  O Transition probabilities: T(s,a,s’)=Pr(s’|s,a) Observation probabilities: T(o,a,s’)=Pr(o|s,a) Rewards: R(s,a) Beliefs: b(s t )=Pr(s t |o t ,a t ,…,o 0 ,a 0 ) S 1 S 2 S 3
The problem How can we find good policies for complex POMDPs? Is there a principled way to provide near-optimal policies?
Proposed Approach Exploit  structure  in the problem domain. What type of structure? Action set partitioning  Act InvestigateHealth Move Navigate CheckPulse AskWhere Left Right Up Down CheckMeds
Hierarchical POMDP Planning What do we start with? A full POMDP model: {S o ,A o ,O o ,M o }. An action set partitioning graph. Key idea: Break the problem into many “related” POMDPs. Each smaller POMDP has only a subset of A o . imposing policy constraint But why? POMDP: exponential run-time per value iteration O(|A|  n-1 |O| )
Example M B K E 0.1 0.1 0.1 0.1 0.1 0.1 0.8 0.8 POMDP: S o = { M eds,  K itchen,  B edroom} A o  = {ClarifyTask, Check M eds, GoTo K itchen, GoTo B edroom} O o  = {Noise,  M eds,  K itchen,  B edroom} Value Function: MedsState KitchenState BedroomState 0.8 GoToKitchen ClarifyTask GoToBedroom CheckMeds
Hierarchical POMDP Action Partitioning: Act Move CheckMeds ClarifyTask ClarifyTask GoToKitchen GoToBedroom
Local Value Function and Policy -  Move  Controller ClarifyTask GoToKitchen GoToBedroom MedsState KitchenState BedroomState
Modeling Abstract Actions ClarifyTask GoToKitchen GoToBedroom MedsState KitchenState BedroomState Problem :  Need parameters for abstract action  Move Solution :  Use the local policy of corresponding low-level controller General form :  Pr ( s j  | s i , a k abstract  ) = Pr ( s j  | s i , Policy(a k abstract ,s i ) ) Example : Pr ( s j  |  MedsState ,  Move  ) = Pr ( s j  |  MedsState , ClarifyTask ) Policy   (Move,s i ):
Local Value Function and Policy -  Act  Controller Move MedsState KitchenState BedroomState CheckMeds
Comparing Policies Hierarchical Policy: Optimal Policy: = ClarifyTask = CheckMeds = GoToKitchen = GoToBedroom
Bounding the value of the approximation Value function of top-level controller is an  upper-bound  on the value of the approximation. Why ?  We were optimistic when modeling the abstract action. Similarly, we can find a  lower-bound . How ?  We can assume “worst-case” view when modeling the abstract action. If we partition the action set differently, we will get different bounds.
A real dialogue management example - AskGoWhere - GoToRoom - GoToKitchen - GoToFollow - VerifyRoom - VerifyKitchen - VerifyFollow - GreetGeneral - GreetMorning - GreetNight - RespondThanks - AskWeatherTime - SayCurrent - SayToday - SayTomorrow - StartMeds - NextMeds - ForceMeds - QuitMeds - AskCallWho - CallHelp - CallNurse - CallRelative - VerifyHelp - VerifyNurse - VerifyRelative - AskHealth - OfferHelp - SayTime Act CheckHealth Phone DoMeds CheckWeather Move Greet
Results:
Final words We presented: a general framework to exploit structure in POMDPs; Future work: automatic generation of good action partitioning; conditions for additional observation abstraction; bigger problems!

Hierarchical Pomdp Planning And Execution

  • 1.
    Hierarchical POMDP Planningand Execution Joelle Pineau Machine Learning Lunch November 20, 2000
  • 2.
    Partially Observable MDPPOMDPs are characterized by: States: s  S Actions: a  A Observations: o  O Transition probabilities: T(s,a,s’)=Pr(s’|s,a) Observation probabilities: T(o,a,s’)=Pr(o|s,a) Rewards: R(s,a) Beliefs: b(s t )=Pr(s t |o t ,a t ,…,o 0 ,a 0 ) S 1 S 2 S 3
  • 3.
    The problem Howcan we find good policies for complex POMDPs? Is there a principled way to provide near-optimal policies?
  • 4.
    Proposed Approach Exploit structure in the problem domain. What type of structure? Action set partitioning Act InvestigateHealth Move Navigate CheckPulse AskWhere Left Right Up Down CheckMeds
  • 5.
    Hierarchical POMDP PlanningWhat do we start with? A full POMDP model: {S o ,A o ,O o ,M o }. An action set partitioning graph. Key idea: Break the problem into many “related” POMDPs. Each smaller POMDP has only a subset of A o . imposing policy constraint But why? POMDP: exponential run-time per value iteration O(|A|  n-1 |O| )
  • 6.
    Example M BK E 0.1 0.1 0.1 0.1 0.1 0.1 0.8 0.8 POMDP: S o = { M eds, K itchen, B edroom} A o = {ClarifyTask, Check M eds, GoTo K itchen, GoTo B edroom} O o = {Noise, M eds, K itchen, B edroom} Value Function: MedsState KitchenState BedroomState 0.8 GoToKitchen ClarifyTask GoToBedroom CheckMeds
  • 7.
    Hierarchical POMDP ActionPartitioning: Act Move CheckMeds ClarifyTask ClarifyTask GoToKitchen GoToBedroom
  • 8.
    Local Value Functionand Policy - Move Controller ClarifyTask GoToKitchen GoToBedroom MedsState KitchenState BedroomState
  • 9.
    Modeling Abstract ActionsClarifyTask GoToKitchen GoToBedroom MedsState KitchenState BedroomState Problem : Need parameters for abstract action Move Solution : Use the local policy of corresponding low-level controller General form : Pr ( s j | s i , a k abstract ) = Pr ( s j | s i , Policy(a k abstract ,s i ) ) Example : Pr ( s j | MedsState , Move ) = Pr ( s j | MedsState , ClarifyTask ) Policy (Move,s i ):
  • 10.
    Local Value Functionand Policy - Act Controller Move MedsState KitchenState BedroomState CheckMeds
  • 11.
    Comparing Policies HierarchicalPolicy: Optimal Policy: = ClarifyTask = CheckMeds = GoToKitchen = GoToBedroom
  • 12.
    Bounding the valueof the approximation Value function of top-level controller is an upper-bound on the value of the approximation. Why ? We were optimistic when modeling the abstract action. Similarly, we can find a lower-bound . How ? We can assume “worst-case” view when modeling the abstract action. If we partition the action set differently, we will get different bounds.
  • 13.
    A real dialoguemanagement example - AskGoWhere - GoToRoom - GoToKitchen - GoToFollow - VerifyRoom - VerifyKitchen - VerifyFollow - GreetGeneral - GreetMorning - GreetNight - RespondThanks - AskWeatherTime - SayCurrent - SayToday - SayTomorrow - StartMeds - NextMeds - ForceMeds - QuitMeds - AskCallWho - CallHelp - CallNurse - CallRelative - VerifyHelp - VerifyNurse - VerifyRelative - AskHealth - OfferHelp - SayTime Act CheckHealth Phone DoMeds CheckWeather Move Greet
  • 14.
  • 15.
    Final words Wepresented: a general framework to exploit structure in POMDPs; Future work: automatic generation of good action partitioning; conditions for additional observation abstraction; bigger problems!

Editor's Notes

  • #2 Talk to you about my recent work on ...