SlideShare a Scribd company logo
1	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
This	
  work	
  is	
  licensed	
  under	
  a	
  CreaBve	
  Commons	
  AEribuBon-­‐NonCommercial-­‐ShareAlike	
  4.0	
  InternaBonal	
  License.	
  
Chapter	
  4	
  	
  
Delibera.on	
  with	
  Probabilis.c	
  Domain	
  Models	
  
Dana S. Nau and Vikas Shivashankar
University of Maryland
2	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Probabilis.c	
  Planning	
  Domain	
  
●  Actions have multiple possible outcomes
Ø  Each outcome has a probability
●  Several possible action representations
Ø  Bayes nets, probabilistic operators, …
●  Book doesn’t commit to any representation
Ø  Only deals with the underlying semantics
●  Σ = (S,A,γ,Pr,cost)
Ø  S = set of states
Ø  A = set of actions
Ø  γ : S × A → 2S
Ø  Pr(s′ | s, a) = probability of going to state s′ if we perform a in s
•  Require Pr(s′ | s, a) > 0 for every s′ in γ(s,a)
Ø  cost: S × A → +
•  cost(s,a) = cost of action a in state s
3	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Nota.on	
  from	
  Chapter	
  5	
  
●  Policy π : Sπ → A
Ø  Sπ ⊆ S
Ø  ∀s ∈ Sπ , π(s) ∈ Applicable(s)
●  γ︎(s,π) = {s and all descendants of s reachable by π}
Ø  the transitive closure of γ with π
●  Graph(s,π) = rooted graph induced by π
Ø  {nodes} = γ︎(s,π)
Ø  {edges} = ∪a∈Applicable(s){(s,s′) | s′ ∈ γ(s,a)}
Ø  root = s
●  leaves(s,π) = {states in γ︎(s,π) that aren’t in Sπ}
4	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Stochas.c	
  Systems	
  
●  Stochastic shortest path (SSP) problem: a triple (Σ, s0, Sg)
●  Solution for (Σ, s0, Sg):
Ø  policy π such that s0 ∈ Sπ and leaves(s0,π) ⋂ Sg ≠ ∅
●  π is closed if π doesn’t stop at non-goal states unless no action is applicable
Ø  for every state in γ︎(s,π), either
•  s ∈ Sπ (i.e., π specifies an action at s)
•  s ∈ Sg (i.e., s is a goal state)
•  Applicable(s) = ∅ (i.e., there are no applicable actions at s)
5	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
●  Robot r1 starts
at location l1	
  
Ø  s0 = s1 in
the diagram
●  Objective is to
get r1 to location l4	
  
Ø  Sg = {s4}
●  π1 = {(s1,	
  move(r1,l1,l2)),	
  (s2,	
  move(r1,l2,l3)),	
  (s3,	
  move(r1,l3,l4))}
Ø  Solution?
Ø  Closed?	
  
move(r1,l2,l1)	
  
2	
  
Policies	
  
Goal	
  
Start	
  
6	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
●  History: a sequence of
states, starting at s0
σ = 〈s0, s1, s2, s3, …, sh〉
or (not in book):
σ = 〈s0, s1, s2, s3, …〉
●  Let H(π) = {all histories
that can be produced by
following π from s0 to a
state in leaves(s0, π)}
●  If σ ∈ H(π) then Pr (σ | π) = ∏i ≥ 0 Pr (si+1 | si ,π(si))
Ø  Thus ∑σ ∈ Hπ Pr (σ | π) = 1
●  Probability that π will stop at a goal state:
Ø  Pr (Sg | π) = ∑ {Pr (σ | π) | σ ∈ H(π) and σ ends at a state in Sg}
move(r1,l2,l1)	
  
2	
  
Histories	
  
Goal	
  
Start	
  
7	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Unsafe	
  Solu.ons	
  
●  A solution π is unsafe if
Ø  0 < Pr (Sg | π) < 1
●  Example:
π1 = {(s1,	
  move(r1,l1,l2)),	
  
	
  	
  	
  	
  	
  (s2,	
  move(r1,l2,l3)),	
  
	
  	
  	
  	
  	
  (s3,	
  move(r1,l3,l4))}
●  H(π1) contains two histories:
Ø  σ1 = 〈s1,	
  s2,	
  s3,	
  s4〉 	
  Pr (σ1 | π1) = 1 × .8 × 1 = 0.8
Ø  σ2 = 〈s1,	
  s2,	
  s5〉 	
  Pr (σ2 | π1) = 1 × .2 = 0.2
●  Pr (Sg | π) = 0.8
Explicit	
  dead	
  end	
  
move(r1,l2,l1)	
  
2	
  
Goal	
  
Start	
  
8	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Unsafe	
  Solu.ons	
  
●  A solution π is unsafe if
Ø  0 < Pr (Sg | π) < 1
●  Example:
π2 = {(s1,	
  move(r1,l1,l2)),	
  
	
  	
  	
  	
  	
  (s2,	
  move(r1,l2,l3)),	
  
	
  	
  	
  	
  	
  (s3,	
  move(r1,l3,l4)),	
  
	
  	
  	
  	
  	
  (s5,	
  move(r1,l5,l6)),	
  
	
  	
  	
  	
  	
  (s6,	
  move(r1,l6,l5))}
●  H(π2) contains two histories:
Ø  σ1 = 〈s1,	
  s2,	
  s3,	
  s4〉 	
  Pr (σ1 | π2) = 1 × .8 × 1 = 0.8
Ø  σ3 = 〈s1,	
  s2,	
  s5,	
  s6,	
  s5,	
  s6,	
  …	
  〉 	
  Pr (σ3 | π2) = 1 × .2 × 1 × 1 × 1 × … = 0.2
●  Pr (Sg | π2) = 0.8
Implicit	
  dead	
  end	
  
move(r1,l2,l1)	
  
2	
   wait	
  
s6	
  
at(r1,l6)	
  
Goal	
  
Start	
  
9	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
●  A solution π is safe if
Ø  Pr (Sg | π) = 1
●  An acyclic safe solution:
π3 = {(s1,	
  move(r1,l1,l2)),	
  
	
  	
  	
  	
  	
  	
  (s2,	
  move(r1,l2,l3)),	
  
	
  	
  	
  	
  	
  	
  (s3,	
  move(r1,l3,l4)),	
  
	
  	
  	
  	
  	
  	
  (s5,	
  move(r1,l5,l1))}
●  H(π3) contains two histories:
Ø  σ1 = 〈s1,	
  s2,	
  s3,	
  s4〉 	
  Pr (σ1 | π3) = 1 × .8 × 1 = 0.8
Ø  σ4 = 〈s1,	
  s2,	
  s5,	
  s4〉 	
  Pr (σ4 | π3) = 1 × .2 × 1 = 0.2
●  Pr (Sg | π3) = 0.8 + 0.2 = 1
move(r1,l2,l1)	
  
2	
  
Safe	
  Solu.ons	
  
Goal	
  
Start	
  
10	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
●  A solution π is safe if
Ø  Pr (Sg | π) = 1
●  A cyclic safe solution:
π4 = {(s5,	
  move(r1,l5,l4)}
●  H(π4) contains infinitely
many histories:	
  
Ø  σ5 = 〈s1,	
  s4	
  〉 	
  Pr (σ5 | π4) = 0.5
Ø  σ6 = 〈s1,	
  s1,	
  s4〉 	
  Pr (σ6 | π4) = 0.5 × 0.5 = 0.25
Ø  σ7 = 〈s1,	
  s1,	
  s1,	
  s4〉 	
  Pr (σ7 | π4) = 0.5 × 0.5 × 0.5 = 0.125
• • •
Ø  σ∞ = 〈s1,	
  s1,	
  s1,	
  s1,	
  …	
  〉 	
  Pr (σ∞ | π4) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0
move(r1,l2,l1)	
  
2	
  
Safe	
  Solu.ons	
  
Pr (Sg | π4) = .5 + .25 + .125 + … = 1
Goal	
  
Start	
  
11	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
●  Example:
π = {(s1,	
  move(r1,l1,l2)),	
  
	
  	
  	
  	
  (s2,	
  move(r1,l2,l3)),	
  
	
  	
  	
  	
  (s3,	
  move(r1,l3,l4)),	
  
	
  	
  	
  	
  (s4,	
  move(r1,l4,l1)),	
  
	
  	
  	
  	
  (s5,	
  move(r1,l5,l1))}
●  What is Pr (Sg | π)?
Goal	
  
move(r1,l2,l1)	
  
2	
  
Example	
  
Start	
  
12	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
r	
  =	
  –100	
  
Expected	
  Cost	
  
Goal	
  
Start	
  
●  cost(s,a) = cost of using a in s
●  Example:
Ø  cost(s,a) = 1 for each
“horizontal” action
Ø  cost(s,a) = 100 for each
“vertical” action
●  Cost of a history:
Ø  Let σ = 〈s0, s1, … 〉 ∈ H(π)
Ø  cost(σ | π) = ∑i ≥ 0 cost(si,π(si))
●  Let π be a safe solution
●  Expected cost of following π to a goal:
Ø  Vπ(s) = 0 if s is a goal
Ø  Vπ(s) = cost(s,π(s)) + ∑ ︎s′∈γ(s,π(s)) Pr (sʹ′|s, π(s)) Vπ(sʹ′) otherwise
●  If s = s0 then
Ø  Vπ(s0) = ∑σ ∈ H(π) cost(σ | π) Pr(σ | s0, π)
s
s1
s2
sn
…
π(s)
13	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
r	
  =	
  –100	
  
Example	
  
Goal	
  
Start	
  
π3 = {(s1,	
  move(r1,l1,l2)),	
  
	
  	
  	
  	
  (s2,	
  move(r1,l2,l3)),	
  
	
  	
  	
  	
  (s3,	
  move(r1,l3,l4)),	
  
	
  	
  	
  	
  (s5,	
  move(r1,l5,l4))}
	
  
H(π1) contains two histories:
σ1 = 〈s1,	
  s2,	
  s3,	
  s4〉 	
  Pr (σ1 | π3) = 0.8 cost(σ1 | π3) = 100 + 1 + 100 = 201	
  
σ2 = 〈s1,	
  s2,	
  s5,	
  s4〉 	
  Pr (σ2 | π3) = 0.2 cost(σ2 | π3) = 100 + 1 + 100 = 201	
  
	
  
Vπ1(s1) = 201×0.8 + 201×0.2 = 201
14	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
π4 = {(s5,	
  move(r1,l5,l4)}
●  H(π4) contains infinitely
many histories:	
  
Ø  σ5 = 〈s1,	
  s4	
  〉 	
  Pr (σ5 | π4) = 0.5 cost (σ5 | π4) = 1
Ø  σ6 = 〈s1,	
  s1,	
  s4〉 	
  Pr (σ6 | π4) = 0.25 cost (σ6 | π4) = 2
Ø  σ7 = 〈s1,	
  s1,	
  s1,	
  s4〉 	
  Pr (σ7 | π4) = 0.125 cost (σ7 | π4) = 3
• • •
●  Vπ4(s1) = 1×.5 + 2×.25 + 3×.125 + 4×.0625 + … = 2
Safe	
  Solu.ons	
  
Goal	
  
Start	
  
15	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Planning	
  as	
  Op.miza.on	
  
●  Let π and π′ be safe solutions
●  π dominates π′ if Vπ(s) ≤ Vπ′(s) for every state where both π and π′ are defined
Ø  i.e., Vπ(s) ≤ Vπ′(s) for every s in Sπ ∩ Sπ′
●  π is optimal if π dominates every safe solution π′
●  V*(s) = min{Vπ(s) | π is a safe solution for which π(s) is defined}
= expected cost of getting from s to a goal using an
optimal safe solution
●  Optimality principle (also called Bellman’s theorem):
Ø  V*(s) = 0, if s is a goal
Ø  V*(s) = mina∈Applicable(s){cost(s,a) + ∑ ︎s′ ∈ γ(s,a) Pr (sʹ′|s,a) Vπ(sʹ′)}, otherwise
s
s1
s2
sn
…
π(s)
16	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Policy	
  Itera.on	
  
●  Let (Σ,s0,Sg) be a safe SSP (i.e., Sg is reachable from every state)
●  Let π be a safe solution that is defined at every state in S
●  Let s be a state, and let a ∈ Applicable(s)
Ø  Cost-to-go: expected cost at s if we start with a, and use π afterward
Ø  Qπ(s,a) = cost(s,a) + ∑s′ ∈ γ(s,a) Pr (sʹ′|s,a) Vπ(sʹ′)
●  For every s, let π′(s) ∈ argmina∈Applicable(s) Qπ(s,a)
Ø  Then π′ is a safe solution and dominates π
●  PI(Σ,s0,Sg,π0)	
  
	
  	
  	
  	
  π ← π0
loop
compute Vπ (n equations and n unkowns, where n = |S|)
for every non-goal state s do
π′(s) ← any action in argmina∈Applicable(s) Qπ(s,a)
if π′ = π then return π
π ← π′
●  Converges in a finite number of iterations
s
s1
s2
sn
…
π(s)
Tie-breaking rule: if
π(s) ∈ argmina∈Applicable(s) Qπ(s,a),
then use π′(s) = π(s)
17	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
r	
  =	
  –100	
  
Start with
π0 = {(s1,	
  move(r1,l1,l2)),	
  
	
  	
  (s2,	
  move(r1,l2,l3)),	
  
	
  	
  (s3,	
  move(r1,l3,l4)),	
  
	
  	
  (s5,	
  move(r1,l5,l4))}
Example	
  
Goal	
  
Start	
  
Vπ(s4) = 0
Vπ(s3) = 100 + Vπ(s4) = 100
Vπ(s5) = 100 + Vπ(s5) = 100
Vπ(s2) = 1 + (0.8 Vπ(s3) + 0.2 Vπ(s5)) = 101
Vπ(s1) = 100 + Vπ(s2) = 201
Q(s1,move(r1,l1,l2)) = 100 + 101 = 201
Q(s1,move(r1,l1,l4)) = 1 + ½ × 201 + ½ × 0 = 101.5
argmin = move(r1,l1,l4)	
  
Q(s2,move(r1,l2,l3)) = 1 + (0.8 × 100 + 0.2 × 100) = 101
Q(s2,move(r1,l2,l1)) = 100 + 201 = 301
argmin = move(r1,l2,l3)	
  
Q(s3,move(r1,l3,l4)) = 100 + 0 = 100
Q(s3,move(r1,l3,l2)) = 100 + 101 = 201
argmin = move(r1,l3,l4)	
  
Q(s5,move(r1,l5,l4)) = 100 + 0 = 100
Q(s5,move(r1,l5,l4)) = 100+101 = 201
argmin = move(r1,l5,l4)
18	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
r	
  =	
  –100	
  
π = {(s1,	
  move(r1,l1,l4)),	
  
	
  	
  (s2,	
  move(r1,l2,l3)),	
  
	
  	
  (s3,	
  move(r1,l3,l4)),	
  
	
  	
  (s5,	
  move(r1,l5,l4))}
Example	
  
Goal	
  
Start	
  
Vπ(s4) = 0
Vπ(s3) = 100 + Vπ(s4) = 100
Vπ(s5) = 100 + Vπ(s5) = 100
Vπ(s2) = 1 + (0.8 Vπ(s3) + 0.2 Vπ(s5)) = 101
Vπ(s1) = 1 + ½ Vπ(s1) + ½ Vπ(s4) = 2
Q(s1,move(r1,l1,l2)) = 100 + 101 = 201
Q(s1,move(r1,l1,l4)) = 1 + ½ × 2 + ½ × 0 = 2
argmin = move(r1,l1,l4)	
  
Q(s2,move(r1,l2,l3)) = 1 + (0.8 × 100 + 0.2 × 100) = 101
Q(s2,move(r1,l2,l1)) = 100 + 2 = 102
argmin = move(r1,l2,l3)	
  
Q(s3,move(r1,l3,l4)) = 100 + 0 = 100
Q(s3,move(r1,l3,l2)) = 100 + 101 = 201
argmin = move(r1,l3,l4)	
  
Q(s5,move(r1,l5,l4)) = 100 + 0 = 100
Q(s5,move(r1,l5,l4)) = 100+101 = 201
argmin = move(r1,l5,l4)
19	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Value	
  Itera.on	
  (Synchronous	
  Version)	
  
●  Let (Σ,s0,Sg) be a safe SSP
●  Start with an arbitrary cost V(s) for each s and a small η > 0
VI(Σ,s0,Sg,V)	
  
π ← ∅
loop
Vold ← V
for every non-goal state s do
for every a ∈ Applicable(s) do
Q(s,a) ← cost(s,a) + ∑sʹ′ ∈ S Pr (sʹ′ | s,a) Vold(sʹ′)
V(s) ← mina∈Applicable(s) Q(s,a)
if maxs ∈ S ∖ Sg
|V(s) – Vold(s)| < η for every s then exit the loop
π(s) ← argmina∈Applicable(s) Q(s,a)
●  |V′(s) – V(s)| is the residual of s
●  maxs ∈ S ∖ Sg
|V′(s) – V(s)| is the residual
20	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Goal	
  
Start	
  
Example	
  
●  aij = the action that moves from si to sj
Ø  e.g., a12 = move(r1,l1,l2))	
  
●  η = 0.2
●  V(s) = 0 for all s
Q(s1, a12) = 100 + 0 = 100
Q(s1, a14) = 1 + (½×0 + ½×0) = 1
min = 1
Q(s2, a21) = 100 + 0 = 100
Q(s2, a23) = 1 + (½×0 + ½×0) = 1
min = 1
Q(s3, a32) = 1 + 0 = 1
Q(s3, a34) = 100 + 0 = 100
min = 1
Q(s5, a52) = 1 + 0 = 1
Q(s5, a54) = 100 + 0 = 100
min = 1
residual = max(1–0, 1–0, 1–0, 1–0) = 1
21	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Goal	
  
Start	
  
Example	
  
●  V(s1) = 1; V(s2) = 1; V(s3) = 1; V(s4) = 0; V(s5) = 1
Q(s1, a12) = 100 + 1 = 101
Q(s1, a14) = 1 + (½×1 + ½×0) = 1½
min = 1½
Q(s2, a21) = 100 + 1 = 101
Q(s2, a23) = 1 + (½×1 + ½×1) = 2
min = 2
Q(s3, a32) = 1 + 1 = 2
Q(s3, a34) = 100 + 0 = 100
min = 2
Q(s5, a52) = 1 + 1 = 2
Q(s5, a54) = 100 + 0 = 100
min = 2
residual = max(1½–1, 2–1, 2–1, 2–1) = 1
22	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Goal	
  
Start	
  
Example	
  
●  V(s1) = 1½; V(s2) = 2; V(s3) = 2; V(s4) = 0; V(s5) = 2
Q(s1, a12) = 100 + 2 = 103
Q(s1, a14) = 1 + (½×1½ + ½×0) = 13/4
min = 13/4
Q(s2, a21) = 100 + 1½ = 101½
Q(s2, a23) = 1 + (½×2 + ½×2) = 3
min = 3
Q(s3, a32) = 1 + 2 = 3
Q(s3, a34) = 100 + 0 = 100
min = 3
Q(s5, a52) = 1 + 2 = 3
Q(s5, a54) = 100 + 0 = 100
min = 3
residual = max(13/4–1½, 3–2, 3–2, 3–2) = 1
23	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Goal	
  
Start	
  
Example	
  
●  V(s1) = 13/4; V(s2) = 3; V(s3) = 3; V(s4) = 0; V(s5) = 3
Q(s1, a12) = 100 + 3 = 104
Q(s1, a14) = 1 + (½×13/4 + ½×0) = 17/8
min = 17/8
Q(s2, a21) = 100 + 13/4 = 1013/4
Q(s2, a23) = 1 + (½×3 + ½×3) = 4
min = 4
Q(s3, a32) = 1 + 3 = 4
Q(s3, a34) = 100 + 0 = 100
min = 4
Q(s5, a52) = 1 + 3 = 4
Q(s5, a54) = 100 + 0 = 100
min = 4
residual = max(17/8–13/4, 4–3, 4–3, 4–3) = 1
●  How long before residual < η = 0.2?
●  How long if the “vertical” actions cost
10 instead of 100?
24	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Discussion	
  
●  Policy iteration computes an entire policy in each iteration,
and computes values based on that policy
Ø  More work per iteration, because it needs to solve a set of simultaneous
equations
Ø  Usually converges in a smaller number of iterations
●  Value iteration computes new values in each iteration,
and chooses a policy based on those values
Ø  In general, the values are not the values that one would get from the chosen
policy or any other policy
Ø  Less work per iteration, because it doesn’t need to solve a set of equations
Ø  Usually takes more iterations to converge
●  What I showed you was the synchronous version of Value Iteration
•  For each s, compute new values of Q and V using Vold
Ø  Asynchronous version: compute new values of Q and V using V
•  New values may depend on which nodes have already been updated
25	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Value	
  Itera.on	
  
●  Synchronous version:
VI(Σ,s0,Sg,V)	
  
π ← ∅
loop
Vold ← V
for every s ∈ S ∖ Sg do
for every a ∈ Applicable(s) do
Q(s,a) ← cost(s,a) +
∑sʹ′ ∈ S Pr (sʹ′|s,a) Vold(sʹ′)
V(s) ← mina∈Applicable(s) Q(s,a)
π(s) ← argmina∈Applicable(s) Q(s,a)
if maxs ∈ S ∖ Sg
|V(s) – Vold(s)| < η then
return π
●  maxs ∈ S ∖ Sg
|V(s) – Vold(s)| is the residual
●  |V(s) – Vold(s)| is the residual of s
●  Asynchronous version:	
  
	
  
VI(Σ,s0,Sg,V)	
  
π ← ∅
loop
r ← 0 // the residual
for every s ∈ S ∖ Sg do
r ← max(r,Bellman-­‐Update(s,V,π))
if r < η then return π
Bellman-­‐Update(s,V,π)
vold ← V(s)
for every a ∈ Applicable(s) do
Q(s,a) ← cost(s,a) + ∑sʹ′∈S Pr (sʹ′|s,a) V(sʹ′)
V(s) ← mina∈Applicable(s) Q(s,a)
π(s) ← argmina∈Applicable(s) Q(s,a)
return |V(s) – vold|
Start with an arbitrary cost V(s) for each s, and a small η > 0
26	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Discussion	
  (Con.nued)	
  
●  For both, the number of iterations is polynomial in the number of states
Ø  But the number of states is usually quite large
Ø  In each iteration, need to examine the entire state space
●  Thus, these algorithms can take huge amounts of time and space
●  Use search techniques to avoid searching the entire space
27	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
AO∗ (Σ,s0,Sg,h)
π ← ∅; V ← ∅
Envelope ← {s0} // all generated states
loop
if leaves(s0,π) ⊆ Sg then return π
select s ∈ leaves(s0,π) ∖ Sg
for all a ∈ Applicable(s)
for all s′ ∈ γ(s,a) ∖ Envelope do
V(s′) ← h(s′); add s′ to Envelope
AO-­‐Update(s,V,π)
return π
AO-­‐Update(s,V,π)
Z ← {s} // ancestors of updated nodes
while Z ≠ ∅ do
select any s ∈ Z such that γ(s,π(s)) ∩ Z = ∅
remove s from Z
Bellman-­‐Update(s,V,π)
Z ← Z ∪ {s′ ∈ Sπ | s ∈ γ(s′,π(s′))}
●  h is the heuristic function
Ø  Must have h(s) = 0 for every s in Sg
	
  
Bellman-­‐Update(s,V,π)
vold ← V(s)
for every a ∈ Applicable(s) do
Q(s,a) ← cost(s,a) + ∑sʹ′∈S Pr (sʹ′|s,a) V(sʹ′)
V(s) ← mina∈Applicable(s) Q(s,a)
π(s) ← argmina∈Applicable(s) Q(s,a)
return |V(s) – vold|
Ø  Example: h(s) = 0 for all s
Goal	
  
Start	
  
AO*	
  (requires	
  Σ	
  to	
  be	
  acyclic)	
  
s2	
  
s1	
   s4	
  
s6	
  
s3	
  
s4	
  
c	
  =	
  1	
  
c	
  =	
  100	
  
c	
  =	
  10	
  
c	
  =	
  20	
  
c	
  =	
  1	
  
0.2	
  
0.8	
  
0.5	
  
0.5	
  
28	
  Dana	
  Nau	
  and	
  Vikas	
  Shivashankar:	
  Lecture	
  slides	
  for	
  Automated	
  Planning	
  and	
  Ac0ng	
   Updated	
  4/24/15	
  
Bellman-­‐Update(s,V,π)
vold ← V(s)
for every a ∈ Applicable(s) do
Q(s,a) ← cost(s,a) + ∑sʹ′∈S Pr (sʹ′|s,a) V(sʹ′)
V(s) ← mina∈Applicable(s) Q(s,a)
π(s) ← argmina∈Applicable(s) Q(s,a)
return |V(s) – vold|
LAO*	
  (can	
  handle	
  cycles)	
  
Goal	
  
Start	
  
LAO∗(Σ,s0,Sg,h)
π ← ∅; V ← ∅
Envelope ← {s0} // all generated states
loop
if leaves(s0,π) ⊆ Sg then return π
select s ∈ leaves(s0,π) ∖ Sg
for all a ∈ Applicable(s)
for all s′ ∈ γ(s,a) ∖ Envelope do
V(s′) ← h(s′); add s′ to Envelope
LAO-­‐Update(s,V,π)
return π
LAO-­‐Update(s,V,π)
Z ← {s} ∪ {s′ ∈ Sπ | s ∈ γ(s′,π)}
for every s ∈ Z do Bellman-­‐Update(s,V,π)
loop
residual ←
max{Bellman-­‐Update(s,V,π) | s ∈ Sπ)
if leaves(s0,π) changed or residual ≤ η then
break

More Related Content

What's hot

Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Yandex
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
郁凱 黃
 
A Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisA Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann Hypothesis
Charaf Ech-Chatbi
 
A Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisA Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann Hypothesis
Charaf Ech-Chatbi
 
Test s velocity_15_5_4
Test s velocity_15_5_4Test s velocity_15_5_4
Test s velocity_15_5_4
Kunihiko Saito
 
Some contributions to PCA for time series
Some contributions to PCA for time seriesSome contributions to PCA for time series
Some contributions to PCA for time series
Isabel Magalhães
 
ICPC Asia::Tokyo 2014 Problem J – Exhibition
ICPC Asia::Tokyo 2014 Problem J – ExhibitionICPC Asia::Tokyo 2014 Problem J – Exhibition
ICPC Asia::Tokyo 2014 Problem J – Exhibition
irrrrr
 
ICPC 2015, Tsukuba : Unofficial Commentary
ICPC 2015, Tsukuba: Unofficial CommentaryICPC 2015, Tsukuba: Unofficial Commentary
ICPC 2015, Tsukuba : Unofficial Commentary
irrrrr
 
Test (S) on R
Test (S) on RTest (S) on R
Test (S) on R
Kunihiko Saito
 
qlp
qlpqlp
Sep logic slide
Sep logic slideSep logic slide
Sep logic slide
rainoftime
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
Akihiro Hayashi
 
Diagonalization and eigen
Diagonalization and eigenDiagonalization and eigen
Diagonalization and eigen
Paresh Parmar
 
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Rene Kotze
 
Theory of Automata and formal languages unit 2
Theory of Automata and formal languages unit 2Theory of Automata and formal languages unit 2
Theory of Automata and formal languages unit 2
Abhimanyu Mishra
 
Nyquist and polar plot 118 &amp; 117
Nyquist and polar plot 118 &amp; 117Nyquist and polar plot 118 &amp; 117
Nyquist and polar plot 118 &amp; 117
RishabhKashyap2
 
Daa divide-n-conquer
Daa divide-n-conquerDaa divide-n-conquer
Daa divide-n-conquer
sankara_rao
 

What's hot (20)

Chapter03b
Chapter03bChapter03b
Chapter03b
 
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
A Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisA Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann Hypothesis
 
A Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisA Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann Hypothesis
 
Pda
PdaPda
Pda
 
Test s velocity_15_5_4
Test s velocity_15_5_4Test s velocity_15_5_4
Test s velocity_15_5_4
 
Some contributions to PCA for time series
Some contributions to PCA for time seriesSome contributions to PCA for time series
Some contributions to PCA for time series
 
Link analysis
Link analysisLink analysis
Link analysis
 
ICPC Asia::Tokyo 2014 Problem J – Exhibition
ICPC Asia::Tokyo 2014 Problem J – ExhibitionICPC Asia::Tokyo 2014 Problem J – Exhibition
ICPC Asia::Tokyo 2014 Problem J – Exhibition
 
ICPC 2015, Tsukuba : Unofficial Commentary
ICPC 2015, Tsukuba: Unofficial CommentaryICPC 2015, Tsukuba: Unofficial Commentary
ICPC 2015, Tsukuba : Unofficial Commentary
 
Test (S) on R
Test (S) on RTest (S) on R
Test (S) on R
 
qlp
qlpqlp
qlp
 
Sep logic slide
Sep logic slideSep logic slide
Sep logic slide
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
Diagonalization and eigen
Diagonalization and eigenDiagonalization and eigen
Diagonalization and eigen
 
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
 
Theory of Automata and formal languages unit 2
Theory of Automata and formal languages unit 2Theory of Automata and formal languages unit 2
Theory of Automata and formal languages unit 2
 
Nyquist and polar plot 118 &amp; 117
Nyquist and polar plot 118 &amp; 117Nyquist and polar plot 118 &amp; 117
Nyquist and polar plot 118 &amp; 117
 
Daa divide-n-conquer
Daa divide-n-conquerDaa divide-n-conquer
Daa divide-n-conquer
 

Similar to Chapter06

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Shahan Ali Memon
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Tasuku Soma
 
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
jfrchicanog
 
Quantum factorization.pdf
Quantum factorization.pdfQuantum factorization.pdf
Quantum factorization.pdf
ssuser8b461f
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
DB Tsai
 
Yuri Boykov — Combinatorial optimization for higher-order segmentation functi...
Yuri Boykov — Combinatorial optimization for higher-order segmentation functi...Yuri Boykov — Combinatorial optimization for higher-order segmentation functi...
Yuri Boykov — Combinatorial optimization for higher-order segmentation functi...
Yandex
 
cps170_bayes_nets.ppt
cps170_bayes_nets.pptcps170_bayes_nets.ppt
cps170_bayes_nets.ppt
FaizAbaas
 
Introduction to the AKS Primality Test
Introduction to the AKS Primality TestIntroduction to the AKS Primality Test
Introduction to the AKS Primality Test
Pranshu Bhatnagar
 
Practical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentPractical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient Apportionment
Raphael Reitzig
 
Prova global 2 correção
Prova global 2 correçãoProva global 2 correção
Prova global 2 correção
Alexandra Oliveira, MBA
 
Complex differentiation contains analytic function.pptx
Complex differentiation contains analytic function.pptxComplex differentiation contains analytic function.pptx
Complex differentiation contains analytic function.pptx
jyotidighole2
 
Metrics for generativemodels
Metrics for generativemodelsMetrics for generativemodels
Metrics for generativemodels
Dai-Hai Nguyen
 
class 12 2014 maths solution set 1
class 12 2014 maths solution set 1class 12 2014 maths solution set 1
class 12 2014 maths solution set 1
vandna123
 
Regularized Estimation of Spatial Patterns
Regularized Estimation of Spatial PatternsRegularized Estimation of Spatial Patterns
Regularized Estimation of Spatial Patterns
Wen-Ting Wang
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Traian Rebedea
 
On maximal and variational Fourier restriction
On maximal and variational Fourier restrictionOn maximal and variational Fourier restriction
On maximal and variational Fourier restriction
VjekoslavKovac1
 
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Alexander Litvinenko
 
On Steiner Dominating Sets and Steiner Domination Polynomials of Paths
On Steiner Dominating Sets and Steiner Domination Polynomials of PathsOn Steiner Dominating Sets and Steiner Domination Polynomials of Paths
On Steiner Dominating Sets and Steiner Domination Polynomials of Paths
IJERA Editor
 
module4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdfmodule4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdf
Shiwani Gupta
 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
ChandanaVemulapalli2
 

Similar to Chapter06 (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares Method
 
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
 
Quantum factorization.pdf
Quantum factorization.pdfQuantum factorization.pdf
Quantum factorization.pdf
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Yuri Boykov — Combinatorial optimization for higher-order segmentation functi...
Yuri Boykov — Combinatorial optimization for higher-order segmentation functi...Yuri Boykov — Combinatorial optimization for higher-order segmentation functi...
Yuri Boykov — Combinatorial optimization for higher-order segmentation functi...
 
cps170_bayes_nets.ppt
cps170_bayes_nets.pptcps170_bayes_nets.ppt
cps170_bayes_nets.ppt
 
Introduction to the AKS Primality Test
Introduction to the AKS Primality TestIntroduction to the AKS Primality Test
Introduction to the AKS Primality Test
 
Practical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentPractical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient Apportionment
 
Prova global 2 correção
Prova global 2 correçãoProva global 2 correção
Prova global 2 correção
 
Complex differentiation contains analytic function.pptx
Complex differentiation contains analytic function.pptxComplex differentiation contains analytic function.pptx
Complex differentiation contains analytic function.pptx
 
Metrics for generativemodels
Metrics for generativemodelsMetrics for generativemodels
Metrics for generativemodels
 
class 12 2014 maths solution set 1
class 12 2014 maths solution set 1class 12 2014 maths solution set 1
class 12 2014 maths solution set 1
 
Regularized Estimation of Spatial Patterns
Regularized Estimation of Spatial PatternsRegularized Estimation of Spatial Patterns
Regularized Estimation of Spatial Patterns
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11
 
On maximal and variational Fourier restriction
On maximal and variational Fourier restrictionOn maximal and variational Fourier restriction
On maximal and variational Fourier restriction
 
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
 
On Steiner Dominating Sets and Steiner Domination Polynomials of Paths
On Steiner Dominating Sets and Steiner Domination Polynomials of PathsOn Steiner Dominating Sets and Steiner Domination Polynomials of Paths
On Steiner Dominating Sets and Steiner Domination Polynomials of Paths
 
module4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdfmodule4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdf
 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
 

More from Tianlu Wang

14 pro resolution
14 pro resolution14 pro resolution
14 pro resolution
Tianlu Wang
 
13 propositional calculus
13 propositional calculus13 propositional calculus
13 propositional calculus
Tianlu Wang
 
12 adversal search
12 adversal search12 adversal search
12 adversal search
Tianlu Wang
 
11 alternative search
11 alternative search11 alternative search
11 alternative search
Tianlu Wang
 
21 situation calculus
21 situation calculus21 situation calculus
21 situation calculus
Tianlu Wang
 
20 bayes learning
20 bayes learning20 bayes learning
20 bayes learning
Tianlu Wang
 
19 uncertain evidence
19 uncertain evidence19 uncertain evidence
19 uncertain evidence
Tianlu Wang
 
18 common knowledge
18 common knowledge18 common knowledge
18 common knowledge
Tianlu Wang
 
17 2 expert systems
17 2 expert systems17 2 expert systems
17 2 expert systems
Tianlu Wang
 
17 1 knowledge-based system
17 1 knowledge-based system17 1 knowledge-based system
17 1 knowledge-based system
Tianlu Wang
 
16 2 predicate resolution
16 2 predicate resolution16 2 predicate resolution
16 2 predicate resolution
Tianlu Wang
 
16 1 predicate resolution
16 1 predicate resolution16 1 predicate resolution
16 1 predicate resolution
Tianlu Wang
 
09 heuristic search
09 heuristic search09 heuristic search
09 heuristic search
Tianlu Wang
 
08 uninformed search
08 uninformed search08 uninformed search
08 uninformed search
Tianlu Wang
 

More from Tianlu Wang (20)

L7 er2
L7 er2L7 er2
L7 er2
 
L8 design1
L8 design1L8 design1
L8 design1
 
L9 design2
L9 design2L9 design2
L9 design2
 
14 pro resolution
14 pro resolution14 pro resolution
14 pro resolution
 
13 propositional calculus
13 propositional calculus13 propositional calculus
13 propositional calculus
 
12 adversal search
12 adversal search12 adversal search
12 adversal search
 
11 alternative search
11 alternative search11 alternative search
11 alternative search
 
10 2 sum
10 2 sum10 2 sum
10 2 sum
 
22 planning
22 planning22 planning
22 planning
 
21 situation calculus
21 situation calculus21 situation calculus
21 situation calculus
 
20 bayes learning
20 bayes learning20 bayes learning
20 bayes learning
 
19 uncertain evidence
19 uncertain evidence19 uncertain evidence
19 uncertain evidence
 
18 common knowledge
18 common knowledge18 common knowledge
18 common knowledge
 
17 2 expert systems
17 2 expert systems17 2 expert systems
17 2 expert systems
 
17 1 knowledge-based system
17 1 knowledge-based system17 1 knowledge-based system
17 1 knowledge-based system
 
16 2 predicate resolution
16 2 predicate resolution16 2 predicate resolution
16 2 predicate resolution
 
16 1 predicate resolution
16 1 predicate resolution16 1 predicate resolution
16 1 predicate resolution
 
15 predicate
15 predicate15 predicate
15 predicate
 
09 heuristic search
09 heuristic search09 heuristic search
09 heuristic search
 
08 uninformed search
08 uninformed search08 uninformed search
08 uninformed search
 

Chapter06

  • 1. 1  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   This  work  is  licensed  under  a  CreaBve  Commons  AEribuBon-­‐NonCommercial-­‐ShareAlike  4.0  InternaBonal  License.   Chapter  4     Delibera.on  with  Probabilis.c  Domain  Models   Dana S. Nau and Vikas Shivashankar University of Maryland
  • 2. 2  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Probabilis.c  Planning  Domain   ●  Actions have multiple possible outcomes Ø  Each outcome has a probability ●  Several possible action representations Ø  Bayes nets, probabilistic operators, … ●  Book doesn’t commit to any representation Ø  Only deals with the underlying semantics ●  Σ = (S,A,γ,Pr,cost) Ø  S = set of states Ø  A = set of actions Ø  γ : S × A → 2S Ø  Pr(s′ | s, a) = probability of going to state s′ if we perform a in s •  Require Pr(s′ | s, a) > 0 for every s′ in γ(s,a) Ø  cost: S × A → + •  cost(s,a) = cost of action a in state s
  • 3. 3  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Nota.on  from  Chapter  5   ●  Policy π : Sπ → A Ø  Sπ ⊆ S Ø  ∀s ∈ Sπ , π(s) ∈ Applicable(s) ●  γ︎(s,π) = {s and all descendants of s reachable by π} Ø  the transitive closure of γ with π ●  Graph(s,π) = rooted graph induced by π Ø  {nodes} = γ︎(s,π) Ø  {edges} = ∪a∈Applicable(s){(s,s′) | s′ ∈ γ(s,a)} Ø  root = s ●  leaves(s,π) = {states in γ︎(s,π) that aren’t in Sπ}
  • 4. 4  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Stochas.c  Systems   ●  Stochastic shortest path (SSP) problem: a triple (Σ, s0, Sg) ●  Solution for (Σ, s0, Sg): Ø  policy π such that s0 ∈ Sπ and leaves(s0,π) ⋂ Sg ≠ ∅ ●  π is closed if π doesn’t stop at non-goal states unless no action is applicable Ø  for every state in γ︎(s,π), either •  s ∈ Sπ (i.e., π specifies an action at s) •  s ∈ Sg (i.e., s is a goal state) •  Applicable(s) = ∅ (i.e., there are no applicable actions at s)
  • 5. 5  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   ●  Robot r1 starts at location l1   Ø  s0 = s1 in the diagram ●  Objective is to get r1 to location l4   Ø  Sg = {s4} ●  π1 = {(s1,  move(r1,l1,l2)),  (s2,  move(r1,l2,l3)),  (s3,  move(r1,l3,l4))} Ø  Solution? Ø  Closed?   move(r1,l2,l1)   2   Policies   Goal   Start  
  • 6. 6  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   ●  History: a sequence of states, starting at s0 σ = 〈s0, s1, s2, s3, …, sh〉 or (not in book): σ = 〈s0, s1, s2, s3, …〉 ●  Let H(π) = {all histories that can be produced by following π from s0 to a state in leaves(s0, π)} ●  If σ ∈ H(π) then Pr (σ | π) = ∏i ≥ 0 Pr (si+1 | si ,π(si)) Ø  Thus ∑σ ∈ Hπ Pr (σ | π) = 1 ●  Probability that π will stop at a goal state: Ø  Pr (Sg | π) = ∑ {Pr (σ | π) | σ ∈ H(π) and σ ends at a state in Sg} move(r1,l2,l1)   2   Histories   Goal   Start  
  • 7. 7  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Unsafe  Solu.ons   ●  A solution π is unsafe if Ø  0 < Pr (Sg | π) < 1 ●  Example: π1 = {(s1,  move(r1,l1,l2)),            (s2,  move(r1,l2,l3)),            (s3,  move(r1,l3,l4))} ●  H(π1) contains two histories: Ø  σ1 = 〈s1,  s2,  s3,  s4〉  Pr (σ1 | π1) = 1 × .8 × 1 = 0.8 Ø  σ2 = 〈s1,  s2,  s5〉  Pr (σ2 | π1) = 1 × .2 = 0.2 ●  Pr (Sg | π) = 0.8 Explicit  dead  end   move(r1,l2,l1)   2   Goal   Start  
  • 8. 8  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Unsafe  Solu.ons   ●  A solution π is unsafe if Ø  0 < Pr (Sg | π) < 1 ●  Example: π2 = {(s1,  move(r1,l1,l2)),            (s2,  move(r1,l2,l3)),            (s3,  move(r1,l3,l4)),            (s5,  move(r1,l5,l6)),            (s6,  move(r1,l6,l5))} ●  H(π2) contains two histories: Ø  σ1 = 〈s1,  s2,  s3,  s4〉  Pr (σ1 | π2) = 1 × .8 × 1 = 0.8 Ø  σ3 = 〈s1,  s2,  s5,  s6,  s5,  s6,  …  〉  Pr (σ3 | π2) = 1 × .2 × 1 × 1 × 1 × … = 0.2 ●  Pr (Sg | π2) = 0.8 Implicit  dead  end   move(r1,l2,l1)   2   wait   s6   at(r1,l6)   Goal   Start  
  • 9. 9  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   ●  A solution π is safe if Ø  Pr (Sg | π) = 1 ●  An acyclic safe solution: π3 = {(s1,  move(r1,l1,l2)),              (s2,  move(r1,l2,l3)),              (s3,  move(r1,l3,l4)),              (s5,  move(r1,l5,l1))} ●  H(π3) contains two histories: Ø  σ1 = 〈s1,  s2,  s3,  s4〉  Pr (σ1 | π3) = 1 × .8 × 1 = 0.8 Ø  σ4 = 〈s1,  s2,  s5,  s4〉  Pr (σ4 | π3) = 1 × .2 × 1 = 0.2 ●  Pr (Sg | π3) = 0.8 + 0.2 = 1 move(r1,l2,l1)   2   Safe  Solu.ons   Goal   Start  
  • 10. 10  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   ●  A solution π is safe if Ø  Pr (Sg | π) = 1 ●  A cyclic safe solution: π4 = {(s5,  move(r1,l5,l4)} ●  H(π4) contains infinitely many histories:   Ø  σ5 = 〈s1,  s4  〉  Pr (σ5 | π4) = 0.5 Ø  σ6 = 〈s1,  s1,  s4〉  Pr (σ6 | π4) = 0.5 × 0.5 = 0.25 Ø  σ7 = 〈s1,  s1,  s1,  s4〉  Pr (σ7 | π4) = 0.5 × 0.5 × 0.5 = 0.125 • • • Ø  σ∞ = 〈s1,  s1,  s1,  s1,  …  〉  Pr (σ∞ | π4) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0 move(r1,l2,l1)   2   Safe  Solu.ons   Pr (Sg | π4) = .5 + .25 + .125 + … = 1 Goal   Start  
  • 11. 11  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   ●  Example: π = {(s1,  move(r1,l1,l2)),          (s2,  move(r1,l2,l3)),          (s3,  move(r1,l3,l4)),          (s4,  move(r1,l4,l1)),          (s5,  move(r1,l5,l1))} ●  What is Pr (Sg | π)? Goal   move(r1,l2,l1)   2   Example   Start  
  • 12. 12  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   r  =  –100   Expected  Cost   Goal   Start   ●  cost(s,a) = cost of using a in s ●  Example: Ø  cost(s,a) = 1 for each “horizontal” action Ø  cost(s,a) = 100 for each “vertical” action ●  Cost of a history: Ø  Let σ = 〈s0, s1, … 〉 ∈ H(π) Ø  cost(σ | π) = ∑i ≥ 0 cost(si,π(si)) ●  Let π be a safe solution ●  Expected cost of following π to a goal: Ø  Vπ(s) = 0 if s is a goal Ø  Vπ(s) = cost(s,π(s)) + ∑ ︎s′∈γ(s,π(s)) Pr (sʹ′|s, π(s)) Vπ(sʹ′) otherwise ●  If s = s0 then Ø  Vπ(s0) = ∑σ ∈ H(π) cost(σ | π) Pr(σ | s0, π) s s1 s2 sn … π(s)
  • 13. 13  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   r  =  –100   Example   Goal   Start   π3 = {(s1,  move(r1,l1,l2)),          (s2,  move(r1,l2,l3)),          (s3,  move(r1,l3,l4)),          (s5,  move(r1,l5,l4))}   H(π1) contains two histories: σ1 = 〈s1,  s2,  s3,  s4〉  Pr (σ1 | π3) = 0.8 cost(σ1 | π3) = 100 + 1 + 100 = 201   σ2 = 〈s1,  s2,  s5,  s4〉  Pr (σ2 | π3) = 0.2 cost(σ2 | π3) = 100 + 1 + 100 = 201     Vπ1(s1) = 201×0.8 + 201×0.2 = 201
  • 14. 14  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   π4 = {(s5,  move(r1,l5,l4)} ●  H(π4) contains infinitely many histories:   Ø  σ5 = 〈s1,  s4  〉  Pr (σ5 | π4) = 0.5 cost (σ5 | π4) = 1 Ø  σ6 = 〈s1,  s1,  s4〉  Pr (σ6 | π4) = 0.25 cost (σ6 | π4) = 2 Ø  σ7 = 〈s1,  s1,  s1,  s4〉  Pr (σ7 | π4) = 0.125 cost (σ7 | π4) = 3 • • • ●  Vπ4(s1) = 1×.5 + 2×.25 + 3×.125 + 4×.0625 + … = 2 Safe  Solu.ons   Goal   Start  
  • 15. 15  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Planning  as  Op.miza.on   ●  Let π and π′ be safe solutions ●  π dominates π′ if Vπ(s) ≤ Vπ′(s) for every state where both π and π′ are defined Ø  i.e., Vπ(s) ≤ Vπ′(s) for every s in Sπ ∩ Sπ′ ●  π is optimal if π dominates every safe solution π′ ●  V*(s) = min{Vπ(s) | π is a safe solution for which π(s) is defined} = expected cost of getting from s to a goal using an optimal safe solution ●  Optimality principle (also called Bellman’s theorem): Ø  V*(s) = 0, if s is a goal Ø  V*(s) = mina∈Applicable(s){cost(s,a) + ∑ ︎s′ ∈ γ(s,a) Pr (sʹ′|s,a) Vπ(sʹ′)}, otherwise s s1 s2 sn … π(s)
  • 16. 16  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Policy  Itera.on   ●  Let (Σ,s0,Sg) be a safe SSP (i.e., Sg is reachable from every state) ●  Let π be a safe solution that is defined at every state in S ●  Let s be a state, and let a ∈ Applicable(s) Ø  Cost-to-go: expected cost at s if we start with a, and use π afterward Ø  Qπ(s,a) = cost(s,a) + ∑s′ ∈ γ(s,a) Pr (sʹ′|s,a) Vπ(sʹ′) ●  For every s, let π′(s) ∈ argmina∈Applicable(s) Qπ(s,a) Ø  Then π′ is a safe solution and dominates π ●  PI(Σ,s0,Sg,π0)          π ← π0 loop compute Vπ (n equations and n unkowns, where n = |S|) for every non-goal state s do π′(s) ← any action in argmina∈Applicable(s) Qπ(s,a) if π′ = π then return π π ← π′ ●  Converges in a finite number of iterations s s1 s2 sn … π(s) Tie-breaking rule: if π(s) ∈ argmina∈Applicable(s) Qπ(s,a), then use π′(s) = π(s)
  • 17. 17  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   r  =  –100   Start with π0 = {(s1,  move(r1,l1,l2)),      (s2,  move(r1,l2,l3)),      (s3,  move(r1,l3,l4)),      (s5,  move(r1,l5,l4))} Example   Goal   Start   Vπ(s4) = 0 Vπ(s3) = 100 + Vπ(s4) = 100 Vπ(s5) = 100 + Vπ(s5) = 100 Vπ(s2) = 1 + (0.8 Vπ(s3) + 0.2 Vπ(s5)) = 101 Vπ(s1) = 100 + Vπ(s2) = 201 Q(s1,move(r1,l1,l2)) = 100 + 101 = 201 Q(s1,move(r1,l1,l4)) = 1 + ½ × 201 + ½ × 0 = 101.5 argmin = move(r1,l1,l4)   Q(s2,move(r1,l2,l3)) = 1 + (0.8 × 100 + 0.2 × 100) = 101 Q(s2,move(r1,l2,l1)) = 100 + 201 = 301 argmin = move(r1,l2,l3)   Q(s3,move(r1,l3,l4)) = 100 + 0 = 100 Q(s3,move(r1,l3,l2)) = 100 + 101 = 201 argmin = move(r1,l3,l4)   Q(s5,move(r1,l5,l4)) = 100 + 0 = 100 Q(s5,move(r1,l5,l4)) = 100+101 = 201 argmin = move(r1,l5,l4)
  • 18. 18  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   r  =  –100   π = {(s1,  move(r1,l1,l4)),      (s2,  move(r1,l2,l3)),      (s3,  move(r1,l3,l4)),      (s5,  move(r1,l5,l4))} Example   Goal   Start   Vπ(s4) = 0 Vπ(s3) = 100 + Vπ(s4) = 100 Vπ(s5) = 100 + Vπ(s5) = 100 Vπ(s2) = 1 + (0.8 Vπ(s3) + 0.2 Vπ(s5)) = 101 Vπ(s1) = 1 + ½ Vπ(s1) + ½ Vπ(s4) = 2 Q(s1,move(r1,l1,l2)) = 100 + 101 = 201 Q(s1,move(r1,l1,l4)) = 1 + ½ × 2 + ½ × 0 = 2 argmin = move(r1,l1,l4)   Q(s2,move(r1,l2,l3)) = 1 + (0.8 × 100 + 0.2 × 100) = 101 Q(s2,move(r1,l2,l1)) = 100 + 2 = 102 argmin = move(r1,l2,l3)   Q(s3,move(r1,l3,l4)) = 100 + 0 = 100 Q(s3,move(r1,l3,l2)) = 100 + 101 = 201 argmin = move(r1,l3,l4)   Q(s5,move(r1,l5,l4)) = 100 + 0 = 100 Q(s5,move(r1,l5,l4)) = 100+101 = 201 argmin = move(r1,l5,l4)
  • 19. 19  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Value  Itera.on  (Synchronous  Version)   ●  Let (Σ,s0,Sg) be a safe SSP ●  Start with an arbitrary cost V(s) for each s and a small η > 0 VI(Σ,s0,Sg,V)   π ← ∅ loop Vold ← V for every non-goal state s do for every a ∈ Applicable(s) do Q(s,a) ← cost(s,a) + ∑sʹ′ ∈ S Pr (sʹ′ | s,a) Vold(sʹ′) V(s) ← mina∈Applicable(s) Q(s,a) if maxs ∈ S ∖ Sg |V(s) – Vold(s)| < η for every s then exit the loop π(s) ← argmina∈Applicable(s) Q(s,a) ●  |V′(s) – V(s)| is the residual of s ●  maxs ∈ S ∖ Sg |V′(s) – V(s)| is the residual
  • 20. 20  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Goal   Start   Example   ●  aij = the action that moves from si to sj Ø  e.g., a12 = move(r1,l1,l2))   ●  η = 0.2 ●  V(s) = 0 for all s Q(s1, a12) = 100 + 0 = 100 Q(s1, a14) = 1 + (½×0 + ½×0) = 1 min = 1 Q(s2, a21) = 100 + 0 = 100 Q(s2, a23) = 1 + (½×0 + ½×0) = 1 min = 1 Q(s3, a32) = 1 + 0 = 1 Q(s3, a34) = 100 + 0 = 100 min = 1 Q(s5, a52) = 1 + 0 = 1 Q(s5, a54) = 100 + 0 = 100 min = 1 residual = max(1–0, 1–0, 1–0, 1–0) = 1
  • 21. 21  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Goal   Start   Example   ●  V(s1) = 1; V(s2) = 1; V(s3) = 1; V(s4) = 0; V(s5) = 1 Q(s1, a12) = 100 + 1 = 101 Q(s1, a14) = 1 + (½×1 + ½×0) = 1½ min = 1½ Q(s2, a21) = 100 + 1 = 101 Q(s2, a23) = 1 + (½×1 + ½×1) = 2 min = 2 Q(s3, a32) = 1 + 1 = 2 Q(s3, a34) = 100 + 0 = 100 min = 2 Q(s5, a52) = 1 + 1 = 2 Q(s5, a54) = 100 + 0 = 100 min = 2 residual = max(1½–1, 2–1, 2–1, 2–1) = 1
  • 22. 22  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Goal   Start   Example   ●  V(s1) = 1½; V(s2) = 2; V(s3) = 2; V(s4) = 0; V(s5) = 2 Q(s1, a12) = 100 + 2 = 103 Q(s1, a14) = 1 + (½×1½ + ½×0) = 13/4 min = 13/4 Q(s2, a21) = 100 + 1½ = 101½ Q(s2, a23) = 1 + (½×2 + ½×2) = 3 min = 3 Q(s3, a32) = 1 + 2 = 3 Q(s3, a34) = 100 + 0 = 100 min = 3 Q(s5, a52) = 1 + 2 = 3 Q(s5, a54) = 100 + 0 = 100 min = 3 residual = max(13/4–1½, 3–2, 3–2, 3–2) = 1
  • 23. 23  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Goal   Start   Example   ●  V(s1) = 13/4; V(s2) = 3; V(s3) = 3; V(s4) = 0; V(s5) = 3 Q(s1, a12) = 100 + 3 = 104 Q(s1, a14) = 1 + (½×13/4 + ½×0) = 17/8 min = 17/8 Q(s2, a21) = 100 + 13/4 = 1013/4 Q(s2, a23) = 1 + (½×3 + ½×3) = 4 min = 4 Q(s3, a32) = 1 + 3 = 4 Q(s3, a34) = 100 + 0 = 100 min = 4 Q(s5, a52) = 1 + 3 = 4 Q(s5, a54) = 100 + 0 = 100 min = 4 residual = max(17/8–13/4, 4–3, 4–3, 4–3) = 1 ●  How long before residual < η = 0.2? ●  How long if the “vertical” actions cost 10 instead of 100?
  • 24. 24  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Discussion   ●  Policy iteration computes an entire policy in each iteration, and computes values based on that policy Ø  More work per iteration, because it needs to solve a set of simultaneous equations Ø  Usually converges in a smaller number of iterations ●  Value iteration computes new values in each iteration, and chooses a policy based on those values Ø  In general, the values are not the values that one would get from the chosen policy or any other policy Ø  Less work per iteration, because it doesn’t need to solve a set of equations Ø  Usually takes more iterations to converge ●  What I showed you was the synchronous version of Value Iteration •  For each s, compute new values of Q and V using Vold Ø  Asynchronous version: compute new values of Q and V using V •  New values may depend on which nodes have already been updated
  • 25. 25  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Value  Itera.on   ●  Synchronous version: VI(Σ,s0,Sg,V)   π ← ∅ loop Vold ← V for every s ∈ S ∖ Sg do for every a ∈ Applicable(s) do Q(s,a) ← cost(s,a) + ∑sʹ′ ∈ S Pr (sʹ′|s,a) Vold(sʹ′) V(s) ← mina∈Applicable(s) Q(s,a) π(s) ← argmina∈Applicable(s) Q(s,a) if maxs ∈ S ∖ Sg |V(s) – Vold(s)| < η then return π ●  maxs ∈ S ∖ Sg |V(s) – Vold(s)| is the residual ●  |V(s) – Vold(s)| is the residual of s ●  Asynchronous version:     VI(Σ,s0,Sg,V)   π ← ∅ loop r ← 0 // the residual for every s ∈ S ∖ Sg do r ← max(r,Bellman-­‐Update(s,V,π)) if r < η then return π Bellman-­‐Update(s,V,π) vold ← V(s) for every a ∈ Applicable(s) do Q(s,a) ← cost(s,a) + ∑sʹ′∈S Pr (sʹ′|s,a) V(sʹ′) V(s) ← mina∈Applicable(s) Q(s,a) π(s) ← argmina∈Applicable(s) Q(s,a) return |V(s) – vold| Start with an arbitrary cost V(s) for each s, and a small η > 0
  • 26. 26  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Discussion  (Con.nued)   ●  For both, the number of iterations is polynomial in the number of states Ø  But the number of states is usually quite large Ø  In each iteration, need to examine the entire state space ●  Thus, these algorithms can take huge amounts of time and space ●  Use search techniques to avoid searching the entire space
  • 27. 27  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   AO∗ (Σ,s0,Sg,h) π ← ∅; V ← ∅ Envelope ← {s0} // all generated states loop if leaves(s0,π) ⊆ Sg then return π select s ∈ leaves(s0,π) ∖ Sg for all a ∈ Applicable(s) for all s′ ∈ γ(s,a) ∖ Envelope do V(s′) ← h(s′); add s′ to Envelope AO-­‐Update(s,V,π) return π AO-­‐Update(s,V,π) Z ← {s} // ancestors of updated nodes while Z ≠ ∅ do select any s ∈ Z such that γ(s,π(s)) ∩ Z = ∅ remove s from Z Bellman-­‐Update(s,V,π) Z ← Z ∪ {s′ ∈ Sπ | s ∈ γ(s′,π(s′))} ●  h is the heuristic function Ø  Must have h(s) = 0 for every s in Sg   Bellman-­‐Update(s,V,π) vold ← V(s) for every a ∈ Applicable(s) do Q(s,a) ← cost(s,a) + ∑sʹ′∈S Pr (sʹ′|s,a) V(sʹ′) V(s) ← mina∈Applicable(s) Q(s,a) π(s) ← argmina∈Applicable(s) Q(s,a) return |V(s) – vold| Ø  Example: h(s) = 0 for all s Goal   Start   AO*  (requires  Σ  to  be  acyclic)   s2   s1   s4   s6   s3   s4   c  =  1   c  =  100   c  =  10   c  =  20   c  =  1   0.2   0.8   0.5   0.5  
  • 28. 28  Dana  Nau  and  Vikas  Shivashankar:  Lecture  slides  for  Automated  Planning  and  Ac0ng   Updated  4/24/15   Bellman-­‐Update(s,V,π) vold ← V(s) for every a ∈ Applicable(s) do Q(s,a) ← cost(s,a) + ∑sʹ′∈S Pr (sʹ′|s,a) V(sʹ′) V(s) ← mina∈Applicable(s) Q(s,a) π(s) ← argmina∈Applicable(s) Q(s,a) return |V(s) – vold| LAO*  (can  handle  cycles)   Goal   Start   LAO∗(Σ,s0,Sg,h) π ← ∅; V ← ∅ Envelope ← {s0} // all generated states loop if leaves(s0,π) ⊆ Sg then return π select s ∈ leaves(s0,π) ∖ Sg for all a ∈ Applicable(s) for all s′ ∈ γ(s,a) ∖ Envelope do V(s′) ← h(s′); add s′ to Envelope LAO-­‐Update(s,V,π) return π LAO-­‐Update(s,V,π) Z ← {s} ∪ {s′ ∈ Sπ | s ∈ γ(s′,π)} for every s ∈ Z do Bellman-­‐Update(s,V,π) loop residual ← max{Bellman-­‐Update(s,V,π) | s ∈ Sπ) if leaves(s0,π) changed or residual ≤ η then break