Abstract: This PDSG workship introduces basic concepts on Bellman Equations. Concepts covered are States, Actions, Rewards, Value Function, Discount Factor, Bellman Equation, Bellman Optimality, Deterministic vs. Non-Deterministic, Policy vs. Plan, and Lifespan Penalty.
Level: Intermediate
Requirements: Should have some prior familiarity with graph theory and basic statistics. No prior programming knowledge is required.
2. Introduction
• A method for calculating value functions in dynamic
environments.
• Invented by Richard Ernest Bellman in 1953.
• Father of Dynamic Programming, which lead to
modern Reinforcement Programming.
• Concepts include:
• Reward
• Discount Factor
• Deterministic vs. Non-Deterministic
• Plan vs Policy
3. Basics
• Terminology
S -> Set of all possible States
A -> Set of all possible Actions from a given State
s -> A specific State
a -> A specific Action
Sa Sb Sc
Sd Se Sf
Sg Sh Sm
Si Sj Sk Sl
Start Node
Goal Node
S -> { Sa, Sb, Sc … Sm }
Aa -> { Down, Right }
Ab -> { Down, Left }
Ac -> { Down, Left }
Ad -> { Up, Down, Right }
...
Am -> {} Goal State
s -> Si
a -> Ai -> { Up, Right }
4. Reward
• Terminology
R -> The ‘Reward’ for being at some state.
v(s) -> Value Function – The anticipated reward for being at a
specific state.
Sa Sb Sc
Sd Se Sf
Sg Sh Sm
Si Sj Sk Sl
Start Node
Goal Node
R = 1
v(Sa) -> 0
v(Sb) -> 0
v(Sc) -> 0
v(Sd) -> 0
…
v(Sl) -> 0
v(Sm) -> 1 Goal State
All Other (non-Goal) Nodes
R = 0
Non-Goal State
Without a Plan or Policy, a Reward cannot be anticipated until we reach the Goal Node.
5. Discount Factor
• Terminology
t, t+1, t+2, … -> Time step intervals, each corresponding to an action
and a new state.
Rt+1 -> The Reward at the next time step after an action has occurred.
St+1 -> The State at the next time step after an action has occurred.
γ -> (gamma) Discount Factor between 0 and 1.
• The Discount Factor accounts for uncertainty in
obtaining a future reward
Sn-1
Sn
Sn-2
Goal Node, R =1If at Goal, receiving
the reward is certain.
If one step away, receiving
the reward is less certain.
Even further away, receiving
the reward is even less certain.
R = 0
R = 0
6. Bellman Equation
• Principle of the Bellman Equation
v(s) = Rt + γ Rt+1 + γ2 Rt+2+ γ3 Rt+3 … + γn Rt+n
The value of some state s is the
sum of rewards to a terminal state
state, with the reward of each
successive state discounted.
The reward at the next
Step after taken some
action a.
The reward at subsequent
state is discounted by γ
The reward at next subsequent
state is further discounted by γ2
Discount Factor is
Increased exponentially
Sn-1
Sn
Sn-2
Goal Node, R =1
γ =1
v(Sn) = 1
v(Sn-1) = 1
v(Sn-2) = 1
R = 0 , v(s) = 0 + 1
R = 0, v(s) = 0 + 0 + 1
Sn-1
Sn
Sn-2
Goal Node, R =1
γ = 0.9
v(Sn) = 1
v(Sn-1) = 0.9
v(Sn-2) = 0.81
R = 0 , v(s) = 0 + .9(1)
R = 0, v(s) = 0 + 0 + .9*.9(1)
Note, the Reward
and value are not
the same thing.
7. Bellman Principle of Optimality
• Bellman Equation – Factored
v(s) = Rt + γ Rt+1 + γ2 Rt+2+ γ3 Rt+3 … + γn Rt+n
v(St+1)
v(s) = Rt + γ( v(St+1) )
• Bellman Optimality – the value of a state is based on the best
action (optimal) for that state, and each subsequent state.
v(s) = argmax( R(s,a) + γ( v(St+1) ) )
a
The action a at state s
which maximizes the
reward.
8. Bellman Optimality Example
R=1
R=-1Wall
R=10.9
R=-1
γ = 0.9
Wall
R=10.90.81
R=-10.81Wall
Calculate 1 step away Calculate adjacent steps
Best action is move
to the goal node.
Best action is move to
the node with the highest
value.
R=10.90.81
R=-10.81
0.73
Wall
Calculate adjacent steps
R=10.90.81
R=-10.81
0.660.730.66
Wall
Calculate adjacent steps
This produces a plan.
The Optimal Action (Move)
for each state.
9. Deterministic vs. Non-Deterministic
• Deterministic – The action taken has a 100% certainty of
the expected (desired) outcome => Plan
e.g., in our Grid World example, there is a 100% certainty that
if the action is to move left, that you will move left.
• Non-Deterministic (Stochastic) – The action taken has
less than a 100% certainty of the expected outcome =>
Policy
e.g., if a Robot is in a standing state and the action is to run,
there maybe a 80% of succeeding, but a 20% probability
of falling down.
10. Bellman Optimality with Probabilities
• Terminology
R(s,a) -> The Reward when at state s and action a is taken.
P(s,a,St+1’) -> The probabilities that when at state s and action a is taken,
of being in one of the successor states St+1’.
• When the outcome is stochastic, we replace the value
of the desired state with the values (summation) of the
possible successor states times their probability:
v(s) = argmax( R(s,a) + γ( v(St+1) ) )
a
v(s) = argmax( R(s,a) + γ∑ P(s,a,St+1’) v(St+1’) )
a St+1’
11. Bellman Optimality Example
R=1SbSa
R=-1Wall Sc
γ = 0.9
Sb, Right -> { 80% Left,
10% Right,
10% Down }
R=10.72R=0
R=-1Wall R=0
γ = 0.9
v(Sb) = 0 + .9( .8(1) + .1(0) + .1(0)
80% Probability
10% Probability
R=10.72Sa
R=-1Wall Sc
γ = 0.9
Sc, Up -> { 80% Up,
10% Left,
10% Right }
R=10.72
R=-1Wall 0.48
R=0
γ = 0.9
v(Sc) = 0 + .9( .8(.72) + .1(0) + .1(-1)
80% Probability
10% Probability
12. Greedy vs. Optimal
R=1Sb
R=-1Wall 0.48
γ = 0.9
• Greedy – Take the Action with the highest Probability of
a Reward -> Plan (act as if deterministic).
Sc, Up -> { 80% Sb,
10% Left,
10% Right (‘The Pit’ – terminal state) }
10% of the time will end up in negative terminal state!
13. Greedy vs. Optimal
R=1Sb
R=-1Wall 0.07
γ = 0.9
• Optimal – Take the Action with certainty we will
proceed towards a positive reward -> Policy.
Sc, Left -> { 80% Wall and Bounce back to Sc,
10% Up (Sb),
10% Down }
If we choose Left, we have 80% chance of bouncing into
the wall and being back where we were.
If we keep bouncing off the wall, eventually we will go Up or Down
(10% of the time), and never go into the Pit!
v(Sc, Left) = 0 + .9( .8(0) + .1(0.72) + .1(0)
14. Lifespan Penalty
• Lifespan Penalty – There is a cost to each action.
R=1R=-.1R=-.1
R=-1Wall Sc
R=-.1R=-.1R=-.1
γ = 0.9
Sc, Left -> { 80% Wall,
10% Left,
10% Right}
R=1Sb
R=-1Wall -0.02
γ = 0.9
v(Sc, Left) = 0 + .9( .8(-.1) + .1(0.72) + .1(-.1)
When there is a penalty in each action, the best policy might be to take the
chance of falling into the pit!
Penalty
15. Not Covered
• When probabilities are learned ( not pre-known ) ->
Backward Propagation.
• Suboptimal Solutions for HUGE search spaces.
THIS IS MORE LIKE THE REAL WORLD!