Markov Decision
Process
Hamed Abdi
PhD Candidate in Computational Cognitive Modeling
Institute for Cognitive & Brain Science (ICBS)
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Introduction
 How should we make decisions so as to maximize payoff (Reward)?
 How should we do this when others may not go along?
 How should we do this when the payoff (Reward) may be far in the future?
“Preferred Outcomes” or “Utility”
2
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
DecisionTheories
ProbabilityTheory + UtilityTheory
Properties of Task Environments
3
Maximize Reward Utility Theory
Other Agents Game Theory
Sequence of Actions Markov Decision Process
Fully Observable vs. Partially Observable Single agent vs. Multi agent
Deterministic vs. Stochastic Episodic vs. Sequential
Static vs. Dynamic Discrete vs. Continuous
Known vs. Unknown
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Agents
The Structure of Agents
Agent= Architecture+ Program
 Simple reflex agents
 Model-based reflex agents
 Goal-based agents
 Utility-based agents
4
Anything that can be viewed as perceiving
its environment through sensors and
acting upon that environment through
actuators.
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
MaximumExpectedUtility (MEU)
A rational agent should choose the action that maximizes the agent’s expected utility:
action = argmaxa EU(a|e)
EU(a|e) = ∑P(RESULT(a)= s’ | a, e) U(s’)
The Value of Information
VPI (Ej) = [∑P(Ej = ejk | e) EU (aejk
| e, Ej = ejk)] − EU (a|e)
5
Random VariableNondeterministic Partially Observable Environments
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Sequential DecisionProblems
Policyand Optimal Policy
6
Optimal Policies depend
on reward and horizon
Finite Horizon Infinite Horizon
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
MarkovDecision Process (MDP)
A sequential decision problem for a fully observable, stochastic environment with a
markovian transition model and additive rewards is called a markov decision process and
consists of four components:
 S: A set of states (with an initial state S0)
 A: A set ACTIONS(s) of actions in each state
 T: A transition model p(s’ | s, a)
 R: A reward function R(s) R(s,a,s’)
7
St
Rt
St+1
Rt+1
St+2
At+1
Rt+2
At
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Assumptions
First-OrderMarkovianDynamics(historyindependence)
 P (St+1|At,St,At-1,St-1,..., S0) = P (St+1|At,St)
First-OrderMarkovianRewardProcess
 P (Rt+1|At,St,At-1,St-1,..., S0) = P (Rt|At,St)
StationaryDynamicsandReward
 P (St+1|At,St) = P (Sk+1|Ak,Sk) for all t, k
 The world dynamics do not depend on the absolute time
FullObservability
8
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Utilities of Sequences
1. Additive rewards:
Uh([s0, s1, s2, . . .]) = R(s0) + R(s1) + R(s2) + · · ·
2. Discounted rewards:
Uh([s0, s1, s2, . . .]) = R(s0) + γR(s1) + γ2R(s2) + · · ·
 With discounted rewards, the utility of an infinite sequence is finite. (γ < 1)
Uh([s0, s1, s2, . . .]) = ∑γtR(st) ≤ ∑γtRmax = Rmax/(1 − γ)
9
Discount Factor is a number between 0 and 1
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
The Bellman Equationfor Utilities
The utility of a state is the immediate reward for that state plus the expected discounted
utility of the next state, assuming that the agent chooses the optimal action.
U’(s) = R(s) + γ maxa∈A(s) [∑P(s’ | s, a)U(s’)]
The Value IterationAlgorithm
10
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Policy Iteration Algorithm
The policy iteration algorithm alternates the following two steps, beginning from some
initial policy π0:
 Policyevaluation: given a policy πi, calculate Ui = Uπi, the utility of each state if πi were
to be executed.
 Policyimprovement: Calculate a new MEU policy πi+1, using one-step look-ahead based
on Ui.
π∗(s) = argmaxa∈A(s)∑P(s’| s, a)U(s’)
Modified Policy Iteration Algorithm (MPI)
Asynchronous Policy Iteration Algorithm (API)
11
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
POMDP
Markov DecisionProcesses The Environment was FullyObservable
The agent always knowswhich state it is in
PartiallyObservableMDP The Environment is PartiallyObservable
The agent doesnot necessarilyknowwhich state it is in
It cannot execute the actionπ(s)recommended for that state
The utility of a state S and the optimal action in S depend not just on S, but
also on how much the agent knows when it is in S.
12
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Definition of POMDP
MDP
Belief State
If b(s) was the previous belief state, and the agent does action a and then perceives
evidence e, then the new belief state is given by
b’(s’) = α P (e | s’) ∑ P (s’ | s, a)b(s)
13
Transition Model P(s’ | s, a)
Actions A(s)
Reward Function R(s)
Sensor Model P(e | s)
POMDP
Normalizing Constant
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
POMDP
The optimalactiondepends only on the agent’s currentbeliefstate
The decision cycle of a POMDP agent
b’(s’) = α P (e | s’) ∑ P (s’ | s, a)b(s)
Define a RewardFunctionfor belief states
ρ(b) = ∑ b(s)R(s)
Solving a POMDP on a physical state space can be reduced to solving an MDP on the
corresponding belief-statespace
14
Given the current belief state b, execute the action a = π∗(b)
Receive percept e
Set the current belief state to b’(b, a, e) and repeat
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
DynamicBayesianNetwork(DBN)
In the DBN, the single state St becomes a set of state variables Xt, and there may be
multiple evidence variables Et.
KnownValue: At-2, Et-1, Rt-1, At-1, Et, Rt
15
Xt
Rt
Xt+1
Rt+1
Xt+2
At+1
Rt+2
AtAt-1
Ut+2
Et Et+1 Et+2
Xt-1
At-2
Et-1
Rt-1
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Dopaminergic System
 Prediction Error in Human
 Reinforcement Learning
 Reward-based Learning
 Decision Making
 ActionSelection(what to do next)
 Time Perception
16
Dopamine Functions:
 Motor control
 Reward behavior
 Addiction
 Synaptic Plasticity
 Nausea
 & …
17
Reference
• E.A. Feinberg and A. Shwartz (eds.) Handbook of Markov Decision Processes, Kluwer, Boston, MA, 2002.
• Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998.
• Gurney, K., Prescott, T. J., & Redgrave, P. (2001). A computational model of action selection in the basal ganglia. Biological
Cybernetics, 84(6), 401-423.
• Stuart Russell, Peter Norvig, Artificial Intelligence: A Modern Approach. Upper Saddle River (New Jersey, 1995.
Thanks for your
Attention

Markov decision process

  • 1.
    Markov Decision Process Hamed Abdi PhDCandidate in Computational Cognitive Modeling Institute for Cognitive & Brain Science (ICBS)
  • 2.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Introduction  How should we make decisions so as to maximize payoff (Reward)?  How should we do this when others may not go along?  How should we do this when the payoff (Reward) may be far in the future? “Preferred Outcomes” or “Utility” 2
  • 3.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning DecisionTheories ProbabilityTheory + UtilityTheory Properties of Task Environments 3 Maximize Reward Utility Theory Other Agents Game Theory Sequence of Actions Markov Decision Process Fully Observable vs. Partially Observable Single agent vs. Multi agent Deterministic vs. Stochastic Episodic vs. Sequential Static vs. Dynamic Discrete vs. Continuous Known vs. Unknown
  • 4.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Agents The Structure of Agents Agent= Architecture+ Program  Simple reflex agents  Model-based reflex agents  Goal-based agents  Utility-based agents 4 Anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.
  • 5.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning MaximumExpectedUtility (MEU) A rational agent should choose the action that maximizes the agent’s expected utility: action = argmaxa EU(a|e) EU(a|e) = ∑P(RESULT(a)= s’ | a, e) U(s’) The Value of Information VPI (Ej) = [∑P(Ej = ejk | e) EU (aejk | e, Ej = ejk)] − EU (a|e) 5 Random VariableNondeterministic Partially Observable Environments
  • 6.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Sequential DecisionProblems Policyand Optimal Policy 6 Optimal Policies depend on reward and horizon Finite Horizon Infinite Horizon
  • 7.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning MarkovDecision Process (MDP) A sequential decision problem for a fully observable, stochastic environment with a markovian transition model and additive rewards is called a markov decision process and consists of four components:  S: A set of states (with an initial state S0)  A: A set ACTIONS(s) of actions in each state  T: A transition model p(s’ | s, a)  R: A reward function R(s) R(s,a,s’) 7 St Rt St+1 Rt+1 St+2 At+1 Rt+2 At
  • 8.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Assumptions First-OrderMarkovianDynamics(historyindependence)  P (St+1|At,St,At-1,St-1,..., S0) = P (St+1|At,St) First-OrderMarkovianRewardProcess  P (Rt+1|At,St,At-1,St-1,..., S0) = P (Rt|At,St) StationaryDynamicsandReward  P (St+1|At,St) = P (Sk+1|Ak,Sk) for all t, k  The world dynamics do not depend on the absolute time FullObservability 8
  • 9.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Utilities of Sequences 1. Additive rewards: Uh([s0, s1, s2, . . .]) = R(s0) + R(s1) + R(s2) + · · · 2. Discounted rewards: Uh([s0, s1, s2, . . .]) = R(s0) + γR(s1) + γ2R(s2) + · · ·  With discounted rewards, the utility of an infinite sequence is finite. (γ < 1) Uh([s0, s1, s2, . . .]) = ∑γtR(st) ≤ ∑γtRmax = Rmax/(1 − γ) 9 Discount Factor is a number between 0 and 1
  • 10.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning The Bellman Equationfor Utilities The utility of a state is the immediate reward for that state plus the expected discounted utility of the next state, assuming that the agent chooses the optimal action. U’(s) = R(s) + γ maxa∈A(s) [∑P(s’ | s, a)U(s’)] The Value IterationAlgorithm 10
  • 11.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Policy Iteration Algorithm The policy iteration algorithm alternates the following two steps, beginning from some initial policy π0:  Policyevaluation: given a policy πi, calculate Ui = Uπi, the utility of each state if πi were to be executed.  Policyimprovement: Calculate a new MEU policy πi+1, using one-step look-ahead based on Ui. π∗(s) = argmaxa∈A(s)∑P(s’| s, a)U(s’) Modified Policy Iteration Algorithm (MPI) Asynchronous Policy Iteration Algorithm (API) 11
  • 12.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning POMDP Markov DecisionProcesses The Environment was FullyObservable The agent always knowswhich state it is in PartiallyObservableMDP The Environment is PartiallyObservable The agent doesnot necessarilyknowwhich state it is in It cannot execute the actionπ(s)recommended for that state The utility of a state S and the optimal action in S depend not just on S, but also on how much the agent knows when it is in S. 12
  • 13.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Definition of POMDP MDP Belief State If b(s) was the previous belief state, and the agent does action a and then perceives evidence e, then the new belief state is given by b’(s’) = α P (e | s’) ∑ P (s’ | s, a)b(s) 13 Transition Model P(s’ | s, a) Actions A(s) Reward Function R(s) Sensor Model P(e | s) POMDP Normalizing Constant
  • 14.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning POMDP The optimalactiondepends only on the agent’s currentbeliefstate The decision cycle of a POMDP agent b’(s’) = α P (e | s’) ∑ P (s’ | s, a)b(s) Define a RewardFunctionfor belief states ρ(b) = ∑ b(s)R(s) Solving a POMDP on a physical state space can be reduced to solving an MDP on the corresponding belief-statespace 14 Given the current belief state b, execute the action a = π∗(b) Receive percept e Set the current belief state to b’(b, a, e) and repeat
  • 15.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning DynamicBayesianNetwork(DBN) In the DBN, the single state St becomes a set of state variables Xt, and there may be multiple evidence variables Et. KnownValue: At-2, Et-1, Rt-1, At-1, Et, Rt 15 Xt Rt Xt+1 Rt+1 Xt+2 At+1 Rt+2 AtAt-1 Ut+2 Et Et+1 Et+2 Xt-1 At-2 Et-1 Rt-1
  • 16.
    Introduction DecisionTheory Intelligence Agents Simple Decisions ComplexDecisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Dopaminergic System  Prediction Error in Human  Reinforcement Learning  Reward-based Learning  Decision Making  ActionSelection(what to do next)  Time Perception 16 Dopamine Functions:  Motor control  Reward behavior  Addiction  Synaptic Plasticity  Nausea  & …
  • 17.
    17 Reference • E.A. Feinbergand A. Shwartz (eds.) Handbook of Markov Decision Processes, Kluwer, Boston, MA, 2002. • Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998. • Gurney, K., Prescott, T. J., & Redgrave, P. (2001). A computational model of action selection in the basal ganglia. Biological Cybernetics, 84(6), 401-423. • Stuart Russell, Peter Norvig, Artificial Intelligence: A Modern Approach. Upper Saddle River (New Jersey, 1995. Thanks for your Attention