Markov decision process

Markov Decision
Process
Hamed Abdi
PhD Candidate in Computational Cognitive Modeling
Institute for Cognitive & Brain Science (ICBS)

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Introduction
 How should we make decisions so as to maximize payoff (Reward)?
 How should we do this when others may not go along?
 How should we do this when the payoff (Reward) may be far in the future?
“Preferred Outcomes” or “Utility”
2

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
DecisionTheories
ProbabilityTheory + UtilityTheory
Properties of Task Environments
3
Maximize Reward Utility Theory
Other Agents Game Theory
Sequence of Actions Markov Decision Process
Fully Observable vs. Partially Observable Single agent vs. Multi agent
Deterministic vs. Stochastic Episodic vs. Sequential
Static vs. Dynamic Discrete vs. Continuous
Known vs. Unknown

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Agents
The Structure of Agents
Agent= Architecture+ Program
 Simple reflex agents
 Model-based reflex agents
 Goal-based agents
 Utility-based agents
4
Anything that can be viewed as perceiving
its environment through sensors and
acting upon that environment through
actuators.

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
MaximumExpectedUtility (MEU)
A rational agent should choose the action that maximizes the agent’s expected utility:
action = argmaxa EU(a|e)
EU(a|e) = ∑P(RESULT(a)= s’ | a, e) U(s’)
The Value of Information
VPI (Ej) = [∑P(Ej = ejk | e) EU (aejk
| e, Ej = ejk)] − EU (a|e)
5
Random VariableNondeterministic Partially Observable Environments

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Sequential DecisionProblems
Policyand Optimal Policy
6
Optimal Policies depend
on reward and horizon
Finite Horizon Infinite Horizon

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
MarkovDecision Process (MDP)
A sequential decision problem for a fully observable, stochastic environment with a
markovian transition model and additive rewards is called a markov decision process and
consists of four components:
 S: A set of states (with an initial state S0)
 A: A set ACTIONS(s) of actions in each state
 T: A transition model p(s’ | s, a)
 R: A reward function R(s) R(s,a,s’)
7
St
Rt
St+1
Rt+1
St+2
At+1
Rt+2
At

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Assumptions
First-OrderMarkovianDynamics(historyindependence)
 P (St+1|At,St,At-1,St-1,..., S0) = P (St+1|At,St)
First-OrderMarkovianRewardProcess
 P (Rt+1|At,St,At-1,St-1,..., S0) = P (Rt|At,St)
StationaryDynamicsandReward
 P (St+1|At,St) = P (Sk+1|Ak,Sk) for all t, k
 The world dynamics do not depend on the absolute time
FullObservability
8

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Utilities of Sequences
1. Additive rewards:
Uh([s0, s1, s2, . . .]) = R(s0) + R(s1) + R(s2) + · · ·
2. Discounted rewards:
Uh([s0, s1, s2, . . .]) = R(s0) + γR(s1) + γ2R(s2) + · · ·
 With discounted rewards, the utility of an infinite sequence is finite. (γ < 1)
Uh([s0, s1, s2, . . .]) = ∑γtR(st) ≤ ∑γtRmax = Rmax/(1 − γ)
9
Discount Factor is a number between 0 and 1

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
The Bellman Equationfor Utilities
The utility of a state is the immediate reward for that state plus the expected discounted
utility of the next state, assuming that the agent chooses the optimal action.
U’(s) = R(s) + γ maxa∈A(s) [∑P(s’ | s, a)U(s’)]
The Value IterationAlgorithm
10

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Policy Iteration Algorithm
The policy iteration algorithm alternates the following two steps, beginning from some
initial policy π0:
 Policyevaluation: given a policy πi, calculate Ui = Uπi, the utility of each state if πi were
to be executed.
 Policyimprovement: Calculate a new MEU policy πi+1, using one-step look-ahead based
on Ui.
π∗(s) = argmaxa∈A(s)∑P(s’| s, a)U(s’)
Modified Policy Iteration Algorithm (MPI)
Asynchronous Policy Iteration Algorithm (API)
11

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
POMDP
Markov DecisionProcesses The Environment was FullyObservable
The agent always knowswhich state it is in
PartiallyObservableMDP The Environment is PartiallyObservable
The agent doesnot necessarilyknowwhich state it is in
It cannot execute the actionπ(s)recommended for that state
The utility of a state S and the optimal action in S depend not just on S, but
also on how much the agent knows when it is in S.
12

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Definition of POMDP
MDP
Belief State
If b(s) was the previous belief state, and the agent does action a and then perceives
evidence e, then the new belief state is given by
b’(s’) = α P (e | s’) ∑ P (s’ | s, a)b(s)
13
Transition Model P(s’ | s, a)
Actions A(s)
Reward Function R(s)
Sensor Model P(e | s)
POMDP
Normalizing Constant

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
POMDP
The optimalactiondepends only on the agent’s currentbeliefstate
The decision cycle of a POMDP agent
b’(s’) = α P (e | s’) ∑ P (s’ | s, a)b(s)
Define a RewardFunctionfor belief states
ρ(b) = ∑ b(s)R(s)
Solving a POMDP on a physical state space can be reduced to solving an MDP on the
corresponding belief-statespace
14
Given the current belief state b, execute the action a = π∗(b)
Receive percept e
Set the current belief state to b’(b, a, e) and repeat

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
DynamicBayesianNetwork(DBN)
In the DBN, the single state St becomes a set of state variables Xt, and there may be
multiple evidence variables Et.
KnownValue: At-2, Et-1, Rt-1, At-1, Et, Rt
15
Xt
Rt
Xt+1
Rt+1
Xt+2
At+1
Rt+2
AtAt-1
Ut+2
Et Et+1 Et+2
Xt-1
At-2
Et-1
Rt-1

Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Dopaminergic System
 Prediction Error in Human
 Reinforcement Learning
 Reward-based Learning
 Decision Making
 ActionSelection(what to do next)
 Time Perception
16
Dopamine Functions:
 Motor control
 Reward behavior
 Addiction
 Synaptic Plasticity
 Nausea
 & …

17
Reference
• E.A. Feinberg and A. Shwartz (eds.) Handbook of Markov Decision Processes, Kluwer, Boston, MA, 2002.
• Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998.
• Gurney, K., Prescott, T. J., & Redgrave, P. (2001). A computational model of action selection in the basal ganglia. Biological
Cybernetics, 84(6), 401-423.
• Stuart Russell, Peter Norvig, Artificial Intelligence: A Modern Approach. Upper Saddle River (New Jersey, 1995.
Thanks for your
Attention

Markov decision process

More Related Content

What's hot

Similar to Markov decision process

More from Hamed Abdi

Recently uploaded

Markov decision process