1
(Module-2)
Markov Decision Process (MDP)
2
Markov Decision Process (MDP)
• An MDP is a mathematical framework for modeling
sequential decision-making problems under uncertainty.
• It assumes that the environment has the Markov
property, meaning the future depends only on the
current state and action, not on the past history.
• MDPs are powerful tools for modeling sequential
decision-making problems in various domains, including
robotics, game playing, resource allocation, and more.
3
Key Equations:
Note: The goal in an MDP is to find an optimal policy (π*) that maximizes the expected
cumulative discounted reward over time.
4
Bellman Equation in MDP
The Bellman equation is a cornerstone of Markov Decision
Processes (MDPs), providing a powerful tool for understanding
and optimizing sequential decision-making under uncertainty.
The standard Bellman equation expresses the value of
a state (V(s)) as the expected reward of taking the best
action from that state, considering both the
immediate reward and the discounted value of the
next state reached:
V(s) = max_a [ R(s, a) + γ * Σ P(s', s, a) *
V(s') ]
5
V(s'): Expected value of the next state s'
The Bellman equation uses future value (V(s')) to estimate
the value of the current state (V(s)), considering the
immediate reward (R(s, a)) by taking action a in state s.
6
The optimal Bellman equation emphasizes finding the
optimal value function (V*): the function that assigns the
highest possible expected total reward to each state under
the optimal policy.
optimOptimal Bellman Equationation
V*(s) = max_a [ R(s, a) + γ * Σ P(s', s, a) * V*(s’) ]
This equation essentially says that the optimal value of a state
is equal to the immediate reward you get by taking the best
possible action in that state, plus the discounted value of the
next state you reach under the optimal policy.
7
Q-value Function:Mathematical Notation
8
Mathematical Notation:
• max_a' Q(s', a'): The maximum expected future reward achievable
from the next state s’, considering all possible actions a'.
9
10
11
12
Key Differences:
13
Cauchy Sequences
14
Cauchy Sequences
• Cauchy Sequence: A sequence of elements in a metric space
where the elements become arbitrarily close to each other as
the sequence progresses.
15
Cauchy Sequences
16
Cauchy Sequences
17
Cauchy Sequences
18
Cauchy Sequences
19
Cauchy Sequences
20
Cauchy Sequences
21
Cauchy Sequences
22
BANACH'S FIXED POINT THEOREM
23
Banach's Fixed Point Theorem
24
Banach's Fixed Point Theorem
25
Banach's Fixed Point Theorem
• Banach fixed point theorem states that if you
have a contraction mapping T acting on a
complete metric space, then there is guaranteed
to be a unique point x in that space such that T(x)
= x. This point x is called the fixed point of the
function T.
• The theorem also suggests a way to find this
26
Banach's Fixed Point Theorem
27
Banach's Fixed Point Theorem
• Banach's fixed point theorem plays a crucial role
in proving the convergence of some important
algorithms in reinforcement learning.
• Dynamic programming algorithms, like value
iteration and policy iteration, use Bellman
Equation to iteratively improve the agent's policy.
28
29

Reinforcement learning Markov decisions process mdp ppt

  • 1.
  • 2.
    2 Markov Decision Process(MDP) • An MDP is a mathematical framework for modeling sequential decision-making problems under uncertainty. • It assumes that the environment has the Markov property, meaning the future depends only on the current state and action, not on the past history. • MDPs are powerful tools for modeling sequential decision-making problems in various domains, including robotics, game playing, resource allocation, and more.
  • 3.
    3 Key Equations: Note: Thegoal in an MDP is to find an optimal policy (π*) that maximizes the expected cumulative discounted reward over time.
  • 4.
    4 Bellman Equation inMDP The Bellman equation is a cornerstone of Markov Decision Processes (MDPs), providing a powerful tool for understanding and optimizing sequential decision-making under uncertainty. The standard Bellman equation expresses the value of a state (V(s)) as the expected reward of taking the best action from that state, considering both the immediate reward and the discounted value of the next state reached: V(s) = max_a [ R(s, a) + γ * Σ P(s', s, a) * V(s') ]
  • 5.
    5 V(s'): Expected valueof the next state s' The Bellman equation uses future value (V(s')) to estimate the value of the current state (V(s)), considering the immediate reward (R(s, a)) by taking action a in state s.
  • 6.
    6 The optimal Bellmanequation emphasizes finding the optimal value function (V*): the function that assigns the highest possible expected total reward to each state under the optimal policy. optimOptimal Bellman Equationation V*(s) = max_a [ R(s, a) + γ * Σ P(s', s, a) * V*(s’) ] This equation essentially says that the optimal value of a state is equal to the immediate reward you get by taking the best possible action in that state, plus the discounted value of the next state you reach under the optimal policy.
  • 7.
  • 8.
    8 Mathematical Notation: • max_a'Q(s', a'): The maximum expected future reward achievable from the next state s’, considering all possible actions a'.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    14 Cauchy Sequences • CauchySequence: A sequence of elements in a metric space where the elements become arbitrarily close to each other as the sequence progresses.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    25 Banach's Fixed PointTheorem • Banach fixed point theorem states that if you have a contraction mapping T acting on a complete metric space, then there is guaranteed to be a unique point x in that space such that T(x) = x. This point x is called the fixed point of the function T. • The theorem also suggests a way to find this
  • 26.
  • 27.
    27 Banach's Fixed PointTheorem • Banach's fixed point theorem plays a crucial role in proving the convergence of some important algorithms in reinforcement learning. • Dynamic programming algorithms, like value iteration and policy iteration, use Bellman Equation to iteratively improve the agent's policy.
  • 28.
  • 29.