Chapter 3: Finite Markov Decision Processes
Seungjae Ryan Lee
● Simplified, flexible reinforcement learning problem
● Consists of States , Actions , Rewards
Markov Decision Process (MDP)
States
Info available to agent
Actions
Choice made by agent
Rewards
Basis for evaluating choices
Agent
The learner
Takes action
Everything outside the agent
Returns state and reward
Environment
● Anything the agent cannot arbitrarily change is part of the environment
○ Agent might still know everything about the environment
● Different boundaries for different purposes
Agent-Environment Boundary
Machinery Sensors Battery“Brain”
1. Agent observes a state
2. Agent takes action
3. Agent receives reward and new state
4. Agent takes another action
5. Repeat
Agent-Environment Interactions
Transition Probability
● Probability of reaching state and reward by taking action on state
● Fully describes the dynamics of a finite MDP
● Can deduce other properties of the environment
Expected Rewards
● Expected reward of taking action on state
● Expected reward of arriving in state by taking action on state
Recycling Robot Example
● States: Battery status (high or low)
● Actions
○ Search: High reward. Battery status can be lowered or depleted.
○ Wait: Low reward. Battery status does not change.
○ Recharge: No reward. Battery status changed to high.
● If battery is depleted, -3 reward and battery status changed to high.
Transition Graph
● Graphical summary of MDP dynamics
Designing Rewards
● Reward hypothesis
○ Goals and purposes can be represented by maximization of cumulative reward
● Tell what you want to achieve, not how
+1 for each boxProportional to
forward action
Always -1
Episodic Tasks
● Interactions can be broken into episodes
● Episodes end in a special terminal state
● Each episode is independent
Finished when the game ends Finished when the agent is out of the maze
Return for Episodic Tasks
● Sum of rewards from time step
● Time of termination:
Continuing Tasks
● Cannot be naturally broken into episodes
● Goes on without limit
Stock Trading
Return for Continuing Tasks
● Sum of rewards is almost always infinite
● Need to discount future rewards by factor
○ If , the return only considers immediate reward (myopic)
Unified Notation for Return
● Cumulative reward
● can be a finite number or infinity
● Future rewards can be discounted with factor
○ If , then must be less than 1.
Policy
● Mapping from states to probabilities of selecting each possible action
● : Probability of selecting action in state
State-value function
● Expected return from state and following policy
Action-value function
● Expected return from taking action in state and following policy
Bellman Equation
● Recursive relationship between and
Optimal Policies and Value Functions
● For any policy , for all states
● There can be multiple optimal policies
● All optimal policies share same optimal value functions:
Bellman Optimality Equation
● Bellman Equation for optimal policies
Solving Bellman Optimality Equation
● Linear system: equations, unknowns
● Possible to find the exact optimal policy
● Impractical in most environments
○ Need to know the dynamics of the environment
○ Need extreme computational power
○ Need Markov property
→ In most cases, approximation is the best possible solution.
Approximation
● Does not require complete knowledge of environment
● Less memory and computational power needed
● Can focus learning on frequently encountered states
Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

Reinforcement Learning 3. Finite Markov Decision Processes

  • 1.
    Chapter 3: FiniteMarkov Decision Processes Seungjae Ryan Lee
  • 2.
    ● Simplified, flexiblereinforcement learning problem ● Consists of States , Actions , Rewards Markov Decision Process (MDP) States Info available to agent Actions Choice made by agent Rewards Basis for evaluating choices
  • 3.
    Agent The learner Takes action Everythingoutside the agent Returns state and reward Environment
  • 4.
    ● Anything theagent cannot arbitrarily change is part of the environment ○ Agent might still know everything about the environment ● Different boundaries for different purposes Agent-Environment Boundary Machinery Sensors Battery“Brain”
  • 5.
    1. Agent observesa state 2. Agent takes action 3. Agent receives reward and new state 4. Agent takes another action 5. Repeat Agent-Environment Interactions
  • 6.
    Transition Probability ● Probabilityof reaching state and reward by taking action on state ● Fully describes the dynamics of a finite MDP ● Can deduce other properties of the environment
  • 7.
    Expected Rewards ● Expectedreward of taking action on state ● Expected reward of arriving in state by taking action on state
  • 8.
    Recycling Robot Example ●States: Battery status (high or low) ● Actions ○ Search: High reward. Battery status can be lowered or depleted. ○ Wait: Low reward. Battery status does not change. ○ Recharge: No reward. Battery status changed to high. ● If battery is depleted, -3 reward and battery status changed to high.
  • 9.
    Transition Graph ● Graphicalsummary of MDP dynamics
  • 10.
    Designing Rewards ● Rewardhypothesis ○ Goals and purposes can be represented by maximization of cumulative reward ● Tell what you want to achieve, not how +1 for each boxProportional to forward action Always -1
  • 11.
    Episodic Tasks ● Interactionscan be broken into episodes ● Episodes end in a special terminal state ● Each episode is independent Finished when the game ends Finished when the agent is out of the maze
  • 12.
    Return for EpisodicTasks ● Sum of rewards from time step ● Time of termination:
  • 13.
    Continuing Tasks ● Cannotbe naturally broken into episodes ● Goes on without limit Stock Trading
  • 14.
    Return for ContinuingTasks ● Sum of rewards is almost always infinite ● Need to discount future rewards by factor ○ If , the return only considers immediate reward (myopic)
  • 15.
    Unified Notation forReturn ● Cumulative reward ● can be a finite number or infinity ● Future rewards can be discounted with factor ○ If , then must be less than 1.
  • 16.
    Policy ● Mapping fromstates to probabilities of selecting each possible action ● : Probability of selecting action in state
  • 17.
    State-value function ● Expectedreturn from state and following policy
  • 18.
    Action-value function ● Expectedreturn from taking action in state and following policy
  • 19.
    Bellman Equation ● Recursiverelationship between and
  • 20.
    Optimal Policies andValue Functions ● For any policy , for all states ● There can be multiple optimal policies ● All optimal policies share same optimal value functions:
  • 21.
    Bellman Optimality Equation ●Bellman Equation for optimal policies
  • 22.
    Solving Bellman OptimalityEquation ● Linear system: equations, unknowns ● Possible to find the exact optimal policy ● Impractical in most environments ○ Need to know the dynamics of the environment ○ Need extreme computational power ○ Need Markov property → In most cases, approximation is the best possible solution.
  • 23.
    Approximation ● Does notrequire complete knowledge of environment ● Less memory and computational power needed ● Can focus learning on frequently encountered states
  • 24.
    Thank you! Original contentfrom ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai