Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reinforcement Learning 3. Finite Markov Decision Processes


Published on

A summary of Chapter 3: Finite Markov Decision Processes of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website:

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Reinforcement Learning 3. Finite Markov Decision Processes

  1. 1. Chapter 3: Finite Markov Decision Processes Seungjae Ryan Lee
  2. 2. ● Simplified, flexible reinforcement learning problem ● Consists of States , Actions , Rewards Markov Decision Process (MDP) States Info available to agent Actions Choice made by agent Rewards Basis for evaluating choices
  3. 3. Agent The learner Takes action Everything outside the agent Returns state and reward Environment
  4. 4. ● Anything the agent cannot arbitrarily change is part of the environment ○ Agent might still know everything about the environment ● Different boundaries for different purposes Agent-Environment Boundary Machinery Sensors Battery“Brain”
  5. 5. 1. Agent observes a state 2. Agent takes action 3. Agent receives reward and new state 4. Agent takes another action 5. Repeat Agent-Environment Interactions
  6. 6. Transition Probability ● Probability of reaching state and reward by taking action on state ● Fully describes the dynamics of a finite MDP ● Can deduce other properties of the environment
  7. 7. Expected Rewards ● Expected reward of taking action on state ● Expected reward of arriving in state by taking action on state
  8. 8. Recycling Robot Example ● States: Battery status (high or low) ● Actions ○ Search: High reward. Battery status can be lowered or depleted. ○ Wait: Low reward. Battery status does not change. ○ Recharge: No reward. Battery status changed to high. ● If battery is depleted, -3 reward and battery status changed to high.
  9. 9. Transition Graph ● Graphical summary of MDP dynamics
  10. 10. Designing Rewards ● Reward hypothesis ○ Goals and purposes can be represented by maximization of cumulative reward ● Tell what you want to achieve, not how +1 for each boxProportional to forward action Always -1
  11. 11. Episodic Tasks ● Interactions can be broken into episodes ● Episodes end in a special terminal state ● Each episode is independent Finished when the game ends Finished when the agent is out of the maze
  12. 12. Return for Episodic Tasks ● Sum of rewards from time step ● Time of termination:
  13. 13. Continuing Tasks ● Cannot be naturally broken into episodes ● Goes on without limit Stock Trading
  14. 14. Return for Continuing Tasks ● Sum of rewards is almost always infinite ● Need to discount future rewards by factor ○ If , the return only considers immediate reward (myopic)
  15. 15. Unified Notation for Return ● Cumulative reward ● can be a finite number or infinity ● Future rewards can be discounted with factor ○ If , then must be less than 1.
  16. 16. Policy ● Mapping from states to probabilities of selecting each possible action ● : Probability of selecting action in state
  17. 17. State-value function ● Expected return from state and following policy
  18. 18. Action-value function ● Expected return from taking action in state and following policy
  19. 19. Bellman Equation ● Recursive relationship between and
  20. 20. Optimal Policies and Value Functions ● For any policy , for all states ● There can be multiple optimal policies ● All optimal policies share same optimal value functions:
  21. 21. Bellman Optimality Equation ● Bellman Equation for optimal policies
  22. 22. Solving Bellman Optimality Equation ● Linear system: equations, unknowns ● Possible to find the exact optimal policy ● Impractical in most environments ○ Need to know the dynamics of the environment ○ Need extreme computational power ○ Need Markov property → In most cases, approximation is the best possible solution.
  23. 23. Approximation ● Does not require complete knowledge of environment ● Less memory and computational power needed ● Can focus learning on frequently encountered states
  24. 24. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● ●