Reinforcement Learning 4. Dynamic Programming

•

0 likes•1,037 views

A summary of Chapter 4: Dynamic Programming of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html

Technology

Chapter 4: Dynamic Programming
Seungjae Ryan Lee

Dynamic Programming
● Algorithms to compute optimal policies with a perfect model of environment
● Use value functions to structure searching for good policies
● Foundation of all methods hereafter
Dynamic
Programming
Monte Carlo
Temporal
Difference

Policy Evaluation (Prediction)
● Compute state-value for some policy
● Use the Bellman Equation:

Iterative Policy Evaluation
● Solving linear systems is tedious → Use iterative methods
● Define sequence of approximate value functions
● Expected update using the Bellman equation:
○ Update based on expectation of all possible next states

Iterative Policy Evaluation in Practice
● In-place methods usually converge faster than keeping two arrays
● Terminate policy evaluation when is sufficiently small

Gridworld Example
● Deterministic state transition
● Off-the-grid actions leave the state unchanged
● Undiscounted, episodic task

Policy Evaluation in Gridworld
● Random policy

Policy Improvement - One state
● Suppose we know for some policy
● For a state , see if there is a better action
● Check if
○ If true, greedily selecting is better than
○ Special case of Policy Improvement Theorem
0
-1
-1
-1
0
-1
-1
-1

Policy Improvement Theorem
For policies , if for all state ,
Then, is at least as good a policy as .
(Strict inequality if )

Policy Improvement
● Find better policies with the computed value function
● Use a new greedy policy
● Satisfies the conditions of Policy Improvement Theorem

Guarantees of Policy Improvement
● If , then the Bellman Optimality Equation holds.
→ Policy Improvement always returns a better policy unless already optimal

Policy Iteration
● Repeat Policy Evaluation and Policy Improvement
● Guaranteed improvement for each policy
● Guaranteed convergence in finite number of steps for finite MDPs

Policy Iteration in Practice
● Initialize with for quicker policy evaluation
● Often converges in surprisingly few iterations

Value Iteration
● “Truncate” policy evaluation
○ Don’t wait until is sufficiently small
○ Update state values once for each state
● Evaluation and improvement can be simplified to one update operation
○ Bellman optimality equation turned into an update rule

Value Iteration in Practice
● Terminate when is sufficiently small

Asynchronous Dynamic Programming
● Don’t sweep over the entire state set systematically
○ Some states are updated multiple times before other state is updated once
○ Order/skip states to propagate information efficiently
● Can intermix with real-time interaction
○ Update states according to the agent’s experience
○ Allow focusing updates to relevant states
● To converge, all states must be continuously updated

Generalized Policy Iteration
● Idea of interaction between policy evaluation and policy improvement
○ Policy improved w.r.t. value function
○ Value function updated for new policy
● Describes most RL methods
● Stabilized process guarantees optimal policy

Efficiency of Dynamic Programming
● Polynomial in and
○ Exponentially faster than direct search in policy space
● More practical than linear programming methods in larger problems
○ Asynchronous DP preferred for large state spaces
● Typically converge faster than their worst-case guarantee
○ Initial values can help faster convergence

Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

What's hot

An introduction to deep reinforcement learningBig Data Colombia

Markov decision processJie-Han Chen

DQN (Deep Q-Network)Dong Guo

An introduction to reinforcement learningJie-Han Chen

Multi-armed BanditsDongmin Lee

Markov decision processHamed Abdi

Multi armed banditJie-Han Chen

Reinforcement LearningJungyeol

Intro to Deep Reinforcement LearningKhaled Saleh

Deep Reinforcement LearningMeetupDataScienceRoma

Deep Reinforcement Learning and Its ApplicationsBill Liu

Reinforcement learningDongHyun Kwak

Multi-Armed Bandit and ApplicationsSangwoo Mo

Temporal difference learningJie-Han Chen

Reinforcement learning, Q-LearningKuppusamy P

Deep Q-LearningNikolay Pavlov

25 introduction reinforcement_learningAndres Mendez-Vazquez

An introduction to reinforcement learningSubrat Panda, PhD

Policy gradientJie-Han Chen

Reinforcement Learning 7. n-step BootstrappingSeung Jae Lee

What's hot (20)

An introduction to deep reinforcement learning

Markov decision process

DQN (Deep Q-Network)

An introduction to reinforcement learning

Multi-armed Bandits

Markov decision process

Multi armed bandit

Reinforcement Learning

Intro to Deep Reinforcement Learning

Deep Reinforcement Learning

Deep Reinforcement Learning and Its Applications

Reinforcement learning

Multi-Armed Bandit and Applications

Temporal difference learning

Reinforcement learning, Q-Learning

Deep Q-Learning

25 introduction reinforcement_learning

An introduction to reinforcement learning

Policy gradient

Reinforcement Learning 7. n-step Bootstrapping

Similar to Reinforcement Learning 4. Dynamic Programming

Reinforcement learning：policy gradient (part 1)Bean Yen

Structured prediction with reinforcement learningguruprasad110

Proximal Policy OptimizationShubhaManikarnike

Rl chapter 1 introductionConnorShorten2

Deep Reinforcement learningCairo University

Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...Lviv Startup Club

Intro to Reinforcement learning - part IIMikko Mäkipää

A brief introduction to Searn AlgorithmSupun Abeysinghe

Trust Region Policy Optimization, Schulman et al, 2015Chris Ohk

Intro to Reinforcement learning - part IMikko Mäkipää

Rules engine.pptxDamijan Ćavar

Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionGuillermo Barbadillo Villanueva

Grad presentationHadya Mansour

FeUdal Networks for Hierarchical Reinforcement LearningIllia Polosukhin

Reinforcement Learning Guide For Beginnersgokulprasath06

Dexterous In-hand Manipulation by OpenAIAnand Joshi

Reinforcement Learning for Algorithmic TradingSynaptonIncorporated

[DSC Croatia 22] Modeling the Dynamics of User Engagement - Enes DeumicDataScienceConferenc1

Aaa ped-24- Reinforcement LearningAminaRepo

Recommender systems Mahmoud Khaled

Similar to Reinforcement Learning 4. Dynamic Programming (20)

Reinforcement learning：policy gradient (part 1)

Structured prediction with reinforcement learning

Proximal Policy Optimization

Rl chapter 1 introduction

Deep Reinforcement learning

Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...

Intro to Reinforcement learning - part II

A brief introduction to Searn Algorithm

Trust Region Policy Optimization, Schulman et al, 2015

Intro to Reinforcement learning - part I

Rules engine.pptx

Power Laws: Optimizing Demand-side Strategies. Second Prize Solution

Grad presentation

FeUdal Networks for Hierarchical Reinforcement Learning

Reinforcement Learning Guide For Beginners

Dexterous In-hand Manipulation by OpenAI

Reinforcement Learning for Algorithmic Trading

[DSC Croatia 22] Modeling the Dynamics of User Engagement - Enes Deumic

Aaa ped-24- Reinforcement Learning

Recommender systems

Recently uploaded

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Pigging Solutions in Pet Food ManufacturingPigging Solutions

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Pigging Solutions Piggable Sweeping Elbows

Salesforce Community Group Quito, Salesforce 101

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking Men

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

SQL Database Design For Developers at php[tek] 2024

Pigging Solutions in Pet Food Manufacturing

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

My Hashitalk Indonesia April 2024 Presentation

Unblocking The Main Thread Solving ANRs and Frozen Frames

Presentation on how to chat with PDF using ChatGPT code interpreter

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Reinforcement Learning 4. Dynamic Programming

1. Chapter 4: Dynamic Programming Seungjae Ryan Lee

2. Dynamic Programming ● Algorithms to compute optimal policies with a perfect model of environment ● Use value functions to structure searching for good policies ● Foundation of all methods hereafter Dynamic Programming Monte Carlo Temporal Difference

3. Policy Evaluation (Prediction) ● Compute state-value for some policy ● Use the Bellman Equation:

4. Iterative Policy Evaluation ● Solving linear systems is tedious → Use iterative methods ● Define sequence of approximate value functions ● Expected update using the Bellman equation: ○ Update based on expectation of all possible next states

5. Iterative Policy Evaluation in Practice ● In-place methods usually converge faster than keeping two arrays ● Terminate policy evaluation when is sufficiently small

6. Gridworld Example ● Deterministic state transition ● Off-the-grid actions leave the state unchanged ● Undiscounted, episodic task

7. Policy Evaluation in Gridworld ● Random policy

8. Policy Improvement - One state ● Suppose we know for some policy ● For a state , see if there is a better action ● Check if ○ If true, greedily selecting is better than ○ Special case of Policy Improvement Theorem 0 -1 -1 -1 0 -1 -1 -1

9. Policy Improvement Theorem For policies , if for all state , Then, is at least as good a policy as . (Strict inequality if )

10. Policy Improvement ● Find better policies with the computed value function ● Use a new greedy policy ● Satisfies the conditions of Policy Improvement Theorem

11. Guarantees of Policy Improvement ● If , then the Bellman Optimality Equation holds. → Policy Improvement always returns a better policy unless already optimal

12. Policy Iteration ● Repeat Policy Evaluation and Policy Improvement ● Guaranteed improvement for each policy ● Guaranteed convergence in finite number of steps for finite MDPs

13. Policy Iteration in Practice ● Initialize with for quicker policy evaluation ● Often converges in surprisingly few iterations

14. Value Iteration ● “Truncate” policy evaluation ○ Don’t wait until is sufficiently small ○ Update state values once for each state ● Evaluation and improvement can be simplified to one update operation ○ Bellman optimality equation turned into an update rule

15. Value Iteration in Practice ● Terminate when is sufficiently small

16. Asynchronous Dynamic Programming ● Don’t sweep over the entire state set systematically ○ Some states are updated multiple times before other state is updated once ○ Order/skip states to propagate information efficiently ● Can intermix with real-time interaction ○ Update states according to the agent’s experience ○ Allow focusing updates to relevant states ● To converge, all states must be continuously updated

17. Generalized Policy Iteration ● Idea of interaction between policy evaluation and policy improvement ○ Policy improved w.r.t. value function ○ Value function updated for new policy ● Describes most RL methods ● Stabilized process guarantees optimal policy

18. Efficiency of Dynamic Programming ● Polynomial in and ○ Exponentially faster than direct search in policy space ● More practical than linear programming methods in larger problems ○ Asynchronous DP preferred for large state spaces ● Typically converge faster than their worst-case guarantee ○ Initial values can help faster convergence

19. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai

Reinforcement Learning 4. Dynamic Programming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reinforcement Learning 4. Dynamic Programming

Similar to Reinforcement Learning 4. Dynamic Programming (20)

Recently uploaded

Recently uploaded (20)

Reinforcement Learning 4. Dynamic Programming