SlideShare a Scribd company logo
1 of 30
Download to read offline
Chapter 7: n-step Bootstrapping
Seungjae Ryan Lee
● Monte Carlo: wait until end of episode
● 1-step TD / TD(0): wait until next time step
Bootstrapping target
TD error
MC error
Recap: MC vs TD
Return
n-step Bootstrapping
● Perform update based on intermediate number of rewards
● Freed from the “tyranny of the time step” of TD
○ Different time step for action selection (1) and bootstrapping interval (n)
● Called n-step TD since they still bootstrap
n-step Bootstrapping
n-step TD Prediction
● Use truncated n-step return as target
○ Use n rewards and bootstrap
● Needs future rewards not available at timestep
● cannot be updated until timestep
n-step TD Prediction: Pseudocode
Compute n-step return
Update V
n-step TD Prediction: Convergence
● The n-step return has the error reduction property
○ Expectation of n-step return is a better estimate of than in the worst-state sense
● Converges to true value under appropriate technical conditions
Random Walk Example
● Rewards only on exit (-1 on left exit, 1 on right exit)
● n-step return: propagate reward up to n latest states
S17 S18 S19S1 S2 S3
R = -1 R = 1
Sample
trajectory
1-step
2-step
Random Walk Example: n-step TD Prediction
● Intermediate n does best
n-step Sarsa
● Extend n-step TD Prediction to Control (Sarsa)
○ Need to use Q instead of V
○ Use ε-greedy policy
● Redefine n-step return with Q
● Naturally extend to Sarsa
n-step Sarsa vs. Sarsa(0)
● Gridworld with nonzero reward only at the end
● n-step can learn much more from one episode
n-step Sarsa: Pseudocode
n-step Expected Sarsa
● Same update as Sarsa except the last element
○ Consider all possible actions in the last step
● Same n-step return as Sarsa except the last step
● Same update as Sarsa
Off-policy n-step Learning
● Need importance sampling
● Update target policy’s values with behavior policy’s returns
● Generalizes the on-policy case
○ If , then
Off-policy n-step Sarsa
● Update Q instead of V
● Importance sampling ratio starts one step later for Q values
○ is already chosen
Off-policy n-step Sarsa: Pseudocode
Off-policy n-step Expected Sarsa
● Importance sampling ratio ends one step earlier for Expected Sarsa
● Use expected n-step return
Per-decision Off-policy Methods: Intuition*
● More efficient off-policy n-step method
● Write returns recursively:
● Naive importance sampling
○ If ,
○ Estimate shrinks, higher variance
Per-decision Off-policy Methods*
● Better: If , leave the estimate unchanged
● Expected update is unchanged since
● Used with TD update without importance sampling
Control Variate
Per-decision Off-policy Methods: Q*
● Use Expected Sarsa’s n-step return
● Off-policy form with control variate:
● Analogous to Expected Sarsa after combining with TD update algorithm
http://auai.org/uai2018/proceedings/papers/282.pdf
n-step Tree Backup Algorithm
● Off-policy without importance sampling
● Update from entire tree of estimated action values
○ Leaf action nodes (not selected) contribute to the target
○ Selected action nodes does not contribute but weighs all next-level action values
n-step Tree Backup Algorithm: n-step Return
● 1-step return
● 2-step return
n-step Tree Backup Algorithm: n-step Return
● 2-step return
● n-step return
n-step Tree Backup Algorithm: Pseudocode
A Unifying Algorithm: n-step *
● Unify Sarsa, Tree Backup and Expected Sarsa
○ Decide on each step to use sample action (Sarsa) or expectation of all actions (Tree Backup)
A Unifying Algorithm: n-step : Equations*
● : degree of sampling on timestep t
● Slide linearly between two weights:
○ Sarsa: Importance sampling ratio
○ Tree Backup: Policy probability
A Unifying Algorithm: n-step : Pseudocode*
Summary
● n-step: Look ahead to the next n rewards, states, and actions
+ Perform better than either MC or TD
+ Escapes the tyranny of the single time step
- Delay of n steps before learning
- More memory and computation per timestep
● Extended to Eligibility Traces (Ch. 12)
+ Minimize additional memory and computation
- More complex
● Two approaches to off-policy n-step learning
○ Importance sampling: high variance
○ Tree backup: limited to few-step bootstrapping if policies are very different (even if n is large)
Summary
Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

More Related Content

What's hot

Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsDongmin Lee
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processJie-Han Chen
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision processVARUN KUMAR
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
Model based rl
Model based rlModel based rl
Model based rlSeolhokim
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningJungyeol
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Thom Lane
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
AI - Introduction to Bellman Equations
AI - Introduction to Bellman EquationsAI - Introduction to Bellman Equations
AI - Introduction to Bellman EquationsAndrew Ferlitsch
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement LearningTakashi Nagata
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 

What's hot (20)

Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Model based rl
Model based rlModel based rl
Model based rl
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
AI - Introduction to Bellman Equations
AI - Introduction to Bellman EquationsAI - Introduction to Bellman Equations
AI - Introduction to Bellman Equations
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 

Similar to Reinforcement Learning 7. n-step Bootstrapping

Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Bean Yen
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learningCairo University
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIMikko Mäkipää
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learningMarsan Ma
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAminaRepo
 
Structured prediction with reinforcement learning
Structured prediction with reinforcement learningStructured prediction with reinforcement learning
Structured prediction with reinforcement learningguruprasad110
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
 

Similar to Reinforcement Learning 7. n-step Bootstrapping (20)

Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
Structured prediction with reinforcement learning
Structured prediction with reinforcement learningStructured prediction with reinforcement learning
Structured prediction with reinforcement learning
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Reinforcement Learning 7. n-step Bootstrapping

  • 1. Chapter 7: n-step Bootstrapping Seungjae Ryan Lee
  • 2. ● Monte Carlo: wait until end of episode ● 1-step TD / TD(0): wait until next time step Bootstrapping target TD error MC error Recap: MC vs TD Return
  • 3. n-step Bootstrapping ● Perform update based on intermediate number of rewards ● Freed from the “tyranny of the time step” of TD ○ Different time step for action selection (1) and bootstrapping interval (n) ● Called n-step TD since they still bootstrap
  • 5. n-step TD Prediction ● Use truncated n-step return as target ○ Use n rewards and bootstrap ● Needs future rewards not available at timestep ● cannot be updated until timestep
  • 6. n-step TD Prediction: Pseudocode Compute n-step return Update V
  • 7. n-step TD Prediction: Convergence ● The n-step return has the error reduction property ○ Expectation of n-step return is a better estimate of than in the worst-state sense ● Converges to true value under appropriate technical conditions
  • 8. Random Walk Example ● Rewards only on exit (-1 on left exit, 1 on right exit) ● n-step return: propagate reward up to n latest states S17 S18 S19S1 S2 S3 R = -1 R = 1 Sample trajectory 1-step 2-step
  • 9. Random Walk Example: n-step TD Prediction ● Intermediate n does best
  • 10. n-step Sarsa ● Extend n-step TD Prediction to Control (Sarsa) ○ Need to use Q instead of V ○ Use ε-greedy policy ● Redefine n-step return with Q ● Naturally extend to Sarsa
  • 11. n-step Sarsa vs. Sarsa(0) ● Gridworld with nonzero reward only at the end ● n-step can learn much more from one episode
  • 13. n-step Expected Sarsa ● Same update as Sarsa except the last element ○ Consider all possible actions in the last step ● Same n-step return as Sarsa except the last step ● Same update as Sarsa
  • 14. Off-policy n-step Learning ● Need importance sampling ● Update target policy’s values with behavior policy’s returns ● Generalizes the on-policy case ○ If , then
  • 15. Off-policy n-step Sarsa ● Update Q instead of V ● Importance sampling ratio starts one step later for Q values ○ is already chosen
  • 17. Off-policy n-step Expected Sarsa ● Importance sampling ratio ends one step earlier for Expected Sarsa ● Use expected n-step return
  • 18. Per-decision Off-policy Methods: Intuition* ● More efficient off-policy n-step method ● Write returns recursively: ● Naive importance sampling ○ If , ○ Estimate shrinks, higher variance
  • 19. Per-decision Off-policy Methods* ● Better: If , leave the estimate unchanged ● Expected update is unchanged since ● Used with TD update without importance sampling Control Variate
  • 20. Per-decision Off-policy Methods: Q* ● Use Expected Sarsa’s n-step return ● Off-policy form with control variate: ● Analogous to Expected Sarsa after combining with TD update algorithm http://auai.org/uai2018/proceedings/papers/282.pdf
  • 21. n-step Tree Backup Algorithm ● Off-policy without importance sampling ● Update from entire tree of estimated action values ○ Leaf action nodes (not selected) contribute to the target ○ Selected action nodes does not contribute but weighs all next-level action values
  • 22. n-step Tree Backup Algorithm: n-step Return ● 1-step return ● 2-step return
  • 23. n-step Tree Backup Algorithm: n-step Return ● 2-step return ● n-step return
  • 24. n-step Tree Backup Algorithm: Pseudocode
  • 25. A Unifying Algorithm: n-step * ● Unify Sarsa, Tree Backup and Expected Sarsa ○ Decide on each step to use sample action (Sarsa) or expectation of all actions (Tree Backup)
  • 26. A Unifying Algorithm: n-step : Equations* ● : degree of sampling on timestep t ● Slide linearly between two weights: ○ Sarsa: Importance sampling ratio ○ Tree Backup: Policy probability
  • 27. A Unifying Algorithm: n-step : Pseudocode*
  • 28. Summary ● n-step: Look ahead to the next n rewards, states, and actions + Perform better than either MC or TD + Escapes the tyranny of the single time step - Delay of n steps before learning - More memory and computation per timestep ● Extended to Eligibility Traces (Ch. 12) + Minimize additional memory and computation - More complex ● Two approaches to off-policy n-step learning ○ Importance sampling: high variance ○ Tree backup: limited to few-step bootstrapping if policies are very different (even if n is large)
  • 30. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai