SlideShare a Scribd company logo
1 of 90
Download to read offline
Temporal-difference Learning
Jie-Han Chen
NetDB, National Cheng Kung University
5/15, 2018 @ National Cheng Kung University, Taiwan
1
The content and images in this slides were borrowed from:
1. Rich Sutton’s textbook
2. David Silver’s Reinforcement Learning class in UCL
3. Künstliche Intelligenz’s slides
2
Disclaimer
Outline
● Recap DP, MC method
● TD learning
● Sarsa, Q-learning
● N-step bootstrapping
● TD (lambda)
3
Value-based method
Generalized policy iteration (GPI) plays the most
important part in reinforcement learning. All we
need is to find how to update the value function.
4
Dynamic Programming
Figure source: David Silver’s slides
5
Monte Carlo Method
Figure source: David Silver’s slides
6
Dynamic Programming & Monte Carlo method
Dynamic Programming
● update per step, using bootstrapping
● need model
● computation cost
Monte Carlo Method
● update per episode
● model-free
● hard to be applied to continuing task
7
Can we combine the advantage of Dynamic
Programming and Monte Carlo method?
8
Temporal-difference learning
9
Different from MC method, each sample of TD learning is just a few steps, not the
whole trajectory. TD learning bases its update in part on an existing estimate, so it’s
also a bootstrapping method.
TD method is an policy evaluation method (without control),
which is used to predict the value of fixed policy.
Temporal-difference learning
10
backup diagram of TD(0)
Temporal-difference learning
We want to improve our estimate of V by computing these averages:
11
Temporal-difference learning
We want to improve our estimate of V by computing these averages:
sample 1:
12
Temporal-difference learning
We want to improve our estimate of V by computing these averages:
sample 1:
sample 2:
13
Temporal-difference learning
We want to improve our estimate of V by computing these averages:
sample 1:
sample 2:
sample 3:
14
Temporal-difference learning
In model-free RL, we use samples to estimate the expectation of future total
rewards.
sample 1:
sample 2:
…
sample n:
15
Temporal-difference learning
sample 1:
sample 2:
sample n:
16
Temporal-difference learning
sample 1:
sample 2:
sample n:
17
But, we cannot rewind time to get
sample after sample from St !
Temporal-difference learning
We can use weighted format to update the new value function:
which is equal to
The can be a kind of learning rate.
18
Exponential Moving Average
● The running interpolation update:
● Makes recent samples more important:
● Forgets about the past, is its forget rate.
19
Temporal-difference learning
TD update, one-step TD/TD(0):
The quantity in the brackets is a sort of error, called TD error.
The target of TD method
20
Temporal-difference learning
● Model-free
● Online learning (fully incremental method)
○ Can be applied to continuing task
● Better convergence time
○ In practice, converge faster than Monte Carlo method
21
How to choose ?
Stochastic approximation theory (Robbins-Monro sequence) tells us there are two
constraints to make previous exponential moving average converge stably.
1.
2.
22
How to choose ?
Stochastic approximation theory (Robbins-Monro sequence) tells us there are two
conditions to make previous exponential moving average converge stably.
1.
2.
23
P-series could be a choice to satisfy these two
conditions. But the learning rate with these
conditions will learn slow to converge.
The works well in most case.
Temporal-difference learning with Control
In the previous slides, we introduce TD learning which is used to predict the value
function by one-step sample.
Now, we’ll introduce two classic methods in TD control:
● Sarsa
● Q-learning
24
Sarsa
● Inspired by policy iteration
● Replace value function by action-value function
25
Sarsa
● Inspired by policy iteration
● Replace value function by action-value function
26
Sarsa
● Inspired by policy iteration
● Replace value function by action-value function
27
Sarsa
● Inspired by policy iteration
● The behavior policy and target policy is same
28
Sarsa
● Inspired by policy iteration
● The behavior policy and target policy is same
29
Sarsa
● Inspired by policy iteration
● The behavior policy and target policy is same
30
In model-free method, we don’t know the transition
probability. All we need to do is to use a lot of
experience sample to estimate value.
The experience sample in Sarsa is (s, a, r, s’, a’)
Sarsa
● Inspired by policy iteration
● Sarsa update:
31
Sarsa
● Inspired by policy iteration
● Sarsa update:
32
Sarsa
33
Sarsa
34
On-policy!
Q-learning
● Inspired from value iteration
35
Q-learning
● Inspired from value iteration
36
Q-learning
● Inspired from value iteration
37
Q-learning
● Inspired from value iteration
38
Q-learning
● Inspired from value iteration
● Q-learning update:
39
Q-learning
40
SARSA V.S. Q-Learning
● On-policy: The sample policy is as same as learning policy (target policy)
eg: Sarsa, Policy Gradient
● Off-policy: The sample policy is different from learning policy (target policy)
eg. Q-learning, Deep Q Network
41
SARSA V.S. Q-Learning: The Cliff walk
42
Additional infomation:
https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q-learning/
TD learning
43
Monte Carlo Method
Figure source: David Silver’s slides
44
Monte Carlo method
Monte Carlo update:
45
Monte Carlo return:
sum of all rewards behind this
step
Monte Carlo method
Monte Carlo update:
● Unbiased
○ The target is true return of the sample.
● High variance
○ The targets among samples are very different
because they are depend on whole trajectory.
46
TD method
TD update:
47
TD return:
just need to add immediate reward
and the estimated value of next state.
TD method
TD update:
● biased
○ The target is also an estimated value (because
of V(s))
● low variance
○ The targets are slightly different, because it just
take one-step.
48
Can we evaluate policy steps less than
Monte Carlo but more than one-step TD?
49
n-step bootstrapping
● One-step TD learning:
○ target of TD:
○ bellman equation:
● Monte Carlo method:
○ target of MC:
○ bellman equation:
50
n-step TD
51
n-step TD
n-step TD return:
Bellman equation:
52
53
Performance of n-step TD: Random walk
● 2 terminal states
● with 19 states instead of 5
54
-1
Performance of n-step TD: Random walk
55
n-step Sarsa
56
n-step Sarsa
n-step Sarsa return:
Bellman equation:
57
58
n-step Sarsa
59
In the previous slides, we have seen TD(0) before. What does the 0 mean?
one-step TD/TD(0) update:
60
We have already introduce n-step TD learning, like:
1-step, 3-step, 8-step etc. Maybe 4-step TD return is best
for learning.
Can we combine them together to get better
performance?
61
2-step TD & 4-step TD
Recap: Performance of n-step TD: Random walk
62
A simple way to combine n-step TD returns is to average n-step
return as long as the weights on component returns are positive and
sum to 1.
63
A simple way to combine n-step TD returns is to average n-step
return as long as the weights on component returns are positive and
sum to 1.
64
This is called compound return
The is one particular way of
averaging n-step updates.
This average contains all the n-step
updates, each weighted proportional to
lam
Besides, each term of n-step returns are
normalized by a factor of to
ensure that the weights sum to 1.
65
The return of is defined as
following:
We called it
66
Another form is to separate
post-termination terms from the main
sum.
67
68
/ one-step TD
/ Monte Carlo
● Where
n-step backups
● Backup (on-line or off-line):
● Off-line: the increments are accumulated "on the side" and are not used to
change value estimates until the end of the episode.
● On-line: the updates are done during the episode, as soon as the increment is
computed.
69
● Update value function towards the
● Foward view looks into the future to compute
● Like MC, can only be computed from complete episodes (Off-line learning)
70
Forward
Forward
71
In forward view, after looking forward from and updating one state, we
move on to the next and never have to work with the preceding state
agein.
vs N-step TD: 19-state random walk
72
Backward
● Forward view provides theory
● Backward view provides mechanism
● Shout backward over time
● The strength of your voice decreases with temporal distance by
73
Backward
Eligibility trace denoted by , which keeps track the weights of updating value
function for every state.
74
Eligibility Trace
75
Accumulating eligibility trace for certain state S
Visit stateS
From David’s slides
Backward
● Keep an eligibility trace for every state s
● Update value V(s) for every state s in the single step
● In proportion to TD-error and eligibility trace
76
From David’s slides
● When , only the current state is updated
● This exactly equivalent to TD(0) update
77
Telescoping in TD(1)
78
Online Tabular
79
Online Tabular
80
Backward part!
81
82
83
From David’s slides
Off-line update VS On-line update
Off-line updates
● updates are accumulated within episode
● but applied in batch at the end of episode
On-line updates
● updates are applied online at each step within episode
● can be applied in continuing task
84
Cons of Tabular method
In the previous method, we use a large table to store the value of each state or
state-action pair which is called tabular method.
In real scene, there are too many state-action pairs to store. Besides, the state
/observation would also be much complicated, for example: an image with high
resolution. It causes curse of dimensionality.
85
Cons of Tabular method
In the previous method, we use a large table to store the value of each state or
state-action pair which is called tabular method.
In real scene, there are too many state-action pairs to store. Besides, the state
/observation would also be much complicated, for example: an image with high
resolution. It causes curse of dimensionality.
86
We can use function approximator to estimate the value
function, and it has generalizability!
Relationship between DP and TD
87
From David’s slides
On-policy & Off-policy 補充
On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas
off-policy methods evaluate or improve a policy different from that used to generate the data.
88
Recommended Papers
1. John et al., High-Dimensional Continuous Control Using Generalized
Advantage Estimation (ICLR 2016)
2. Kristopher et al., Multi-step Reinforcement Learning: A Unifying Algorithm
(AAAI 2018)
89
Reference
1. Sutton’s textbook Chapter 6, 7, 12
2. Reinforcement Learning Lecture4 from UCL
3. Künstliche Intelligenz’s slides:
https://www.tu-chemnitz.de/informatik/KI/scripts/ws0910/ml09_7.pdf
90

More Related Content

What's hot

I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMvikas dhakane
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Forms of learning in ai
Forms of learning in aiForms of learning in ai
Forms of learning in aiRobert Antony
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processJie-Han Chen
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classificationKrish_ver2
 
Learning set of rules
Learning set of rulesLearning set of rules
Learning set of rulesswapnac12
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningCloudxLab
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 
Linear regression
Linear regressionLinear regression
Linear regressionMartinHogg9
 
Back propagation
Back propagationBack propagation
Back propagationNagarajan
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 

What's hot (20)

Uncertainty in AI
Uncertainty in AIUncertainty in AI
Uncertainty in AI
 
I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHM
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Forms of learning in ai
Forms of learning in aiForms of learning in ai
Forms of learning in ai
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
Learning set of rules
Learning set of rulesLearning set of rules
Learning set of rules
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Back propagation
Back propagationBack propagation
Back propagation
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Hill climbing algorithm
Hill climbing algorithmHill climbing algorithm
Hill climbing algorithm
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 

Similar to Temporal difference learning

Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learningCairo University
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Efficient Backpropagation
Efficient BackpropagationEfficient Backpropagation
Efficient BackpropagationAakash Chotrani
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Bean Yen
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingSeung Jae Lee
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfvudinhphuong96
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Recommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model TrainingRecommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model TrainingCrossing Minds
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
logisticregression-190726150723.pdf
logisticregression-190726150723.pdflogisticregression-190726150723.pdf
logisticregression-190726150723.pdfSuaibDanish
 
Logistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesLogistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesRajat Sharma
 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksSteve Nouri
 
Power Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionPower Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionGuillermo Barbadillo Villanueva
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 

Similar to Temporal difference learning (20)

Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Efficient Backpropagation
Efficient BackpropagationEfficient Backpropagation
Efficient Backpropagation
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step Bootstrapping
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdf
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Recommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model TrainingRecommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model Training
 
TD Learning Webinar
TD Learning WebinarTD Learning Webinar
TD Learning Webinar
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
logisticregression-190726150723.pdf
logisticregression-190726150723.pdflogisticregression-190726150723.pdf
logisticregression-190726150723.pdf
 
Logistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesLogistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | Disadvantages
 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricks
 
Power Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionPower Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize Solution
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 

More from Jie-Han Chen

Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learningJie-Han Chen
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learningJie-Han Chen
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)Jie-Han Chen
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchainJie-Han Chen
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecodeJie-Han Chen
 

More from Jie-Han Chen (9)

Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecode
 

Recently uploaded

Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 

Recently uploaded (20)

Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 

Temporal difference learning

  • 1. Temporal-difference Learning Jie-Han Chen NetDB, National Cheng Kung University 5/15, 2018 @ National Cheng Kung University, Taiwan 1
  • 2. The content and images in this slides were borrowed from: 1. Rich Sutton’s textbook 2. David Silver’s Reinforcement Learning class in UCL 3. Künstliche Intelligenz’s slides 2 Disclaimer
  • 3. Outline ● Recap DP, MC method ● TD learning ● Sarsa, Q-learning ● N-step bootstrapping ● TD (lambda) 3
  • 4. Value-based method Generalized policy iteration (GPI) plays the most important part in reinforcement learning. All we need is to find how to update the value function. 4
  • 5. Dynamic Programming Figure source: David Silver’s slides 5
  • 6. Monte Carlo Method Figure source: David Silver’s slides 6
  • 7. Dynamic Programming & Monte Carlo method Dynamic Programming ● update per step, using bootstrapping ● need model ● computation cost Monte Carlo Method ● update per episode ● model-free ● hard to be applied to continuing task 7
  • 8. Can we combine the advantage of Dynamic Programming and Monte Carlo method? 8
  • 10. Different from MC method, each sample of TD learning is just a few steps, not the whole trajectory. TD learning bases its update in part on an existing estimate, so it’s also a bootstrapping method. TD method is an policy evaluation method (without control), which is used to predict the value of fixed policy. Temporal-difference learning 10 backup diagram of TD(0)
  • 11. Temporal-difference learning We want to improve our estimate of V by computing these averages: 11
  • 12. Temporal-difference learning We want to improve our estimate of V by computing these averages: sample 1: 12
  • 13. Temporal-difference learning We want to improve our estimate of V by computing these averages: sample 1: sample 2: 13
  • 14. Temporal-difference learning We want to improve our estimate of V by computing these averages: sample 1: sample 2: sample 3: 14
  • 15. Temporal-difference learning In model-free RL, we use samples to estimate the expectation of future total rewards. sample 1: sample 2: … sample n: 15
  • 17. Temporal-difference learning sample 1: sample 2: sample n: 17 But, we cannot rewind time to get sample after sample from St !
  • 18. Temporal-difference learning We can use weighted format to update the new value function: which is equal to The can be a kind of learning rate. 18
  • 19. Exponential Moving Average ● The running interpolation update: ● Makes recent samples more important: ● Forgets about the past, is its forget rate. 19
  • 20. Temporal-difference learning TD update, one-step TD/TD(0): The quantity in the brackets is a sort of error, called TD error. The target of TD method 20
  • 21. Temporal-difference learning ● Model-free ● Online learning (fully incremental method) ○ Can be applied to continuing task ● Better convergence time ○ In practice, converge faster than Monte Carlo method 21
  • 22. How to choose ? Stochastic approximation theory (Robbins-Monro sequence) tells us there are two constraints to make previous exponential moving average converge stably. 1. 2. 22
  • 23. How to choose ? Stochastic approximation theory (Robbins-Monro sequence) tells us there are two conditions to make previous exponential moving average converge stably. 1. 2. 23 P-series could be a choice to satisfy these two conditions. But the learning rate with these conditions will learn slow to converge. The works well in most case.
  • 24. Temporal-difference learning with Control In the previous slides, we introduce TD learning which is used to predict the value function by one-step sample. Now, we’ll introduce two classic methods in TD control: ● Sarsa ● Q-learning 24
  • 25. Sarsa ● Inspired by policy iteration ● Replace value function by action-value function 25
  • 26. Sarsa ● Inspired by policy iteration ● Replace value function by action-value function 26
  • 27. Sarsa ● Inspired by policy iteration ● Replace value function by action-value function 27
  • 28. Sarsa ● Inspired by policy iteration ● The behavior policy and target policy is same 28
  • 29. Sarsa ● Inspired by policy iteration ● The behavior policy and target policy is same 29
  • 30. Sarsa ● Inspired by policy iteration ● The behavior policy and target policy is same 30 In model-free method, we don’t know the transition probability. All we need to do is to use a lot of experience sample to estimate value. The experience sample in Sarsa is (s, a, r, s’, a’)
  • 31. Sarsa ● Inspired by policy iteration ● Sarsa update: 31
  • 32. Sarsa ● Inspired by policy iteration ● Sarsa update: 32
  • 35. Q-learning ● Inspired from value iteration 35
  • 36. Q-learning ● Inspired from value iteration 36
  • 37. Q-learning ● Inspired from value iteration 37
  • 38. Q-learning ● Inspired from value iteration 38
  • 39. Q-learning ● Inspired from value iteration ● Q-learning update: 39
  • 41. SARSA V.S. Q-Learning ● On-policy: The sample policy is as same as learning policy (target policy) eg: Sarsa, Policy Gradient ● Off-policy: The sample policy is different from learning policy (target policy) eg. Q-learning, Deep Q Network 41
  • 42. SARSA V.S. Q-Learning: The Cliff walk 42 Additional infomation: https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q-learning/
  • 44. Monte Carlo Method Figure source: David Silver’s slides 44
  • 45. Monte Carlo method Monte Carlo update: 45 Monte Carlo return: sum of all rewards behind this step
  • 46. Monte Carlo method Monte Carlo update: ● Unbiased ○ The target is true return of the sample. ● High variance ○ The targets among samples are very different because they are depend on whole trajectory. 46
  • 47. TD method TD update: 47 TD return: just need to add immediate reward and the estimated value of next state.
  • 48. TD method TD update: ● biased ○ The target is also an estimated value (because of V(s)) ● low variance ○ The targets are slightly different, because it just take one-step. 48
  • 49. Can we evaluate policy steps less than Monte Carlo but more than one-step TD? 49
  • 50. n-step bootstrapping ● One-step TD learning: ○ target of TD: ○ bellman equation: ● Monte Carlo method: ○ target of MC: ○ bellman equation: 50
  • 52. n-step TD n-step TD return: Bellman equation: 52
  • 53. 53
  • 54. Performance of n-step TD: Random walk ● 2 terminal states ● with 19 states instead of 5 54 -1
  • 55. Performance of n-step TD: Random walk 55
  • 57. n-step Sarsa n-step Sarsa return: Bellman equation: 57
  • 58. 58
  • 60. In the previous slides, we have seen TD(0) before. What does the 0 mean? one-step TD/TD(0) update: 60
  • 61. We have already introduce n-step TD learning, like: 1-step, 3-step, 8-step etc. Maybe 4-step TD return is best for learning. Can we combine them together to get better performance? 61 2-step TD & 4-step TD
  • 62. Recap: Performance of n-step TD: Random walk 62
  • 63. A simple way to combine n-step TD returns is to average n-step return as long as the weights on component returns are positive and sum to 1. 63
  • 64. A simple way to combine n-step TD returns is to average n-step return as long as the weights on component returns are positive and sum to 1. 64 This is called compound return
  • 65. The is one particular way of averaging n-step updates. This average contains all the n-step updates, each weighted proportional to lam Besides, each term of n-step returns are normalized by a factor of to ensure that the weights sum to 1. 65
  • 66. The return of is defined as following: We called it 66
  • 67. Another form is to separate post-termination terms from the main sum. 67
  • 68. 68 / one-step TD / Monte Carlo ● Where
  • 69. n-step backups ● Backup (on-line or off-line): ● Off-line: the increments are accumulated "on the side" and are not used to change value estimates until the end of the episode. ● On-line: the updates are done during the episode, as soon as the increment is computed. 69
  • 70. ● Update value function towards the ● Foward view looks into the future to compute ● Like MC, can only be computed from complete episodes (Off-line learning) 70 Forward
  • 71. Forward 71 In forward view, after looking forward from and updating one state, we move on to the next and never have to work with the preceding state agein.
  • 72. vs N-step TD: 19-state random walk 72
  • 73. Backward ● Forward view provides theory ● Backward view provides mechanism ● Shout backward over time ● The strength of your voice decreases with temporal distance by 73
  • 74. Backward Eligibility trace denoted by , which keeps track the weights of updating value function for every state. 74
  • 75. Eligibility Trace 75 Accumulating eligibility trace for certain state S Visit stateS From David’s slides
  • 76. Backward ● Keep an eligibility trace for every state s ● Update value V(s) for every state s in the single step ● In proportion to TD-error and eligibility trace 76 From David’s slides
  • 77. ● When , only the current state is updated ● This exactly equivalent to TD(0) update 77
  • 81. 81
  • 82. 82
  • 84. Off-line update VS On-line update Off-line updates ● updates are accumulated within episode ● but applied in batch at the end of episode On-line updates ● updates are applied online at each step within episode ● can be applied in continuing task 84
  • 85. Cons of Tabular method In the previous method, we use a large table to store the value of each state or state-action pair which is called tabular method. In real scene, there are too many state-action pairs to store. Besides, the state /observation would also be much complicated, for example: an image with high resolution. It causes curse of dimensionality. 85
  • 86. Cons of Tabular method In the previous method, we use a large table to store the value of each state or state-action pair which is called tabular method. In real scene, there are too many state-action pairs to store. Besides, the state /observation would also be much complicated, for example: an image with high resolution. It causes curse of dimensionality. 86 We can use function approximator to estimate the value function, and it has generalizability!
  • 87. Relationship between DP and TD 87 From David’s slides
  • 88. On-policy & Off-policy 補充 On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data. 88
  • 89. Recommended Papers 1. John et al., High-Dimensional Continuous Control Using Generalized Advantage Estimation (ICLR 2016) 2. Kristopher et al., Multi-step Reinforcement Learning: A Unifying Algorithm (AAAI 2018) 89
  • 90. Reference 1. Sutton’s textbook Chapter 6, 7, 12 2. Reinforcement Learning Lecture4 from UCL 3. Künstliche Intelligenz’s slides: https://www.tu-chemnitz.de/informatik/KI/scripts/ws0910/ml09_7.pdf 90