SlideShare a Scribd company logo
1 of 90
Download to read offline
Temporal-difference Learning
Jie-Han Chen
NetDB, National Cheng Kung University
5/15, 2018 @ National Cheng Kung University, Taiwan
1
The content and images in this slides were borrowed from:
1. Rich Sutton’s textbook
2. David Silver’s Reinforcement Learning class in UCL
3. Künstliche Intelligenz’s slides
2
Disclaimer
Outline
● Recap DP, MC method
● TD learning
● Sarsa, Q-learning
● N-step bootstrapping
● TD (lambda)
3
Value-based method
Generalized policy iteration (GPI) plays the most
important part in reinforcement learning. All we
need is to find how to update the value function.
4
Dynamic Programming
Figure source: David Silver’s slides
5
Monte Carlo Method
Figure source: David Silver’s slides
6
Dynamic Programming & Monte Carlo method
Dynamic Programming
● update per step, using bootstrapping
● need model
● computation cost
Monte Carlo Method
● update per episode
● model-free
● hard to be applied to continuing task
7
Can we combine the advantage of Dynamic
Programming and Monte Carlo method?
8
Temporal-difference learning
9
Different from MC method, each sample of TD learning is just a few steps, not the
whole trajectory. TD learning bases its update in part on an existing estimate, so it’s
also a bootstrapping method.
TD method is an policy evaluation method (without control),
which is used to predict the value of fixed policy.
Temporal-difference learning
10
backup diagram of TD(0)
Temporal-difference learning
We want to improve our estimate of V by computing these averages:
11
Temporal-difference learning
We want to improve our estimate of V by computing these averages:
sample 1:
12
Temporal-difference learning
We want to improve our estimate of V by computing these averages:
sample 1:
sample 2:
13
Temporal-difference learning
We want to improve our estimate of V by computing these averages:
sample 1:
sample 2:
sample 3:
14
Temporal-difference learning
In model-free RL, we use samples to estimate the expectation of future total
rewards.
sample 1:
sample 2:
…
sample n:
15
Temporal-difference learning
sample 1:
sample 2:
sample n:
16
Temporal-difference learning
sample 1:
sample 2:
sample n:
17
But, we cannot rewind time to get
sample after sample from St !
Temporal-difference learning
We can use weighted format to update the new value function:
which is equal to
The can be a kind of learning rate.
18
Exponential Moving Average
● The running interpolation update:
● Makes recent samples more important:
● Forgets about the past, is its forget rate.
19
Temporal-difference learning
TD update, one-step TD/TD(0):
The quantity in the brackets is a sort of error, called TD error.
The target of TD method
20
Temporal-difference learning
● Model-free
● Online learning (fully incremental method)
○ Can be applied to continuing task
● Better convergence time
○ In practice, converge faster than Monte Carlo method
21
How to choose ?
Stochastic approximation theory (Robbins-Monro sequence) tells us there are two
constraints to make previous exponential moving average converge stably.
1.
2.
22
How to choose ?
Stochastic approximation theory (Robbins-Monro sequence) tells us there are two
conditions to make previous exponential moving average converge stably.
1.
2.
23
P-series could be a choice to satisfy these two
conditions. But the learning rate with these
conditions will learn slow to converge.
The works well in most case.
Temporal-difference learning with Control
In the previous slides, we introduce TD learning which is used to predict the value
function by one-step sample.
Now, we’ll introduce two classic methods in TD control:
● Sarsa
● Q-learning
24
Sarsa
● Inspired by policy iteration
● Replace value function by action-value function
25
Sarsa
● Inspired by policy iteration
● Replace value function by action-value function
26
Sarsa
● Inspired by policy iteration
● Replace value function by action-value function
27
Sarsa
● Inspired by policy iteration
● The behavior policy and target policy is same
28
Sarsa
● Inspired by policy iteration
● The behavior policy and target policy is same
29
Sarsa
● Inspired by policy iteration
● The behavior policy and target policy is same
30
In model-free method, we don’t know the transition
probability. All we need to do is to use a lot of
experience sample to estimate value.
The experience sample in Sarsa is (s, a, r, s’, a’)
Sarsa
● Inspired by policy iteration
● Sarsa update:
31
Sarsa
● Inspired by policy iteration
● Sarsa update:
32
Sarsa
33
Sarsa
34
On-policy!
Q-learning
● Inspired from value iteration
35
Q-learning
● Inspired from value iteration
36
Q-learning
● Inspired from value iteration
37
Q-learning
● Inspired from value iteration
38
Q-learning
● Inspired from value iteration
● Q-learning update:
39
Q-learning
40
SARSA V.S. Q-Learning
● On-policy: The sample policy is as same as learning policy (target policy)
eg: Sarsa, Policy Gradient
● Off-policy: The sample policy is different from learning policy (target policy)
eg. Q-learning, Deep Q Network
41
SARSA V.S. Q-Learning: The Cliff walk
42
Additional infomation:
https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q-learning/
TD learning
43
Monte Carlo Method
Figure source: David Silver’s slides
44
Monte Carlo method
Monte Carlo update:
45
Monte Carlo return:
sum of all rewards behind this
step
Monte Carlo method
Monte Carlo update:
● Unbiased
○ The target is true return of the sample.
● High variance
○ The targets among samples are very different
because they are depend on whole trajectory.
46
TD method
TD update:
47
TD return:
just need to add immediate reward
and the estimated value of next state.
TD method
TD update:
● biased
○ The target is also an estimated value (because
of V(s))
● low variance
○ The targets are slightly different, because it just
take one-step.
48
Can we evaluate policy steps less than
Monte Carlo but more than one-step TD?
49
n-step bootstrapping
● One-step TD learning:
○ target of TD:
○ bellman equation:
● Monte Carlo method:
○ target of MC:
○ bellman equation:
50
n-step TD
51
n-step TD
n-step TD return:
Bellman equation:
52
53
Performance of n-step TD: Random walk
● 2 terminal states
● with 19 states instead of 5
54
-1
Performance of n-step TD: Random walk
55
n-step Sarsa
56
n-step Sarsa
n-step Sarsa return:
Bellman equation:
57
58
n-step Sarsa
59
In the previous slides, we have seen TD(0) before. What does the 0 mean?
one-step TD/TD(0) update:
60
We have already introduce n-step TD learning, like:
1-step, 3-step, 8-step etc. Maybe 4-step TD return is best
for learning.
Can we combine them together to get better
performance?
61
2-step TD & 4-step TD
Recap: Performance of n-step TD: Random walk
62
A simple way to combine n-step TD returns is to average n-step
return as long as the weights on component returns are positive and
sum to 1.
63
A simple way to combine n-step TD returns is to average n-step
return as long as the weights on component returns are positive and
sum to 1.
64
This is called compound return
The is one particular way of
averaging n-step updates.
This average contains all the n-step
updates, each weighted proportional to
lam
Besides, each term of n-step returns are
normalized by a factor of to
ensure that the weights sum to 1.
65
The return of is defined as
following:
We called it
66
Another form is to separate
post-termination terms from the main
sum.
67
68
/ one-step TD
/ Monte Carlo
● Where
n-step backups
● Backup (on-line or off-line):
● Off-line: the increments are accumulated "on the side" and are not used to
change value estimates until the end of the episode.
● On-line: the updates are done during the episode, as soon as the increment is
computed.
69
● Update value function towards the
● Foward view looks into the future to compute
● Like MC, can only be computed from complete episodes (Off-line learning)
70
Forward
Forward
71
In forward view, after looking forward from and updating one state, we
move on to the next and never have to work with the preceding state
agein.
vs N-step TD: 19-state random walk
72
Backward
● Forward view provides theory
● Backward view provides mechanism
● Shout backward over time
● The strength of your voice decreases with temporal distance by
73
Backward
Eligibility trace denoted by , which keeps track the weights of updating value
function for every state.
74
Eligibility Trace
75
Accumulating eligibility trace for certain state S
Visit stateS
From David’s slides
Backward
● Keep an eligibility trace for every state s
● Update value V(s) for every state s in the single step
● In proportion to TD-error and eligibility trace
76
From David’s slides
● When , only the current state is updated
● This exactly equivalent to TD(0) update
77
Telescoping in TD(1)
78
Online Tabular
79
Online Tabular
80
Backward part!
81
82
83
From David’s slides
Off-line update VS On-line update
Off-line updates
● updates are accumulated within episode
● but applied in batch at the end of episode
On-line updates
● updates are applied online at each step within episode
● can be applied in continuing task
84
Cons of Tabular method
In the previous method, we use a large table to store the value of each state or
state-action pair which is called tabular method.
In real scene, there are too many state-action pairs to store. Besides, the state
/observation would also be much complicated, for example: an image with high
resolution. It causes curse of dimensionality.
85
Cons of Tabular method
In the previous method, we use a large table to store the value of each state or
state-action pair which is called tabular method.
In real scene, there are too many state-action pairs to store. Besides, the state
/observation would also be much complicated, for example: an image with high
resolution. It causes curse of dimensionality.
86
We can use function approximator to estimate the value
function, and it has generalizability!
Relationship between DP and TD
87
From David’s slides
On-policy & Off-policy 補充
On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas
off-policy methods evaluate or improve a policy different from that used to generate the data.
88
Recommended Papers
1. John et al., High-Dimensional Continuous Control Using Generalized
Advantage Estimation (ICLR 2016)
2. Kristopher et al., Multi-step Reinforcement Learning: A Unifying Algorithm
(AAAI 2018)
89
Reference
1. Sutton’s textbook Chapter 6, 7, 12
2. Reinforcement Learning Lecture4 from UCL
3. Künstliche Intelligenz’s slides:
https://www.tu-chemnitz.de/informatik/KI/scripts/ws0910/ml09_7.pdf
90

More Related Content

What's hot

Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy OptimizationShubhaManikarnike
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processJie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingSeung Jae Lee
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term MemoryYan Xu
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation MethodSHUBHAM GUPTA
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 

What's hot (20)

Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step Bootstrapping
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 

Similar to Temporal difference learning

Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learningCairo University
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Efficient Backpropagation
Efficient BackpropagationEfficient Backpropagation
Efficient BackpropagationAakash Chotrani
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Bean Yen
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfvudinhphuong96
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Recommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model TrainingRecommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model TrainingCrossing Minds
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
logisticregression-190726150723.pdf
logisticregression-190726150723.pdflogisticregression-190726150723.pdf
logisticregression-190726150723.pdfSuaibDanish
 
Logistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesLogistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesRajat Sharma
 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksSteve Nouri
 
Power Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionPower Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionGuillermo Barbadillo Villanueva
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 

Similar to Temporal difference learning (20)

Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Efficient Backpropagation
Efficient BackpropagationEfficient Backpropagation
Efficient Backpropagation
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdf
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Recommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model TrainingRecommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model Training
 
TD Learning Webinar
TD Learning WebinarTD Learning Webinar
TD Learning Webinar
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
logisticregression-190726150723.pdf
logisticregression-190726150723.pdflogisticregression-190726150723.pdf
logisticregression-190726150723.pdf
 
Logistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesLogistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | Disadvantages
 
Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricks
 
Power Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize SolutionPower Laws: Optimizing Demand-side Strategies. Second Prize Solution
Power Laws: Optimizing Demand-side Strategies. Second Prize Solution
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 

More from Jie-Han Chen

Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learningJie-Han Chen
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learningJie-Han Chen
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)Jie-Han Chen
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchainJie-Han Chen
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecodeJie-Han Chen
 

More from Jie-Han Chen (7)

Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecode
 

Recently uploaded

Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 

Recently uploaded (20)

Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 

Temporal difference learning

  • 1. Temporal-difference Learning Jie-Han Chen NetDB, National Cheng Kung University 5/15, 2018 @ National Cheng Kung University, Taiwan 1
  • 2. The content and images in this slides were borrowed from: 1. Rich Sutton’s textbook 2. David Silver’s Reinforcement Learning class in UCL 3. Künstliche Intelligenz’s slides 2 Disclaimer
  • 3. Outline ● Recap DP, MC method ● TD learning ● Sarsa, Q-learning ● N-step bootstrapping ● TD (lambda) 3
  • 4. Value-based method Generalized policy iteration (GPI) plays the most important part in reinforcement learning. All we need is to find how to update the value function. 4
  • 5. Dynamic Programming Figure source: David Silver’s slides 5
  • 6. Monte Carlo Method Figure source: David Silver’s slides 6
  • 7. Dynamic Programming & Monte Carlo method Dynamic Programming ● update per step, using bootstrapping ● need model ● computation cost Monte Carlo Method ● update per episode ● model-free ● hard to be applied to continuing task 7
  • 8. Can we combine the advantage of Dynamic Programming and Monte Carlo method? 8
  • 10. Different from MC method, each sample of TD learning is just a few steps, not the whole trajectory. TD learning bases its update in part on an existing estimate, so it’s also a bootstrapping method. TD method is an policy evaluation method (without control), which is used to predict the value of fixed policy. Temporal-difference learning 10 backup diagram of TD(0)
  • 11. Temporal-difference learning We want to improve our estimate of V by computing these averages: 11
  • 12. Temporal-difference learning We want to improve our estimate of V by computing these averages: sample 1: 12
  • 13. Temporal-difference learning We want to improve our estimate of V by computing these averages: sample 1: sample 2: 13
  • 14. Temporal-difference learning We want to improve our estimate of V by computing these averages: sample 1: sample 2: sample 3: 14
  • 15. Temporal-difference learning In model-free RL, we use samples to estimate the expectation of future total rewards. sample 1: sample 2: … sample n: 15
  • 17. Temporal-difference learning sample 1: sample 2: sample n: 17 But, we cannot rewind time to get sample after sample from St !
  • 18. Temporal-difference learning We can use weighted format to update the new value function: which is equal to The can be a kind of learning rate. 18
  • 19. Exponential Moving Average ● The running interpolation update: ● Makes recent samples more important: ● Forgets about the past, is its forget rate. 19
  • 20. Temporal-difference learning TD update, one-step TD/TD(0): The quantity in the brackets is a sort of error, called TD error. The target of TD method 20
  • 21. Temporal-difference learning ● Model-free ● Online learning (fully incremental method) ○ Can be applied to continuing task ● Better convergence time ○ In practice, converge faster than Monte Carlo method 21
  • 22. How to choose ? Stochastic approximation theory (Robbins-Monro sequence) tells us there are two constraints to make previous exponential moving average converge stably. 1. 2. 22
  • 23. How to choose ? Stochastic approximation theory (Robbins-Monro sequence) tells us there are two conditions to make previous exponential moving average converge stably. 1. 2. 23 P-series could be a choice to satisfy these two conditions. But the learning rate with these conditions will learn slow to converge. The works well in most case.
  • 24. Temporal-difference learning with Control In the previous slides, we introduce TD learning which is used to predict the value function by one-step sample. Now, we’ll introduce two classic methods in TD control: ● Sarsa ● Q-learning 24
  • 25. Sarsa ● Inspired by policy iteration ● Replace value function by action-value function 25
  • 26. Sarsa ● Inspired by policy iteration ● Replace value function by action-value function 26
  • 27. Sarsa ● Inspired by policy iteration ● Replace value function by action-value function 27
  • 28. Sarsa ● Inspired by policy iteration ● The behavior policy and target policy is same 28
  • 29. Sarsa ● Inspired by policy iteration ● The behavior policy and target policy is same 29
  • 30. Sarsa ● Inspired by policy iteration ● The behavior policy and target policy is same 30 In model-free method, we don’t know the transition probability. All we need to do is to use a lot of experience sample to estimate value. The experience sample in Sarsa is (s, a, r, s’, a’)
  • 31. Sarsa ● Inspired by policy iteration ● Sarsa update: 31
  • 32. Sarsa ● Inspired by policy iteration ● Sarsa update: 32
  • 35. Q-learning ● Inspired from value iteration 35
  • 36. Q-learning ● Inspired from value iteration 36
  • 37. Q-learning ● Inspired from value iteration 37
  • 38. Q-learning ● Inspired from value iteration 38
  • 39. Q-learning ● Inspired from value iteration ● Q-learning update: 39
  • 41. SARSA V.S. Q-Learning ● On-policy: The sample policy is as same as learning policy (target policy) eg: Sarsa, Policy Gradient ● Off-policy: The sample policy is different from learning policy (target policy) eg. Q-learning, Deep Q Network 41
  • 42. SARSA V.S. Q-Learning: The Cliff walk 42 Additional infomation: https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q-learning/
  • 44. Monte Carlo Method Figure source: David Silver’s slides 44
  • 45. Monte Carlo method Monte Carlo update: 45 Monte Carlo return: sum of all rewards behind this step
  • 46. Monte Carlo method Monte Carlo update: ● Unbiased ○ The target is true return of the sample. ● High variance ○ The targets among samples are very different because they are depend on whole trajectory. 46
  • 47. TD method TD update: 47 TD return: just need to add immediate reward and the estimated value of next state.
  • 48. TD method TD update: ● biased ○ The target is also an estimated value (because of V(s)) ● low variance ○ The targets are slightly different, because it just take one-step. 48
  • 49. Can we evaluate policy steps less than Monte Carlo but more than one-step TD? 49
  • 50. n-step bootstrapping ● One-step TD learning: ○ target of TD: ○ bellman equation: ● Monte Carlo method: ○ target of MC: ○ bellman equation: 50
  • 52. n-step TD n-step TD return: Bellman equation: 52
  • 53. 53
  • 54. Performance of n-step TD: Random walk ● 2 terminal states ● with 19 states instead of 5 54 -1
  • 55. Performance of n-step TD: Random walk 55
  • 57. n-step Sarsa n-step Sarsa return: Bellman equation: 57
  • 58. 58
  • 60. In the previous slides, we have seen TD(0) before. What does the 0 mean? one-step TD/TD(0) update: 60
  • 61. We have already introduce n-step TD learning, like: 1-step, 3-step, 8-step etc. Maybe 4-step TD return is best for learning. Can we combine them together to get better performance? 61 2-step TD & 4-step TD
  • 62. Recap: Performance of n-step TD: Random walk 62
  • 63. A simple way to combine n-step TD returns is to average n-step return as long as the weights on component returns are positive and sum to 1. 63
  • 64. A simple way to combine n-step TD returns is to average n-step return as long as the weights on component returns are positive and sum to 1. 64 This is called compound return
  • 65. The is one particular way of averaging n-step updates. This average contains all the n-step updates, each weighted proportional to lam Besides, each term of n-step returns are normalized by a factor of to ensure that the weights sum to 1. 65
  • 66. The return of is defined as following: We called it 66
  • 67. Another form is to separate post-termination terms from the main sum. 67
  • 68. 68 / one-step TD / Monte Carlo ● Where
  • 69. n-step backups ● Backup (on-line or off-line): ● Off-line: the increments are accumulated "on the side" and are not used to change value estimates until the end of the episode. ● On-line: the updates are done during the episode, as soon as the increment is computed. 69
  • 70. ● Update value function towards the ● Foward view looks into the future to compute ● Like MC, can only be computed from complete episodes (Off-line learning) 70 Forward
  • 71. Forward 71 In forward view, after looking forward from and updating one state, we move on to the next and never have to work with the preceding state agein.
  • 72. vs N-step TD: 19-state random walk 72
  • 73. Backward ● Forward view provides theory ● Backward view provides mechanism ● Shout backward over time ● The strength of your voice decreases with temporal distance by 73
  • 74. Backward Eligibility trace denoted by , which keeps track the weights of updating value function for every state. 74
  • 75. Eligibility Trace 75 Accumulating eligibility trace for certain state S Visit stateS From David’s slides
  • 76. Backward ● Keep an eligibility trace for every state s ● Update value V(s) for every state s in the single step ● In proportion to TD-error and eligibility trace 76 From David’s slides
  • 77. ● When , only the current state is updated ● This exactly equivalent to TD(0) update 77
  • 81. 81
  • 82. 82
  • 84. Off-line update VS On-line update Off-line updates ● updates are accumulated within episode ● but applied in batch at the end of episode On-line updates ● updates are applied online at each step within episode ● can be applied in continuing task 84
  • 85. Cons of Tabular method In the previous method, we use a large table to store the value of each state or state-action pair which is called tabular method. In real scene, there are too many state-action pairs to store. Besides, the state /observation would also be much complicated, for example: an image with high resolution. It causes curse of dimensionality. 85
  • 86. Cons of Tabular method In the previous method, we use a large table to store the value of each state or state-action pair which is called tabular method. In real scene, there are too many state-action pairs to store. Besides, the state /observation would also be much complicated, for example: an image with high resolution. It causes curse of dimensionality. 86 We can use function approximator to estimate the value function, and it has generalizability!
  • 87. Relationship between DP and TD 87 From David’s slides
  • 88. On-policy & Off-policy 補充 On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data. 88
  • 89. Recommended Papers 1. John et al., High-Dimensional Continuous Control Using Generalized Advantage Estimation (ICLR 2016) 2. Kristopher et al., Multi-step Reinforcement Learning: A Unifying Algorithm (AAAI 2018) 89
  • 90. Reference 1. Sutton’s textbook Chapter 6, 7, 12 2. Reinforcement Learning Lecture4 from UCL 3. Künstliche Intelligenz’s slides: https://www.tu-chemnitz.de/informatik/KI/scripts/ws0910/ml09_7.pdf 90