SlideShare a Scribd company logo
1 of 43
Download to read offline
Introduction to Reinforcement Learning
Part II: Basic tabular methods in RL
Mikko Mäkipää 31.1.2022
Agenda
• Last time: Part I
• Intro: Reinforcement learning as a ML approach
• Basic building blocks: agent and environment, MDP, policies, value functions, Bellman
equations, optimal policies and value functions
• Basic dynamic programming algorithms illustrated on a simple maze: Value iteration, Policy
iteration
Agenda
• This time: Part II
• Some more building blocks: GPI, bandits, exploration, TD updates,…
• Basic model-free methods using tabular value representation
• …illustrated on Blackjack: Monte Carlo on- vs off-policy; Sarsa, Expected Sarsa, Q-learning
• Next time: Part III
• Value function approximation-based methods
• Semi-gradient descent with Sarsa and different linear representations; polynomial, tile
coding, Fourier cosine basis
• Batch updates LSPI-LSTDQ
Recap – we’ll be briefly revisiting the
following concepts
• RL problem setting; Agent and environment
• Markov Decision Process, MDP
• Policy
• Policy Iteration
• Discounted return, Utility
• Value function, state-value function, action value function
• Bellman equations, update rules (backups)
RL problem setting: Agent and environment
Agent performs action
Agent observes
environment state
and reward
Environment
Agent
*) This would be a fully observable environment
RL problem setting: Agent and environment
Agent performs action
Agent observes
environment state
and reward
Environment
Agent
Agent models the environment as
a Markov Decision Process
Agent maintains a policy
that defines what action to
take when in a state
Agent approximates the value function
of each state and action
Agent creates an internal
representation
of state
Markov Decision Process (MDP)
Markov Decision Process is a tuple , with
• States
• Actions
• Transition probabilities
• Rewards
Reward function
• Discount factor ,
• When we know all of these, we have a fully defined MPD
From fully defined MDPs to model-free
methods
Methods for fully defined MDPs Model-based methods Model-free methods
We know all of states, actions, transition
probabilities, rewards and discount factor
We know states, actions, and discount factor
(from problem definition)
We know states, actions, and discount factor
(from problem definition)
Model gives prediction of next state and
reward when taking an action in a state
So, we use a model to estimate transition
probabilities and rewards
We don’t know transition probabilities or
rewards
We don’t have a model, and don’t need one
We can use a Dynamic Programming
algorithms, such as policy iteration, to find
optimal value function and policy
We can use the model-augmented MDP with
DP methods
Or learn our model from experience as in
model free methods
Or create simulated experience based on our
model for model-free
We employ an agent to explore the
environment, and we use agent’s direct
experience to update our estimates of value
function and policy
Algorithms typically use full sweeps of the
state-space to update values
We can use episodes or individual action-
steps to update values
We need to have some approach to selecting
which states and actions the agent explores
We discussed Policy iteration
– a DP algorithm for fully-defined MDPs
• Perform policy evaluation - evaluate
state values under current policy*
• Improve the policy by determining the
best action in each state using the
state-value function determined
during policy evaluation step
• Stop when policy no longer changes
• Each improvement step improves the
value function and when improvement
stops, the optimal policy and value
function have been found
*) using either iterative policy evaluation or solving the linear system
Towards Generalized Policy Iteration
source: David Silver: UCL Course on RL
The process of making a new policy that improves on
an original policy, by making it greedy with respect to
the value function of the original policy, is called
policy improvement
Policy improvement theorem:
Blackjack as an MDP
• States: current set of cards for dealer and
player
• Actions: HIT – more cards, STAND – no
more cards
• Transition probabilities: A stochastic
environment due to randomly drawing
cards – exact probabilities difficult to
determine, though
• Rewards: At the end of episode: +1 for
winning, -1 for losing, 0 for a draw; and 0
otherwise
• Discounting: not used,
Blackjack as an MDP - example
dealer: 5
player: 13
no ace
HIT
5,13,no ace
HIT
Dealer:
Player:
Blackjack as an MDP
dealer: 5
player: 13
no ace
5,13,no ace
HIT
Dealer:
Player:
Blackjack as an MDP - example
dealer: 5
player: 13
no ace
dealer: 5
player: 20
no ace
dealer: bust
player: 20
no ace
dealer bust
HIT STAND
reward: 0
reward: +1
5,20,no ace
5,13,no ace
Dealer:
Player:
If in previous state
player is 13 then
here we have 8
states where
player’s sum is
14,15,…,21
If in previous state
player has 20 then
here only state is
21, all other cards
lead to lose
Policy
• Policy defines how the agent behaves
in an MDP environment
• Policy is a mapping from each state
to an action
• A deterministic policy always
returns the same action for a state
• A stochastic policy gives a
probability for an action in a state
One possible deterministic policy for the maze
Flipsism (Höpsismi) as a policy
• A random policy
• For MDPs with two actions in each
state
• Equal probability for choosing either
action = 0,5
Multi-armed bandits
– a slightly more formal approach to stochastic policies
• We can choose from four actions;
a, b, c or d
• Whenever we choose an action,
we receive a reward with an
unknown probability distribution
• We have now chosen an action
six times, a and b twice, c and d
once
• We have received the rewards
shown
• We want to maximize the reward
we receive over time
• What action would you select
next, why?
This would be a 4-armed bandit
Multi-armed bandits
• Now we have selected each
action six times and the
reward situation is as
shown
• How would you continue
from here? Why?
Exploration vs exploitation
• Exploitation: we exploit the
information we already have to
maximize reward
• Maintain estimates of the
values of actions
• Select the action whose
estimated value is greatest
• This is called the greedy
action
• Exploration: we choose some
other action than the greedy
one to gain information and to
improve our estimates We have now chosen an action 40 000 times: 10 000 times each a, b, c and d
We can estimate that we have lost about 85 000 in value compared to the optimal
strategy of choosing b every time
Epsilon-greedy policy
• -greedy policy is a strategy to balance exploration and exploitation
• We choose the greedy action with probability , and a random action with
probability
• For this to work in theory, all states are to be visited infinitely often and needs
to decrease towards zero, so that the policy converges to greedy policy*
• In practice, it might be enough to decrease epsilon towards the greedy policy
• A simple, often proposed strategy is to decrease epsilon as , but this might
be a bit fast in practice
*) GLIE: Greedy in the Limit with Infinite Exploration
-greedy method for bandits
• It was said that ”we maintain estimates
of the values of actions”
• For this, we use incrementally
computed sample averages:
• And use -greedy policy for selecting an
action
Source: Sutton-Barto 2nd ed
For calculating incremental mean, we maintain two parameters:
N, the current visit count for each action (selecting a bandit) and
Q, the current estimated value for the action
General update rule for RL
• Note the format of the update rule in the method on the previous slide
• We can consider the form
as a general update rule, where represents our current target value,
is the error of our current estimate and is a decreasing step-size or
learning-rate parameter
• Expect to see more of these soon…
Discounted return, utility
• An agent exploring the MDP environment would observe a sequence
• Discounted return, or utility, from time step 𝑡 onwards is the sum of discounted
rewards received:
The state-value function
• If the agent was following a policy, then in each state , the agent would select
the action defined by that policy
• The state-value function of a state under policy , denoted , is the expected
discounted return when following the policy from state onwards:
• The recursive relationship between the value of a state and its successor states is
called the Bellman expectation equation for state-value function
The action-value function
• Action-value function for policy defines the expected utility when
starting in state , performing action and following the policy thereafter
5,20,no ace
5,13,no ace
State-action value
when
state is (5,13,no ace)
and action is HIT
Q(S,A) ~ - 0,255
State-action value
when
state is (5,20,no ace)
and action is STAND
Q(S,A) ~ 0,669
State-action value function
Greedy policy from action-value function
• To derive the policy from state-value
function , we need to know the
transition probabilities and rewards:
• But we can extract the policy directly
from action-value function
• So, working with enables us to be
model-free
First RL algorithm: Monte Carlo
• Sample a full episode from MDP using a -greedy policy
• For each state-action pair estimate value using average sample returns
• Maintain visit-counts to each state action pair
• Update value estimates based on incremental average of observed return
One more concept: on-policy vs off-policy
• On-policy learning: apply a policy to choose actions and learn the value-function
for that policy
• Monte Carlo algorithm presented in the previous slide is an on-policy method
• In practice, we start with a stochastic policy to sample all possible state-action
pairs
• and gradually adjust the policy towards a deterministic optimal policy (GLIE?)
• Off-policy learning: apply a policy, but learn for some other policy
• Typically in off-policy learning, we apply a behavior policy that allows for
exploration and learns about an optimal target policy
Towards Off-policy Monte Carlo
• To use returns generated by behavior policy to evaluate target policy , we
apply importance sampling, a technique to estimate expected values for one
distribution using samples from another
• The probability of observing a sequence of states and actions under policy is
• We form importance sampling ratios, the ratio of probabilities of the sequences
under target and behavior policies
• And apply those to weight our observed returns
Off-policy Monte Carlo
generate episode
iterate backwards
accumulate discounted returns
MC update, now with importance sampling
policy improvement, greedy wrt value func
incremental weight update
Source: Sutton-Barto 2nd ed
Temporal-difference methods
• Recall our general update rule from a couple of slides back
• Monte Carlo methods use the returns from a full episode as a learning target
• In Temporal-difference methods, we use a sample return instead
• We can apply temporal-difference methods with incomplete sequences, or when
we don’t have terminating episodes
If one had to identify one idea as central and novel to reinforcement learning, it would
undoubtedly be temporal-difference (TD) learning
- Sutton and Barto
Recap: Bellman eqs
First TD algorithm: Sarsa
• Generate samples from MDP using a -greedy policy
• For each sample, update state-action value using discounted sample return
TD-target
TD-error
learning-rate parameter
Three TD algorithms in just one slide
• Sarsa: Samples
• Q-learning: Samples
• Expected Sarsa: Samples
Q-learning again
• Considered as “one of the early
breakthroughs in RL”
• published by Watkins in 1989
• It is an off-policy algorithm that directly
approximates the optimal action-value
function
• State-action pairs are selected for
evaluation by e-greedy behavior policy
• But next state action, and thus, next
state-action value in the update, is
replaced by the greedy action for that
state
Source: Sutton-Barto 2nd ed
Simulation experiments: Reference result
Greedy policy Action-value function Difference in value between actions
Monte Carlo Off-policy; 100 000 000 episodes; random behavior policy, ; no discounting
Monte Carlo
On-policy
• 100 000 learning episodes
• Decreasing epsilon
according to state-action
visit count:
• Initial epsilon
•
Learning results: Action value function
So, this illustrates Monte Carlo on-policy
after 100 000 learning episodes
Battle of TD-agents
• Participating agents:
• Monte Carlo on-policy as episodic reference, on-policy, decreasing epsilon
• Sarsa, on-policy, decreasing epsilon
• Expected Sarsa, as on-policy, decreasing epsilon
• Expected Sarsa, as off-policy, random behavior policy,
• Q-learning, random behavior policy,
• 100 000 learning episodes for each
• Schedule for alpha: Exponential target at
• Target rounds 90 000, initial 0,2 –> target 0,01
• Schedule for epsilon: State-action visit count –scaled,
MSE and wrong action calls*
*) When compared to reference case
Q-learning
• 100 000 learning episodes
• Constant epsilon:
So…
• We have covered basic model-free RL
algorithms
• Algorithms that learn from episodes or
from TD-updates
• That apply GPI; they work with value, in
particular state-action value function, and
derive the corresponding policy from that
• That store the values of state-actions, i.e.
use tabular value representation

More Related Content

Similar to Intro to Reinforcement learning - part II

Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningRuth Yakubu
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
Reinforcemnet Leaning in ML and DL.pptx
Reinforcemnet Leaning in ML and  DL.pptxReinforcemnet Leaning in ML and  DL.pptx
Reinforcemnet Leaning in ML and DL.pptxManiMaran230751
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Chris Ohk
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
poster_final_v7
poster_final_v7poster_final_v7
poster_final_v7Tie Zheng
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 

Similar to Intro to Reinforcement learning - part II (20)

Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
RL_in_10_min.pptx
RL_in_10_min.pptxRL_in_10_min.pptx
RL_in_10_min.pptx
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcemnet Leaning in ML and DL.pptx
Reinforcemnet Leaning in ML and  DL.pptxReinforcemnet Leaning in ML and  DL.pptx
Reinforcemnet Leaning in ML and DL.pptx
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
poster_final_v7
poster_final_v7poster_final_v7
poster_final_v7
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 

Recently uploaded

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Recently uploaded (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 

Intro to Reinforcement learning - part II

  • 1. Introduction to Reinforcement Learning Part II: Basic tabular methods in RL Mikko Mäkipää 31.1.2022
  • 2. Agenda • Last time: Part I • Intro: Reinforcement learning as a ML approach • Basic building blocks: agent and environment, MDP, policies, value functions, Bellman equations, optimal policies and value functions • Basic dynamic programming algorithms illustrated on a simple maze: Value iteration, Policy iteration
  • 3. Agenda • This time: Part II • Some more building blocks: GPI, bandits, exploration, TD updates,… • Basic model-free methods using tabular value representation • …illustrated on Blackjack: Monte Carlo on- vs off-policy; Sarsa, Expected Sarsa, Q-learning • Next time: Part III • Value function approximation-based methods • Semi-gradient descent with Sarsa and different linear representations; polynomial, tile coding, Fourier cosine basis • Batch updates LSPI-LSTDQ
  • 4. Recap – we’ll be briefly revisiting the following concepts • RL problem setting; Agent and environment • Markov Decision Process, MDP • Policy • Policy Iteration • Discounted return, Utility • Value function, state-value function, action value function • Bellman equations, update rules (backups)
  • 5. RL problem setting: Agent and environment Agent performs action Agent observes environment state and reward Environment Agent *) This would be a fully observable environment
  • 6. RL problem setting: Agent and environment Agent performs action Agent observes environment state and reward Environment Agent Agent models the environment as a Markov Decision Process Agent maintains a policy that defines what action to take when in a state Agent approximates the value function of each state and action Agent creates an internal representation of state
  • 7. Markov Decision Process (MDP) Markov Decision Process is a tuple , with • States • Actions • Transition probabilities • Rewards Reward function • Discount factor , • When we know all of these, we have a fully defined MPD
  • 8. From fully defined MDPs to model-free methods Methods for fully defined MDPs Model-based methods Model-free methods We know all of states, actions, transition probabilities, rewards and discount factor We know states, actions, and discount factor (from problem definition) We know states, actions, and discount factor (from problem definition) Model gives prediction of next state and reward when taking an action in a state So, we use a model to estimate transition probabilities and rewards We don’t know transition probabilities or rewards We don’t have a model, and don’t need one We can use a Dynamic Programming algorithms, such as policy iteration, to find optimal value function and policy We can use the model-augmented MDP with DP methods Or learn our model from experience as in model free methods Or create simulated experience based on our model for model-free We employ an agent to explore the environment, and we use agent’s direct experience to update our estimates of value function and policy Algorithms typically use full sweeps of the state-space to update values We can use episodes or individual action- steps to update values We need to have some approach to selecting which states and actions the agent explores
  • 9. We discussed Policy iteration – a DP algorithm for fully-defined MDPs • Perform policy evaluation - evaluate state values under current policy* • Improve the policy by determining the best action in each state using the state-value function determined during policy evaluation step • Stop when policy no longer changes • Each improvement step improves the value function and when improvement stops, the optimal policy and value function have been found *) using either iterative policy evaluation or solving the linear system
  • 10. Towards Generalized Policy Iteration source: David Silver: UCL Course on RL The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement Policy improvement theorem:
  • 11. Blackjack as an MDP • States: current set of cards for dealer and player • Actions: HIT – more cards, STAND – no more cards • Transition probabilities: A stochastic environment due to randomly drawing cards – exact probabilities difficult to determine, though • Rewards: At the end of episode: +1 for winning, -1 for losing, 0 for a draw; and 0 otherwise • Discounting: not used,
  • 12. Blackjack as an MDP - example dealer: 5 player: 13 no ace HIT 5,13,no ace HIT Dealer: Player:
  • 13. Blackjack as an MDP dealer: 5 player: 13 no ace 5,13,no ace HIT Dealer: Player:
  • 14. Blackjack as an MDP - example dealer: 5 player: 13 no ace dealer: 5 player: 20 no ace dealer: bust player: 20 no ace dealer bust HIT STAND reward: 0 reward: +1 5,20,no ace 5,13,no ace Dealer: Player: If in previous state player is 13 then here we have 8 states where player’s sum is 14,15,…,21 If in previous state player has 20 then here only state is 21, all other cards lead to lose
  • 15. Policy • Policy defines how the agent behaves in an MDP environment • Policy is a mapping from each state to an action • A deterministic policy always returns the same action for a state • A stochastic policy gives a probability for an action in a state One possible deterministic policy for the maze
  • 16. Flipsism (Höpsismi) as a policy • A random policy • For MDPs with two actions in each state • Equal probability for choosing either action = 0,5
  • 17. Multi-armed bandits – a slightly more formal approach to stochastic policies • We can choose from four actions; a, b, c or d • Whenever we choose an action, we receive a reward with an unknown probability distribution • We have now chosen an action six times, a and b twice, c and d once • We have received the rewards shown • We want to maximize the reward we receive over time • What action would you select next, why? This would be a 4-armed bandit
  • 18. Multi-armed bandits • Now we have selected each action six times and the reward situation is as shown • How would you continue from here? Why?
  • 19. Exploration vs exploitation • Exploitation: we exploit the information we already have to maximize reward • Maintain estimates of the values of actions • Select the action whose estimated value is greatest • This is called the greedy action • Exploration: we choose some other action than the greedy one to gain information and to improve our estimates We have now chosen an action 40 000 times: 10 000 times each a, b, c and d We can estimate that we have lost about 85 000 in value compared to the optimal strategy of choosing b every time
  • 20. Epsilon-greedy policy • -greedy policy is a strategy to balance exploration and exploitation • We choose the greedy action with probability , and a random action with probability • For this to work in theory, all states are to be visited infinitely often and needs to decrease towards zero, so that the policy converges to greedy policy* • In practice, it might be enough to decrease epsilon towards the greedy policy • A simple, often proposed strategy is to decrease epsilon as , but this might be a bit fast in practice *) GLIE: Greedy in the Limit with Infinite Exploration
  • 21. -greedy method for bandits • It was said that ”we maintain estimates of the values of actions” • For this, we use incrementally computed sample averages: • And use -greedy policy for selecting an action Source: Sutton-Barto 2nd ed For calculating incremental mean, we maintain two parameters: N, the current visit count for each action (selecting a bandit) and Q, the current estimated value for the action
  • 22. General update rule for RL • Note the format of the update rule in the method on the previous slide • We can consider the form as a general update rule, where represents our current target value, is the error of our current estimate and is a decreasing step-size or learning-rate parameter • Expect to see more of these soon…
  • 23. Discounted return, utility • An agent exploring the MDP environment would observe a sequence • Discounted return, or utility, from time step 𝑡 onwards is the sum of discounted rewards received:
  • 24. The state-value function • If the agent was following a policy, then in each state , the agent would select the action defined by that policy • The state-value function of a state under policy , denoted , is the expected discounted return when following the policy from state onwards: • The recursive relationship between the value of a state and its successor states is called the Bellman expectation equation for state-value function
  • 25. The action-value function • Action-value function for policy defines the expected utility when starting in state , performing action and following the policy thereafter
  • 26. 5,20,no ace 5,13,no ace State-action value when state is (5,13,no ace) and action is HIT Q(S,A) ~ - 0,255 State-action value when state is (5,20,no ace) and action is STAND Q(S,A) ~ 0,669 State-action value function
  • 27. Greedy policy from action-value function • To derive the policy from state-value function , we need to know the transition probabilities and rewards: • But we can extract the policy directly from action-value function • So, working with enables us to be model-free
  • 28. First RL algorithm: Monte Carlo • Sample a full episode from MDP using a -greedy policy • For each state-action pair estimate value using average sample returns • Maintain visit-counts to each state action pair • Update value estimates based on incremental average of observed return
  • 29. One more concept: on-policy vs off-policy • On-policy learning: apply a policy to choose actions and learn the value-function for that policy • Monte Carlo algorithm presented in the previous slide is an on-policy method • In practice, we start with a stochastic policy to sample all possible state-action pairs • and gradually adjust the policy towards a deterministic optimal policy (GLIE?) • Off-policy learning: apply a policy, but learn for some other policy • Typically in off-policy learning, we apply a behavior policy that allows for exploration and learns about an optimal target policy
  • 30. Towards Off-policy Monte Carlo • To use returns generated by behavior policy to evaluate target policy , we apply importance sampling, a technique to estimate expected values for one distribution using samples from another • The probability of observing a sequence of states and actions under policy is • We form importance sampling ratios, the ratio of probabilities of the sequences under target and behavior policies • And apply those to weight our observed returns
  • 31. Off-policy Monte Carlo generate episode iterate backwards accumulate discounted returns MC update, now with importance sampling policy improvement, greedy wrt value func incremental weight update Source: Sutton-Barto 2nd ed
  • 32. Temporal-difference methods • Recall our general update rule from a couple of slides back • Monte Carlo methods use the returns from a full episode as a learning target • In Temporal-difference methods, we use a sample return instead • We can apply temporal-difference methods with incomplete sequences, or when we don’t have terminating episodes If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning - Sutton and Barto
  • 34. First TD algorithm: Sarsa • Generate samples from MDP using a -greedy policy • For each sample, update state-action value using discounted sample return TD-target TD-error learning-rate parameter
  • 35. Three TD algorithms in just one slide • Sarsa: Samples • Q-learning: Samples • Expected Sarsa: Samples
  • 36. Q-learning again • Considered as “one of the early breakthroughs in RL” • published by Watkins in 1989 • It is an off-policy algorithm that directly approximates the optimal action-value function • State-action pairs are selected for evaluation by e-greedy behavior policy • But next state action, and thus, next state-action value in the update, is replaced by the greedy action for that state Source: Sutton-Barto 2nd ed
  • 37. Simulation experiments: Reference result Greedy policy Action-value function Difference in value between actions Monte Carlo Off-policy; 100 000 000 episodes; random behavior policy, ; no discounting
  • 38. Monte Carlo On-policy • 100 000 learning episodes • Decreasing epsilon according to state-action visit count: • Initial epsilon •
  • 39. Learning results: Action value function So, this illustrates Monte Carlo on-policy after 100 000 learning episodes
  • 40. Battle of TD-agents • Participating agents: • Monte Carlo on-policy as episodic reference, on-policy, decreasing epsilon • Sarsa, on-policy, decreasing epsilon • Expected Sarsa, as on-policy, decreasing epsilon • Expected Sarsa, as off-policy, random behavior policy, • Q-learning, random behavior policy, • 100 000 learning episodes for each • Schedule for alpha: Exponential target at • Target rounds 90 000, initial 0,2 –> target 0,01 • Schedule for epsilon: State-action visit count –scaled,
  • 41. MSE and wrong action calls* *) When compared to reference case
  • 42. Q-learning • 100 000 learning episodes • Constant epsilon:
  • 43. So… • We have covered basic model-free RL algorithms • Algorithms that learn from episodes or from TD-updates • That apply GPI; they work with value, in particular state-action value function, and derive the corresponding policy from that • That store the values of state-actions, i.e. use tabular value representation