SlideShare a Scribd company logo
1 of 45
Download to read offline
Introduction to Reinforcement Learning
Part III: Basic approximate methods in RL
Mikko Mäkipää 4.5.2022
Agenda
• First time: Part I
• Intro: Reinforcement learning as a ML approach
• Basic building blocks: agent and environment, MDP, policies, value functions, Bellman
equations, optimal policies and value functions
• Basic dynamic programming algorithms illustrated on a simple maze: Value iteration, Policy
iteration
• Last time: Part II
• Some more building blocks: GPI, bandits, exploration, TD updates,…
• Basic model-free methods using tabular value representation
• …illustrated on Blackjack: Monte Carlo on- vs off-policy; Sarsa, Expected Sarsa, Q-learning
Agenda
• This time: Part III
• Value function approximation -based methods
• Examples of linear representations; polynomial, tile coding, Fourier cosine basis
• Stochastic gradient descent and semi-gradient descent
• Batch updates LSPI-LSTDQ
• Python implementation
• Results from simulation experiments
Recap: Markov Decision Process (MDP)
• Markov Decision Process is a tuple , with
states, actions, transition probabilities, rewards and a
discount factor
• If we know all of these, we have a fully defined MPD and
can apply Dynamic Programming (DP) (as covered in Part
I)
• If we don’t, we can use a model to augment the MDP, in
practice transition probabilities and rewards - this would
be a model-based approach
• Or, we apply agent’s direct experience with the
environment to update our estimates of value function,
this would be a model-free approach
Recap: Discounted return, utility
• An agent exploring the MDP environment would observe a sequence
• Discounted return, or utility, from time step 𝑡 onwards is the sum of discounted
rewards received:
Recap: The state-value function
• If the agent was following a policy, then in each state , the agent would select
the action defined by that policy
• The state-value function of a state under policy , denoted , is the expected
discounted return when following the policy from state onwards:
• Here applying the Bellman expectation equation
Recap: The action-value function
• Action-value function for policy defines the expected utility when
starting in state , performing action and following the policy thereafter
Recap: Greedy policy from
• To derive the policy from state-value
function , we need to know the
transition probabilities and rewards:
• But we can extract the policy directly
from action-value function
• So, working with enables us to be
model-free
Recap: Monte Carlo – full episodes
• Sample a full episode from MDP using a -greedy policy
• For each state-action pair estimate value using average sample returns
• Maintain visit-counts to each state action pair
• Update value estimates based on incremental average of observed return
Recap: Sarsa – Temporal-difference TD(0)
• Generate samples from MDP using a -greedy policy
• For each sample, update state-action value using discounted sample return
TD-target
TD-error
learning-rate parameter
Recap: Three TD algorithms
• Sarsa: Samples
• Q-learning: Samples
• Expected Sarsa: Samples
Tabular methods
• Last time we discussed tabular methods, where action-values are stored in a
lookup table structure
• These algorithms read current values from and update values to table storage,
e.g. for Sarsa update:
• Good: simple, fast, can be used to store other values as well (e.g. visit counts)
• Limitations: large state-action spaces, continuous values, does not generalize
update current value read current and next state values
Value function approximation
• Action-value function is represented with parametric approximation
• where is a -dimensional vector of weights
• We use to denote features representing a state-action pair
• If the approximate function is a linear function of the weight vector , we have
RL problem setting and value function
approximation
Perform action
Receive raw signal from
environment
Environment
Agent
Approximate the value function of each
state and action
Create features to
represent observed state
Pre-process the received
observation
Sidenote:
Domain knowledge?
”Demonstrate superhuman performance
without human domain knowledge”
• prior knowledge == bad
“Make money at the casino playing RL
guided Blackjack”
• maximally exploit prior knowledge
Silver et al 2017: Mastering the game of Go without human knowledge
For our case of simple Blackjack
• We represent the state-action pair as
• total sum of dealer’s cards (2,11)
• total sum of player’s cards (4, 21)
• does the player have a soft, usable
ace (True, False)
• action to take (HIT, STAND)
• This gives 560 state-action pairs for a
tabular lookup table*
• Or, we can derive feature vectors
for approximate methods using these
four variables
Dealer:
Player:
dealer: 5
player: 20
no ace
*) Compare this to Backgammon 1020 states, Go 10170 states
How to approximate this?
• Each subplot on the right shows
the action-values by one of
the four variables - dealer’s sum,
player’s sum, soft ace, action
• Now, we need to define a mapping
from each pair, i.e. from
values
to features
• And then approximate action-value
function with
a suitable linear model of
parameter
Estimated action-values for each of the 560 state-action pairs for simple blackjack
Note: shown values represent our reference result, a test run of 100 000 000 episodes of tabular off-policy Monte Carlo
How to approximate this? Start from here…?
• Subplots on the right show the
action-values for each of the 560
state-action pairs
• Values are grouped into four
groups based on
• x-axis shows values for player’s
cards
• Equal values for dealer’s card are
shown connected
dealer
2
11
Note: shown values represent our reference result, a test run of 100 000 000 million episodes of tabular off-policy Monte Carlo
Three different value representations
• In the following we will briefly cover three different approaches to generating
features for linear models
• Polynomial approximation
• Fourier-cosine basis
• Tile Coding
• Notable omissions:
• Radial Basis Functions, i.e. gaussians with fixed mean and variance, appropriately positioned
(as we have not tried then, but yes, they would make sense)
• Neural Networks (as they are non-linear)
First: Polynomial approximation
• We want to approximate the action-value function using a linear model
• using the variables we have available (from preprocessing the input signal)
• where features are terms of a nth degree polynomial, of the form
• and exponent coefficients are from the set and indicators
either
• For order , this would give us 100 terms. How to select a meaningful subset?
Experiment: Model selection for polynomial
features
• Model search against our reference result
obtained using a tabular method
• Forward-backward search applying Akaike
information criterion (AIC)
• start with a constant
• test adding remaining terms one at a
time
• add the one that improves AIC the most
• test if removing terms improves AIC
• repeat and stop when AIC no longer
improves
• Using the resulting polynomial model, we
apply approximate RL and compare the
result to the LS fit
Polynomial
model
candidate
Reference
result
LS fit
Selected
polynomial
features
RL method
Model search
RL model
Performance comparison
AIC
Selected polynomial: c 25 d4p3sa
• AIC-search gives us a polynomial
with 27 terms as shown below
• Subplots on the right show the
least-squares fit of this
polynomial model against the
reference result
• Reference values are shown in
gray dotted lines, colored lines
for estimated action-values
dealer
2
11
Second: Fourier cosine basis
• Fourier series approximates a periodic function as a
weighted sum of sine and cosine basis functions of
different frequencies
• Within half-interval, sine terms can be dropped, and
function can be represented as closely as desired with
just the cosine terms
• So, we select our period as 2 and restrict the features to
half-interval [0,1]
• For one dimensional case, we get feature terms
• For example, for , gives the features
shown on the right
Fourier cosine basis in 2D
• When we represent each state-action pair with more
than one dimension, the basis formula becomes
• where and
• In our blackjack case, we use a different model for each
of the four combinations, so input is
, and
• We would get the cosine basis functions shown on the
right for
• For RL testing, we use giving 49 features for each
- defined plane
Third: Tile coding
• For tile coding (a form of coarse coding) the
feature space is divided into partitions, or
tiles
• We get binary features: For each input, the
feature has either value of 1 (within tile) or 0
(outside tile)
• For this to work, the tiles have to overlap, so
that multiple tiles are active for each input
• For testing, we get 160 tiles, creating 3x3
areas with four different tiles active
• With this, we get a different set of tiles to
cover each of the combinations
Stochastic Gradient Descent
• So, we want to represent action-value function with a parametric model with
weights
• We can use different approaches to generate features for our model, such as
Fourier cosine basis…
• To determine values for the weights, we apply gradient descent… or a variant
thereof
Backgrounder: Stochastic Gradient Descent
• We have an error function we want to minimize, in this case MSE between actual
action-values and estimated values given by our function approximation
• Gradient of a function shows the direction of steepest ascent
• For the error function, the (negative half) gradient is given as
Stochastic Gradient Descent (SGD)
• Stochastic gradient descent approximates the gradient using a single sample
• To reduce the error, we update the weights towards the negative gradient
• where is a learning rate parameter
• Gradient descent converges to a local minimum of the error function, if the
learning rate is decreased over iterations
SGD with linear model
• For linear model of action-value function
• Gradient is given by
• And the update rule reduces to
Simple SGD example
• We want to find the best fitting line
to the observations 
• Our weight vector is and the
partial derivatives for the gradient
• The SGD updates are
• Light blue line is the (unknown) relation, from which the
observations are generated by adding N(0,1) noise
• Blue dots are the known observations
• Dark blue line is given by the least-squares estimate for a
and b
SGD example
Note: error surface calculated for all of the observations
Sample gradients
Above the error surface
calculated using all of
the observations
Images on the right
show the gradients
calculated based on
samples for first eight
iterations
SGD example
• Figure on the right shows how the line
changes as slope and intercept are
updated during gradient descent
• Light green line shows the final line,
orange is the line obtained applying
least-squares
RL: Semi-gradient descent
• In RL, we do not have a target value for that we could use in a supervised
learning -type update
• So, for model free RL, we substitute an approximate target that we get by
interacting with the environment
• and we live with the fact that * also depends on our current weights .
Not being a true gradient descent method, this is called semi-gradient descent
*) such targets are called bootstrapping targets or bootstrapping estimates in the literature
RL: Semi-gradient descent
• Full-episode Monte Carlo we use the observed return
• to get the update semi-gradient update
• For temporal-difference Sarsa, we use the sample return
Semi-gradient Sarsa TD(0)
e-greedy policy (using estimated action-values)
Source: Sutton-Barto 2nd ed
learning from each episode
iterate forward
Semi-GD weight updates at each step
Convergence and the deadly triad
• Three elements, making up the deadly triad
• Function approximation: Generalizing from a state space (e.g., linear function approximation
or ANNs).
• Bootstrapping: Update targets that include existing estimates rather than relying exclusively
on actual rewards and complete returns
• Off-policy training: Training on other than the target policy
• Two of these can be handled, three leads to instability and divergence
Source: Sutton-Barto 2nd ed, Table source: David Silver: UCL Course on RL
Batch method: LSPI-LSTDQ
• This far, we have considered episodic methods, using either a full episode or TD
samples return targets
• Batch methods apply a different approach: Agent generates experience, a batch
of samples from the environment, which is then used for learning
• Samples can be reused or played in different order
• LSPI-LSTDQ is an example of a batch policy improvement algorithm, using a batch
of samples to solve the weights using least-squares
• Solving the linear system involves a matrix (pseudo-) inversion at each iteration of
the algorithm
LSPI-LSTDQ
Source: Lagoudakis, Parr 2003
RL algorithm implementation
• Learning goals were
• improve Python skills beyond simple scripting and notebook use
• gain experience in modeling complex real-world (?) entities (such as
agents, policies and environments) in Python
• gain understanding on how the RL algorithms actually work
• develop ideas for real RL use cases to tackle later
• Common toy-problems that enable fast iterations and
straightforward visualization of both the data and results
• From scratch –ideology
• Downside: slow and painful, excellent libraries with ready
implementations readily available
Tech stack
WSL
Ubuntu
Conda
VS Code
Python 3.10
Mypy
Numpy
Pandas
Jupyter
Matplotlib
Seaborn
Rough class diagram
GitHub Repositories
• Dynamic Programming – Jupyter notebooks
https://github.com/mmakipaa/dp
• RL algorithms – Python
https://github.com/mmakipaa/rl
• Results visualization – Jupyter notebooks
https://github.com/mmakipaa/rl-results
Simulation experiments: Reference result
Greedy policy Action-value function Difference in value between actions
Monte Carlo Off-policy; 100 000 000 episodes; random behavior policy, ; no discounting
So…
• We have covered basic model-free RL
algorithms
• Algorithms that learn from episodes or
from TD-updates, and a single example of
a batch algorithm
• That apply GPI; they work with state-
action value function, and derive the
corresponding policy from that
• They store the values of state-actions, i.e.
use tabular value representation
• Or use linear function approximation to
represent value of state-actions, and
apply a derivative of gradient descent to
learn from experience
What was not covered
• Introductions to RL typically cover both prediction, i.e. working with state-value function, and control, working with action-value function.
We pretty much skipped the prediction part (as we do not find it interesting or useful if control is feasible)
• Between presented single-step TD(0) and full episode Monte Carlo approaches, a set of methods exists that use several return steps in
updates, TD-lambda or n-step methods. These were not covered
• We have covered only action-value GPI methods that work with value-function and derive a corresponding policy from the action-values. A
family of methods that work directly with parametrized policy, policy gradient methods such as REINFORCE, were not covered
• We covered dynamic programming in first part for fully-defined cases. For situations where the MDP is not known, we concentrated on
model-free methods only, did not cover model-based approaches
• We limited the discussion to linear function approximation methods, even hinting that applying non-linear methods would lead to issues
like divergence. This may be true. Yet, most of the groundbreaking results in RL have been obtained applying Deep Neural Networks. For
instance, DQN (Deep NNs and Q-learning) for Atari games is standard textbook material
• We did not discuss tree-search algorithms, such as MCTS
• There is a huge number of interesting recent algorithms and variants that were not covered. The discussion pretty much follows the
standard textbook level of knowledge from some years back, as covered in Sutton-Barto and David Silver’s lectures from 2015
• And notably, we omitted almost all of the convergence proofs that are diligently covered in most lecture materials and books. We are sorry
about that

More Related Content

What's hot

Reinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationReinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationSeung Jae Lee
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsSeung Jae Lee
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Thom Lane
 
Introduction to Multi-armed Bandits
Introduction to Multi-armed BanditsIntroduction to Multi-armed Bandits
Introduction to Multi-armed BanditsYan Xu
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNEuijin Jeong
 
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...SlideTeam
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsDongmin Lee
 

What's hot (20)

Reinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationReinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with Approximation
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
 
Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)
 
Introduction to Multi-armed Bandits
Introduction to Multi-armed BanditsIntroduction to Multi-armed Bandits
Introduction to Multi-armed Bandits
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 

Similar to Intro to Reinforcement learning - part III

Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIMikko Mäkipää
 
Intro to Reinforcement learning - part I
Intro to Reinforcement learning - part IIntro to Reinforcement learning - part I
Intro to Reinforcement learning - part IMikko Mäkipää
 
06 image features
06 image features06 image features
06 image featuresankit_ppt
 
30thSep2014
30thSep201430thSep2014
30thSep2014Mia liu
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzikitzik cohen
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Regression Analysis.pptx
Regression Analysis.pptxRegression Analysis.pptx
Regression Analysis.pptxarsh260174
 
Regression Analysis Techniques.pptx
Regression Analysis Techniques.pptxRegression Analysis Techniques.pptx
Regression Analysis Techniques.pptxYutaItadori
 
EMOD_Optimization_Presentation.pptx
EMOD_Optimization_Presentation.pptxEMOD_Optimization_Presentation.pptx
EMOD_Optimization_Presentation.pptxAliElMoselhy
 
Linear Programing.pptx
Linear Programing.pptxLinear Programing.pptx
Linear Programing.pptxAdnanHaleem
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
C sharp part 001
C sharp part 001C sharp part 001
C sharp part 001Ralph Weber
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Maninda Edirisooriya
 
Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...Lu Jiang
 
Linear regression
Linear regressionLinear regression
Linear regressionMartinHogg9
 
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...Databricks
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 
Object detection at night
Object detection at nightObject detection at night
Object detection at nightSanjay Crúzé
 
Applications of Machine Learning in High Frequency Trading
Applications of Machine Learning in High Frequency TradingApplications of Machine Learning in High Frequency Trading
Applications of Machine Learning in High Frequency TradingAyan Sengupta
 

Similar to Intro to Reinforcement learning - part III (20)

Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
 
Intro to Reinforcement learning - part I
Intro to Reinforcement learning - part IIntro to Reinforcement learning - part I
Intro to Reinforcement learning - part I
 
06 image features
06 image features06 image features
06 image features
 
30thSep2014
30thSep201430thSep2014
30thSep2014
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzik
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Regression Analysis.pptx
Regression Analysis.pptxRegression Analysis.pptx
Regression Analysis.pptx
 
Regression Analysis Techniques.pptx
Regression Analysis Techniques.pptxRegression Analysis Techniques.pptx
Regression Analysis Techniques.pptx
 
EMOD_Optimization_Presentation.pptx
EMOD_Optimization_Presentation.pptxEMOD_Optimization_Presentation.pptx
EMOD_Optimization_Presentation.pptx
 
Linear Programing.pptx
Linear Programing.pptxLinear Programing.pptx
Linear Programing.pptx
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
C sharp part 001
C sharp part 001C sharp part 001
C sharp part 001
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
 
Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
Object detection at night
Object detection at nightObject detection at night
Object detection at night
 
Applications of Machine Learning in High Frequency Trading
Applications of Machine Learning in High Frequency TradingApplications of Machine Learning in High Frequency Trading
Applications of Machine Learning in High Frequency Trading
 

Recently uploaded

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Recently uploaded (20)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

Intro to Reinforcement learning - part III

  • 1. Introduction to Reinforcement Learning Part III: Basic approximate methods in RL Mikko Mäkipää 4.5.2022
  • 2. Agenda • First time: Part I • Intro: Reinforcement learning as a ML approach • Basic building blocks: agent and environment, MDP, policies, value functions, Bellman equations, optimal policies and value functions • Basic dynamic programming algorithms illustrated on a simple maze: Value iteration, Policy iteration • Last time: Part II • Some more building blocks: GPI, bandits, exploration, TD updates,… • Basic model-free methods using tabular value representation • …illustrated on Blackjack: Monte Carlo on- vs off-policy; Sarsa, Expected Sarsa, Q-learning
  • 3. Agenda • This time: Part III • Value function approximation -based methods • Examples of linear representations; polynomial, tile coding, Fourier cosine basis • Stochastic gradient descent and semi-gradient descent • Batch updates LSPI-LSTDQ • Python implementation • Results from simulation experiments
  • 4. Recap: Markov Decision Process (MDP) • Markov Decision Process is a tuple , with states, actions, transition probabilities, rewards and a discount factor • If we know all of these, we have a fully defined MPD and can apply Dynamic Programming (DP) (as covered in Part I) • If we don’t, we can use a model to augment the MDP, in practice transition probabilities and rewards - this would be a model-based approach • Or, we apply agent’s direct experience with the environment to update our estimates of value function, this would be a model-free approach
  • 5. Recap: Discounted return, utility • An agent exploring the MDP environment would observe a sequence • Discounted return, or utility, from time step 𝑡 onwards is the sum of discounted rewards received:
  • 6. Recap: The state-value function • If the agent was following a policy, then in each state , the agent would select the action defined by that policy • The state-value function of a state under policy , denoted , is the expected discounted return when following the policy from state onwards: • Here applying the Bellman expectation equation
  • 7. Recap: The action-value function • Action-value function for policy defines the expected utility when starting in state , performing action and following the policy thereafter
  • 8. Recap: Greedy policy from • To derive the policy from state-value function , we need to know the transition probabilities and rewards: • But we can extract the policy directly from action-value function • So, working with enables us to be model-free
  • 9. Recap: Monte Carlo – full episodes • Sample a full episode from MDP using a -greedy policy • For each state-action pair estimate value using average sample returns • Maintain visit-counts to each state action pair • Update value estimates based on incremental average of observed return
  • 10. Recap: Sarsa – Temporal-difference TD(0) • Generate samples from MDP using a -greedy policy • For each sample, update state-action value using discounted sample return TD-target TD-error learning-rate parameter
  • 11. Recap: Three TD algorithms • Sarsa: Samples • Q-learning: Samples • Expected Sarsa: Samples
  • 12. Tabular methods • Last time we discussed tabular methods, where action-values are stored in a lookup table structure • These algorithms read current values from and update values to table storage, e.g. for Sarsa update: • Good: simple, fast, can be used to store other values as well (e.g. visit counts) • Limitations: large state-action spaces, continuous values, does not generalize update current value read current and next state values
  • 13. Value function approximation • Action-value function is represented with parametric approximation • where is a -dimensional vector of weights • We use to denote features representing a state-action pair • If the approximate function is a linear function of the weight vector , we have
  • 14. RL problem setting and value function approximation Perform action Receive raw signal from environment Environment Agent Approximate the value function of each state and action Create features to represent observed state Pre-process the received observation
  • 15. Sidenote: Domain knowledge? ”Demonstrate superhuman performance without human domain knowledge” • prior knowledge == bad “Make money at the casino playing RL guided Blackjack” • maximally exploit prior knowledge Silver et al 2017: Mastering the game of Go without human knowledge
  • 16. For our case of simple Blackjack • We represent the state-action pair as • total sum of dealer’s cards (2,11) • total sum of player’s cards (4, 21) • does the player have a soft, usable ace (True, False) • action to take (HIT, STAND) • This gives 560 state-action pairs for a tabular lookup table* • Or, we can derive feature vectors for approximate methods using these four variables Dealer: Player: dealer: 5 player: 20 no ace *) Compare this to Backgammon 1020 states, Go 10170 states
  • 17. How to approximate this? • Each subplot on the right shows the action-values by one of the four variables - dealer’s sum, player’s sum, soft ace, action • Now, we need to define a mapping from each pair, i.e. from values to features • And then approximate action-value function with a suitable linear model of parameter Estimated action-values for each of the 560 state-action pairs for simple blackjack Note: shown values represent our reference result, a test run of 100 000 000 episodes of tabular off-policy Monte Carlo
  • 18. How to approximate this? Start from here…? • Subplots on the right show the action-values for each of the 560 state-action pairs • Values are grouped into four groups based on • x-axis shows values for player’s cards • Equal values for dealer’s card are shown connected dealer 2 11 Note: shown values represent our reference result, a test run of 100 000 000 million episodes of tabular off-policy Monte Carlo
  • 19. Three different value representations • In the following we will briefly cover three different approaches to generating features for linear models • Polynomial approximation • Fourier-cosine basis • Tile Coding • Notable omissions: • Radial Basis Functions, i.e. gaussians with fixed mean and variance, appropriately positioned (as we have not tried then, but yes, they would make sense) • Neural Networks (as they are non-linear)
  • 20. First: Polynomial approximation • We want to approximate the action-value function using a linear model • using the variables we have available (from preprocessing the input signal) • where features are terms of a nth degree polynomial, of the form • and exponent coefficients are from the set and indicators either • For order , this would give us 100 terms. How to select a meaningful subset?
  • 21. Experiment: Model selection for polynomial features • Model search against our reference result obtained using a tabular method • Forward-backward search applying Akaike information criterion (AIC) • start with a constant • test adding remaining terms one at a time • add the one that improves AIC the most • test if removing terms improves AIC • repeat and stop when AIC no longer improves • Using the resulting polynomial model, we apply approximate RL and compare the result to the LS fit Polynomial model candidate Reference result LS fit Selected polynomial features RL method Model search RL model Performance comparison AIC
  • 22. Selected polynomial: c 25 d4p3sa • AIC-search gives us a polynomial with 27 terms as shown below • Subplots on the right show the least-squares fit of this polynomial model against the reference result • Reference values are shown in gray dotted lines, colored lines for estimated action-values dealer 2 11
  • 23. Second: Fourier cosine basis • Fourier series approximates a periodic function as a weighted sum of sine and cosine basis functions of different frequencies • Within half-interval, sine terms can be dropped, and function can be represented as closely as desired with just the cosine terms • So, we select our period as 2 and restrict the features to half-interval [0,1] • For one dimensional case, we get feature terms • For example, for , gives the features shown on the right
  • 24. Fourier cosine basis in 2D • When we represent each state-action pair with more than one dimension, the basis formula becomes • where and • In our blackjack case, we use a different model for each of the four combinations, so input is , and • We would get the cosine basis functions shown on the right for • For RL testing, we use giving 49 features for each - defined plane
  • 25. Third: Tile coding • For tile coding (a form of coarse coding) the feature space is divided into partitions, or tiles • We get binary features: For each input, the feature has either value of 1 (within tile) or 0 (outside tile) • For this to work, the tiles have to overlap, so that multiple tiles are active for each input • For testing, we get 160 tiles, creating 3x3 areas with four different tiles active • With this, we get a different set of tiles to cover each of the combinations
  • 26. Stochastic Gradient Descent • So, we want to represent action-value function with a parametric model with weights • We can use different approaches to generate features for our model, such as Fourier cosine basis… • To determine values for the weights, we apply gradient descent… or a variant thereof
  • 27. Backgrounder: Stochastic Gradient Descent • We have an error function we want to minimize, in this case MSE between actual action-values and estimated values given by our function approximation • Gradient of a function shows the direction of steepest ascent • For the error function, the (negative half) gradient is given as
  • 28. Stochastic Gradient Descent (SGD) • Stochastic gradient descent approximates the gradient using a single sample • To reduce the error, we update the weights towards the negative gradient • where is a learning rate parameter • Gradient descent converges to a local minimum of the error function, if the learning rate is decreased over iterations
  • 29. SGD with linear model • For linear model of action-value function • Gradient is given by • And the update rule reduces to
  • 30. Simple SGD example • We want to find the best fitting line to the observations  • Our weight vector is and the partial derivatives for the gradient • The SGD updates are • Light blue line is the (unknown) relation, from which the observations are generated by adding N(0,1) noise • Blue dots are the known observations • Dark blue line is given by the least-squares estimate for a and b
  • 31. SGD example Note: error surface calculated for all of the observations
  • 32. Sample gradients Above the error surface calculated using all of the observations Images on the right show the gradients calculated based on samples for first eight iterations
  • 33. SGD example • Figure on the right shows how the line changes as slope and intercept are updated during gradient descent • Light green line shows the final line, orange is the line obtained applying least-squares
  • 34. RL: Semi-gradient descent • In RL, we do not have a target value for that we could use in a supervised learning -type update • So, for model free RL, we substitute an approximate target that we get by interacting with the environment • and we live with the fact that * also depends on our current weights . Not being a true gradient descent method, this is called semi-gradient descent *) such targets are called bootstrapping targets or bootstrapping estimates in the literature
  • 35. RL: Semi-gradient descent • Full-episode Monte Carlo we use the observed return • to get the update semi-gradient update • For temporal-difference Sarsa, we use the sample return
  • 36. Semi-gradient Sarsa TD(0) e-greedy policy (using estimated action-values) Source: Sutton-Barto 2nd ed learning from each episode iterate forward Semi-GD weight updates at each step
  • 37. Convergence and the deadly triad • Three elements, making up the deadly triad • Function approximation: Generalizing from a state space (e.g., linear function approximation or ANNs). • Bootstrapping: Update targets that include existing estimates rather than relying exclusively on actual rewards and complete returns • Off-policy training: Training on other than the target policy • Two of these can be handled, three leads to instability and divergence Source: Sutton-Barto 2nd ed, Table source: David Silver: UCL Course on RL
  • 38. Batch method: LSPI-LSTDQ • This far, we have considered episodic methods, using either a full episode or TD samples return targets • Batch methods apply a different approach: Agent generates experience, a batch of samples from the environment, which is then used for learning • Samples can be reused or played in different order • LSPI-LSTDQ is an example of a batch policy improvement algorithm, using a batch of samples to solve the weights using least-squares • Solving the linear system involves a matrix (pseudo-) inversion at each iteration of the algorithm
  • 40. RL algorithm implementation • Learning goals were • improve Python skills beyond simple scripting and notebook use • gain experience in modeling complex real-world (?) entities (such as agents, policies and environments) in Python • gain understanding on how the RL algorithms actually work • develop ideas for real RL use cases to tackle later • Common toy-problems that enable fast iterations and straightforward visualization of both the data and results • From scratch –ideology • Downside: slow and painful, excellent libraries with ready implementations readily available Tech stack WSL Ubuntu Conda VS Code Python 3.10 Mypy Numpy Pandas Jupyter Matplotlib Seaborn
  • 42. GitHub Repositories • Dynamic Programming – Jupyter notebooks https://github.com/mmakipaa/dp • RL algorithms – Python https://github.com/mmakipaa/rl • Results visualization – Jupyter notebooks https://github.com/mmakipaa/rl-results
  • 43. Simulation experiments: Reference result Greedy policy Action-value function Difference in value between actions Monte Carlo Off-policy; 100 000 000 episodes; random behavior policy, ; no discounting
  • 44. So… • We have covered basic model-free RL algorithms • Algorithms that learn from episodes or from TD-updates, and a single example of a batch algorithm • That apply GPI; they work with state- action value function, and derive the corresponding policy from that • They store the values of state-actions, i.e. use tabular value representation • Or use linear function approximation to represent value of state-actions, and apply a derivative of gradient descent to learn from experience
  • 45. What was not covered • Introductions to RL typically cover both prediction, i.e. working with state-value function, and control, working with action-value function. We pretty much skipped the prediction part (as we do not find it interesting or useful if control is feasible) • Between presented single-step TD(0) and full episode Monte Carlo approaches, a set of methods exists that use several return steps in updates, TD-lambda or n-step methods. These were not covered • We have covered only action-value GPI methods that work with value-function and derive a corresponding policy from the action-values. A family of methods that work directly with parametrized policy, policy gradient methods such as REINFORCE, were not covered • We covered dynamic programming in first part for fully-defined cases. For situations where the MDP is not known, we concentrated on model-free methods only, did not cover model-based approaches • We limited the discussion to linear function approximation methods, even hinting that applying non-linear methods would lead to issues like divergence. This may be true. Yet, most of the groundbreaking results in RL have been obtained applying Deep Neural Networks. For instance, DQN (Deep NNs and Q-learning) for Atari games is standard textbook material • We did not discuss tree-search algorithms, such as MCTS • There is a huge number of interesting recent algorithms and variants that were not covered. The discussion pretty much follows the standard textbook level of knowledge from some years back, as covered in Sutton-Barto and David Silver’s lectures from 2015 • And notably, we omitted almost all of the convergence proofs that are diligently covered in most lecture materials and books. We are sorry about that