SlideShare a Scribd company logo
Introduction to Artificial Intelligence
Reinforcement Learning
Andres Mendez-Vazquez
April 26, 2019
1 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
2 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
3 / 111
Images/cinvestav
Reinforcement Learning is learning what to do
What does Reinforcement Learning want?
Maximize a numerical reward signal
The learner is not told which actions to take
A discovery must be performed to obtain the actions that yield the
most rewards
4 / 111
Images/cinvestav
Reinforcement Learning is learning what to do
What does Reinforcement Learning want?
Maximize a numerical reward signal
The learner is not told which actions to take
A discovery must be performed to obtain the actions that yield the
most rewards
4 / 111
Images/cinvestav
These ideas come from Dynamical Systems Theory
Specifically
As the optimal control of incompletely-known Markov Decision
Processes (MDP).
Thus, the device must do the following
Interact with the environment to learn by using punishments and
rewards.
5 / 111
Images/cinvestav
These ideas come from Dynamical Systems Theory
Specifically
As the optimal control of incompletely-known Markov Decision
Processes (MDP).
Thus, the device must do the following
Interact with the environment to learn by using punishments and
rewards.
5 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
6 / 111
Images/cinvestav
A mobile robot
decides whether it should enter a new room
in search of more trash to collect
or going back to charge its battery
It needs to take a decision based on the
current charge level of its battery
how quickly it can arrive to its base
7 / 111
Images/cinvestav
A mobile robot
decides whether it should enter a new room
in search of more trash to collect
or going back to charge its battery
It needs to take a decision based on the
current charge level of its battery
how quickly it can arrive to its base
7 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
8 / 111
Images/cinvestav
A K-Armed Bandit Problem
Definition
You are faced repeatedly with a choice among k different options, or
actions.
After each choice you receive a numerical reward chosen
From an stationary probability distribution depending on the actions.
Definition Stationary Probability
A stationary distribution of a Markov chain is a probability distribution
that remains unchanged in the Markov chain as time progresses.
9 / 111
Images/cinvestav
A K-Armed Bandit Problem
Definition
You are faced repeatedly with a choice among k different options, or
actions.
After each choice you receive a numerical reward chosen
From an stationary probability distribution depending on the actions.
Definition Stationary Probability
A stationary distribution of a Markov chain is a probability distribution
that remains unchanged in the Markov chain as time progresses.
9 / 111
Images/cinvestav
A K-Armed Bandit Problem
Definition
You are faced repeatedly with a choice among k different options, or
actions.
After each choice you receive a numerical reward chosen
From an stationary probability distribution depending on the actions.
Definition Stationary Probability
A stationary distribution of a Markov chain is a probability distribution
that remains unchanged in the Markov chain as time progresses.
9 / 111
Images/cinvestav
Example
A Two State System
10 / 111
Images/cinvestav
Therefore
Each of the k actions has an expected value
q∗ (a) = E [Rt|At = a]
Something Notable
If you knew the value of each action,
It would be trivial to solve the k-armed bandit problem.
11 / 111
Images/cinvestav
Therefore
Each of the k actions has an expected value
q∗ (a) = E [Rt|At = a]
Something Notable
If you knew the value of each action,
It would be trivial to solve the k-armed bandit problem.
11 / 111
Images/cinvestav
What do we want?
To find the estimated value of action a at time t, Qt (a)
|Qt (a) − q∗ (a)| < with as small as possible
12 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
13 / 111
Images/cinvestav
Greedy Actions
Intuitions
To choose the greatest E [Rt|At = a]
This is know as Exploitation
You are exploiting your current knowledge of the values of the actions.
Contrasting to Exploration
When you select the non-gready actions to obtain a better knowledge
of the non-gready actions.
14 / 111
Images/cinvestav
Greedy Actions
Intuitions
To choose the greatest E [Rt|At = a]
This is know as Exploitation
You are exploiting your current knowledge of the values of the actions.
Contrasting to Exploration
When you select the non-gready actions to obtain a better knowledge
of the non-gready actions.
14 / 111
Images/cinvestav
Greedy Actions
Intuitions
To choose the greatest E [Rt|At = a]
This is know as Exploitation
You are exploiting your current knowledge of the values of the actions.
Contrasting to Exploration
When you select the non-gready actions to obtain a better knowledge
of the non-gready actions.
14 / 111
Images/cinvestav
An Interesting Conundrum
It has been observed that
Exploitation maximizes the expected reward on the one step,
Exploration may produce the greater total reward in the long run.
Thus, the idea
The value of a state s is the total amount of reward an agent can
expect to accumulate over the future, starting from s.
Thus, Values
They indicate the long-term profit of states after taking into account
The states that follow and their rewards.
15 / 111
Images/cinvestav
An Interesting Conundrum
It has been observed that
Exploitation maximizes the expected reward on the one step,
Exploration may produce the greater total reward in the long run.
Thus, the idea
The value of a state s is the total amount of reward an agent can
expect to accumulate over the future, starting from s.
Thus, Values
They indicate the long-term profit of states after taking into account
The states that follow and their rewards.
15 / 111
Images/cinvestav
An Interesting Conundrum
It has been observed that
Exploitation maximizes the expected reward on the one step,
Exploration may produce the greater total reward in the long run.
Thus, the idea
The value of a state s is the total amount of reward an agent can
expect to accumulate over the future, starting from s.
Thus, Values
They indicate the long-term profit of states after taking into account
The states that follow and their rewards.
15 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
16 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Finally
A Model of the Environment
This simulates the behavior of the environment
Allowing to make inferences on how the environment can behave
Something Notable
Models are used for planning
Meaning we made decisions before they are executed
Thus, Reinforcement Learning Methods using models are called
model-based
18 / 111
Images/cinvestav
Finally
A Model of the Environment
This simulates the behavior of the environment
Allowing to make inferences on how the environment can behave
Something Notable
Models are used for planning
Meaning we made decisions before they are executed
Thus, Reinforcement Learning Methods using models are called
model-based
18 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
19 / 111
Images/cinvestav
Limitations
Given that
Reinforcement learning relies heavily on the concept of state
Therefore, it has the same problem as in Markov Decision Processes
Defining the state to obtain a better representation of the search
space.
This is where the creative part of the problems come to be
As we have seen, solutions in AI really a lot in the creativity of the
solution...
20 / 111
Images/cinvestav
Limitations
Given that
Reinforcement learning relies heavily on the concept of state
Therefore, it has the same problem as in Markov Decision Processes
Defining the state to obtain a better representation of the search
space.
This is where the creative part of the problems come to be
As we have seen, solutions in AI really a lot in the creativity of the
solution...
20 / 111
Images/cinvestav
Limitations
Given that
Reinforcement learning relies heavily on the concept of state
Therefore, it has the same problem as in Markov Decision Processes
Defining the state to obtain a better representation of the search
space.
This is where the creative part of the problems come to be
As we have seen, solutions in AI really a lot in the creativity of the
solution...
20 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
21 / 111
Images/cinvestav
The System–Environment Interaction
In a Markov Decision Process
22 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
23 / 111
Images/cinvestav
Markov Process and Markov Decision Process
Definition of Markov Process
A sequence of states is Markov if and only if the probability of moving
to the next state st+1 depends only on the present state st
P [st+1|st] = P [st+1|s1, ..., st]
Something Notable
We always talk about time-homogeneous Markov chain in
Reinforcement Learning:
the probability of the transition is independent of t
P [st+1 = s |st = s] = P [st = s |st−1 = s]
24 / 111
Images/cinvestav
Markov Process and Markov Decision Process
Definition of Markov Process
A sequence of states is Markov if and only if the probability of moving
to the next state st+1 depends only on the present state st
P [st+1|st] = P [st+1|s1, ..., st]
Something Notable
We always talk about time-homogeneous Markov chain in
Reinforcement Learning:
the probability of the transition is independent of t
P [st+1 = s |st = s] = P [st = s |st−1 = s]
24 / 111
Images/cinvestav
Formally
Definition (Markov Process)
a Markov Process (or Markov Chain) is a tuple (S, P), where
1 S is a finite set of states
2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s]
Thus, the dynamic of the Markov process is as follow
s0
Ps0s1
−→ s1
Ps1s2
−→ s2
Ps2s3
−→ s3 −→ · · ·
25 / 111
Images/cinvestav
Formally
Definition (Markov Process)
a Markov Process (or Markov Chain) is a tuple (S, P), where
1 S is a finite set of states
2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s]
Thus, the dynamic of the Markov process is as follow
s0
Ps0s1
−→ s1
Ps1s2
−→ s2
Ps2s3
−→ s3 −→ · · ·
25 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
27 / 111
Images/cinvestav
The Goal of Reinforcement Learning
We want to maximize the expected value of the return
i.e. To obtain the optimal policy!!!
Thus, as in our previous study of MDP’s
Gt = Rt+1 + αRt+2 + · · · =
∞
k=0
αk
Rt+k+1
28 / 111
Images/cinvestav
The Goal of Reinforcement Learning
We want to maximize the expected value of the return
i.e. To obtain the optimal policy!!!
Thus, as in our previous study of MDP’s
Gt = Rt+1 + αRt+2 + · · · =
∞
k=0
αk
Rt+k+1
28 / 111
Images/cinvestav
Remarks
Interpretation
The discounted future rewards can be interpreted as the current
value of future rewards.
The reward α
α close to 0 leads to “short-sighted” evaluation... you are only
looking for immediate rewards!!!
α close to 1 leads to “long-sighted” evaluation... you are willing to
wait for a better reward!!!
29 / 111
Images/cinvestav
Remarks
Interpretation
The discounted future rewards can be interpreted as the current
value of future rewards.
The reward α
α close to 0 leads to “short-sighted” evaluation... you are only
looking for immediate rewards!!!
α close to 1 leads to “long-sighted” evaluation... you are willing to
wait for a better reward!!!
29 / 111
Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
Images/cinvestav
A Policy
Using our MDP ideas...
A policy π is a distribution over actions given states
π (a|s) = P [At = a|St = s]
A policy guides the choice of action at a given state.
And it is time independent
Then, we introduce two value functions
State-value function.
Action-value function.
31 / 111
Images/cinvestav
A Policy
Using our MDP ideas...
A policy π is a distribution over actions given states
π (a|s) = P [At = a|St = s]
A policy guides the choice of action at a given state.
And it is time independent
Then, we introduce two value functions
State-value function.
Action-value function.
31 / 111
Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
Images/cinvestav
The Action-Value Function qπ (s, a)
It is the expected return starting from state s
Starting from state s, taking action a, and then following policy π
qπ (s, a) = Eπ [Gt|St = a, At = a]
We can also obtain the following decomposition
qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a]
33 / 111
Images/cinvestav
The Action-Value Function qπ (s, a)
It is the expected return starting from state s
Starting from state s, taking action a, and then following policy π
qπ (s, a) = Eπ [Gt|St = a, At = a]
We can also obtain the following decomposition
qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a]
33 / 111
Images/cinvestav
Now...
To simplify notations, we define
Ra
s = E [Rt+1|St = s, At = a]
Then, we have
vπ (s) =
a∈A
π (a|s) qπ (s, a)
qπ (s, a) = Ra
s + α
s ∈S
Pa
ss vπ s
34 / 111
Images/cinvestav
Now...
To simplify notations, we define
Ra
s = E [Rt+1|St = s, At = a]
Then, we have
vπ (s) =
a∈A
π (a|s) qπ (s, a)
qπ (s, a) = Ra
s + α
s ∈S
Pa
ss vπ s
34 / 111
Images/cinvestav
Then
Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) -
The Bellman Equation
vπ (s) =
a∈A
π (a|s)

Ra
s + α
s ∈S
Pa
ss vπ s


The Bellman equation relates the state-value function
of one state with that of other states.
35 / 111
Images/cinvestav
Then
Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) -
The Bellman Equation
vπ (s) =
a∈A
π (a|s)

Ra
s + α
s ∈S
Pa
ss vπ s


The Bellman equation relates the state-value function
of one state with that of other states.
35 / 111
Images/cinvestav
Also
Bellman equation for qπ (s, a)
qπ (s, a) = Ra
s + α
s ∈S
Pa
ss
a∈A
π (a|s) qπ (s, a)
How do we solve this problem?
We have the following solutions
36 / 111
Images/cinvestav
Also
Bellman equation for qπ (s, a)
qπ (s, a) = Ra
s + α
s ∈S
Pa
ss
a∈A
π (a|s) qπ (s, a)
How do we solve this problem?
We have the following solutions
36 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
37 / 111
Images/cinvestav
Action-Value Methods
One natural way to estimate this is by averaging the rewards actually
received
Qt (a) =
t−1
i=1 RiIAi=a
t−1
i=1 IAi=a
Nevertheless
If the denominator is zero, we define Qt (a) at some default value
Now, as the denominator goes to infinity, by the law of large numbers,
Qt (a) −→ q∗
(a)
38 / 111
Images/cinvestav
Action-Value Methods
One natural way to estimate this is by averaging the rewards actually
received
Qt (a) =
t−1
i=1 RiIAi=a
t−1
i=1 IAi=a
Nevertheless
If the denominator is zero, we define Qt (a) at some default value
Now, as the denominator goes to infinity, by the law of large numbers,
Qt (a) −→ q∗
(a)
38 / 111
Images/cinvestav
Action-Value Methods
One natural way to estimate this is by averaging the rewards actually
received
Qt (a) =
t−1
i=1 RiIAi=a
t−1
i=1 IAi=a
Nevertheless
If the denominator is zero, we define Qt (a) at some default value
Now, as the denominator goes to infinity, by the law of large numbers,
Qt (a) −→ q∗
(a)
38 / 111
Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
Images/cinvestav
And Here a Heuristic
A Simple Alternative is to act greedily most of the time
every once in a while, say with small probability , select randomly
from among all the actions with equal probability.
This is called -greedy method
In the limit to infinity, this method ensures:
Qt (a) −→ q∗
(a)
40 / 111
Images/cinvestav
And Here a Heuristic
A Simple Alternative is to act greedily most of the time
every once in a while, say with small probability , select randomly
from among all the actions with equal probability.
This is called -greedy method
In the limit to infinity, this method ensures:
Qt (a) −→ q∗
(a)
40 / 111
Images/cinvestav
It is like the Mean Filter
We have a nice result
41 / 111
Images/cinvestav
Example
In the following example
The q∗ (a) were selected according to a Gaussian Distribution with
mean 0 and variance 1.
Then, when an action is selected, the reward Rt was selected from a
normal distribution with mean q∗ (At) and variance 1.
We have then
42 / 111
Images/cinvestav
Example
In the following example
The q∗ (a) were selected according to a Gaussian Distribution with
mean 0 and variance 1.
Then, when an action is selected, the reward Rt was selected from a
normal distribution with mean q∗ (At) and variance 1.
We have then
42 / 111
Images/cinvestav
Now, as the sampling goes up
We get
43 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
44 / 111
Images/cinvestav
Imagine the following
Let Ri denotes the reward after the ith
of this action
Qn =
R1 + R2 + · · · + Rn−1
n − 1
The problem of this technique
The Amount of Memory and Computational Power to keep updates
the estimates as more reward happens
45 / 111
Images/cinvestav
Imagine the following
Let Ri denotes the reward after the ith
of this action
Qn =
R1 + R2 + · · · + Rn−1
n − 1
The problem of this technique
The Amount of Memory and Computational Power to keep updates
the estimates as more reward happens
45 / 111
Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
47 / 111
Images/cinvestav
General Formula in Reinforcement Learning
The Pattern Appears almost all the time in RL
New Estimate = Old Estimate + Step Size [Target − Old Estimate]
Looks a lot as a Gradient Descent if we decide Assume a Taylor Series
f (x) = f (a) + f (a) (x − a) =⇒ f (a) =
f (x) − f (a)
x − a
Then
xn+1 = xn + γf (a) = xn + γ
f (x) − f (a)
x − a
= xn + γ [f (x) − f (a)]
48 / 111
Images/cinvestav
General Formula in Reinforcement Learning
The Pattern Appears almost all the time in RL
New Estimate = Old Estimate + Step Size [Target − Old Estimate]
Looks a lot as a Gradient Descent if we decide Assume a Taylor Series
f (x) = f (a) + f (a) (x − a) =⇒ f (a) =
f (x) − f (a)
x − a
Then
xn+1 = xn + γf (a) = xn + γ
f (x) − f (a)
x − a
= xn + γ [f (x) − f (a)]
48 / 111
Images/cinvestav
General Formula in Reinforcement Learning
The Pattern Appears almost all the time in RL
New Estimate = Old Estimate + Step Size [Target − Old Estimate]
Looks a lot as a Gradient Descent if we decide Assume a Taylor Series
f (x) = f (a) + f (a) (x − a) =⇒ f (a) =
f (x) − f (a)
x − a
Then
xn+1 = xn + γf (a) = xn + γ
f (x) − f (a)
x − a
= xn + γ [f (x) − f (a)]
48 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
49 / 111
Images/cinvestav
A Simple Bandit Algorithm
Initialize for a = 1 to k
1 Q (a) ← 0
2 N (a) ← 0
Loop until certain threshold γ
1 A ←
arg maxa Q (a) with probability 1 −
A random action with probability
2 R ← bandit (A)
3 N (A) ← N (A) + 1
4 Q (A) ← Q (A) + 1
N(A) [R − Q (A)]
50 / 111
Images/cinvestav
A Simple Bandit Algorithm
Initialize for a = 1 to k
1 Q (a) ← 0
2 N (a) ← 0
Loop until certain threshold γ
1 A ←
arg maxa Q (a) with probability 1 −
A random action with probability
2 R ← bandit (A)
3 N (A) ← N (A) + 1
4 Q (A) ← Q (A) + 1
N(A) [R − Q (A)]
50 / 111
Images/cinvestav
Remarks
Remember
This is works well for a stationary problem
For more in non-stationary problems
“Reinforcement Learning: An Introduction“ by Sutton et al. at
chapter 2.
51 / 111
Images/cinvestav
Remarks
Remember
This is works well for a stationary problem
For more in non-stationary problems
“Reinforcement Learning: An Introduction“ by Sutton et al. at
chapter 2.
51 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
52 / 111
Images/cinvestav
Dynamic Programming
As in many things in Computer Science
We love the Divide Et Impera...
Then
In Reinforcement Learning, Bellman equation gives recursive
decompositions, and value function stores and reuses solutions.
53 / 111
Images/cinvestav
Dynamic Programming
As in many things in Computer Science
We love the Divide Et Impera...
Then
In Reinforcement Learning, Bellman equation gives recursive
decompositions, and value function stores and reuses solutions.
53 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
54 / 111
Images/cinvestav
Policy Evaluation
For this, we assume as always
The problem can be solved with the Bellman Equation
We start from an initial guess v1
Then, we apply Bellman Iteratively
v1 → v2 → · · · → vπ
For this, we use synchronous updates
We update the value functions of the present iteration at the same
time based on the that of the previous iteration.
55 / 111
Images/cinvestav
Policy Evaluation
For this, we assume as always
The problem can be solved with the Bellman Equation
We start from an initial guess v1
Then, we apply Bellman Iteratively
v1 → v2 → · · · → vπ
For this, we use synchronous updates
We update the value functions of the present iteration at the same
time based on the that of the previous iteration.
55 / 111
Images/cinvestav
Policy Evaluation
For this, we assume as always
The problem can be solved with the Bellman Equation
We start from an initial guess v1
Then, we apply Bellman Iteratively
v1 → v2 → · · · → vπ
For this, we use synchronous updates
We update the value functions of the present iteration at the same
time based on the that of the previous iteration.
55 / 111
Images/cinvestav
Then
At each iteration k + 1, for all states s ∈ S
We update vk+1 from vk (s ) according to Bellman Equations, where
s is a successor state of s:
vk+1 (s) =
a∈A
π (a|s)

Ra
s + α
s ∈S
Pa
ss vk s


Something Notable
The iterative process stops when the maximum difference between
value function at the current step and that at the previous step is
smaller than some small positive constant.
56 / 111
Images/cinvestav
Then
At each iteration k + 1, for all states s ∈ S
We update vk+1 from vk (s ) according to Bellman Equations, where
s is a successor state of s:
vk+1 (s) =
a∈A
π (a|s)

Ra
s + α
s ∈S
Pa
ss vk s


Something Notable
The iterative process stops when the maximum difference between
value function at the current step and that at the previous step is
smaller than some small positive constant.
56 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
57 / 111
Images/cinvestav
However
Our ultimate goal is to find the optimal policy
For this, we can use policy iteration process
Algorithm
1 Initialize π randomly
2 Repeat until the previous policy and the current policy are the same.
1 Evaluate vπ by policy evaluation.
2 Using synchronous updates, for each state s, let
π (s) = arg max
a∈A
q (s, a) = arg max
a∈A

Ra
s + α
s ∈S
Pa
ss vπ s


58 / 111
Images/cinvestav
However
Our ultimate goal is to find the optimal policy
For this, we can use policy iteration process
Algorithm
1 Initialize π randomly
2 Repeat until the previous policy and the current policy are the same.
1 Evaluate vπ by policy evaluation.
2 Using synchronous updates, for each state s, let
π (s) = arg max
a∈A
q (s, a) = arg max
a∈A

Ra
s + α
s ∈S
Pa
ss vπ s


58 / 111
Images/cinvestav
Now
Something Notable
This policy iteration algorithm will improve the policy for each step.
Assume that
We have a deterministic policy π and after one step we get π .
We know that the value improves in one step because
qπ s, π (s) = max
a∈A
qπ (s, a) ≥ qπ (s, π (s)) = vπ (s)
59 / 111
Images/cinvestav
Now
Something Notable
This policy iteration algorithm will improve the policy for each step.
Assume that
We have a deterministic policy π and after one step we get π .
We know that the value improves in one step because
qπ s, π (s) = max
a∈A
qπ (s, a) ≥ qπ (s, π (s)) = vπ (s)
59 / 111
Images/cinvestav
Now
Something Notable
This policy iteration algorithm will improve the policy for each step.
Assume that
We have a deterministic policy π and after one step we get π .
We know that the value improves in one step because
qπ s, π (s) = max
a∈A
qπ (s, a) ≥ qπ (s, π (s)) = vπ (s)
59 / 111
Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
Images/cinvestav
Then
If improvements stop
qπ s, π (s) = max
a∈A
qπ (s, a) = qπ (s, π (s)) = vπ (s)
This means that the Bellman equation has been satisfied
vπ (s) = max
a∈A
qπ (s, a)
Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy.
61 / 111
Images/cinvestav
Then
If improvements stop
qπ s, π (s) = max
a∈A
qπ (s, a) = qπ (s, π (s)) = vπ (s)
This means that the Bellman equation has been satisfied
vπ (s) = max
a∈A
qπ (s, a)
Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy.
61 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
62 / 111
Images/cinvestav
Once the policy has been improved
We can iteratively a sequence of monotonically improving policies and
values
π0
E
−→ vπ0
I
→ π1
E
−→ vπ1
I
→ π2
E
−→ · · ·
I
−→ π∗
E
→ v∗
with E = evaluation and I = improvement.
63 / 111
Images/cinvestav
Policy Iteration (using iterative policy evaluation)
Step 1. Initialization
V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S
Step 2. Policy Evaluation
1 Loop:
2 ∆ ← 0
3 Loop for each s ∈ S
4 v ← V (s)
5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )]
6 ∆ ← max (∆, |v − V (s)|)
7 until ∆ < θ (A small threshold for accuracy)
64 / 111
Images/cinvestav
Policy Iteration (using iterative policy evaluation)
Step 1. Initialization
V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S
Step 2. Policy Evaluation
1 Loop:
2 ∆ ← 0
3 Loop for each s ∈ S
4 v ← V (s)
5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )]
6 ∆ ← max (∆, |v − V (s)|)
7 until ∆ < θ (A small threshold for accuracy)
64 / 111
Images/cinvestav
Further
Step 3. Policy Improvement
1 Policy-stable ← True
2 For each s ∈ S:
3 old_action ← π (s)
4 π (s) ← s ,r p (s .r|s, a) [r + αV (s )]
5 if old_action= π (s) then policy-stable ← False
6 If policy-stable then stop and return V ≈ v∗ and π ≈ π∗ else go to 2.
65 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
66 / 111
Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
Images/cinvestav
We choose a probability for modeling requests and returns
Additionally, renting at each location has a Poisson distribution
P (n|λ) =
λn
n!
e−λ
We simplify the problem by having an upper-bound n = 20
Basically, any other car generated by the return simply disappear!!!
68 / 111
Images/cinvestav
We choose a probability for modeling requests and returns
Additionally, renting at each location has a Poisson distribution
P (n|λ) =
λn
n!
e−λ
We simplify the problem by having an upper-bound n = 20
Basically, any other car generated by the return simply disappear!!!
68 / 111
Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
Time Interval
Events
69 / 111
Images/cinvestav
Thus, we have the following running
With a α = 0.9
70 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
71 / 111
Images/cinvestav
History
There is an original earlier work
By Arthur Samuel and it was used by Sutton to invent Temporal
Difference-Lambda
It was famously applied by Gerald Tesauro
To build TD-Gammon
A program that learned to play backgammon at the level of expert
human players
72 / 111
Images/cinvestav
History
There is an original earlier work
By Arthur Samuel and it was used by Sutton to invent Temporal
Difference-Lambda
It was famously applied by Gerald Tesauro
To build TD-Gammon
A program that learned to play backgammon at the level of expert
human players
72 / 111
Images/cinvestav
Temporal-Difference (TD) Learning
Something Notable
Possibly the central and novel idea of reinforcement learning,
It has similarity with Monte Carlo Methods
They use experience to solve the prediction problem (For More look
at Sutton’s Book)
Monte Carlo methods wait until Gt is known
V (St) ← V (St) + γ [Gt − V (St)] and Gt =
∞
k=0
αk
Rt+k+1
Let us call this method constant-α Monte Carlo.
73 / 111
Images/cinvestav
Temporal-Difference (TD) Learning
Something Notable
Possibly the central and novel idea of reinforcement learning,
It has similarity with Monte Carlo Methods
They use experience to solve the prediction problem (For More look
at Sutton’s Book)
Monte Carlo methods wait until Gt is known
V (St) ← V (St) + γ [Gt − V (St)] and Gt =
∞
k=0
αk
Rt+k+1
Let us call this method constant-α Monte Carlo.
73 / 111
Images/cinvestav
Temporal-Difference (TD) Learning
Something Notable
Possibly the central and novel idea of reinforcement learning,
It has similarity with Monte Carlo Methods
They use experience to solve the prediction problem (For More look
at Sutton’s Book)
Monte Carlo methods wait until Gt is known
V (St) ← V (St) + γ [Gt − V (St)] and Gt =
∞
k=0
αk
Rt+k+1
Let us call this method constant-α Monte Carlo.
73 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
74 / 111
Images/cinvestav
Fundamental Theorem of Simulation
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX(x) = P(f(X) ≤ x)
With properties:
FX(x) ≥ 0
FX(x) in a non-decreasing function of X.
75 / 111
Images/cinvestav
Fundamental Theorem of Simulation
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX(x) = P(f(X) ≤ x)
With properties:
FX(x) ≥ 0
FX(x) in a non-decreasing function of X.
75 / 111
Images/cinvestav
Fundamental Theorem of Simulation
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX(x) = P(f(X) ≤ x)
With properties:
FX(x) ≥ 0
FX(x) in a non-decreasing function of X.
75 / 111
Images/cinvestav
Graphically
Example Uniform Distribution
1
76 / 111
Images/cinvestav
Graphically
Example Gaussian Distribution
77 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
78 / 111
Images/cinvestav
Fundamental Theorem of Simulation
Theorem
Simulating
X ∼ f (x)
It is equivalent to simulating
(X, U) ∼ U ((x, u) |0 < u < f (x))
79 / 111
Images/cinvestav
Fundamental Theorem of Simulation
Theorem
Simulating
X ∼ f (x)
It is equivalent to simulating
(X, U) ∼ U ((x, u) |0 < u < f (x))
79 / 111
Images/cinvestav
Geometrically
We have
Proof
We have that f (x) =
f(x)
0 du the marginal pdf of (X, U).
80 / 111
Images/cinvestav
Geometrically
We have
Proof
We have that f (x) =
f(x)
0 du the marginal pdf of (X, U).
80 / 111
Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
83 / 111
Images/cinvestav
The Idea
Draw an i.i.d. set of samples x(i)
N
i=1
From a target density p (x) on a high-dimensional X.
The set of possible configurations of a system.
Thus, we can use those samples to approximate the target density
pN (x) =
1
N
N
i=1
δx(i) (x)
where δx(i) (x) denotes the Dirac Delta mass located at x(i).
84 / 111
Images/cinvestav
The Idea
Draw an i.i.d. set of samples x(i)
N
i=1
From a target density p (x) on a high-dimensional X.
The set of possible configurations of a system.
Thus, we can use those samples to approximate the target density
pN (x) =
1
N
N
i=1
δx(i) (x)
where δx(i) (x) denotes the Dirac Delta mass located at x(i).
84 / 111
Images/cinvestav
The Idea
Draw an i.i.d. set of samples x(i)
N
i=1
From a target density p (x) on a high-dimensional X.
The set of possible configurations of a system.
Thus, we can use those samples to approximate the target density
pN (x) =
1
N
N
i=1
δx(i) (x)
where δx(i) (x) denotes the Dirac Delta mass located at x(i).
84 / 111
Images/cinvestav
Thus
Given, the Dirac Delta
δ (t) =
0 t = 0
∞ t = 0
with
t2
t1
δ(t)dt = 1
Therefore
δ (t) = lim
σ→0
1
√
2πσ
e− t2
2σ2
85 / 111
Images/cinvestav
Thus
Given, the Dirac Delta
δ (t) =
0 t = 0
∞ t = 0
with
t2
t1
δ(t)dt = 1
Therefore
δ (t) = lim
σ→0
1
√
2πσ
e− t2
2σ2
85 / 111
Images/cinvestav
Consequently
Approximate the integrals
IN (f) =
1
N
N
i=1
f x(i) a.s.
−→
N→∞
I (f) =
X
f (x) p (x) dx
Where Variance of f (x)
σ2
f = Ep(x) f2
(x) − I2
(f) < ∞
Therefore the variance of the estimator IN (f)
EIN (f) = var (IN (f)) =
σ2
f
N
86 / 111
Images/cinvestav
Consequently
Approximate the integrals
IN (f) =
1
N
N
i=1
f x(i) a.s.
−→
N→∞
I (f) =
X
f (x) p (x) dx
Where Variance of f (x)
σ2
f = Ep(x) f2
(x) − I2
(f) < ∞
Therefore the variance of the estimator IN (f)
EIN (f) = var (IN (f)) =
σ2
f
N
86 / 111
Images/cinvestav
Consequently
Approximate the integrals
IN (f) =
1
N
N
i=1
f x(i) a.s.
−→
N→∞
I (f) =
X
f (x) p (x) dx
Where Variance of f (x)
σ2
f = Ep(x) f2
(x) − I2
(f) < ∞
Therefore the variance of the estimator IN (f)
EIN (f) = var (IN (f)) =
σ2
f
N
86 / 111
Images/cinvestav
Then
By Robert & Casella 1999, Section 3.2
√
N (IN (f) − I (f)) =⇒
N−→∞
N 0, σ2
f
87 / 111
Images/cinvestav
Then
By Robert & Casella 1999, Section 3.2
√
N (IN (f) − I (f)) =⇒
N−→∞
N 0, σ2
f
87 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
88 / 111
Images/cinvestav
What if we do not wait for Gt
TD methods need to wait only until the next time step
At time t + 1, it uses the observed reward Rt+1 and the estimate
V (St+1)
The simplest TD method
V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)]
Immediately on transition to St+1 and receiving Rt+1 .
89 / 111
Images/cinvestav
What if we do not wait for Gt
TD methods need to wait only until the next time step
At time t + 1, it uses the observed reward Rt+1 and the estimate
V (St+1)
The simplest TD method
V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)]
Immediately on transition to St+1 and receiving Rt+1 .
89 / 111
Images/cinvestav
This method is called TD(0) or one-step TD
Actually
There are TD with n-step TD methods
Differences with the Monte Carlo Methods
In Monte Carlo update, the target is Gt
In TD update, the target is Rt + αV (St+1)
90 / 111
Images/cinvestav
This method is called TD(0) or one-step TD
Actually
There are TD with n-step TD methods
Differences with the Monte Carlo Methods
In Monte Carlo update, the target is Gt
In TD update, the target is Rt + αV (St+1)
90 / 111
Images/cinvestav
TD as a Bootstrapping Method
Quite similar to Dynamic
Additionally, we know that
vπ (s) = Eπ [Gt|St]
= Eπ [Rt+1 + αGt+1|St]
= Eπ [Rt+1 + αvπ (St+1) |St]
Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St]
Instead, Dynamic Programming methods use an estimate of
vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St].
91 / 111
Images/cinvestav
TD as a Bootstrapping Method
Quite similar to Dynamic
Additionally, we know that
vπ (s) = Eπ [Gt|St]
= Eπ [Rt+1 + αGt+1|St]
= Eπ [Rt+1 + αvπ (St+1) |St]
Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St]
Instead, Dynamic Programming methods use an estimate of
vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St].
91 / 111
Images/cinvestav
Furthermore
The Monte Carlo target is an estimate
because the expected value in (6.3) is not known, but a sample is
used instead
The DP target is an estimate
because vπ(St + 1) is not known and the current estimate,V (St+1), is
used instead.
Here TD goes beyond and combines both ideas
It combines the sampling of Monte Carlo with the bootstrapping of
DP.
92 / 111
Images/cinvestav
Furthermore
The Monte Carlo target is an estimate
because the expected value in (6.3) is not known, but a sample is
used instead
The DP target is an estimate
because vπ(St + 1) is not known and the current estimate,V (St+1), is
used instead.
Here TD goes beyond and combines both ideas
It combines the sampling of Monte Carlo with the bootstrapping of
DP.
92 / 111
Images/cinvestav
Furthermore
The Monte Carlo target is an estimate
because the expected value in (6.3) is not known, but a sample is
used instead
The DP target is an estimate
because vπ(St + 1) is not known and the current estimate,V (St+1), is
used instead.
Here TD goes beyond and combines both ideas
It combines the sampling of Monte Carlo with the bootstrapping of
DP.
92 / 111
Images/cinvestav
As in many methods there is an error
And it is used to improve the policy
δt = Rt+1 + αV (St+1) − V (St)
Because the TD error depends on the next state and next reward
It is not actually available until one time step later.
δt is the error in V (St)
At time t + 1
93 / 111
Images/cinvestav
As in many methods there is an error
And it is used to improve the policy
δt = Rt+1 + αV (St+1) − V (St)
Because the TD error depends on the next state and next reward
It is not actually available until one time step later.
δt is the error in V (St)
At time t + 1
93 / 111
Images/cinvestav
As in many methods there is an error
And it is used to improve the policy
δt = Rt+1 + αV (St+1) − V (St)
Because the TD error depends on the next state and next reward
It is not actually available until one time step later.
δt is the error in V (St)
At time t + 1
93 / 111
Images/cinvestav
Finally
If the array V does not change during the episode,
The Monte Carlo Error can be written as a sum of TD errors
= Gt − V (St) = Rt+1 + αGt+1 − V (St) + αV (St+1) − αV (St+1)
= δt + α (Gt+1 − V (St+1))
= δt + αδt+! + α2
(Gt+2 − V (St+2))
=
T−1
k=t
αk−t
δk
94 / 111
Images/cinvestav
Optimality of TD(0)
Under batch updating
TD(0) converges deterministically to a single answer independent
as long as γ is chosen to be sufficiently small.
V (St) ← V (St) + γ [Gt − V (St)] and
95 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
96 / 111
Images/cinvestav
Q-learning: Off-Policy TD Control
One of the early breakthroughs in reinforcement learning
The development of an off-policy TD control algorithm, Q-learning
(Watkind, 1989)
It is defined as
Q (St, At) ← Q (St, At) + γ Rt+1 + α max
a
Q (St+1, a) − Q (St, At)
Here
The learned action-value function, Q, directly approximates q∗ the
optimal action-value function
independent of the policy being followed q.
97 / 111
Images/cinvestav
Q-learning: Off-Policy TD Control
One of the early breakthroughs in reinforcement learning
The development of an off-policy TD control algorithm, Q-learning
(Watkind, 1989)
It is defined as
Q (St, At) ← Q (St, At) + γ Rt+1 + α max
a
Q (St+1, a) − Q (St, At)
Here
The learned action-value function, Q, directly approximates q∗ the
optimal action-value function
independent of the policy being followed q.
97 / 111
Images/cinvestav
Q-learning: Off-Policy TD Control
One of the early breakthroughs in reinforcement learning
The development of an off-policy TD control algorithm, Q-learning
(Watkind, 1989)
It is defined as
Q (St, At) ← Q (St, At) + γ Rt+1 + α max
a
Q (St+1, a) − Q (St, At)
Here
The learned action-value function, Q, directly approximates q∗ the
optimal action-value function
independent of the policy being followed q.
97 / 111
Images/cinvestav
Furthermore
The policy still has an effect
to determine which state-action pairs are visited and updated.
Therefore
All that is required for correct convergence is that all pairs continue
to be updated
Under this assumption and a variant of the usual stochastic
approximation conditions
Q has been shown to converge with probability 1 to q∗
98 / 111
Images/cinvestav
Furthermore
The policy still has an effect
to determine which state-action pairs are visited and updated.
Therefore
All that is required for correct convergence is that all pairs continue
to be updated
Under this assumption and a variant of the usual stochastic
approximation conditions
Q has been shown to converge with probability 1 to q∗
98 / 111
Images/cinvestav
Furthermore
The policy still has an effect
to determine which state-action pairs are visited and updated.
Therefore
All that is required for correct convergence is that all pairs continue
to be updated
Under this assumption and a variant of the usual stochastic
approximation conditions
Q has been shown to converge with probability 1 to q∗
98 / 111
Images/cinvestav
Q-learning (Off-Policy TD control) for estimating π ≈ π∗
Algorithm parameters
α ∈ (0, 1] and small > 0
Initialize
Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that
Q (terminal, ·) = 0
99 / 111
Images/cinvestav
Q-learning (Off-Policy TD control) for estimating π ≈ π∗
Algorithm parameters
α ∈ (0, 1] and small > 0
Initialize
Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that
Q (terminal, ·) = 0
99 / 111
Images/cinvestav
Then
Loop for each episode
1 Initialize S
2 Loop for each step of episode:
3 Choose A from S using policy derived from Q (for example
-greedy)
4 Take action A, observe R, S
Q (S, A) ← Q (S, A) + γ R + α max
a
Q S , a − Q (S, A)
5 S ← S
6 Until S is terminal
100 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
101 / 111
Images/cinvestav
One of the earliest successes
Backgammon is a major game
with numerous tournaments and regular world championship matches.
It is a stochastic game with a full description of the world
102 / 111
Images/cinvestav
One of the earliest successes
Backgammon is a major game
with numerous tournaments and regular world championship matches.
It is a stochastic game with a full description of the world
102 / 111
Images/cinvestav
What do they used?
TD-Gammon used a nonlinear form of TD (λ)
An estimated value ˆv (s, w) is used to estimate the probability of
winning from state s.
Then
To achieve this, rewards were defined as zero for all time steps except
those on which the game is won.
103 / 111
Images/cinvestav
What do they used?
TD-Gammon used a nonlinear form of TD (λ)
An estimated value ˆv (s, w) is used to estimate the probability of
winning from state s.
Then
To achieve this, rewards were defined as zero for all time steps except
those on which the game is won.
103 / 111
Images/cinvestav
Implementation of the value function in TD-Gammon
They took the decision to use a standard multilayer ANN
104 / 111
Images/cinvestav
The interesting part
Tesauro’s representation
For each point on the backgammon board
Four units indicate the number of white pieces on the point
Basically
If no white pieces, all four units took zero values,
if there was one white piece, the first unit takes a one value
And so on...
Therefore, given four white pieces and another four black pieces at
each of the 24 points
We have 4*24+4*24=192 units
105 / 111
Images/cinvestav
The interesting part
Tesauro’s representation
For each point on the backgammon board
Four units indicate the number of white pieces on the point
Basically
If no white pieces, all four units took zero values,
if there was one white piece, the first unit takes a one value
And so on...
Therefore, given four white pieces and another four black pieces at
each of the 24 points
We have 4*24+4*24=192 units
105 / 111
Images/cinvestav
The interesting part
Tesauro’s representation
For each point on the backgammon board
Four units indicate the number of white pieces on the point
Basically
If no white pieces, all four units took zero values,
if there was one white piece, the first unit takes a one value
And so on...
Therefore, given four white pieces and another four black pieces at
each of the 24 points
We have 4*24+4*24=192 units
105 / 111
Images/cinvestav
Then
We have that
Two additional units encoded the number of white and black pieces
on the bar (Each took the value n/2 with n the number on the bar)
Two other represent the number of black and white pieces already
(these took the value n/15)
two units indicated in a binary fashion the white or black turn.
106 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
107 / 111
Images/cinvestav
TD(λ) by Back-Propagation
TD-Gammon uses a semi-gradient of the TD (λ)
TD (λ) has the following general update rule:
wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt
with wt is the vector of all modifiable parameters
zt is a vector of eligibility traces
Eligibility traces unify and generalize TD and Monte Carlo methods.
It works as Long-Short Term Memory
The rough idea is that when a component of wt produces an estimate
zt is bumped out and then begins to fade away.
108 / 111
Images/cinvestav
TD(λ) by Back-Propagation
TD-Gammon uses a semi-gradient of the TD (λ)
TD (λ) has the following general update rule:
wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt
with wt is the vector of all modifiable parameters
zt is a vector of eligibility traces
Eligibility traces unify and generalize TD and Monte Carlo methods.
It works as Long-Short Term Memory
The rough idea is that when a component of wt produces an estimate
zt is bumped out and then begins to fade away.
108 / 111
Images/cinvestav
TD(λ) by Back-Propagation
TD-Gammon uses a semi-gradient of the TD (λ)
TD (λ) has the following general update rule:
wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt
with wt is the vector of all modifiable parameters
zt is a vector of eligibility traces
Eligibility traces unify and generalize TD and Monte Carlo methods.
It works as Long-Short Term Memory
The rough idea is that when a component of wt produces an estimate
zt is bumped out and then begins to fade away.
108 / 111
Images/cinvestav
How this is done?
Update Rule
zt = αλzt−1 + ˆv (St, wt) with z0 = 0
Something Notable
The gradient in this equation can be computed efficiently by the
back-propapagation procedure
For this, we set the error in the ANN
ˆv (St+1, wt) − ˆv (St, wt)
109 / 111
Images/cinvestav
How this is done?
Update Rule
zt = αλzt−1 + ˆv (St, wt) with z0 = 0
Something Notable
The gradient in this equation can be computed efficiently by the
back-propapagation procedure
For this, we set the error in the ANN
ˆv (St+1, wt) − ˆv (St, wt)
109 / 111
Images/cinvestav
How this is done?
Update Rule
zt = αλzt−1 + ˆv (St, wt) with z0 = 0
Something Notable
The gradient in this equation can be computed efficiently by the
back-propapagation procedure
For this, we set the error in the ANN
ˆv (St+1, wt) − ˆv (St, wt)
109 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
110 / 111
Images/cinvestav
Conclusions
With this new tools as Reinforcement Learning
The New Artificial Intelligence has a lot to do in the world
Thanks to be in this class
As they say we have a lot to do...
111 / 111
Images/cinvestav
Conclusions
With this new tools as Reinforcement Learning
The New Artificial Intelligence has a lot to do in the world
Thanks to be in this class
As they say we have a lot to do...
111 / 111

More Related Content

What's hot

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningJungyeol
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
Sangwoo Mo
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
민재 정
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
Taehoon Kim
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
pauldix
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Bill Liu
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
Jie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
Jie-Han Chen
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
Sangwoo Mo
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
Jie-Han Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
Ronald Teo
 
[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning
Seung Jae Lee
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
Taehoon Kim
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
 
파이썬으로 나만의 강화학습 환경 만들기
파이썬으로 나만의 강화학습 환경 만들기파이썬으로 나만의 강화학습 환경 만들기
파이썬으로 나만의 강화학습 환경 만들기
정주 김
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Chris Ohk
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
Dongmin Lee
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Shahan Ali Memon
 

What's hot (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
파이썬으로 나만의 강화학습 환경 만들기
파이썬으로 나만의 강화학습 환경 만들기파이썬으로 나만의 강화학습 환경 만들기
파이썬으로 나만의 강화학습 환경 만들기
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 

Similar to 25 introduction reinforcement_learning

Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Hung Le
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
YasutoTamura1
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
DataScienceLab
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4
Xiaohu ZHU
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Salem-Kabbani
 
Dp
DpDp
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Universitat Politècnica de Catalunya
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
pradiprahul
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
Flavian Vasile
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
Dan Elton
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
Ben Ball
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
RithikRaj25
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Ashwin Rao
 

Similar to 25 introduction reinforcement_learning (20)

Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Dp
DpDp
Dp
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
 

More from Andres Mendez-Vazquez

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
Andres Mendez-Vazquez
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
Andres Mendez-Vazquez
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
Andres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
Andres Mendez-Vazquez
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
Andres Mendez-Vazquez
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
Andres Mendez-Vazquez
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
Andres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
Andres Mendez-Vazquez
 
Zetta global
Zetta globalZetta global
Zetta global
Andres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
Andres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
Andres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
Andres Mendez-Vazquez
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
Andres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
Andres Mendez-Vazquez
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
Andres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
Andres Mendez-Vazquez
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
Andres Mendez-Vazquez
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
Andres Mendez-Vazquez
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
Andres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
Andres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
 

Recently uploaded

AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 

Recently uploaded (20)

AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 

25 introduction reinforcement_learning

  • 1. Introduction to Artificial Intelligence Reinforcement Learning Andres Mendez-Vazquez April 26, 2019 1 / 111
  • 2. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 2 / 111
  • 3. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 3 / 111
  • 4. Images/cinvestav Reinforcement Learning is learning what to do What does Reinforcement Learning want? Maximize a numerical reward signal The learner is not told which actions to take A discovery must be performed to obtain the actions that yield the most rewards 4 / 111
  • 5. Images/cinvestav Reinforcement Learning is learning what to do What does Reinforcement Learning want? Maximize a numerical reward signal The learner is not told which actions to take A discovery must be performed to obtain the actions that yield the most rewards 4 / 111
  • 6. Images/cinvestav These ideas come from Dynamical Systems Theory Specifically As the optimal control of incompletely-known Markov Decision Processes (MDP). Thus, the device must do the following Interact with the environment to learn by using punishments and rewards. 5 / 111
  • 7. Images/cinvestav These ideas come from Dynamical Systems Theory Specifically As the optimal control of incompletely-known Markov Decision Processes (MDP). Thus, the device must do the following Interact with the environment to learn by using punishments and rewards. 5 / 111
  • 8. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 6 / 111
  • 9. Images/cinvestav A mobile robot decides whether it should enter a new room in search of more trash to collect or going back to charge its battery It needs to take a decision based on the current charge level of its battery how quickly it can arrive to its base 7 / 111
  • 10. Images/cinvestav A mobile robot decides whether it should enter a new room in search of more trash to collect or going back to charge its battery It needs to take a decision based on the current charge level of its battery how quickly it can arrive to its base 7 / 111
  • 11. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 8 / 111
  • 12. Images/cinvestav A K-Armed Bandit Problem Definition You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen From an stationary probability distribution depending on the actions. Definition Stationary Probability A stationary distribution of a Markov chain is a probability distribution that remains unchanged in the Markov chain as time progresses. 9 / 111
  • 13. Images/cinvestav A K-Armed Bandit Problem Definition You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen From an stationary probability distribution depending on the actions. Definition Stationary Probability A stationary distribution of a Markov chain is a probability distribution that remains unchanged in the Markov chain as time progresses. 9 / 111
  • 14. Images/cinvestav A K-Armed Bandit Problem Definition You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen From an stationary probability distribution depending on the actions. Definition Stationary Probability A stationary distribution of a Markov chain is a probability distribution that remains unchanged in the Markov chain as time progresses. 9 / 111
  • 16. Images/cinvestav Therefore Each of the k actions has an expected value q∗ (a) = E [Rt|At = a] Something Notable If you knew the value of each action, It would be trivial to solve the k-armed bandit problem. 11 / 111
  • 17. Images/cinvestav Therefore Each of the k actions has an expected value q∗ (a) = E [Rt|At = a] Something Notable If you knew the value of each action, It would be trivial to solve the k-armed bandit problem. 11 / 111
  • 18. Images/cinvestav What do we want? To find the estimated value of action a at time t, Qt (a) |Qt (a) − q∗ (a)| < with as small as possible 12 / 111
  • 19. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 13 / 111
  • 20. Images/cinvestav Greedy Actions Intuitions To choose the greatest E [Rt|At = a] This is know as Exploitation You are exploiting your current knowledge of the values of the actions. Contrasting to Exploration When you select the non-gready actions to obtain a better knowledge of the non-gready actions. 14 / 111
  • 21. Images/cinvestav Greedy Actions Intuitions To choose the greatest E [Rt|At = a] This is know as Exploitation You are exploiting your current knowledge of the values of the actions. Contrasting to Exploration When you select the non-gready actions to obtain a better knowledge of the non-gready actions. 14 / 111
  • 22. Images/cinvestav Greedy Actions Intuitions To choose the greatest E [Rt|At = a] This is know as Exploitation You are exploiting your current knowledge of the values of the actions. Contrasting to Exploration When you select the non-gready actions to obtain a better knowledge of the non-gready actions. 14 / 111
  • 23. Images/cinvestav An Interesting Conundrum It has been observed that Exploitation maximizes the expected reward on the one step, Exploration may produce the greater total reward in the long run. Thus, the idea The value of a state s is the total amount of reward an agent can expect to accumulate over the future, starting from s. Thus, Values They indicate the long-term profit of states after taking into account The states that follow and their rewards. 15 / 111
  • 24. Images/cinvestav An Interesting Conundrum It has been observed that Exploitation maximizes the expected reward on the one step, Exploration may produce the greater total reward in the long run. Thus, the idea The value of a state s is the total amount of reward an agent can expect to accumulate over the future, starting from s. Thus, Values They indicate the long-term profit of states after taking into account The states that follow and their rewards. 15 / 111
  • 25. Images/cinvestav An Interesting Conundrum It has been observed that Exploitation maximizes the expected reward on the one step, Exploration may produce the greater total reward in the long run. Thus, the idea The value of a state s is the total amount of reward an agent can expect to accumulate over the future, starting from s. Thus, Values They indicate the long-term profit of states after taking into account The states that follow and their rewards. 15 / 111
  • 26. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 16 / 111
  • 27. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 28. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 29. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 30. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 31. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 32. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 33. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 34. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 35. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 36. Images/cinvestav Finally A Model of the Environment This simulates the behavior of the environment Allowing to make inferences on how the environment can behave Something Notable Models are used for planning Meaning we made decisions before they are executed Thus, Reinforcement Learning Methods using models are called model-based 18 / 111
  • 37. Images/cinvestav Finally A Model of the Environment This simulates the behavior of the environment Allowing to make inferences on how the environment can behave Something Notable Models are used for planning Meaning we made decisions before they are executed Thus, Reinforcement Learning Methods using models are called model-based 18 / 111
  • 38. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 19 / 111
  • 39. Images/cinvestav Limitations Given that Reinforcement learning relies heavily on the concept of state Therefore, it has the same problem as in Markov Decision Processes Defining the state to obtain a better representation of the search space. This is where the creative part of the problems come to be As we have seen, solutions in AI really a lot in the creativity of the solution... 20 / 111
  • 40. Images/cinvestav Limitations Given that Reinforcement learning relies heavily on the concept of state Therefore, it has the same problem as in Markov Decision Processes Defining the state to obtain a better representation of the search space. This is where the creative part of the problems come to be As we have seen, solutions in AI really a lot in the creativity of the solution... 20 / 111
  • 41. Images/cinvestav Limitations Given that Reinforcement learning relies heavily on the concept of state Therefore, it has the same problem as in Markov Decision Processes Defining the state to obtain a better representation of the search space. This is where the creative part of the problems come to be As we have seen, solutions in AI really a lot in the creativity of the solution... 20 / 111
  • 42. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 21 / 111
  • 43. Images/cinvestav The System–Environment Interaction In a Markov Decision Process 22 / 111
  • 44. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 23 / 111
  • 45. Images/cinvestav Markov Process and Markov Decision Process Definition of Markov Process A sequence of states is Markov if and only if the probability of moving to the next state st+1 depends only on the present state st P [st+1|st] = P [st+1|s1, ..., st] Something Notable We always talk about time-homogeneous Markov chain in Reinforcement Learning: the probability of the transition is independent of t P [st+1 = s |st = s] = P [st = s |st−1 = s] 24 / 111
  • 46. Images/cinvestav Markov Process and Markov Decision Process Definition of Markov Process A sequence of states is Markov if and only if the probability of moving to the next state st+1 depends only on the present state st P [st+1|st] = P [st+1|s1, ..., st] Something Notable We always talk about time-homogeneous Markov chain in Reinforcement Learning: the probability of the transition is independent of t P [st+1 = s |st = s] = P [st = s |st−1 = s] 24 / 111
  • 47. Images/cinvestav Formally Definition (Markov Process) a Markov Process (or Markov Chain) is a tuple (S, P), where 1 S is a finite set of states 2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s] Thus, the dynamic of the Markov process is as follow s0 Ps0s1 −→ s1 Ps1s2 −→ s2 Ps2s3 −→ s3 −→ · · · 25 / 111
  • 48. Images/cinvestav Formally Definition (Markov Process) a Markov Process (or Markov Chain) is a tuple (S, P), where 1 S is a finite set of states 2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s] Thus, the dynamic of the Markov process is as follow s0 Ps0s1 −→ s1 Ps1s2 −→ s2 Ps2s3 −→ s3 −→ · · · 25 / 111
  • 49. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 50. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 51. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 52. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 53. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 54. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 55. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 56. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 27 / 111
  • 57. Images/cinvestav The Goal of Reinforcement Learning We want to maximize the expected value of the return i.e. To obtain the optimal policy!!! Thus, as in our previous study of MDP’s Gt = Rt+1 + αRt+2 + · · · = ∞ k=0 αk Rt+k+1 28 / 111
  • 58. Images/cinvestav The Goal of Reinforcement Learning We want to maximize the expected value of the return i.e. To obtain the optimal policy!!! Thus, as in our previous study of MDP’s Gt = Rt+1 + αRt+2 + · · · = ∞ k=0 αk Rt+k+1 28 / 111
  • 59. Images/cinvestav Remarks Interpretation The discounted future rewards can be interpreted as the current value of future rewards. The reward α α close to 0 leads to “short-sighted” evaluation... you are only looking for immediate rewards!!! α close to 1 leads to “long-sighted” evaluation... you are willing to wait for a better reward!!! 29 / 111
  • 60. Images/cinvestav Remarks Interpretation The discounted future rewards can be interpreted as the current value of future rewards. The reward α α close to 0 leads to “short-sighted” evaluation... you are only looking for immediate rewards!!! α close to 1 leads to “long-sighted” evaluation... you are willing to wait for a better reward!!! 29 / 111
  • 61. Images/cinvestav Why to use this discount factor? Basically 1 Discount factor avoids infinite returns in cyclic Markov process given the property of the geometry series. 2 There is certain uncertainty about the future reward. 3 If the reward is financial, immediate rewards may earn more interest than delayed rewards. 4 Animal/human behavior shows preference for immediate reward. 30 / 111
  • 62. Images/cinvestav Why to use this discount factor? Basically 1 Discount factor avoids infinite returns in cyclic Markov process given the property of the geometry series. 2 There is certain uncertainty about the future reward. 3 If the reward is financial, immediate rewards may earn more interest than delayed rewards. 4 Animal/human behavior shows preference for immediate reward. 30 / 111
  • 63. Images/cinvestav Why to use this discount factor? Basically 1 Discount factor avoids infinite returns in cyclic Markov process given the property of the geometry series. 2 There is certain uncertainty about the future reward. 3 If the reward is financial, immediate rewards may earn more interest than delayed rewards. 4 Animal/human behavior shows preference for immediate reward. 30 / 111
  • 64. Images/cinvestav Why to use this discount factor? Basically 1 Discount factor avoids infinite returns in cyclic Markov process given the property of the geometry series. 2 There is certain uncertainty about the future reward. 3 If the reward is financial, immediate rewards may earn more interest than delayed rewards. 4 Animal/human behavior shows preference for immediate reward. 30 / 111
  • 65. Images/cinvestav A Policy Using our MDP ideas... A policy π is a distribution over actions given states π (a|s) = P [At = a|St = s] A policy guides the choice of action at a given state. And it is time independent Then, we introduce two value functions State-value function. Action-value function. 31 / 111
  • 66. Images/cinvestav A Policy Using our MDP ideas... A policy π is a distribution over actions given states π (a|s) = P [At = a|St = s] A policy guides the choice of action at a given state. And it is time independent Then, we introduce two value functions State-value function. Action-value function. 31 / 111
  • 67. Images/cinvestav The state-value function vπ of an MDP It is defined as vπ (s) = Eπ [Gt|St = s] This gives the long-term value of the state s vπ (s) = Eπ [Gt|St = s] = Eπ ∞ k=0 αk Rt+k+1|St = s = Eπ [Rt+1 + αGt+1|St = s] = Eπ [Rt+1|St = s] Immediate Reward + Eπ [αvπ (St+1) |St = s] Discount value of successor state 32 / 111
  • 68. Images/cinvestav The state-value function vπ of an MDP It is defined as vπ (s) = Eπ [Gt|St = s] This gives the long-term value of the state s vπ (s) = Eπ [Gt|St = s] = Eπ ∞ k=0 αk Rt+k+1|St = s = Eπ [Rt+1 + αGt+1|St = s] = Eπ [Rt+1|St = s] Immediate Reward + Eπ [αvπ (St+1) |St = s] Discount value of successor state 32 / 111
  • 69. Images/cinvestav The state-value function vπ of an MDP It is defined as vπ (s) = Eπ [Gt|St = s] This gives the long-term value of the state s vπ (s) = Eπ [Gt|St = s] = Eπ ∞ k=0 αk Rt+k+1|St = s = Eπ [Rt+1 + αGt+1|St = s] = Eπ [Rt+1|St = s] Immediate Reward + Eπ [αvπ (St+1) |St = s] Discount value of successor state 32 / 111
  • 70. Images/cinvestav The state-value function vπ of an MDP It is defined as vπ (s) = Eπ [Gt|St = s] This gives the long-term value of the state s vπ (s) = Eπ [Gt|St = s] = Eπ ∞ k=0 αk Rt+k+1|St = s = Eπ [Rt+1 + αGt+1|St = s] = Eπ [Rt+1|St = s] Immediate Reward + Eπ [αvπ (St+1) |St = s] Discount value of successor state 32 / 111
  • 71. Images/cinvestav The Action-Value Function qπ (s, a) It is the expected return starting from state s Starting from state s, taking action a, and then following policy π qπ (s, a) = Eπ [Gt|St = a, At = a] We can also obtain the following decomposition qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a] 33 / 111
  • 72. Images/cinvestav The Action-Value Function qπ (s, a) It is the expected return starting from state s Starting from state s, taking action a, and then following policy π qπ (s, a) = Eπ [Gt|St = a, At = a] We can also obtain the following decomposition qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a] 33 / 111
  • 73. Images/cinvestav Now... To simplify notations, we define Ra s = E [Rt+1|St = s, At = a] Then, we have vπ (s) = a∈A π (a|s) qπ (s, a) qπ (s, a) = Ra s + α s ∈S Pa ss vπ s 34 / 111
  • 74. Images/cinvestav Now... To simplify notations, we define Ra s = E [Rt+1|St = s, At = a] Then, we have vπ (s) = a∈A π (a|s) qπ (s, a) qπ (s, a) = Ra s + α s ∈S Pa ss vπ s 34 / 111
  • 75. Images/cinvestav Then Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) - The Bellman Equation vπ (s) = a∈A π (a|s)  Ra s + α s ∈S Pa ss vπ s   The Bellman equation relates the state-value function of one state with that of other states. 35 / 111
  • 76. Images/cinvestav Then Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) - The Bellman Equation vπ (s) = a∈A π (a|s)  Ra s + α s ∈S Pa ss vπ s   The Bellman equation relates the state-value function of one state with that of other states. 35 / 111
  • 77. Images/cinvestav Also Bellman equation for qπ (s, a) qπ (s, a) = Ra s + α s ∈S Pa ss a∈A π (a|s) qπ (s, a) How do we solve this problem? We have the following solutions 36 / 111
  • 78. Images/cinvestav Also Bellman equation for qπ (s, a) qπ (s, a) = Ra s + α s ∈S Pa ss a∈A π (a|s) qπ (s, a) How do we solve this problem? We have the following solutions 36 / 111
  • 79. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 37 / 111
  • 80. Images/cinvestav Action-Value Methods One natural way to estimate this is by averaging the rewards actually received Qt (a) = t−1 i=1 RiIAi=a t−1 i=1 IAi=a Nevertheless If the denominator is zero, we define Qt (a) at some default value Now, as the denominator goes to infinity, by the law of large numbers, Qt (a) −→ q∗ (a) 38 / 111
  • 81. Images/cinvestav Action-Value Methods One natural way to estimate this is by averaging the rewards actually received Qt (a) = t−1 i=1 RiIAi=a t−1 i=1 IAi=a Nevertheless If the denominator is zero, we define Qt (a) at some default value Now, as the denominator goes to infinity, by the law of large numbers, Qt (a) −→ q∗ (a) 38 / 111
  • 82. Images/cinvestav Action-Value Methods One natural way to estimate this is by averaging the rewards actually received Qt (a) = t−1 i=1 RiIAi=a t−1 i=1 IAi=a Nevertheless If the denominator is zero, we define Qt (a) at some default value Now, as the denominator goes to infinity, by the law of large numbers, Qt (a) −→ q∗ (a) 38 / 111
  • 83. Images/cinvestav Furthermore We call this the sample-average method Not the best, but allows to introduce a way to select actions... A classic one us the greedy estimate At = arg max a Qt (a) Therefore Greedy action selection always exploits current knowledge to maximize immediate reward. it spends no time in inferior actions to see if they might really be better. 39 / 111
  • 84. Images/cinvestav Furthermore We call this the sample-average method Not the best, but allows to introduce a way to select actions... A classic one us the greedy estimate At = arg max a Qt (a) Therefore Greedy action selection always exploits current knowledge to maximize immediate reward. it spends no time in inferior actions to see if they might really be better. 39 / 111
  • 85. Images/cinvestav Furthermore We call this the sample-average method Not the best, but allows to introduce a way to select actions... A classic one us the greedy estimate At = arg max a Qt (a) Therefore Greedy action selection always exploits current knowledge to maximize immediate reward. it spends no time in inferior actions to see if they might really be better. 39 / 111
  • 86. Images/cinvestav Furthermore We call this the sample-average method Not the best, but allows to introduce a way to select actions... A classic one us the greedy estimate At = arg max a Qt (a) Therefore Greedy action selection always exploits current knowledge to maximize immediate reward. it spends no time in inferior actions to see if they might really be better. 39 / 111
  • 87. Images/cinvestav And Here a Heuristic A Simple Alternative is to act greedily most of the time every once in a while, say with small probability , select randomly from among all the actions with equal probability. This is called -greedy method In the limit to infinity, this method ensures: Qt (a) −→ q∗ (a) 40 / 111
  • 88. Images/cinvestav And Here a Heuristic A Simple Alternative is to act greedily most of the time every once in a while, say with small probability , select randomly from among all the actions with equal probability. This is called -greedy method In the limit to infinity, this method ensures: Qt (a) −→ q∗ (a) 40 / 111
  • 89. Images/cinvestav It is like the Mean Filter We have a nice result 41 / 111
  • 90. Images/cinvestav Example In the following example The q∗ (a) were selected according to a Gaussian Distribution with mean 0 and variance 1. Then, when an action is selected, the reward Rt was selected from a normal distribution with mean q∗ (At) and variance 1. We have then 42 / 111
  • 91. Images/cinvestav Example In the following example The q∗ (a) were selected according to a Gaussian Distribution with mean 0 and variance 1. Then, when an action is selected, the reward Rt was selected from a normal distribution with mean q∗ (At) and variance 1. We have then 42 / 111
  • 92. Images/cinvestav Now, as the sampling goes up We get 43 / 111
  • 93. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 44 / 111
  • 94. Images/cinvestav Imagine the following Let Ri denotes the reward after the ith of this action Qn = R1 + R2 + · · · + Rn−1 n − 1 The problem of this technique The Amount of Memory and Computational Power to keep updates the estimates as more reward happens 45 / 111
  • 95. Images/cinvestav Imagine the following Let Ri denotes the reward after the ith of this action Qn = R1 + R2 + · · · + Rn−1 n − 1 The problem of this technique The Amount of Memory and Computational Power to keep updates the estimates as more reward happens 45 / 111
  • 96. Images/cinvestav Therefore We need something more efficient We can do something like that by realizing the following: Qn+1 = 1 n n i=1 Rn = 1 n Rn + n i=1 Ei = 1 n [Rn + (n − 1) Qn] = Qn + 1 n [Rn − Qn] 46 / 111
  • 97. Images/cinvestav Therefore We need something more efficient We can do something like that by realizing the following: Qn+1 = 1 n n i=1 Rn = 1 n Rn + n i=1 Ei = 1 n [Rn + (n − 1) Qn] = Qn + 1 n [Rn − Qn] 46 / 111
  • 98. Images/cinvestav Therefore We need something more efficient We can do something like that by realizing the following: Qn+1 = 1 n n i=1 Rn = 1 n Rn + n i=1 Ei = 1 n [Rn + (n − 1) Qn] = Qn + 1 n [Rn − Qn] 46 / 111
  • 99. Images/cinvestav Therefore We need something more efficient We can do something like that by realizing the following: Qn+1 = 1 n n i=1 Rn = 1 n Rn + n i=1 Ei = 1 n [Rn + (n − 1) Qn] = Qn + 1 n [Rn − Qn] 46 / 111
  • 100. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 47 / 111
  • 101. Images/cinvestav General Formula in Reinforcement Learning The Pattern Appears almost all the time in RL New Estimate = Old Estimate + Step Size [Target − Old Estimate] Looks a lot as a Gradient Descent if we decide Assume a Taylor Series f (x) = f (a) + f (a) (x − a) =⇒ f (a) = f (x) − f (a) x − a Then xn+1 = xn + γf (a) = xn + γ f (x) − f (a) x − a = xn + γ [f (x) − f (a)] 48 / 111
  • 102. Images/cinvestav General Formula in Reinforcement Learning The Pattern Appears almost all the time in RL New Estimate = Old Estimate + Step Size [Target − Old Estimate] Looks a lot as a Gradient Descent if we decide Assume a Taylor Series f (x) = f (a) + f (a) (x − a) =⇒ f (a) = f (x) − f (a) x − a Then xn+1 = xn + γf (a) = xn + γ f (x) − f (a) x − a = xn + γ [f (x) − f (a)] 48 / 111
  • 103. Images/cinvestav General Formula in Reinforcement Learning The Pattern Appears almost all the time in RL New Estimate = Old Estimate + Step Size [Target − Old Estimate] Looks a lot as a Gradient Descent if we decide Assume a Taylor Series f (x) = f (a) + f (a) (x − a) =⇒ f (a) = f (x) − f (a) x − a Then xn+1 = xn + γf (a) = xn + γ f (x) − f (a) x − a = xn + γ [f (x) − f (a)] 48 / 111
  • 104. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 49 / 111
  • 105. Images/cinvestav A Simple Bandit Algorithm Initialize for a = 1 to k 1 Q (a) ← 0 2 N (a) ← 0 Loop until certain threshold γ 1 A ← arg maxa Q (a) with probability 1 − A random action with probability 2 R ← bandit (A) 3 N (A) ← N (A) + 1 4 Q (A) ← Q (A) + 1 N(A) [R − Q (A)] 50 / 111
  • 106. Images/cinvestav A Simple Bandit Algorithm Initialize for a = 1 to k 1 Q (a) ← 0 2 N (a) ← 0 Loop until certain threshold γ 1 A ← arg maxa Q (a) with probability 1 − A random action with probability 2 R ← bandit (A) 3 N (A) ← N (A) + 1 4 Q (A) ← Q (A) + 1 N(A) [R − Q (A)] 50 / 111
  • 107. Images/cinvestav Remarks Remember This is works well for a stationary problem For more in non-stationary problems “Reinforcement Learning: An Introduction“ by Sutton et al. at chapter 2. 51 / 111
  • 108. Images/cinvestav Remarks Remember This is works well for a stationary problem For more in non-stationary problems “Reinforcement Learning: An Introduction“ by Sutton et al. at chapter 2. 51 / 111
  • 109. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 52 / 111
  • 110. Images/cinvestav Dynamic Programming As in many things in Computer Science We love the Divide Et Impera... Then In Reinforcement Learning, Bellman equation gives recursive decompositions, and value function stores and reuses solutions. 53 / 111
  • 111. Images/cinvestav Dynamic Programming As in many things in Computer Science We love the Divide Et Impera... Then In Reinforcement Learning, Bellman equation gives recursive decompositions, and value function stores and reuses solutions. 53 / 111
  • 112. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 54 / 111
  • 113. Images/cinvestav Policy Evaluation For this, we assume as always The problem can be solved with the Bellman Equation We start from an initial guess v1 Then, we apply Bellman Iteratively v1 → v2 → · · · → vπ For this, we use synchronous updates We update the value functions of the present iteration at the same time based on the that of the previous iteration. 55 / 111
  • 114. Images/cinvestav Policy Evaluation For this, we assume as always The problem can be solved with the Bellman Equation We start from an initial guess v1 Then, we apply Bellman Iteratively v1 → v2 → · · · → vπ For this, we use synchronous updates We update the value functions of the present iteration at the same time based on the that of the previous iteration. 55 / 111
  • 115. Images/cinvestav Policy Evaluation For this, we assume as always The problem can be solved with the Bellman Equation We start from an initial guess v1 Then, we apply Bellman Iteratively v1 → v2 → · · · → vπ For this, we use synchronous updates We update the value functions of the present iteration at the same time based on the that of the previous iteration. 55 / 111
  • 116. Images/cinvestav Then At each iteration k + 1, for all states s ∈ S We update vk+1 from vk (s ) according to Bellman Equations, where s is a successor state of s: vk+1 (s) = a∈A π (a|s)  Ra s + α s ∈S Pa ss vk s   Something Notable The iterative process stops when the maximum difference between value function at the current step and that at the previous step is smaller than some small positive constant. 56 / 111
  • 117. Images/cinvestav Then At each iteration k + 1, for all states s ∈ S We update vk+1 from vk (s ) according to Bellman Equations, where s is a successor state of s: vk+1 (s) = a∈A π (a|s)  Ra s + α s ∈S Pa ss vk s   Something Notable The iterative process stops when the maximum difference between value function at the current step and that at the previous step is smaller than some small positive constant. 56 / 111
  • 118. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 57 / 111
  • 119. Images/cinvestav However Our ultimate goal is to find the optimal policy For this, we can use policy iteration process Algorithm 1 Initialize π randomly 2 Repeat until the previous policy and the current policy are the same. 1 Evaluate vπ by policy evaluation. 2 Using synchronous updates, for each state s, let π (s) = arg max a∈A q (s, a) = arg max a∈A  Ra s + α s ∈S Pa ss vπ s   58 / 111
  • 120. Images/cinvestav However Our ultimate goal is to find the optimal policy For this, we can use policy iteration process Algorithm 1 Initialize π randomly 2 Repeat until the previous policy and the current policy are the same. 1 Evaluate vπ by policy evaluation. 2 Using synchronous updates, for each state s, let π (s) = arg max a∈A q (s, a) = arg max a∈A  Ra s + α s ∈S Pa ss vπ s   58 / 111
  • 121. Images/cinvestav Now Something Notable This policy iteration algorithm will improve the policy for each step. Assume that We have a deterministic policy π and after one step we get π . We know that the value improves in one step because qπ s, π (s) = max a∈A qπ (s, a) ≥ qπ (s, π (s)) = vπ (s) 59 / 111
  • 122. Images/cinvestav Now Something Notable This policy iteration algorithm will improve the policy for each step. Assume that We have a deterministic policy π and after one step we get π . We know that the value improves in one step because qπ s, π (s) = max a∈A qπ (s, a) ≥ qπ (s, π (s)) = vπ (s) 59 / 111
  • 123. Images/cinvestav Now Something Notable This policy iteration algorithm will improve the policy for each step. Assume that We have a deterministic policy π and after one step we get π . We know that the value improves in one step because qπ s, π (s) = max a∈A qπ (s, a) ≥ qπ (s, π (s)) = vπ (s) 59 / 111
  • 124. Images/cinvestav Then We can see that vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s] ≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s ≤ Eπ Rt+1 + αRt+2 + α2 qπ St+2, π (St+2) |St = s ≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s) 60 / 111
  • 125. Images/cinvestav Then We can see that vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s] ≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s ≤ Eπ Rt+1 + αRt+2 + α2 qπ St+2, π (St+2) |St = s ≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s) 60 / 111
  • 126. Images/cinvestav Then We can see that vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s] ≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s ≤ Eπ Rt+1 + αRt+2 + α2 qπ St+2, π (St+2) |St = s ≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s) 60 / 111
  • 127. Images/cinvestav Then We can see that vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s] ≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s ≤ Eπ Rt+1 + αRt+2 + α2 qπ St+2, π (St+2) |St = s ≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s) 60 / 111
  • 128. Images/cinvestav Then If improvements stop qπ s, π (s) = max a∈A qπ (s, a) = qπ (s, π (s)) = vπ (s) This means that the Bellman equation has been satisfied vπ (s) = max a∈A qπ (s, a) Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy. 61 / 111
  • 129. Images/cinvestav Then If improvements stop qπ s, π (s) = max a∈A qπ (s, a) = qπ (s, π (s)) = vπ (s) This means that the Bellman equation has been satisfied vπ (s) = max a∈A qπ (s, a) Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy. 61 / 111
  • 130. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 62 / 111
  • 131. Images/cinvestav Once the policy has been improved We can iteratively a sequence of monotonically improving policies and values π0 E −→ vπ0 I → π1 E −→ vπ1 I → π2 E −→ · · · I −→ π∗ E → v∗ with E = evaluation and I = improvement. 63 / 111
  • 132. Images/cinvestav Policy Iteration (using iterative policy evaluation) Step 1. Initialization V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S Step 2. Policy Evaluation 1 Loop: 2 ∆ ← 0 3 Loop for each s ∈ S 4 v ← V (s) 5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )] 6 ∆ ← max (∆, |v − V (s)|) 7 until ∆ < θ (A small threshold for accuracy) 64 / 111
  • 133. Images/cinvestav Policy Iteration (using iterative policy evaluation) Step 1. Initialization V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S Step 2. Policy Evaluation 1 Loop: 2 ∆ ← 0 3 Loop for each s ∈ S 4 v ← V (s) 5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )] 6 ∆ ← max (∆, |v − V (s)|) 7 until ∆ < θ (A small threshold for accuracy) 64 / 111
  • 134. Images/cinvestav Further Step 3. Policy Improvement 1 Policy-stable ← True 2 For each s ∈ S: 3 old_action ← π (s) 4 π (s) ← s ,r p (s .r|s, a) [r + αV (s )] 5 if old_action= π (s) then policy-stable ← False 6 If policy-stable then stop and return V ≈ v∗ and π ≈ π∗ else go to 2. 65 / 111
  • 135. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 66 / 111
  • 136. Images/cinvestav Jack’s Car Rental Setup We have a Car’s Rental with two places (s1, s2). And each time Jack is available he rents it out and is credited $10 by the company. If Jack’s is out of cars at that location, then the business is lost. The cars are available for renting the next day Jack’s has options It can move them between the two locations overnight, at a cost of $2 dollars and only 5 of them can be moved. 67 / 111
  • 137. Images/cinvestav Jack’s Car Rental Setup We have a Car’s Rental with two places (s1, s2). And each time Jack is available he rents it out and is credited $10 by the company. If Jack’s is out of cars at that location, then the business is lost. The cars are available for renting the next day Jack’s has options It can move them between the two locations overnight, at a cost of $2 dollars and only 5 of them can be moved. 67 / 111
  • 138. Images/cinvestav Jack’s Car Rental Setup We have a Car’s Rental with two places (s1, s2). And each time Jack is available he rents it out and is credited $10 by the company. If Jack’s is out of cars at that location, then the business is lost. The cars are available for renting the next day Jack’s has options It can move them between the two locations overnight, at a cost of $2 dollars and only 5 of them can be moved. 67 / 111
  • 139. Images/cinvestav Jack’s Car Rental Setup We have a Car’s Rental with two places (s1, s2). And each time Jack is available he rents it out and is credited $10 by the company. If Jack’s is out of cars at that location, then the business is lost. The cars are available for renting the next day Jack’s has options It can move them between the two locations overnight, at a cost of $2 dollars and only 5 of them can be moved. 67 / 111
  • 140. Images/cinvestav Jack’s Car Rental Setup We have a Car’s Rental with two places (s1, s2). And each time Jack is available he rents it out and is credited $10 by the company. If Jack’s is out of cars at that location, then the business is lost. The cars are available for renting the next day Jack’s has options It can move them between the two locations overnight, at a cost of $2 dollars and only 5 of them can be moved. 67 / 111
  • 141. Images/cinvestav We choose a probability for modeling requests and returns Additionally, renting at each location has a Poisson distribution P (n|λ) = λn n! e−λ We simplify the problem by having an upper-bound n = 20 Basically, any other car generated by the return simply disappear!!! 68 / 111
  • 142. Images/cinvestav We choose a probability for modeling requests and returns Additionally, renting at each location has a Poisson distribution P (n|λ) = λn n! e−λ We simplify the problem by having an upper-bound n = 20 Basically, any other car generated by the return simply disappear!!! 68 / 111
  • 143. Images/cinvestav Finally We have the following λ s for request and returns 1 Request s1 we have λr1 = 3 2 Request s2 we have λr2 = 4 3 Return s1 we have λr1 = 3 4 Return s2 we have λr1 = 2 Actually, it defines how many cars are requested or returned in a time interval (One day) 69 / 111
  • 144. Images/cinvestav Finally We have the following λ s for request and returns 1 Request s1 we have λr1 = 3 2 Request s2 we have λr2 = 4 3 Return s1 we have λr1 = 3 4 Return s2 we have λr1 = 2 Actually, it defines how many cars are requested or returned in a time interval (One day) 69 / 111
  • 145. Images/cinvestav Finally We have the following λ s for request and returns 1 Request s1 we have λr1 = 3 2 Request s2 we have λr2 = 4 3 Return s1 we have λr1 = 3 4 Return s2 we have λr1 = 2 Actually, it defines how many cars are requested or returned in a time interval (One day) 69 / 111
  • 146. Images/cinvestav Finally We have the following λ s for request and returns 1 Request s1 we have λr1 = 3 2 Request s2 we have λr2 = 4 3 Return s1 we have λr1 = 3 4 Return s2 we have λr1 = 2 Actually, it defines how many cars are requested or returned in a time interval (One day) 69 / 111
  • 147. Images/cinvestav Finally We have the following λ s for request and returns 1 Request s1 we have λr1 = 3 2 Request s2 we have λr2 = 4 3 Return s1 we have λr1 = 3 4 Return s2 we have λr1 = 2 Actually, it defines how many cars are requested or returned in a time interval (One day) Time Interval Events 69 / 111
  • 148. Images/cinvestav Thus, we have the following running With a α = 0.9 70 / 111
  • 149. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 71 / 111
  • 150. Images/cinvestav History There is an original earlier work By Arthur Samuel and it was used by Sutton to invent Temporal Difference-Lambda It was famously applied by Gerald Tesauro To build TD-Gammon A program that learned to play backgammon at the level of expert human players 72 / 111
  • 151. Images/cinvestav History There is an original earlier work By Arthur Samuel and it was used by Sutton to invent Temporal Difference-Lambda It was famously applied by Gerald Tesauro To build TD-Gammon A program that learned to play backgammon at the level of expert human players 72 / 111
  • 152. Images/cinvestav Temporal-Difference (TD) Learning Something Notable Possibly the central and novel idea of reinforcement learning, It has similarity with Monte Carlo Methods They use experience to solve the prediction problem (For More look at Sutton’s Book) Monte Carlo methods wait until Gt is known V (St) ← V (St) + γ [Gt − V (St)] and Gt = ∞ k=0 αk Rt+k+1 Let us call this method constant-α Monte Carlo. 73 / 111
  • 153. Images/cinvestav Temporal-Difference (TD) Learning Something Notable Possibly the central and novel idea of reinforcement learning, It has similarity with Monte Carlo Methods They use experience to solve the prediction problem (For More look at Sutton’s Book) Monte Carlo methods wait until Gt is known V (St) ← V (St) + γ [Gt − V (St)] and Gt = ∞ k=0 αk Rt+k+1 Let us call this method constant-α Monte Carlo. 73 / 111
  • 154. Images/cinvestav Temporal-Difference (TD) Learning Something Notable Possibly the central and novel idea of reinforcement learning, It has similarity with Monte Carlo Methods They use experience to solve the prediction problem (For More look at Sutton’s Book) Monte Carlo methods wait until Gt is known V (St) ← V (St) + γ [Gt − V (St)] and Gt = ∞ k=0 αk Rt+k+1 Let us call this method constant-α Monte Carlo. 73 / 111
  • 155. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 74 / 111
  • 156. Images/cinvestav Fundamental Theorem of Simulation Cumulative Distribution Function With every random variable, we associate a function called Cumulative Distribution Function (CDF) which is defined as follows: FX(x) = P(f(X) ≤ x) With properties: FX(x) ≥ 0 FX(x) in a non-decreasing function of X. 75 / 111
  • 157. Images/cinvestav Fundamental Theorem of Simulation Cumulative Distribution Function With every random variable, we associate a function called Cumulative Distribution Function (CDF) which is defined as follows: FX(x) = P(f(X) ≤ x) With properties: FX(x) ≥ 0 FX(x) in a non-decreasing function of X. 75 / 111
  • 158. Images/cinvestav Fundamental Theorem of Simulation Cumulative Distribution Function With every random variable, we associate a function called Cumulative Distribution Function (CDF) which is defined as follows: FX(x) = P(f(X) ≤ x) With properties: FX(x) ≥ 0 FX(x) in a non-decreasing function of X. 75 / 111
  • 161. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 78 / 111
  • 162. Images/cinvestav Fundamental Theorem of Simulation Theorem Simulating X ∼ f (x) It is equivalent to simulating (X, U) ∼ U ((x, u) |0 < u < f (x)) 79 / 111
  • 163. Images/cinvestav Fundamental Theorem of Simulation Theorem Simulating X ∼ f (x) It is equivalent to simulating (X, U) ∼ U ((x, u) |0 < u < f (x)) 79 / 111
  • 164. Images/cinvestav Geometrically We have Proof We have that f (x) = f(x) 0 du the marginal pdf of (X, U). 80 / 111
  • 165. Images/cinvestav Geometrically We have Proof We have that f (x) = f(x) 0 du the marginal pdf of (X, U). 80 / 111
  • 166. Images/cinvestav Therefore One thing that is made clear by this theorem is that we can generate X in three ways 1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but this is useless because we already have X and do not need X. 2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} . 3 Generate (X, U) jointly, a smart idea. it allows us to generate data on a larger set were simulation is easier. Not as simple as it looks. 81 / 111
  • 167. Images/cinvestav Therefore One thing that is made clear by this theorem is that we can generate X in three ways 1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but this is useless because we already have X and do not need X. 2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} . 3 Generate (X, U) jointly, a smart idea. it allows us to generate data on a larger set were simulation is easier. Not as simple as it looks. 81 / 111
  • 168. Images/cinvestav Therefore One thing that is made clear by this theorem is that we can generate X in three ways 1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but this is useless because we already have X and do not need X. 2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} . 3 Generate (X, U) jointly, a smart idea. it allows us to generate data on a larger set were simulation is easier. Not as simple as it looks. 81 / 111
  • 169. Images/cinvestav Therefore One thing that is made clear by this theorem is that we can generate X in three ways 1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but this is useless because we already have X and do not need X. 2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} . 3 Generate (X, U) jointly, a smart idea. it allows us to generate data on a larger set were simulation is easier. Not as simple as it looks. 81 / 111
  • 170. Images/cinvestav Markov Chains The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property iif p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1) Finite-state Discrete Time Markov Chains |S| < ∞ It can be completly specified by the transition matrix. P = [pij] with pij = P [Xt = j|Xt−1 = i] For Irreducible chains, the stationary distribution π is long-term proportion of time that the chain spends in each state. Computed by π = πP . 82 / 111
  • 171. Images/cinvestav Markov Chains The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property iif p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1) Finite-state Discrete Time Markov Chains |S| < ∞ It can be completly specified by the transition matrix. P = [pij] with pij = P [Xt = j|Xt−1 = i] For Irreducible chains, the stationary distribution π is long-term proportion of time that the chain spends in each state. Computed by π = πP . 82 / 111
  • 172. Images/cinvestav Markov Chains The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property iif p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1) Finite-state Discrete Time Markov Chains |S| < ∞ It can be completly specified by the transition matrix. P = [pij] with pij = P [Xt = j|Xt−1 = i] For Irreducible chains, the stationary distribution π is long-term proportion of time that the chain spends in each state. Computed by π = πP . 82 / 111
  • 173. Images/cinvestav Markov Chains The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property iif p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1) Finite-state Discrete Time Markov Chains |S| < ∞ It can be completly specified by the transition matrix. P = [pij] with pij = P [Xt = j|Xt−1 = i] For Irreducible chains, the stationary distribution π is long-term proportion of time that the chain spends in each state. Computed by π = πP . 82 / 111
  • 174. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 83 / 111
  • 175. Images/cinvestav The Idea Draw an i.i.d. set of samples x(i) N i=1 From a target density p (x) on a high-dimensional X. The set of possible configurations of a system. Thus, we can use those samples to approximate the target density pN (x) = 1 N N i=1 δx(i) (x) where δx(i) (x) denotes the Dirac Delta mass located at x(i). 84 / 111
  • 176. Images/cinvestav The Idea Draw an i.i.d. set of samples x(i) N i=1 From a target density p (x) on a high-dimensional X. The set of possible configurations of a system. Thus, we can use those samples to approximate the target density pN (x) = 1 N N i=1 δx(i) (x) where δx(i) (x) denotes the Dirac Delta mass located at x(i). 84 / 111
  • 177. Images/cinvestav The Idea Draw an i.i.d. set of samples x(i) N i=1 From a target density p (x) on a high-dimensional X. The set of possible configurations of a system. Thus, we can use those samples to approximate the target density pN (x) = 1 N N i=1 δx(i) (x) where δx(i) (x) denotes the Dirac Delta mass located at x(i). 84 / 111
  • 178. Images/cinvestav Thus Given, the Dirac Delta δ (t) = 0 t = 0 ∞ t = 0 with t2 t1 δ(t)dt = 1 Therefore δ (t) = lim σ→0 1 √ 2πσ e− t2 2σ2 85 / 111
  • 179. Images/cinvestav Thus Given, the Dirac Delta δ (t) = 0 t = 0 ∞ t = 0 with t2 t1 δ(t)dt = 1 Therefore δ (t) = lim σ→0 1 √ 2πσ e− t2 2σ2 85 / 111
  • 180. Images/cinvestav Consequently Approximate the integrals IN (f) = 1 N N i=1 f x(i) a.s. −→ N→∞ I (f) = X f (x) p (x) dx Where Variance of f (x) σ2 f = Ep(x) f2 (x) − I2 (f) < ∞ Therefore the variance of the estimator IN (f) EIN (f) = var (IN (f)) = σ2 f N 86 / 111
  • 181. Images/cinvestav Consequently Approximate the integrals IN (f) = 1 N N i=1 f x(i) a.s. −→ N→∞ I (f) = X f (x) p (x) dx Where Variance of f (x) σ2 f = Ep(x) f2 (x) − I2 (f) < ∞ Therefore the variance of the estimator IN (f) EIN (f) = var (IN (f)) = σ2 f N 86 / 111
  • 182. Images/cinvestav Consequently Approximate the integrals IN (f) = 1 N N i=1 f x(i) a.s. −→ N→∞ I (f) = X f (x) p (x) dx Where Variance of f (x) σ2 f = Ep(x) f2 (x) − I2 (f) < ∞ Therefore the variance of the estimator IN (f) EIN (f) = var (IN (f)) = σ2 f N 86 / 111
  • 183. Images/cinvestav Then By Robert & Casella 1999, Section 3.2 √ N (IN (f) − I (f)) =⇒ N−→∞ N 0, σ2 f 87 / 111
  • 184. Images/cinvestav Then By Robert & Casella 1999, Section 3.2 √ N (IN (f) − I (f)) =⇒ N−→∞ N 0, σ2 f 87 / 111
  • 185. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 88 / 111
  • 186. Images/cinvestav What if we do not wait for Gt TD methods need to wait only until the next time step At time t + 1, it uses the observed reward Rt+1 and the estimate V (St+1) The simplest TD method V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)] Immediately on transition to St+1 and receiving Rt+1 . 89 / 111
  • 187. Images/cinvestav What if we do not wait for Gt TD methods need to wait only until the next time step At time t + 1, it uses the observed reward Rt+1 and the estimate V (St+1) The simplest TD method V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)] Immediately on transition to St+1 and receiving Rt+1 . 89 / 111
  • 188. Images/cinvestav This method is called TD(0) or one-step TD Actually There are TD with n-step TD methods Differences with the Monte Carlo Methods In Monte Carlo update, the target is Gt In TD update, the target is Rt + αV (St+1) 90 / 111
  • 189. Images/cinvestav This method is called TD(0) or one-step TD Actually There are TD with n-step TD methods Differences with the Monte Carlo Methods In Monte Carlo update, the target is Gt In TD update, the target is Rt + αV (St+1) 90 / 111
  • 190. Images/cinvestav TD as a Bootstrapping Method Quite similar to Dynamic Additionally, we know that vπ (s) = Eπ [Gt|St] = Eπ [Rt+1 + αGt+1|St] = Eπ [Rt+1 + αvπ (St+1) |St] Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St] Instead, Dynamic Programming methods use an estimate of vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St]. 91 / 111
  • 191. Images/cinvestav TD as a Bootstrapping Method Quite similar to Dynamic Additionally, we know that vπ (s) = Eπ [Gt|St] = Eπ [Rt+1 + αGt+1|St] = Eπ [Rt+1 + αvπ (St+1) |St] Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St] Instead, Dynamic Programming methods use an estimate of vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St]. 91 / 111
  • 192. Images/cinvestav Furthermore The Monte Carlo target is an estimate because the expected value in (6.3) is not known, but a sample is used instead The DP target is an estimate because vπ(St + 1) is not known and the current estimate,V (St+1), is used instead. Here TD goes beyond and combines both ideas It combines the sampling of Monte Carlo with the bootstrapping of DP. 92 / 111
  • 193. Images/cinvestav Furthermore The Monte Carlo target is an estimate because the expected value in (6.3) is not known, but a sample is used instead The DP target is an estimate because vπ(St + 1) is not known and the current estimate,V (St+1), is used instead. Here TD goes beyond and combines both ideas It combines the sampling of Monte Carlo with the bootstrapping of DP. 92 / 111
  • 194. Images/cinvestav Furthermore The Monte Carlo target is an estimate because the expected value in (6.3) is not known, but a sample is used instead The DP target is an estimate because vπ(St + 1) is not known and the current estimate,V (St+1), is used instead. Here TD goes beyond and combines both ideas It combines the sampling of Monte Carlo with the bootstrapping of DP. 92 / 111
  • 195. Images/cinvestav As in many methods there is an error And it is used to improve the policy δt = Rt+1 + αV (St+1) − V (St) Because the TD error depends on the next state and next reward It is not actually available until one time step later. δt is the error in V (St) At time t + 1 93 / 111
  • 196. Images/cinvestav As in many methods there is an error And it is used to improve the policy δt = Rt+1 + αV (St+1) − V (St) Because the TD error depends on the next state and next reward It is not actually available until one time step later. δt is the error in V (St) At time t + 1 93 / 111
  • 197. Images/cinvestav As in many methods there is an error And it is used to improve the policy δt = Rt+1 + αV (St+1) − V (St) Because the TD error depends on the next state and next reward It is not actually available until one time step later. δt is the error in V (St) At time t + 1 93 / 111
  • 198. Images/cinvestav Finally If the array V does not change during the episode, The Monte Carlo Error can be written as a sum of TD errors = Gt − V (St) = Rt+1 + αGt+1 − V (St) + αV (St+1) − αV (St+1) = δt + α (Gt+1 − V (St+1)) = δt + αδt+! + α2 (Gt+2 − V (St+2)) = T−1 k=t αk−t δk 94 / 111
  • 199. Images/cinvestav Optimality of TD(0) Under batch updating TD(0) converges deterministically to a single answer independent as long as γ is chosen to be sufficiently small. V (St) ← V (St) + γ [Gt − V (St)] and 95 / 111
  • 200. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 96 / 111
  • 201. Images/cinvestav Q-learning: Off-Policy TD Control One of the early breakthroughs in reinforcement learning The development of an off-policy TD control algorithm, Q-learning (Watkind, 1989) It is defined as Q (St, At) ← Q (St, At) + γ Rt+1 + α max a Q (St+1, a) − Q (St, At) Here The learned action-value function, Q, directly approximates q∗ the optimal action-value function independent of the policy being followed q. 97 / 111
  • 202. Images/cinvestav Q-learning: Off-Policy TD Control One of the early breakthroughs in reinforcement learning The development of an off-policy TD control algorithm, Q-learning (Watkind, 1989) It is defined as Q (St, At) ← Q (St, At) + γ Rt+1 + α max a Q (St+1, a) − Q (St, At) Here The learned action-value function, Q, directly approximates q∗ the optimal action-value function independent of the policy being followed q. 97 / 111
  • 203. Images/cinvestav Q-learning: Off-Policy TD Control One of the early breakthroughs in reinforcement learning The development of an off-policy TD control algorithm, Q-learning (Watkind, 1989) It is defined as Q (St, At) ← Q (St, At) + γ Rt+1 + α max a Q (St+1, a) − Q (St, At) Here The learned action-value function, Q, directly approximates q∗ the optimal action-value function independent of the policy being followed q. 97 / 111
  • 204. Images/cinvestav Furthermore The policy still has an effect to determine which state-action pairs are visited and updated. Therefore All that is required for correct convergence is that all pairs continue to be updated Under this assumption and a variant of the usual stochastic approximation conditions Q has been shown to converge with probability 1 to q∗ 98 / 111
  • 205. Images/cinvestav Furthermore The policy still has an effect to determine which state-action pairs are visited and updated. Therefore All that is required for correct convergence is that all pairs continue to be updated Under this assumption and a variant of the usual stochastic approximation conditions Q has been shown to converge with probability 1 to q∗ 98 / 111
  • 206. Images/cinvestav Furthermore The policy still has an effect to determine which state-action pairs are visited and updated. Therefore All that is required for correct convergence is that all pairs continue to be updated Under this assumption and a variant of the usual stochastic approximation conditions Q has been shown to converge with probability 1 to q∗ 98 / 111
  • 207. Images/cinvestav Q-learning (Off-Policy TD control) for estimating π ≈ π∗ Algorithm parameters α ∈ (0, 1] and small > 0 Initialize Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that Q (terminal, ·) = 0 99 / 111
  • 208. Images/cinvestav Q-learning (Off-Policy TD control) for estimating π ≈ π∗ Algorithm parameters α ∈ (0, 1] and small > 0 Initialize Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that Q (terminal, ·) = 0 99 / 111
  • 209. Images/cinvestav Then Loop for each episode 1 Initialize S 2 Loop for each step of episode: 3 Choose A from S using policy derived from Q (for example -greedy) 4 Take action A, observe R, S Q (S, A) ← Q (S, A) + γ R + α max a Q S , a − Q (S, A) 5 S ← S 6 Until S is terminal 100 / 111
  • 210. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 101 / 111
  • 211. Images/cinvestav One of the earliest successes Backgammon is a major game with numerous tournaments and regular world championship matches. It is a stochastic game with a full description of the world 102 / 111
  • 212. Images/cinvestav One of the earliest successes Backgammon is a major game with numerous tournaments and regular world championship matches. It is a stochastic game with a full description of the world 102 / 111
  • 213. Images/cinvestav What do they used? TD-Gammon used a nonlinear form of TD (λ) An estimated value ˆv (s, w) is used to estimate the probability of winning from state s. Then To achieve this, rewards were defined as zero for all time steps except those on which the game is won. 103 / 111
  • 214. Images/cinvestav What do they used? TD-Gammon used a nonlinear form of TD (λ) An estimated value ˆv (s, w) is used to estimate the probability of winning from state s. Then To achieve this, rewards were defined as zero for all time steps except those on which the game is won. 103 / 111
  • 215. Images/cinvestav Implementation of the value function in TD-Gammon They took the decision to use a standard multilayer ANN 104 / 111
  • 216. Images/cinvestav The interesting part Tesauro’s representation For each point on the backgammon board Four units indicate the number of white pieces on the point Basically If no white pieces, all four units took zero values, if there was one white piece, the first unit takes a one value And so on... Therefore, given four white pieces and another four black pieces at each of the 24 points We have 4*24+4*24=192 units 105 / 111
  • 217. Images/cinvestav The interesting part Tesauro’s representation For each point on the backgammon board Four units indicate the number of white pieces on the point Basically If no white pieces, all four units took zero values, if there was one white piece, the first unit takes a one value And so on... Therefore, given four white pieces and another four black pieces at each of the 24 points We have 4*24+4*24=192 units 105 / 111
  • 218. Images/cinvestav The interesting part Tesauro’s representation For each point on the backgammon board Four units indicate the number of white pieces on the point Basically If no white pieces, all four units took zero values, if there was one white piece, the first unit takes a one value And so on... Therefore, given four white pieces and another four black pieces at each of the 24 points We have 4*24+4*24=192 units 105 / 111
  • 219. Images/cinvestav Then We have that Two additional units encoded the number of white and black pieces on the bar (Each took the value n/2 with n the number on the bar) Two other represent the number of black and white pieces already (these took the value n/15) two units indicated in a binary fashion the white or black turn. 106 / 111
  • 220. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 107 / 111
  • 221. Images/cinvestav TD(λ) by Back-Propagation TD-Gammon uses a semi-gradient of the TD (λ) TD (λ) has the following general update rule: wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt with wt is the vector of all modifiable parameters zt is a vector of eligibility traces Eligibility traces unify and generalize TD and Monte Carlo methods. It works as Long-Short Term Memory The rough idea is that when a component of wt produces an estimate zt is bumped out and then begins to fade away. 108 / 111
  • 222. Images/cinvestav TD(λ) by Back-Propagation TD-Gammon uses a semi-gradient of the TD (λ) TD (λ) has the following general update rule: wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt with wt is the vector of all modifiable parameters zt is a vector of eligibility traces Eligibility traces unify and generalize TD and Monte Carlo methods. It works as Long-Short Term Memory The rough idea is that when a component of wt produces an estimate zt is bumped out and then begins to fade away. 108 / 111
  • 223. Images/cinvestav TD(λ) by Back-Propagation TD-Gammon uses a semi-gradient of the TD (λ) TD (λ) has the following general update rule: wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt with wt is the vector of all modifiable parameters zt is a vector of eligibility traces Eligibility traces unify and generalize TD and Monte Carlo methods. It works as Long-Short Term Memory The rough idea is that when a component of wt produces an estimate zt is bumped out and then begins to fade away. 108 / 111
  • 224. Images/cinvestav How this is done? Update Rule zt = αλzt−1 + ˆv (St, wt) with z0 = 0 Something Notable The gradient in this equation can be computed efficiently by the back-propapagation procedure For this, we set the error in the ANN ˆv (St+1, wt) − ˆv (St, wt) 109 / 111
  • 225. Images/cinvestav How this is done? Update Rule zt = αλzt−1 + ˆv (St, wt) with z0 = 0 Something Notable The gradient in this equation can be computed efficiently by the back-propapagation procedure For this, we set the error in the ANN ˆv (St+1, wt) − ˆv (St, wt) 109 / 111
  • 226. Images/cinvestav How this is done? Update Rule zt = αλzt−1 + ˆv (St, wt) with z0 = 0 Something Notable The gradient in this equation can be computed efficiently by the back-propapagation procedure For this, we set the error in the ANN ˆv (St+1, wt) − ˆv (St, wt) 109 / 111
  • 227. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 110 / 111
  • 228. Images/cinvestav Conclusions With this new tools as Reinforcement Learning The New Artificial Intelligence has a lot to do in the world Thanks to be in this class As they say we have a lot to do... 111 / 111
  • 229. Images/cinvestav Conclusions With this new tools as Reinforcement Learning The New Artificial Intelligence has a lot to do in the world Thanks to be in this class As they say we have a lot to do... 111 / 111