Reinforcement Learning
www.credosystemz.com
Reinforcement Learning
• Reinforcement learning (RL) has its origins in the
psychology of animal learning.
• The basic idea is that of awarding the learner (agent)
for correct actions, and punishing wrong actions.
• RL is a process of trial and error, combined with
learning.
• The agent decides on actions based on the current
environmental state, and through feedback in terms of
the desirability of the action, learns which action is
best associated with which state.
• The agent learns from interaction with the
environment.
www.credosystemz.com
Learning through Awards
• Reinforcement learning is the learning of a mapping from situations to
actions with the main objective to maximize the scalar reward or
reinforcement signal.
• Reinforcement learning is defined as learning by trial-and error from
performance feedback from the environment or an external evaluator.
• The agent has absolutely no prior knowledge of what action to take, and
has to discover (or explore) which actions yield the highest reward.
www.credosystemz.com
• The agent receives sensory inputs from its
environment, as a description of the current state of
the perceived environment.
• An action is executed, upon which the agent receives
the reinforcement signal or reward.
• This reward can be a positive or negative signal,
depending on the correctness of the action. A negative
reward has the effect of punishing the agent for a bad
action.
• The action may cause a change in the agent’s
environment, thereby affecting the future options and
actions of the agent. The effects of actions on the
environment and future states can not always be
predicted. It is therefore necessary that the agent
frequently monitors its environment
www.credosystemz.com
Exploration–Exploitation trade-off.
• One of the important issues in RL (which occurs in most
search methods) is that of the exploration–exploitation
trade-off.
• As already indicated, RL has two important components:
❑ A trial and error search to find good actions, which forms
the exploration component of RL.
❑ A memory of which actions worked well in which situations.
This is the exploitation component of RL.
• It is important that the agent exploits what it has already
learned, such that a reward can be obtained.
• However, via the trial and error search, the agent must also
explore to improve action selections in the future.
www.credosystemz.com
Reinforcement learning agent has the
following components:
• A policy, which is the decision making function of the agent. This function
is used to specify which action to execute in each of the situations that the
agent may encounter. The policy is basically a set of associations between
actions and situations, or alternatively, a set of stimulus-response rules.
• A reward function, which defines the goal of the agent. The reward
function defines what are good and bad actions for the agent for specific
situations. The reward is immediate, and represents only the current
environment state. The goal of the agent is to maximize the total reward
that it receives over the long run.
• A value function, which specifies the goal in the long run. The value
function is used to predict future reward, and is used to indicate what is
good in the long run.
• Optionally, an RL agent may also have a model of the environment. The
environmental model mimics the behavior of the environment. This can
be done by transition functions that describe transitions between
different states.
www.credosystemz.com
A number of models have been proposed to find future
for a value function
www.credosystemz.com
To find an optimal policy
• In order to find an optimal policy, π∗, it is necessary to find an optimal
value function. A candidate optimal value function is
Where,
A is the set of all possible actions,
S is the set of environmental states,
R(s, a) is the reward function, and
T(s, a, s ) is the transition function.
• The value of a state, s, is the expected instantaneous reward, R(s, a), for
action a plus the expected discounted value of the next state, using the
best possible action.
www.credosystemz.com
Model-Free Reinforcement Learning
Model
• The objective is to obtain an optimal policy
without a model of the environment. This
section reviews two approaches,
1. Temporal difference (TD) learning
2. Q-learning
www.credosystemz.com
1. Temporal Difference Learning
• Temporal difference (TD) learning [824] learns the value policy using
the update rule,
• where η is a learning rate, r is the immediate reward, γ is the
discount factor, s is the current state, and s is a future state. Based
on equation (6.5), whenever a state, s, is visited, its estimated value
is updated to be closer to r + ηV (s ` ).
• The above model is referred to as TD(0), where only one future step
is considered.
• The TD method has been generalized to TD(λ) strategies , where λ ∈
[0, 1] is a weighting on the relevance of recent temporal differences
of previous predictions
www.credosystemz.com
• For TD(λ), the value function is learned using
• where e(u) is the eligibility of state u. The eligibility of a
state is the degree to which the state has been visited
in the recent past, computed as
Where
• The update in equation (6.6) is applied to every state,
according to its eligibility, and not just the previous
state as for TD(0).
www.credosystemz.com
2. Q-Learning
• In Q-learning , the task is to learn the expected discounted
reinforcement values, Q(s, a), of taking action a in state s,
then continuing by always choosing actions optimally. To
relate Q-values to the value function, note that
• where V ∗(s) is the value of s assuming that the best action is
taken initially.
• The Q-learning rule is given as
• The agent then takes the action with the highest Q-value.
www.credosystemz.com
Neural Networks and Reinforcement
Learning
• Neural networks and reinforcement learning have been combined
in a number of ways.
• One approach of combining these models is to use a NN as an
approximator of the value function used to predict future reward .
• Another approach uses RL to adjust weights.
• Both these approaches are discussed in this section.
• As already indicated, the LVQ-II implements a form of RL.
• Weights of the winning output unit are positively updated only if
that output unit provided the correct response for the
corresponding input pattern. If not, weights are penalized through
adjustment away from that input pattern.
• Other approaches to use RL for NN training include RPROP and
gradient descent on the expected reward and Connectionist Q-
learning is used to approximate the value function.
www.credosystemz.com
1. RPROP
• Resilient propagation (RPROP) [727, 728] performs a direct adaptation of
the weight step using local gradient information. Weight adjustments are
implemented in the form of a reward or punishment, as follows: If the
partial derivative,
• Of weight vji (or wkj) changes its sign, the weight update value, Δji (Δkj), is
decreased by the factor, η−. The reason for this penalty is because the last
weight update was too large, causing the algorithm to jump over a local
minimum. On the other hand, if the derivative retains its sign, the update
value is increased by factor η+ to accelerate convergence.
• For each weight, vji (and wkj), the change in weight is determined as
• Where
• Using the above
www.credosystemz.com
• RPROP is summarized in Algorithm 6.1. The value of Δ0 indicates
the first weight step, and is chosen as a small value, e.g. Δ0 = 0.1. It
is shown that the performance of RPROP is insensitive to the value
of Δ0. Parameters Δmax and Δmin respectively specify upper and
lower limits on update step sizes. It is suggested that η− = 0.5 and
η+ = 1.2.
www.credosystemz.com
2 . Gradient Descent Reinforcement
Learning
• For problems where only the immediate reward is maximized (i.e. there is no value
function, only a reward function), Williams [911] proposed weight update rules
that perform a gradient descent on the expected reward. These rules are then
integrated with back-propagation. Weights are updated as follows:
• where ηkj is a non-negative learning rate, rp is the reinforcement associated with
pattern zp, θk is the reinforcement threshold value, and ekj is the eligibility of
weight wkj, given as
• Where
• is the probability density function used to randomly generate actions, based on
whether the target was correctly predicted or not. Thus, this NN reinforcement
learning rule computes a GD in probability space.
• Similar update equations are used for the vji weights.
www.credosystemz.com
3. Connectionist Q-Learning
• Neural networks have been used to learn the Q-function in Q-
learning.
• The NN is used to approximate the mapping between states and
actions, and even to generalize between states.
• The input to the NN is the current state of the environment, and
the output represents the action to execute. If there are na actions,
then either one NN with na output units can be used, or na NNs,
one for each of the actions,can be used.
• Assuming that one NN is used per action, Lin [527] used the Q-
learning in equation (6.10) to update weights as follows:
• where Q(t) is used as shorthand notation for Q(s(t), a(t)) and ∇wQ(t)
is a vector of the output gradients, ∂Q ∂w (t), which are calculated
by means of back-propagation. Similar equations are used for the vj
weights.
www.credosystemz.com
www.credosystemz.com
•
www.credosystemz.com
• This gives an overall update of
• where the eligibility is calculated using
• Equation (6.26) keeps track of the weighted sum of
previous error gradients.
• The Q(λ) up ate algo ithm is given in Algo ithm 6
Checkout: http://bit.ly/2Mub6xP

Reinforcement Learning Guide For Beginners

  • 1.
  • 2.
    Reinforcement Learning • Reinforcementlearning (RL) has its origins in the psychology of animal learning. • The basic idea is that of awarding the learner (agent) for correct actions, and punishing wrong actions. • RL is a process of trial and error, combined with learning. • The agent decides on actions based on the current environmental state, and through feedback in terms of the desirability of the action, learns which action is best associated with which state. • The agent learns from interaction with the environment. www.credosystemz.com
  • 3.
    Learning through Awards •Reinforcement learning is the learning of a mapping from situations to actions with the main objective to maximize the scalar reward or reinforcement signal. • Reinforcement learning is defined as learning by trial-and error from performance feedback from the environment or an external evaluator. • The agent has absolutely no prior knowledge of what action to take, and has to discover (or explore) which actions yield the highest reward. www.credosystemz.com
  • 4.
    • The agentreceives sensory inputs from its environment, as a description of the current state of the perceived environment. • An action is executed, upon which the agent receives the reinforcement signal or reward. • This reward can be a positive or negative signal, depending on the correctness of the action. A negative reward has the effect of punishing the agent for a bad action. • The action may cause a change in the agent’s environment, thereby affecting the future options and actions of the agent. The effects of actions on the environment and future states can not always be predicted. It is therefore necessary that the agent frequently monitors its environment www.credosystemz.com
  • 5.
    Exploration–Exploitation trade-off. • Oneof the important issues in RL (which occurs in most search methods) is that of the exploration–exploitation trade-off. • As already indicated, RL has two important components: ❑ A trial and error search to find good actions, which forms the exploration component of RL. ❑ A memory of which actions worked well in which situations. This is the exploitation component of RL. • It is important that the agent exploits what it has already learned, such that a reward can be obtained. • However, via the trial and error search, the agent must also explore to improve action selections in the future. www.credosystemz.com
  • 6.
    Reinforcement learning agenthas the following components: • A policy, which is the decision making function of the agent. This function is used to specify which action to execute in each of the situations that the agent may encounter. The policy is basically a set of associations between actions and situations, or alternatively, a set of stimulus-response rules. • A reward function, which defines the goal of the agent. The reward function defines what are good and bad actions for the agent for specific situations. The reward is immediate, and represents only the current environment state. The goal of the agent is to maximize the total reward that it receives over the long run. • A value function, which specifies the goal in the long run. The value function is used to predict future reward, and is used to indicate what is good in the long run. • Optionally, an RL agent may also have a model of the environment. The environmental model mimics the behavior of the environment. This can be done by transition functions that describe transitions between different states. www.credosystemz.com
  • 7.
    A number ofmodels have been proposed to find future for a value function www.credosystemz.com
  • 8.
    To find anoptimal policy • In order to find an optimal policy, π∗, it is necessary to find an optimal value function. A candidate optimal value function is Where, A is the set of all possible actions, S is the set of environmental states, R(s, a) is the reward function, and T(s, a, s ) is the transition function. • The value of a state, s, is the expected instantaneous reward, R(s, a), for action a plus the expected discounted value of the next state, using the best possible action. www.credosystemz.com
  • 9.
    Model-Free Reinforcement Learning Model •The objective is to obtain an optimal policy without a model of the environment. This section reviews two approaches, 1. Temporal difference (TD) learning 2. Q-learning www.credosystemz.com
  • 10.
    1. Temporal DifferenceLearning • Temporal difference (TD) learning [824] learns the value policy using the update rule, • where η is a learning rate, r is the immediate reward, γ is the discount factor, s is the current state, and s is a future state. Based on equation (6.5), whenever a state, s, is visited, its estimated value is updated to be closer to r + ηV (s ` ). • The above model is referred to as TD(0), where only one future step is considered. • The TD method has been generalized to TD(λ) strategies , where λ ∈ [0, 1] is a weighting on the relevance of recent temporal differences of previous predictions www.credosystemz.com
  • 11.
    • For TD(λ),the value function is learned using • where e(u) is the eligibility of state u. The eligibility of a state is the degree to which the state has been visited in the recent past, computed as Where • The update in equation (6.6) is applied to every state, according to its eligibility, and not just the previous state as for TD(0). www.credosystemz.com
  • 12.
    2. Q-Learning • InQ-learning , the task is to learn the expected discounted reinforcement values, Q(s, a), of taking action a in state s, then continuing by always choosing actions optimally. To relate Q-values to the value function, note that • where V ∗(s) is the value of s assuming that the best action is taken initially. • The Q-learning rule is given as • The agent then takes the action with the highest Q-value. www.credosystemz.com
  • 13.
    Neural Networks andReinforcement Learning • Neural networks and reinforcement learning have been combined in a number of ways. • One approach of combining these models is to use a NN as an approximator of the value function used to predict future reward . • Another approach uses RL to adjust weights. • Both these approaches are discussed in this section. • As already indicated, the LVQ-II implements a form of RL. • Weights of the winning output unit are positively updated only if that output unit provided the correct response for the corresponding input pattern. If not, weights are penalized through adjustment away from that input pattern. • Other approaches to use RL for NN training include RPROP and gradient descent on the expected reward and Connectionist Q- learning is used to approximate the value function. www.credosystemz.com
  • 14.
    1. RPROP • Resilientpropagation (RPROP) [727, 728] performs a direct adaptation of the weight step using local gradient information. Weight adjustments are implemented in the form of a reward or punishment, as follows: If the partial derivative, • Of weight vji (or wkj) changes its sign, the weight update value, Δji (Δkj), is decreased by the factor, η−. The reason for this penalty is because the last weight update was too large, causing the algorithm to jump over a local minimum. On the other hand, if the derivative retains its sign, the update value is increased by factor η+ to accelerate convergence. • For each weight, vji (and wkj), the change in weight is determined as • Where • Using the above www.credosystemz.com
  • 15.
    • RPROP issummarized in Algorithm 6.1. The value of Δ0 indicates the first weight step, and is chosen as a small value, e.g. Δ0 = 0.1. It is shown that the performance of RPROP is insensitive to the value of Δ0. Parameters Δmax and Δmin respectively specify upper and lower limits on update step sizes. It is suggested that η− = 0.5 and η+ = 1.2. www.credosystemz.com
  • 16.
    2 . GradientDescent Reinforcement Learning • For problems where only the immediate reward is maximized (i.e. there is no value function, only a reward function), Williams [911] proposed weight update rules that perform a gradient descent on the expected reward. These rules are then integrated with back-propagation. Weights are updated as follows: • where ηkj is a non-negative learning rate, rp is the reinforcement associated with pattern zp, θk is the reinforcement threshold value, and ekj is the eligibility of weight wkj, given as • Where • is the probability density function used to randomly generate actions, based on whether the target was correctly predicted or not. Thus, this NN reinforcement learning rule computes a GD in probability space. • Similar update equations are used for the vji weights. www.credosystemz.com
  • 17.
    3. Connectionist Q-Learning •Neural networks have been used to learn the Q-function in Q- learning. • The NN is used to approximate the mapping between states and actions, and even to generalize between states. • The input to the NN is the current state of the environment, and the output represents the action to execute. If there are na actions, then either one NN with na output units can be used, or na NNs, one for each of the actions,can be used. • Assuming that one NN is used per action, Lin [527] used the Q- learning in equation (6.10) to update weights as follows: • where Q(t) is used as shorthand notation for Q(s(t), a(t)) and ∇wQ(t) is a vector of the output gradients, ∂Q ∂w (t), which are calculated by means of back-propagation. Similar equations are used for the vj weights. www.credosystemz.com
  • 18.
  • 19.
  • 20.
    • This givesan overall update of • where the eligibility is calculated using • Equation (6.26) keeps track of the weighted sum of previous error gradients. • The Q(λ) up ate algo ithm is given in Algo ithm 6 Checkout: http://bit.ly/2Mub6xP