Reinforcement Learning Guide For Beginners

Reinforcement Learning
www.credosystemz.com

Reinforcement Learning
• Reinforcement learning (RL) has its origins in the
psychology of animal learning.
• The basic idea is that of awarding the learner (agent)
for correct actions, and punishing wrong actions.
• RL is a process of trial and error, combined with
learning.
• The agent decides on actions based on the current
environmental state, and through feedback in terms of
the desirability of the action, learns which action is
best associated with which state.
• The agent learns from interaction with the
environment.

Learning through Awards
• Reinforcement learning is the learning of a mapping from situations to
actions with the main objective to maximize the scalar reward or
reinforcement signal.
• Reinforcement learning is defined as learning by trial-and error from
performance feedback from the environment or an external evaluator.
• The agent has absolutely no prior knowledge of what action to take, and
has to discover (or explore) which actions yield the highest reward.

• The agent receives sensory inputs from its
environment, as a description of the current state of
the perceived environment.
• An action is executed, upon which the agent receives
the reinforcement signal or reward.
• This reward can be a positive or negative signal,
depending on the correctness of the action. A negative
reward has the effect of punishing the agent for a bad
action.
• The action may cause a change in the agent’s
environment, thereby affecting the future options and
actions of the agent. The effects of actions on the
environment and future states can not always be
predicted. It is therefore necessary that the agent
frequently monitors its environment

Exploration–Exploitation trade-off.
• One of the important issues in RL (which occurs in most
search methods) is that of the exploration–exploitation
trade-off.
• As already indicated, RL has two important components:
❑ A trial and error search to find good actions, which forms
the exploration component of RL.
❑ A memory of which actions worked well in which situations.
This is the exploitation component of RL.
• It is important that the agent exploits what it has already
learned, such that a reward can be obtained.
• However, via the trial and error search, the agent must also
explore to improve action selections in the future.

Reinforcement learning agent has the
following components:
• A policy, which is the decision making function of the agent. This function
is used to specify which action to execute in each of the situations that the
agent may encounter. The policy is basically a set of associations between
actions and situations, or alternatively, a set of stimulus-response rules.
• A reward function, which defines the goal of the agent. The reward
function defines what are good and bad actions for the agent for specific
situations. The reward is immediate, and represents only the current
environment state. The goal of the agent is to maximize the total reward
that it receives over the long run.
• A value function, which specifies the goal in the long run. The value
function is used to predict future reward, and is used to indicate what is
good in the long run.
• Optionally, an RL agent may also have a model of the environment. The
environmental model mimics the behavior of the environment. This can
be done by transition functions that describe transitions between
different states.

A number of models have been proposed to find future
for a value function

To find an optimal policy
• In order to find an optimal policy, π∗, it is necessary to find an optimal
value function. A candidate optimal value function is
Where,
A is the set of all possible actions,
S is the set of environmental states,
R(s, a) is the reward function, and
T(s, a, s ) is the transition function.
• The value of a state, s, is the expected instantaneous reward, R(s, a), for
action a plus the expected discounted value of the next state, using the
best possible action.

Model-Free Reinforcement Learning
Model
• The objective is to obtain an optimal policy
without a model of the environment. This
section reviews two approaches,
1. Temporal difference (TD) learning
2. Q-learning

1. Temporal Difference Learning
• Temporal difference (TD) learning [824] learns the value policy using
the update rule,
• where η is a learning rate, r is the immediate reward, γ is the
discount factor, s is the current state, and s is a future state. Based
on equation (6.5), whenever a state, s, is visited, its estimated value
is updated to be closer to r + ηV (s ` ).
• The above model is referred to as TD(0), where only one future step
is considered.
• The TD method has been generalized to TD(λ) strategies , where λ ∈
[0, 1] is a weighting on the relevance of recent temporal differences
of previous predictions

• For TD(λ), the value function is learned using
• where e(u) is the eligibility of state u. The eligibility of a
state is the degree to which the state has been visited
in the recent past, computed as
Where
• The update in equation (6.6) is applied to every state,
according to its eligibility, and not just the previous
state as for TD(0).

2. Q-Learning
• In Q-learning , the task is to learn the expected discounted
reinforcement values, Q(s, a), of taking action a in state s,
then continuing by always choosing actions optimally. To
relate Q-values to the value function, note that
• where V ∗(s) is the value of s assuming that the best action is
taken initially.
• The Q-learning rule is given as
• The agent then takes the action with the highest Q-value.

Neural Networks and Reinforcement
Learning
• Neural networks and reinforcement learning have been combined
in a number of ways.
• One approach of combining these models is to use a NN as an
approximator of the value function used to predict future reward .
• Another approach uses RL to adjust weights.
• Both these approaches are discussed in this section.
• As already indicated, the LVQ-II implements a form of RL.
• Weights of the winning output unit are positively updated only if
that output unit provided the correct response for the
corresponding input pattern. If not, weights are penalized through
adjustment away from that input pattern.
• Other approaches to use RL for NN training include RPROP and
gradient descent on the expected reward and Connectionist Q-
learning is used to approximate the value function.

1. RPROP
• Resilient propagation (RPROP) [727, 728] performs a direct adaptation of
the weight step using local gradient information. Weight adjustments are
implemented in the form of a reward or punishment, as follows: If the
partial derivative,
• Of weight vji (or wkj) changes its sign, the weight update value, Δji (Δkj), is
decreased by the factor, η−. The reason for this penalty is because the last
weight update was too large, causing the algorithm to jump over a local
minimum. On the other hand, if the derivative retains its sign, the update
value is increased by factor η+ to accelerate convergence.
• For each weight, vji (and wkj), the change in weight is determined as
• Where
• Using the above

• RPROP is summarized in Algorithm 6.1. The value of Δ0 indicates
the first weight step, and is chosen as a small value, e.g. Δ0 = 0.1. It
is shown that the performance of RPROP is insensitive to the value
of Δ0. Parameters Δmax and Δmin respectively specify upper and
lower limits on update step sizes. It is suggested that η− = 0.5 and
η+ = 1.2.

2 . Gradient Descent Reinforcement
Learning
• For problems where only the immediate reward is maximized (i.e. there is no value
function, only a reward function), Williams [911] proposed weight update rules
that perform a gradient descent on the expected reward. These rules are then
integrated with back-propagation. Weights are updated as follows:
• where ηkj is a non-negative learning rate, rp is the reinforcement associated with
pattern zp, θk is the reinforcement threshold value, and ekj is the eligibility of
weight wkj, given as
• Where
• is the probability density function used to randomly generate actions, based on
whether the target was correctly predicted or not. Thus, this NN reinforcement
learning rule computes a GD in probability space.
• Similar update equations are used for the vji weights.

3. Connectionist Q-Learning
• Neural networks have been used to learn the Q-function in Q-
learning.
• The NN is used to approximate the mapping between states and
actions, and even to generalize between states.
• The input to the NN is the current state of the environment, and
the output represents the action to execute. If there are na actions,
then either one NN with na output units can be used, or na NNs,
one for each of the actions,can be used.
• Assuming that one NN is used per action, Lin [527] used the Q-
learning in equation (6.10) to update weights as follows:
• where Q(t) is used as shorthand notation for Q(s(t), a(t)) and ∇wQ(t)
is a vector of the output gradients, ∂Q ∂w (t), which are calculated
by means of back-propagation. Similar equations are used for the vj
weights.

• This gives an overall update of
• where the eligibility is calculated using
• Equation (6.26) keeps track of the weighted sum of
previous error gradients.
• The Q(λ) up ate algo ithm is given in Algo ithm 6
Checkout: http://bit.ly/2Mub6xP

Reinforcement Learning Guide For Beginners

More Related Content

What's hot

Similar to Reinforcement Learning Guide For Beginners

More from gokulprasath06

Recently uploaded

Reinforcement Learning Guide For Beginners