Hybridization of Reinforcement Learning Agents
Héctor J. Fraire Huacuja
Juan J. González Barbosa
Jesús V. Flores Morfin
Department of Systems and Computation
Technological Institute of Madero City
Avenue 1º de Mayo y Sor Juana Inés de la Cruz S/N Colony Los Mangos
(12) 10-04-15 Commutator
(12) 10-29-02 Computer Center
Technological Institute of the Lagoon
Summary The origen of reinforcement learning
came along with the beginnings of
The purpose of this work was to show cybernetics and involves elements of
that agents of classic reinforcement statistic, psychology, neuro-science and
learning, can achieve significant computational sciences . In the last
improvements in performance by ten years, the interest on these
utilizing techniques of hybridization. techniques in the fields of the Artificial
The applied methodology consists of Intelligence and the learning machines
defining a mechanism to compare the has been greatly increased . The
capacity of learning of the agents, to test reinforcement learning is a form of
the performance of the classic agents in agents programming, basied on rewards,
similar conditions of interaction with its which purpose is to make the agent
environment and finally, the worst one makes perform a task without a precise
of the agents must be altered in order to specification of what to do. This
increase its performance. After approach modifies the effective
analyzing the classic agents of paradigms of programming and creates
reinforcement learning, under an a field of opportunities of great
environment of simulated maze, the amplitude.
comparative tests located agent SARSA
as having the worst performance and
agent SARSA(λ) as having the best.
The most prominent result is that the
application of techniques of
hybridization of the considered agents,
allow the construction of hybrid agents
that present notable improvements in
The reinforcement learning is a Reinforcement Learning
technique that allows the construction
of intelligent agents with learning The learning has been the key factor of
capacities and adaptation. An agent of the intelligent systems since they have
this class learns about the interaction to make robots develope tasks without
with the environment in which he the necessity of an explicit
performes and adapts to changes that programming. The main elements
appear in its surroundings. The involved in the problem are the
perception that the agent has of the following:
environment through its sensors,
becomes a state vector. To each one of Agent: Subject that makes the tasks of
the possible states, an elementary action learning and takes decisions to reach a
is associated and the agent is capable of goal.
making the action through its element Environment: Everything that is not
actuators. The relation state-action is the controlled by the agent generates states
nucleus of this approach, since the and receives actions.
update of this function allows the agent In other words, the agent and the
to keep a record of the experience environment interact in a sequence of
acquired. By using this information, the steps.
agent determines at each step the action State (s): Perception that the agent has
that contributed the most to achive the of the environment.
global objectives. The selection of the Action (a): Conduct that the agent uses
action is made this way most of the to modify its surroundings or
time. In a small proportion of time the environment.
selection of the action is made by Policy: It defines the form in which the
random among all the possible actions. agent should be conducted from a given
This mechanism, denominated state or of a combination of (state,
exploration, allows the agent to test action). A policy can be represented in a
actions that in the past were found of table or in another type of structure.
little effectiveness. This function allows Normally a policy is defined implicitly
the agent to adapt its performance to in the table associated to a function. The
changes in the environment. Once the determination of a policy is the nucleus
selected action is made, a prize or a of the approach since this defines the
punishment for the action is generated. conduct of the agent.
The reward accumulated is stored in the Function of Prizes: It defines the goal
relation estate-action. of the agent. It determines how
desirable a state or a pair (state, action)
This type of agent has been selected to can be. In a certain sense which events
evaluate its potential application in the are good or bad so that the agent achive
construction of controllers of mobile its goal. The agent´s goal is to maximize
autonomous robots, for two reasons. the total amount of rewards (r) or
First of all, the learning is made on line. quantity of prizes received during the
This condition is fundamental for time of the experiment. The function of
applications of robots in which animal prizes defines what is good for the agent
actions are emulated. The other reason instantly.
is that the reinforcement learning is Value Function (state) V(s): It is the
extremely economic as for the resources total quantity of rewards (r) that the
of calculate and storage required for its agent expects to accumulate, from state
implementation. (s) in a given time (i). This specifies
what is good for the agent in the long successor states. This property is known
run. him like Bootstrapping.
• V(si)=E[Σri l si]
Value Function (state-action) Q(s, a): The Monte Carlo methods do not
It is the expected value of rewards (r), require a complete knowledge of the
starting from state (s) and the action (a) environment. They use on-line
in a given time (i). The estimation of the experience or simulated of the
values of these functions is the nucleus interaction with the environment
of the activity of this method. (sequence of states, actions and
• Q(si,ai)=E[Σri l si,ai] rewards). The simulation requires a
In the problem of reinforcement partial description of the environment.
learning, the agent makes decisions in This stimates the value function and
function of a signal provided by the optimal policies from experience in
environment, called the environment´s form of episodes. It is of practical utility
state. If each state of the environment to make the tasks that can be described
has the characteristic of sumarizes all on the basis of subtasks or episodes.
the passed information, in such a way Bootstrapping is not used.
that information of previous states is not
required, it says that the state signal has The temporal-difference learning is a
the Markov property [ 4 ]. combination of dynamic programming
ideas and Monte Carlo ideas. Monte
If the property of Markov is not Carlo methods can learn directly from
assumed, the task of maintaining the on-line experience without using a
information of all the last states implies model of the environment. Dynamic
to count on a great capacity of memory programming updates the values of a
available. A task of reinforcement state on the basis of the estimation of
learning that satisfies the property of the values of other states, without
Markov will be called a process of waiting for arrival of the following
decision of Markov. state. The reinforcement learning
problem is divided in 2 subproblems: A
The methods for solving the problem of Prediction and a Problem of
reinforcement learning problem are the Control. First, it tries to determine the
dynamic programming, the Monte Carlo value function V *, for a given policy.
methods and temporal-difference Secondly, it determines the optimal
learning . policy π *, by convergence to the
maximum of the value function Q * .
The dynamic programming is a The classic algorithms of temporal-
collection of algorithms that can be used difference that help to solve this
to compute optimal policies given a problem are: SARSA, SARSA(λ), Q-
model of the environment as a Markov learning.
decision process. Its utility practices is
limited since it requires a complete Classic Algorithms
model of the environment. However it
is important to dominate this techniques Let:
since it provides the basis of operation Q(s, a). - The value function (state,
of other techniques. All the methods of action)
the dynamic programming update the s. - The present state in time t
estimates of the values of a state based a. - The action for state s
on estimates of the the values of s'. - The following state in the time t+1
á. - The action for the state s'
α - Parameter that describes the time in
each step Hybrid Agent
r. - reward
γ - Rank of discontinuity between 0 The hibridazation technique used
and 1 consists of modifying the form in which
rewards are calculated after each action.
Algorithm SARSA From this mechanism, applied by
SARSA, a quadratic combination is
Initialize Q(s, a) arbitrarily generated in which the parameters are
Repeat for each episode basically two constants (τ=200 and
Initialize s µ=75) that controls the contribution or
Selects a from s using the policy weight of each one of the factors. In the
defined by Q. described experimental tests, the agent
Repeat for each step of episode shows the potential of this approach to
Take action a, observe r and s' obtain improvements in the
Selects a' from s' using Q. performance of the classic algorithms
Q(s,a)Q(s,a)+α [ r + γ Q(s’,a’) denominated Q_SARSA
- Q(s,a)] (Quick_SARSA):
until s is terminal Algorithm Q-SARSA
Algorithm Sarsa(λ) Initialize Q(s, a) arbitrarily
Repeat for each episode
Initialize Q(s, a) arbitrarily and e(s,a)=0 Initialize s
for all s and a Selects a from s using policy derived
Repeat for each episode from Q.
Initialize s,a Repeat for each step of episode
Repeat for each step of episode Take the action a, observe r and s'
Take the action a, observe r and s' Seleccione a' from s' using Q.
Selects a' from s' using Q. Q(s,a)Q(s,a)+α [ r + τ (γ Q(s’,a’) -
δ r + γ Q(s’,a’) - Q(s,a) Q(s,a)) + µ(γ Q(s’,a’) - Q(s,a))2 ]
e(s,a)e(s,a)+1 ss’; aa’
For all s,a: until s is terminal
Q(s,a) Q(s,a) + α δ e(s,a)
e(s,a) γ λ e(s,a) Comparative Tests
until s is terminal In the tests that appear hybrid algorithm
Q_SARSA against algorithms SARSA
Q-Learning Algorithm and SARSA(λ) are compared. The
environment in which the agents evolve
Initialize Q(s, a) arbitrarily is a simulator of mazes. In which the
Repeat for each episode: agent is put under the following test: It
Initialize s is placed in the initial square of a maze
Repeat for each step of episode: and from there it must reach the
Choose a from s using policy derived indicated square as the goal. The global
from Q objective of the agent is to learn how to
Take action a, observe r, s' go from the initial square to the target
Q( s, a ) ← Q( s, a ) + α [ r + γ maxa 'Q( s' , a ') − Q( s, a ) ] square. After each attempt, the value
s ← s'; function Q(s, a) allows to establish a
until s is terminal relation between each one of the
possible states of the environment and through counting the number of
the action of greater reward attempts made by each algorithm until
accumulated. The relation state-action the stabilization.
of highest rewards that it settles the end
of each attempt, is denominated learned # Made Attempts
policy. The criterion to define in which SARSA 211,003
moment the agent has learned to make Q_SARSA 204
the task is the following: Once the agent SARSA(λ) 246
already learned how to go from the
initial square to the target one without Results for mazes of average
variation in the learning policy during complexity:
the last 200 attempts [8 ]. It is possible
to hope that to greater complexity of the # Total of Actions
task this number must be increased to SARSA 4,430,244
assure that the learned policy is indeed a Q_SARSA 61,542
solution to the problem that tries to SARSA(λ) 2,034
solve the agent.
# Made Attempts
With this criterion the agents were SARSA 283,699
tested to solve mazes of different Q_SARSA 255
complexity (easy, average and difficult). SARSA(λ) 205
The mazes from easy complexity have a
single way of solution. Results for mazes of difficult
The mazes from average complexity complexity:
have one or more ways of solution.
The mazes from difficult complexity # Total of Actions
have ways whitout exit and one or more SARSA 16,777,216
ways of solution. Q_SARSA 3,659,667
A first indicator of the effectiveness of
the agents is the total number of actions # Made Attempts
made by the agent in the test, until the SARSA 11,796
learned-policy became stabilized. The Q_SARSA 33,437
results of the test in mazes of easy SARSA(λ) 356
complexity with respect to this indicator
are the following ones:
# Total of Actions
SARSA 29,673,089 Conclusions
SARSA(λ) 4,045 This study showed that the algorithm
Q_SARSA surpasses SARSA
This table locates with all clarity that algorythm and that the used mechanism
the performance of Q_SARSA is more of hibridization utilized is capable of
achieving improved performance of the
similar to that of SARSA(λ), than to
agents. The type of hibridization with
that of SARSA. The relevance of this
which one has experienced, maintains
result is manifest since Q_SARSA is
low costs of operation and the basic
structurally of type SARSA. The
structure of the agents.
performance of the algorithms in
function of the rapidity with which they
stabilize the learned policy, is observed Future jobs
.Mahadevan, Sridhar, Khaleeli
It would be interesting to try to make Nikfar, and Marchalleck Nicholas.
some other modifications of the hybrid 1998. Designing Agent Controllers
agent, such as the evaluation of the using Discrete-Event Markov
result of the actions that the agent made Models. Michigan State University,
in the past. The objective of this MI, USA.
function would be to filter the actions to .Mahadevan, Sridhar. 1997. Machine
eliminate those that evidently do not Learning for Robots: A Comparison
contribute to the achievement of the of Different Paradigms. University
global goal of the agent. of South Florida, USA.
It would also be interesting to implant .Brooks, Rodney A. 1991.
the algorithm of learning Q_SARSA in Intelligence Whitout Reason. MIT
a robot. Press A.I. Memo No. 1293. USA.
.Martin Mario. (1998).
Acknowledgments Reinforcement learning for
embedded agents facing complex
Special thanks for the accomplishment task. Tesis Doctoral. Universidad
of this project are given to the Counsel Politecnica de Cataluña.
of the National System of
Technological Education, from the
General Direction of Technological
Institutes and of the Direction of the
Technological Institute of City Madero.
Project supported by COSNET key
.Pendrith, Mark D. & Ryan Malcolm
R. K. 1997. C-Trace: A new
algorithm for reinforcement learning
of robotic control. The University of
New South Wales; Sydney,
.Sutton, Richard S., & Barto Andrew
G. Reinforcement Learning. An
Introduction. Ed. MIT Press;
Cambridge, Massachusetts, USA.
.Kaelbling, Leslie P., Littman,
Michael L., Moore, Andrew W.
1996. Reinforcement Learning: A
Survey. Brown University, USA.
.Mohan Rao, K. Vijay. 1997.
Learning Algorithms for Markov
Decision Processes. Departament of
Computer Science and Automation
Indian Institute of Science
Bangalore – 560 012.