Hibridization of Reinforcement Learning Agents


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hibridization of Reinforcement Learning Agents

  1. 1. Hybridization of Reinforcement Learning Agents Héctor J. Fraire Huacuja Juan J. González Barbosa Jesús V. Flores Morfin Department of Systems and Computation Technological Institute of Madero City Avenue 1º de Mayo y Sor Juana Inés de la Cruz S/N Colony Los Mangos (12) 10-04-15 Commutator (12) 10-29-02 Computer Center Technological Institute of the Lagoon jjgonzalezbarbosa@usa.net hfraire@acm.org jvfm@acm.org Summary The origen of reinforcement learning came along with the beginnings of The purpose of this work was to show cybernetics and involves elements of that agents of classic reinforcement statistic, psychology, neuro-science and learning[1], can achieve significant computational sciences [2]. In the last improvements in performance by ten years, the interest on these utilizing techniques of hybridization. techniques in the fields of the Artificial The applied methodology consists of Intelligence and the learning machines defining a mechanism to compare the has been greatly increased [8]. The capacity of learning of the agents, to test reinforcement learning is a form of the performance of the classic agents in agents programming, basied on rewards, similar conditions of interaction with its which purpose is to make the agent environment and finally, the worst one makes perform a task without a precise of the agents must be altered in order to specification of what to do. This increase its performance. After approach modifies the effective analyzing the classic agents of paradigms of programming and creates reinforcement learning, under an a field of opportunities of great environment of simulated maze, the amplitude. comparative tests located agent SARSA as having the worst performance and agent SARSA(λ) as having the best. The most prominent result is that the application of techniques of hybridization of the considered agents, allow the construction of hybrid agents that present notable improvements in performance. Introduction
  2. 2. The reinforcement learning is a Reinforcement Learning technique that allows the construction of intelligent agents with learning The learning has been the key factor of capacities and adaptation. An agent of the intelligent systems since they have this class learns about the interaction to make robots develope tasks without with the environment in which he the necessity of an explicit performes and adapts to changes that programming[7]. The main elements appear in its surroundings. The involved in the problem are the perception that the agent has of the following: environment through its sensors, becomes a state vector. To each one of Agent: Subject that makes the tasks of the possible states, an elementary action learning and takes decisions to reach a is associated and the agent is capable of goal. making the action through its element Environment: Everything that is not actuators. The relation state-action is the controlled by the agent generates states nucleus of this approach, since the and receives actions. update of this function allows the agent In other words, the agent and the to keep a record of the experience environment interact in a sequence of acquired. By using this information, the steps. agent determines at each step the action State (s): Perception that the agent has that contributed the most to achive the of the environment. global objectives. The selection of the Action (a): Conduct that the agent uses action is made this way most of the to modify its surroundings or time. In a small proportion of time the environment. selection of the action is made by Policy: It defines the form in which the random among all the possible actions. agent should be conducted from a given This mechanism, denominated state or of a combination of (state, exploration, allows the agent to test action). A policy can be represented in a actions that in the past were found of table or in another type of structure. little effectiveness. This function allows Normally a policy is defined implicitly the agent to adapt its performance to in the table associated to a function. The changes in the environment. Once the determination of a policy is the nucleus selected action is made, a prize or a of the approach since this defines the punishment for the action is generated. conduct of the agent. The reward accumulated is stored in the Function of Prizes: It defines the goal relation estate-action. of the agent. It determines how desirable a state or a pair (state, action) This type of agent has been selected to can be. In a certain sense which events evaluate its potential application in the are good or bad so that the agent achive construction of controllers of mobile its goal. The agent´s goal is to maximize autonomous robots, for two reasons. the total amount of rewards (r) or First of all, the learning is made on line. quantity of prizes received during the This condition is fundamental for time of the experiment. The function of applications of robots in which animal prizes defines what is good for the agent actions are emulated. The other reason instantly. is that the reinforcement learning is Value Function (state) V(s): It is the extremely economic as for the resources total quantity of rewards (r) that the of calculate and storage required for its agent expects to accumulate, from state implementation. (s) in a given time (i). This specifies
  3. 3. what is good for the agent in the long successor states. This property is known run. him like Bootstrapping. • V(si)=E[Σri l si] Value Function (state-action) Q(s, a): The Monte Carlo methods do not It is the expected value of rewards (r), require a complete knowledge of the starting from state (s) and the action (a) environment. They use on-line in a given time (i). The estimation of the experience or simulated of the values of these functions is the nucleus interaction with the environment of the activity of this method. (sequence of states, actions and • Q(si,ai)=E[Σri l si,ai] rewards). The simulation requires a In the problem of reinforcement partial description of the environment. learning, the agent makes decisions in This stimates the value function and function of a signal provided by the optimal policies from experience in environment, called the environment´s form of episodes. It is of practical utility state. If each state of the environment to make the tasks that can be described has the characteristic of sumarizes all on the basis of subtasks or episodes. the passed information, in such a way Bootstrapping is not used. that information of previous states is not required, it says that the state signal has The temporal-difference learning is a the Markov property [ 4 ]. combination of dynamic programming ideas and Monte Carlo ideas. Monte If the property of Markov is not Carlo methods can learn directly from assumed, the task of maintaining the on-line experience without using a information of all the last states implies model of the environment. Dynamic to count on a great capacity of memory programming updates the values of a available. A task of reinforcement state on the basis of the estimation of learning that satisfies the property of the values of other states, without Markov will be called a process of waiting for arrival of the following decision of Markov. state. The reinforcement learning problem is divided in 2 subproblems: A The methods for solving the problem of Prediction and a Problem of reinforcement learning problem are the Control. First, it tries to determine the dynamic programming, the Monte Carlo value function V *, for a given policy. methods and temporal-difference Secondly, it determines the optimal learning [3]. policy π *, by convergence to the maximum of the value function Q * [6]. The dynamic programming is a The classic algorithms of temporal- collection of algorithms that can be used difference that help to solve this to compute optimal policies given a problem are: SARSA, SARSA(λ), Q- model of the environment as a Markov learning. decision process. Its utility practices is limited since it requires a complete Classic Algorithms model of the environment. However it is important to dominate this techniques Let: since it provides the basis of operation Q(s, a). - The value function (state, of other techniques. All the methods of action) the dynamic programming update the s. - The present state in time t estimates of the values of a state based a. - The action for state s on estimates of the the values of s'. - The following state in the time t+1 á. - The action for the state s'
  4. 4. α - Parameter that describes the time in each step Hybrid Agent r. - reward γ - Rank of discontinuity between 0 The hibridazation technique used and 1 consists of modifying the form in which rewards are calculated after each action. Algorithm SARSA From this mechanism, applied by SARSA, a quadratic combination is Initialize Q(s, a) arbitrarily generated in which the parameters are Repeat for each episode basically two constants (τ=200 and Initialize s µ=75) that controls the contribution or Selects a from s using the policy weight of each one of the factors. In the defined by Q. described experimental tests, the agent Repeat for each step of episode shows the potential of this approach to Take action a, observe r and s' obtain improvements in the Selects a' from s' using Q. performance of the classic algorithms Q(s,a)Q(s,a)+α [ r + γ Q(s’,a’) denominated Q_SARSA - Q(s,a)] (Quick_SARSA): ss’; aa’ until s is terminal Algorithm Q-SARSA Algorithm Sarsa(λ) Initialize Q(s, a) arbitrarily Repeat for each episode Initialize Q(s, a) arbitrarily and e(s,a)=0 Initialize s for all s and a Selects a from s using policy derived Repeat for each episode from Q. Initialize s,a Repeat for each step of episode Repeat for each step of episode Take the action a, observe r and s' Take the action a, observe r and s' Seleccione a' from s' using Q. Selects a' from s' using Q. Q(s,a)Q(s,a)+α [ r + τ (γ Q(s’,a’) - δ r + γ Q(s’,a’) - Q(s,a) Q(s,a)) + µ(γ Q(s’,a’) - Q(s,a))2 ] e(s,a)e(s,a)+1 ss’; aa’ For all s,a: until s is terminal Q(s,a) Q(s,a) + α δ e(s,a) e(s,a) γ λ e(s,a) Comparative Tests ss’; aa’ until s is terminal In the tests that appear hybrid algorithm Q_SARSA against algorithms SARSA Q-Learning Algorithm and SARSA(λ) are compared. The environment in which the agents evolve Initialize Q(s, a) arbitrarily is a simulator of mazes. In which the Repeat for each episode: agent is put under the following test: It Initialize s is placed in the initial square of a maze Repeat for each step of episode: and from there it must reach the Choose a from s using policy derived indicated square as the goal. The global from Q objective of the agent is to learn how to Take action a, observe r, s' go from the initial square to the target Q( s, a ) ← Q( s, a ) + α [ r + γ maxa 'Q( s' , a ') − Q( s, a ) ] square. After each attempt, the value s ← s'; function Q(s, a) allows to establish a until s is terminal relation between each one of the
  5. 5. possible states of the environment and through counting the number of the action of greater reward attempts made by each algorithm until accumulated. The relation state-action the stabilization. of highest rewards that it settles the end of each attempt, is denominated learned # Made Attempts policy. The criterion to define in which SARSA 211,003 moment the agent has learned to make Q_SARSA 204 the task is the following: Once the agent SARSA(λ) 246 already learned how to go from the initial square to the target one without Results for mazes of average variation in the learning policy during complexity: the last 200 attempts [8 ]. It is possible to hope that to greater complexity of the # Total of Actions task this number must be increased to SARSA 4,430,244 assure that the learned policy is indeed a Q_SARSA 61,542 solution to the problem that tries to SARSA(λ) 2,034 solve the agent. # Made Attempts With this criterion the agents were SARSA 283,699 tested to solve mazes of different Q_SARSA 255 complexity (easy, average and difficult). SARSA(λ) 205 The mazes from easy complexity have a single way of solution. Results for mazes of difficult The mazes from average complexity complexity: have one or more ways of solution. The mazes from difficult complexity # Total of Actions have ways whitout exit and one or more SARSA 16,777,216 ways of solution. Q_SARSA 3,659,667 SARSA(λ) 17,710 A first indicator of the effectiveness of the agents is the total number of actions # Made Attempts made by the agent in the test, until the SARSA 11,796 learned-policy became stabilized. The Q_SARSA 33,437 results of the test in mazes of easy SARSA(λ) 356 complexity with respect to this indicator are the following ones: # Total of Actions SARSA 29,673,089 Conclusions Q_SARSA 72,964 SARSA(λ) 4,045 This study showed that the algorithm Q_SARSA surpasses SARSA This table locates with all clarity that algorythm and that the used mechanism the performance of Q_SARSA is more of hibridization utilized is capable of achieving improved performance of the similar to that of SARSA(λ), than to agents. The type of hibridization with that of SARSA. The relevance of this which one has experienced, maintains result is manifest since Q_SARSA is low costs of operation and the basic structurally of type SARSA. The structure of the agents. performance of the algorithms in function of the rapidity with which they stabilize the learned policy, is observed Future jobs
  6. 6. [5].Mahadevan, Sridhar, Khaleeli It would be interesting to try to make Nikfar, and Marchalleck Nicholas. some other modifications of the hybrid 1998. Designing Agent Controllers agent, such as the evaluation of the using Discrete-Event Markov result of the actions that the agent made Models. Michigan State University, in the past. The objective of this MI, USA. function would be to filter the actions to [6].Mahadevan, Sridhar. 1997. Machine eliminate those that evidently do not Learning for Robots: A Comparison contribute to the achievement of the of Different Paradigms. University global goal of the agent. of South Florida, USA. It would also be interesting to implant [7].Brooks, Rodney A. 1991. the algorithm of learning Q_SARSA in Intelligence Whitout Reason. MIT a robot. Press A.I. Memo No. 1293. USA. [8].Martin Mario. (1998). Acknowledgments Reinforcement learning for embedded agents facing complex Special thanks for the accomplishment task. Tesis Doctoral. Universidad of this project are given to the Counsel Politecnica de Cataluña. of the National System of Technological Education, from the General Direction of Technological Institutes and of the Direction of the Technological Institute of City Madero. Project supported by COSNET key 700.99-P References [1].Pendrith, Mark D. & Ryan Malcolm R. K. 1997. C-Trace: A new algorithm for reinforcement learning of robotic control. The University of New South Wales; Sydney, Australia. [2].Sutton, Richard S., & Barto Andrew G. Reinforcement Learning. An Introduction. Ed. MIT Press; Cambridge, Massachusetts, USA. 1998. [3].Kaelbling, Leslie P., Littman, Michael L., Moore, Andrew W. 1996. Reinforcement Learning: A Survey. Brown University, USA. [4].Mohan Rao, K. Vijay. 1997. Learning Algorithms for Markov Decision Processes. Departament of Computer Science and Automation Indian Institute of Science Bangalore – 560 012.