SlideShare a Scribd company logo
1 of 8
Download to read offline
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020
DOI: 10.5121/ijaia.2020.11605 47
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR
TRANSFER REINFORCEMENT LEARNING
Hitoshi Kono1
, Yuto Sakamoto2
, Yonghoon Ji3
and Hiromitsu Fujii4
1
Department of Engineering, Tokyo Polytechnic University, Kanagawa, Japan
2
Department of Electronics and Mechatronics,
Tokyo Polytechnic University, Kanagawa, Japan
3
Graduate School for Advanced Science and Technology, Japan Advanced Institute of
Science and Technology, Ishikawa, Japan
4
Department of Advanced Robotics, Chiba Institute of Technology, Chiba, Japan
ABSTRACT
This paper proposes a novel parameter for transfer reinforcement learning to avoid over-fitting when an
agent uses a transferred policy from a source task. Learning robot systems have recently been studied for
many applications, such as home robots, communication robots, and warehouse robots. However, if the
agent reuses the knowledge that has been sufficiently learned in the source task, deadlock may occur and
appropriate transfer learning may not be realized. In the previous work, a parameter called transfer rate
was proposed to adjust the ratio of transfer, and its contribution include avoiding dead lock in the target
task. However, adjusting the parameter depends on human intuition and experiences. Furthermore, the
method for deciding transfer rate has not discussed. Therefore, an automatic method for adjusting the
transfer rate is proposed in this paper using a sigmoid function. Further, computer simulations are used to
evaluate the effectiveness of the proposed method to improve the environmental adaptation performance in
a target task, which refers to the situation of reusing knowledge.
KEYWORDS
Reinforcement Learning, Transfer Learning, Transfer rate, Overfitting, Overlearning
1. INTRODUCTION
Machine learning systems and intelligent robot systems are increasingly being deployed to solve
practical problems, such as house cleaning and conveyance systems in warehouses [1] [2]. The
reinforcement learning framework has been widely discussed in applications of machine learning,
such as deep Q-networks [3] [4]. Basic reinforcement learning techniques are usually time
consuming. Thus, they are disadvantageous for implementation in actual applications, such as
robot systems. To address this problem, transfer learning has been proposed for reinforcement
learning [5-8]. Transfer learning theory allows the application of prior knowledge to another
similar task. In reinforcement learning, a learning agent is used to draw and transfer policies from
previous tasks (source task) to current tasks (target task). Advantages of transfer learning in
reinforcement learning include: improved learning speed, fast adaptation to environments, and
exploration of more effective performances. Agent systems, including transfer learning, have
been successful in some cases. However, transfer learning is not successful in all applications,
and in most cases, it is necessary to adjust learning conditions and avoid negative transfer
situations, such as deadlock. In the previous years, adjusting the ratio of reusing policy had been
proposed as a method to increase the environmental adaptation performance in the target task.
The method of adjusting the transfer rate is a mechanism that determines the action value of the
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020
48
reusing policy, and then uses it in the target task [9] [10]. However, the transfer rate decision
depends on human intuition and experience, and the method to determine the transfer rate has not
been discussed. In addition, the transfer rate is a fixed value, and it is desirable to adjust it
adaptively according to the environment and behavioral conditions. Therefore, an adjusting
method of transfer rate using a sigmoid function is proposed in this paper. The proposed method
automatically adjusts the transfer rate, and the method adjusting the value has to be discounted
when triggered by a bad situation of learning in a target task owing to the reuse of knowledge,
such as collision with an obstacle. Computer simulation is used to confirm whether the proposed
method for transfer learning in reinforcement learning can adjust the transfer rate. The knowledge
obtained is not reused in the case of a negative transfer situation, such as a deadlock; however, it
can be reused in other cases. In particular, in recent years, transfer in reinforcement learning in
multi-agent systems and human robot teams has been discussed, and it is considered essential to
avoid deadlock, that is, negative transfer by transfer learning [11-13].
The remainder of this paper is organized as follows. Section 2 provides an overview of the
reinforcement learning and transfer learning method, and discusses the previous method. It
adjusts the transferring ratio of reusing knowledge. Section 3 discusses the adjusting method for
transferring ratio with the sigmoid function as the proposed method. Section 4 presents the
computer simulation experiments and results to evaluate the performance of the proposed method
compared with previous methods. Section 5 presents the concluding remarks.
2. PREVIOUS WORKS
2.1. Reinforcement Learning
Reinforcement learning is a machine learning algorithm [3]. The reinforcement learning agent
explores the optimal solution via trial-and-error and creates its own knowledge as policy. Thus,
reinforcement learning does not require training datasets, unlike supervised learning. Many types
of reinforcement learning algorithms have been developed in the past decades. In this study, Q-
learning was adopted as the learning algorithm [14]. The Q-learning is defined by
𝑄(𝑠𝑑, π‘Žπ‘‘) ← 𝑄(𝑠𝑑, π‘Žπ‘‘) + 𝛼 {π‘Ÿπ‘‘+1 + 𝛾 max
π‘Žβˆˆπ΄
𝑄(𝑠𝑑+1, π‘Ž) βˆ’ 𝑄(𝑠𝑑, π‘Žπ‘‘)}, (1)
where 𝑠 ∈ 𝑺 is the state of environments in a state space, π‘Ž ∈ 𝑨 is the action of agent in an action
space, 𝛼 is the learning rate (0 < 𝛼 ≀ 1), 𝛾 is the discount rate (0 < 𝛾 ≀ 1), and r denotes the
reward. 𝑄(𝑠, π‘Ž), also known as Q-table, contains all states of the environment and each action
value pair.
To select action by the agent, an action selection method was employed in previous
reinforcement learning research. In this study, the Boltzmann distribution model is adopted as an
action selection function [3]. The action selection probability adopted from the Boltzmann
distribution is defined by
P(π‘Žπ‘–|𝑠𝑑) =
exp {
𝑄(𝑠𝑑, π‘Žπ‘–)
𝑇
⁄ }
βˆ‘ exp {
𝑄(𝑠𝑑, π‘Ž)
𝑇
⁄ }
π‘Žβˆˆπ΄
, (2)
where T is a parameter that determines the randomness of action selection. The following
discussion is based on Watkins’s Q-learning and action selection function is derived from the
Boltzmann distribution model.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020
49
2.2. Policy Copy in Transfer Learning
For traditional transfer learning in reinforcement learning, Tyalor et al. defined a method for
transferring the learned action-value function [5] [6]. This method is referred to as policy copy in
transfer learning, and is defined as follows:
𝑄𝑑(𝑠, π‘Ž) ← 𝑄𝑑(𝑠, π‘Ž) + 𝑄𝑠(𝑠, π‘Ž). (3)
Here, 𝑄𝑑(𝑠, π‘Ž) is the policy that includes the initial value in the target task. It is used for learning
and action selection. 𝑄𝑠(𝑠, π‘Ž) is reusing policy from the source task. Reusing policy has state-
action values obtained from the source task. In Taylor’s method, the agent in the initial state of
the target task sums the initial value of 𝑄𝑑(𝑠, π‘Ž) and transferring policy 𝑄𝑠(𝑠, π‘Ž). The agent is
learned using the assimilated policy 𝑄𝑑(𝑠, π‘Ž) in the target task. Simply, the agent is reusing the
transferred policy and overwriting the policy through trial-and-error simultaneously.
Originally, Taylor’s method includes inter-task mapping, which defines the mapping between the
agent’s observable environmental state S and the executable action A. Inter-task mapping is not
mentioned in this paper owing to the description of simplicity.
In Taylor’s method, if the transferred policy is sufficiently learned in the source task, it will
behave according to the policy when used in different environments, and deadlock, such as
collision with obstacles, may occur. This phenomenon is called over-fitting and over learning.
The agent can overwrite the policy through behavior in the target task, and some learning time is
required.
2.3. Adjusting of Transfer Ratio
Takano et al. proposed a method to adjust the action value of the transferred policy [9] by using
parameters ΞΆ in the policy assimilation equation. One of the methods in Takano’s study is defined
by
𝑄𝑐(𝑠, π‘Ž) ←
1
2
(1 βˆ’ 𝜁) 𝑄𝑑(𝑠, π‘Ž) + πœπ‘„π‘ (𝑠, π‘Ž), (4)
Where ΞΆ is the adjusting parameter (0 < 𝜁 < 1). 𝑄𝑐(𝑠, π‘Ž) is the assimilation policy, and is used
for action selection. Learning in the target task is used as a policy 𝑄𝑑(𝑠, π‘Ž). This method
discounts 𝑄𝑠(𝑠, π‘Ž) by ΞΆ , and conversely uses 𝑄𝑑(𝑠, π‘Ž) by the amount of 1 βˆ’ 𝜁. Thus, it is possible
to reduce the effect on the target ask even if a policy 𝑄𝑠(𝑠, π‘Ž)with a high action value is reused.
However, Takano’s method requires the value of 𝜁 to be determined before operation and
depends on human intuition and experience. Furthermore, the policy 𝑄𝑑(𝑠, π‘Ž)that should be
trusted is discounted.
2.4. Transfer Rate
Kono et al. proposed the transfer rate to adjust the ratio of transferring the obtained policy [10].
The method is based on Takano’s method, and it is similar and simpler than Takano’s method.
Transferring method with transfer rate can be defined as
𝑄𝑐(𝑠, π‘Ž) ← 𝑄𝑑(𝑠, π‘Ž) + πœπ‘„π‘ (𝑠, π‘Ž), (5)
where 𝜏, (0 < 𝜏 ≀ 1) is transfer rate, and 𝑄𝑐(𝑠, π‘Ž) is an assimilated policy, which is used for
action selection in the target task. 𝑄𝑑(𝑠, π‘Ž) is the policy in the target task., and is used for learning
in the target task. 𝑄𝑠(𝑠, π‘Ž) is a reusing policy that was learned in the source task. Kono’s method
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020
50
is also used to determine the parameters 𝜏 before the learning operation, and the value setting
depends on human intuition and experience. In this paper, apart from Taylor’s method, Kono’s
and Takano’s methods are collectively called Takano’s method.
3. PROPOSED METHOD
In previous study, transferring policy method overwrote transferred policy and fixed discount
parameters. An automatic method for adjusting the parameters is proposed in this section. When
considering the situation of negative transfer, the action of the agent that can be originally
executed cannot be executed by the target task because the policy is reused, and thus, a collision
with an obstacle occurs. Therefore, to avoid the above situation, if the agent becomes unable to
act, the transfer rate value should be lowered. To lower the value of the transfer rate, the value is
adjusted using a sigmoid function according to the number of times that the agent cannot act. The
sigmoid function is defined as:
1
1 + π‘’βˆ’πœŽπ‘”
, (6)
where 𝑔 is the gain value, which is determined by an arbitrary value, and 𝜎 is the input value of
the sigmoid function. The transfer method is reformulated using the sigmoid function as defined
by
𝑄𝑐(𝑠, π‘Ž) ← 𝑄𝑑(𝑠, π‘Ž) +
1
1 + π‘’βˆ’πœŽπ‘”
𝑄𝑠(𝑠, π‘Ž), (7)
𝜎 is the increment or decrement with state 𝑠𝑑 = 𝑠𝑑+1 and 𝑠𝑑 β‰  𝑠𝑑+1, respectively. The inability of
the agent to act means that the current environmental state 𝑠𝑑 and the next environmental state
𝑠𝑑+1 are the same; thus, 𝜎 is subtracted by an arbitrary constant value 𝒹. Conversely, if the action
is performed by reusing the policy, only an arbitrary constant 𝒹 is added for each action. This
mechanism is defined as follows:
Οƒ = {
𝜎 βˆ’ 𝒹 (𝑠𝑑 = 𝑠𝑑+1)
𝜎 + 𝒹 (𝑠𝑑 β‰  𝑠𝑑+1)
. (8)
The above function is calculated for each action, such as a step of the agent. It is possible to
adjust the rate of reuse of the observes by comparing the environmental condition at each time
step in a generalized form, regardless of actual situations, such as collisions with obstacles.
4. EVALUATION WITH COMPUTER SIMULATION
4.1. Experimental Setup
This experiment aims to confirm the performance of an agent’s environmental adaptation using
the proposed method. Furthermore, the shortest path problem is adopted in this study as the basic
evaluation with a single agent. The learning environment is shown in Figure 1 in the computer
simulation. Figures 1 (a) and (b) show the source task environment and target task environment,
respectively. In this environment, if the agent achieves the goal, the agent can obtain a reward
π‘Ÿ = 1 from the environment. An agent can observe self-localization. The reinforcement learning
parameter is set as Ξ± = 0.1 and Ξ³ = 0.99. The parameter T of the Boltzmann distribution model
is set as 𝑇 = 0.01. These parameters are common to all conditions. In each experiment, 300
episodes were conducted for the source and target tasks. The agent can move to 1 grid in each
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020
51
step, and there are four directions that can be moved: front, right, back, and left. The time when
the agent reaches the goal from the start is referred to as a single 1episode.
(a) Source task (b) Target task.
Figure 1. Grid world of shortest path problem. Circle mark is agent and initial position in each simulation.
Green square is set as goal position. In each sub-figure, red is shortest path between start position and goal
position with reinforcement learning.
For experimental conditions, the learning results are compared using Taylor, Takano, and the
proposed methods. In Taylor’s method, reinforcement learning is executed, as shown in Figure 1
(a), and the acquired policy is transferred to the agent in Figure 1 (b) for transfer learning. In
Takano’s method, Kono’s method was adopted for the implementation of this experiment. It is
clear that Takano’s method is overfitting, and it assumes a considerable amount of time to learn
with the target task. Therefore, the agent is set up to obtain a negative reward π‘Ÿ = βˆ’1 in the event
of a collision with a wall. In addition, it was confirmed in advance that the transfer learning was
possible, and the transfer rate was set to 𝜏 = 0.1. In the proposed method, the parameter 𝒹 is set
to 0.1, and the gain 𝑔 is set to 9.0.
4.2. Results
In this section, the experimental results under the three conditions are presented. The obtained
learning curves are shown in Figure 2. The learning curve represents the performance related to
the learning progress, whereby the horizontal axis is the number of learning episodes and the
vertical axis is the number of steps required from the start to the goal. Each learning curve had 10
trial averages. In Figure 2, the learning curve of reinforcement learning that does not reuse the
policy in the target task is the standard for performance evaluation. The black line in Figure 2
represents the learning curve of reinforcement learning.
4.2.1. Result of Taylor’s Method
Taylor’s method is believed to have a high number of steps in the early stage of learning, and it is
considered that an overfitting state emerges. However, because the target task is relearned, the
optimal solution, i.e., the convergence to the shortest path, appears from the learning curve of
reinforcement learning.
4.2.2. Result of Takano’s Method
The learning curve of Tekano has a relatively small number of steps from the early state of
learning compared with the learning curve of reinforcement learning, and its convergence is
considered to be equivalent to the learning curve of reinforcement learning. High performance in
the early stages of the learning curve is called the jump start, and is one of the basic profits of
transfer reinforcement learning. It has been confirmed that relearning in the target task is not
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020
52
possible when the transfer rate 𝜏 = 1.0 because if the learning in the target task progresses, the
value of the update is small and it is time-consuming to cancel the action value of the transferred
policy.
Figure 2. Learning curves
4.2.3. Result of the Proposed Method
The learning curve of the proposed method also has jumpstarted. The speed of convergence is
slower than that of Takano’s method and is close to the learning curve of reinforcement learning.
From the figure, it can be observed that the proposed method performs better than reinforcement
learning. In the initial state, the proposed method is equivalent to the transfer rate 𝜏 = 1.0 of
Takano’s method, which normally requires time for learning. However, to adaptively change the
transfer rate, the performance of the proposed method is verified by Takano’s method with a
transfer rate 𝜏 = 0.1.
4.2.4. Discussion
The transfer ratio was used to quantitatively compare the performance. The transfer ratio can be
evaluated by improving the learning time compared with the basic learning curve, such as
reinforcement learning. Transfer ratio π‘Ÿ can be defined as follows:
π‘Ÿ =
βˆ‘ 𝐿𝑑(𝑑) βˆ’ βˆ‘ 𝐿𝑀𝑑(𝑑)
βˆ‘πΏπ‘€π‘‘(𝑑)
, (9)
where 𝐿𝑑(𝑑) is the learning curve with transfer and 𝑑 is the number of episodes. Therefore,
βˆ‘ 𝐿𝑑(𝑑)is calculated as the learning curve area. In addition, 𝐿𝑀𝑑(𝑑) is the learning curve without
transfer, which indicates reinforcement learning in the target task. βˆ‘ 𝐿𝑀𝑑(𝑑) is the calculated
learning curve area of without transfer. Table 1 presents the transfer ratios in Taylor, Takano, and
the proposed methods after evaluation.
Table 1. Comparison of transfer ratio in each conditions
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020
53
In Table 1, it can be observed that only Taylor’s method increases the learning time by
approximately 1.41 times. Takano’s method and the proposed method decrease learning time.
Figure 3 shows the transition of the output value of the sigmoid function at the beginning and end
of learning. In the early stage of learning, the value corresponding to the transfer rate is adjusted
by interaction with the environment. Sometimes, the transferred policy is used by the agent
moving to the goal. At the end of learning, the reusing policy may not be used once. It time-
consuming to use the policy at the end of learning than in the early stages of learning because the
agent learns a new policy and obtains a high action value; therefore, it becomes possible to reach
the goal without being affected by the reusing policy. However, if the agent obtains sufficient
policy, there is no need to adjust the ratio of reusing policy. The learning curve of the proposed
method does not converge because the ratio is still adjusted, even at the end of learning.
Therefore, compared to Takano’s method, the result suggested that the proposed method allows
the agent to interact with the environment to automatically adjust the ratio of reuse of the policy
without manual adjustment or determining the transfer rate. Moreover, compared with the
previous method, it was shown that the agent can improve the environmental adaptability while
maintaining the characteristic of not overwriting the reusing policy from the source task.
(a) Early stage of learning (1 episode) (b) End of learning (300 episodes)
Figure 3. Transition of the value output from the sigmoid function
5. CONCLUSIONS
This paper proposed a novel parameter for transfer reinforcement learning to avoid over-fitting in
the relearning process of the target task. In the proposed method, the ratio of reusing policy from
the source task is adjusted by the sigmoid function and input value, such as collision with
obstacle. Basic experiments were performed with the shortest path problem with transfer
reinforcement learning based on Taylor’s method without inter-task mapping. The result suggests
that the learning agent can be adapted to the environment. Moreover, compared with the previous
method, it was shown that the agent can improve the environmental adaptability while
maintaining the characteristic of not overwriting the reuse policy from the source task. This result
suggests that it can be applied to reusing policy selection in future applications of reinforcement
learning agents.
Our proposed method does not show convergence in the target task compared with the previous
method. For future work, it is necessary to discuss the adjusting method of reusing the ratio of
transferring policy at the end of learning. In the proposed method, the ratio of reuse was adjusted
adaptively. The effect of the transferring policy cannot be ignored even at the end of learning.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020
54
Moreover, the proposed method is evaluated in a simple task and environment in this study. The
computer simulation environment is a Markov decision process (MDP), which is different from
the real-world environment. To demonstrate the effectiveness of the proposed method, it is
necessary to carry out various types of more complex experiments and evaluations. In addition,
the proposed method should be evaluated not only in various environments but also in non-MDP
environments, such as multi-agent systems.
ACKNOWLEDGEMENTS
This study was partially supported by JSPS KAKENHI Grant number JP18K18133, and
SUZUKI Foundation. This work was founded by The Japan Atomic Energy Agency (JAEA)
FY2020 Center of World Intelligence Project for Nuclear Science/Technology and Human
Resource Development (Grant no. R02I015).
REFERENCES
[1] B. Ramalingam, A. K. Lakshmanan, M. Ilyas, A. V. Le, and M. R. Elara (2018). β€œCascaded Machine-
Learning Technique for Debris Classification in Floor-Cleaning Robot Application.” Applied
Sciences 8(12): 1-19.
[2] R. D. Andrea (2012). β€œGuest Editorial: A Revolution in the Warehouse: A Retrospective on Kiva
Systems and the Grand Challenges Ahead.”IEEE Transactions on Automation Science and
Engineering 9(4): 638-639.
[3] R. S. Sutton and A. G. Barto (1998). β€œReinforcement learning: An introduction.” MIT press.
[4] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M.
Riedmiller, A. K. Fidjeland, G. Ostrovski, S.Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King,
D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015). β€œHuman-level control through deep
reinforcement learning.” Nature 518: 529-533.
[5] M. E. Taylor and P. Stone (2009). β€œTransfer learning for reinforcement learning domains: A survey.”
Journal of Machine Learning Research 10(Jul): 1633-1685.
[6] M. E. Taylor (2009). β€œTransfer in Reinforcement Learning Domains.”Studies in Computational
Intelligence 216: Springer.
[7] A. Lazaric (2012). β€œTransfer in Reinforcement Learning: A Framework and a Survey. Reinforcement
Learning. Adaptation, Learning, and Optimization.” Berlin, Heidelberg, Springer. 12: 143-173.
[8] Q. Yang, Y. Zhang, W. Dai, and S. J. Pan (2020) β€œTransfer learning.”Cambridge University Press.
[9] T. Takano, H. Takase, H. Kawanaka, H. Kita, T. Hayashi and S. Tsuruoka (2011). β€œTransfer Learning
based on Forbidden Rule Set in Actor-Critic Method.” International Journal of Innovative
Computing, Information and Control 7(5(B)).
[10] H. Kono, A. Kamimura, K. Tomita, Y. Murata, and T. Suzuki (2014) β€œTransfer Learning Method
Using Ontology for Heterogeneous Multi-agent Reinforcement Learning.” International Journal of
Advanced Computer Science and Application 5(10): pp.156-164.
[11] R. Ramakrishnan, C. Zhang, and J. Shah (2017). β€œPerturbation Training for Human-Robot Teams.”
Journal of Artificial Intelligence Research 59: pp.495-541.
[12] F. L. Da Silva and A. H. R. Costa (2019). β€œA Survey on Transfer Learning for Multiagent
Reinforcement Learning Systems.” Journal of Artificial Intelligence Research 69: pp.645-703.
[13] F. L. Da Silva, G. Warnell, A.H.R.Costa, and P. Stone (2020). β€œAgents teaching agents: a survey on
inter-agent transfer learning.”Autonomous Agents and Multi-Agent Systems 34 (9).
[14] C. J. C. H. Watkins and P. Dayan (1992). β€œQ-Learning.” Machine Learning 8: pp.279-292

More Related Content

What's hot

COMBINED CLASSIFIERS FOR TIME SERIES SHAPELETS
COMBINED CLASSIFIERS FOR TIME SERIES SHAPELETSCOMBINED CLASSIFIERS FOR TIME SERIES SHAPELETS
COMBINED CLASSIFIERS FOR TIME SERIES SHAPELETScsandit
Β 
Genetic Approach to Parallel Scheduling
Genetic Approach to Parallel SchedulingGenetic Approach to Parallel Scheduling
Genetic Approach to Parallel SchedulingIOSR Journals
Β 
Two-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential EvolutionTwo-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential EvolutionXin-She Yang
Β 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
Β 
SigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerSigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerIan Dewancker
Β 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative StudyIJMER
Β 
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGA HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGcscpconf
Β 
Tree Physiology Optimization in Constrained Optimization Problem
Tree Physiology Optimization in Constrained Optimization ProblemTree Physiology Optimization in Constrained Optimization Problem
Tree Physiology Optimization in Constrained Optimization ProblemTELKOMNIKA JOURNAL
Β 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
Β 
Manager’s Preferences Modeling within Multi-Criteria Flowshop Scheduling Prob...
Manager’s Preferences Modeling within Multi-Criteria Flowshop Scheduling Prob...Manager’s Preferences Modeling within Multi-Criteria Flowshop Scheduling Prob...
Manager’s Preferences Modeling within Multi-Criteria Flowshop Scheduling Prob...Waqas Tariq
Β 
Concatenated decision paths classification for time series shapelets
Concatenated decision paths classification for time series shapeletsConcatenated decision paths classification for time series shapelets
Concatenated decision paths classification for time series shapeletsijics
Β 
Hybrid Genetic Algorithm with Multiparents Crossover for Job Shop Scheduling ...
Hybrid Genetic Algorithm with Multiparents Crossover for Job Shop Scheduling ...Hybrid Genetic Algorithm with Multiparents Crossover for Job Shop Scheduling ...
Hybrid Genetic Algorithm with Multiparents Crossover for Job Shop Scheduling ...CHUNG SIN ONG
Β 
Artificial Intelligence in Robot Path Planning
Artificial Intelligence in Robot Path PlanningArtificial Intelligence in Robot Path Planning
Artificial Intelligence in Robot Path Planningiosrjce
Β 
Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...André Gonçalves
Β 
HYBRID GENETIC ALGORITHM FOR BI-CRITERIA MULTIPROCESSOR TASK SCHEDULING WITH ...
HYBRID GENETIC ALGORITHM FOR BI-CRITERIA MULTIPROCESSOR TASK SCHEDULING WITH ...HYBRID GENETIC ALGORITHM FOR BI-CRITERIA MULTIPROCESSOR TASK SCHEDULING WITH ...
HYBRID GENETIC ALGORITHM FOR BI-CRITERIA MULTIPROCESSOR TASK SCHEDULING WITH ...aciijournal
Β 

What's hot (17)

COMBINED CLASSIFIERS FOR TIME SERIES SHAPELETS
COMBINED CLASSIFIERS FOR TIME SERIES SHAPELETSCOMBINED CLASSIFIERS FOR TIME SERIES SHAPELETS
COMBINED CLASSIFIERS FOR TIME SERIES SHAPELETS
Β 
Genetic Approach to Parallel Scheduling
Genetic Approach to Parallel SchedulingGenetic Approach to Parallel Scheduling
Genetic Approach to Parallel Scheduling
Β 
Two-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential EvolutionTwo-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential Evolution
Β 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Β 
SigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerSigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_Primer
Β 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
Β 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis Presentation
Β 
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGA HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
Β 
Tree Physiology Optimization in Constrained Optimization Problem
Tree Physiology Optimization in Constrained Optimization ProblemTree Physiology Optimization in Constrained Optimization Problem
Tree Physiology Optimization in Constrained Optimization Problem
Β 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
Β 
Manager’s Preferences Modeling within Multi-Criteria Flowshop Scheduling Prob...
Manager’s Preferences Modeling within Multi-Criteria Flowshop Scheduling Prob...Manager’s Preferences Modeling within Multi-Criteria Flowshop Scheduling Prob...
Manager’s Preferences Modeling within Multi-Criteria Flowshop Scheduling Prob...
Β 
Ica 2013021816274759
Ica 2013021816274759Ica 2013021816274759
Ica 2013021816274759
Β 
Concatenated decision paths classification for time series shapelets
Concatenated decision paths classification for time series shapeletsConcatenated decision paths classification for time series shapelets
Concatenated decision paths classification for time series shapelets
Β 
Hybrid Genetic Algorithm with Multiparents Crossover for Job Shop Scheduling ...
Hybrid Genetic Algorithm with Multiparents Crossover for Job Shop Scheduling ...Hybrid Genetic Algorithm with Multiparents Crossover for Job Shop Scheduling ...
Hybrid Genetic Algorithm with Multiparents Crossover for Job Shop Scheduling ...
Β 
Artificial Intelligence in Robot Path Planning
Artificial Intelligence in Robot Path PlanningArtificial Intelligence in Robot Path Planning
Artificial Intelligence in Robot Path Planning
Β 
Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...
Β 
HYBRID GENETIC ALGORITHM FOR BI-CRITERIA MULTIPROCESSOR TASK SCHEDULING WITH ...
HYBRID GENETIC ALGORITHM FOR BI-CRITERIA MULTIPROCESSOR TASK SCHEDULING WITH ...HYBRID GENETIC ALGORITHM FOR BI-CRITERIA MULTIPROCESSOR TASK SCHEDULING WITH ...
HYBRID GENETIC ALGORITHM FOR BI-CRITERIA MULTIPROCESSOR TASK SCHEDULING WITH ...
Β 

Similar to AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING

An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...
An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...
An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...Waqas Tariq
Β 
An improved teaching learning
An improved teaching learningAn improved teaching learning
An improved teaching learningcsandit
Β 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
Β 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
Β 
Proposing a scheduling algorithm to balance the time and cost using a genetic...
Proposing a scheduling algorithm to balance the time and cost using a genetic...Proposing a scheduling algorithm to balance the time and cost using a genetic...
Proposing a scheduling algorithm to balance the time and cost using a genetic...Editor IJCATR
Β 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Xin-She Yang
Β 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Xin-She Yang
Β 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
Β 
RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.pptssuser43a599
Β 
ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...
ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...
ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...cscpconf
Β 
Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...csandit
Β 
Hibridization of Reinforcement Learning Agents
Hibridization of Reinforcement Learning AgentsHibridization of Reinforcement Learning Agents
Hibridization of Reinforcement Learning Agentsbutest
Β 
Cuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsCuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsXin-She Yang
Β 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
Β 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
Β 
TEACHING AND LEARNING BASED OPTIMISATION
TEACHING AND LEARNING BASED OPTIMISATIONTEACHING AND LEARNING BASED OPTIMISATION
TEACHING AND LEARNING BASED OPTIMISATIONUday Wankar
Β 
A hybrid constructive algorithm incorporating teaching-learning based optimiz...
A hybrid constructive algorithm incorporating teaching-learning based optimiz...A hybrid constructive algorithm incorporating teaching-learning based optimiz...
A hybrid constructive algorithm incorporating teaching-learning based optimiz...IJECEIAES
Β 
Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...
Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...
Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...University Utara Malaysia
Β 

Similar to AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING (20)

An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...
An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...
An Algorithm of Policy Gradient Reinforcement Learning with a Fuzzy Controlle...
Β 
An improved teaching learning
An improved teaching learningAn improved teaching learning
An improved teaching learning
Β 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
Β 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
Β 
Proposing a scheduling algorithm to balance the time and cost using a genetic...
Proposing a scheduling algorithm to balance the time and cost using a genetic...Proposing a scheduling algorithm to balance the time and cost using a genetic...
Proposing a scheduling algorithm to balance the time and cost using a genetic...
Β 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Β 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Β 
Mangai
MangaiMangai
Mangai
Β 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
Β 
YijueRL.ppt
YijueRL.pptYijueRL.ppt
YijueRL.ppt
Β 
RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.ppt
Β 
ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...
ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...
ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...
Β 
Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...
Β 
Hibridization of Reinforcement Learning Agents
Hibridization of Reinforcement Learning AgentsHibridization of Reinforcement Learning Agents
Hibridization of Reinforcement Learning Agents
Β 
Cuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsCuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and Applications
Β 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Β 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Β 
TEACHING AND LEARNING BASED OPTIMISATION
TEACHING AND LEARNING BASED OPTIMISATIONTEACHING AND LEARNING BASED OPTIMISATION
TEACHING AND LEARNING BASED OPTIMISATION
Β 
A hybrid constructive algorithm incorporating teaching-learning based optimiz...
A hybrid constructive algorithm incorporating teaching-learning based optimiz...A hybrid constructive algorithm incorporating teaching-learning based optimiz...
A hybrid constructive algorithm incorporating teaching-learning based optimiz...
Β 
Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...
Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...
Strategic Oscillation for Exploitation and Exploration of ACS Algorithm for J...
Β 

More from gerogepatton

13th International Conference on Software Engineering and Applications (SEA 2...
13th International Conference on Software Engineering and Applications (SEA 2...13th International Conference on Software Engineering and Applications (SEA 2...
13th International Conference on Software Engineering and Applications (SEA 2...gerogepatton
Β 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
Β 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONgerogepatton
Β 
10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...gerogepatton
Β 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
Β 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
Β 
A Machine Learning Ensemble Model for the Detection of Cyberbullying
A Machine Learning Ensemble Model for the Detection of CyberbullyingA Machine Learning Ensemble Model for the Detection of Cyberbullying
A Machine Learning Ensemble Model for the Detection of Cyberbullyinggerogepatton
Β 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
Β 
10th International Conference on Data Mining (DaMi 2024)
10th International Conference on Data Mining (DaMi 2024)10th International Conference on Data Mining (DaMi 2024)
10th International Conference on Data Mining (DaMi 2024)gerogepatton
Β 
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...gerogepatton
Β 
WAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATION
WAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATIONWAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATION
WAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATIONgerogepatton
Β 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
Β 
Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...
Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...
Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...gerogepatton
Β 
10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...gerogepatton
Β 
The International Journal of Artificial Intelligence & Applications (IJAIA)
The International Journal of Artificial Intelligence & Applications (IJAIA)The International Journal of Artificial Intelligence & Applications (IJAIA)
The International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
Β 
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecturegerogepatton
Β 
2nd International Conference on Computer Science, Engineering and Artificial ...
2nd International Conference on Computer Science, Engineering and Artificial ...2nd International Conference on Computer Science, Engineering and Artificial ...
2nd International Conference on Computer Science, Engineering and Artificial ...gerogepatton
Β 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
Β 
Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...
Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...
Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...gerogepatton
Β 
13th International Conference on Soft Computing, Artificial Intelligence and ...
13th International Conference on Soft Computing, Artificial Intelligence and ...13th International Conference on Soft Computing, Artificial Intelligence and ...
13th International Conference on Soft Computing, Artificial Intelligence and ...gerogepatton
Β 

More from gerogepatton (20)

13th International Conference on Software Engineering and Applications (SEA 2...
13th International Conference on Software Engineering and Applications (SEA 2...13th International Conference on Software Engineering and Applications (SEA 2...
13th International Conference on Software Engineering and Applications (SEA 2...
Β 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)
Β 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
Β 
10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...
Β 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)
Β 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)
Β 
A Machine Learning Ensemble Model for the Detection of Cyberbullying
A Machine Learning Ensemble Model for the Detection of CyberbullyingA Machine Learning Ensemble Model for the Detection of Cyberbullying
A Machine Learning Ensemble Model for the Detection of Cyberbullying
Β 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)
Β 
10th International Conference on Data Mining (DaMi 2024)
10th International Conference on Data Mining (DaMi 2024)10th International Conference on Data Mining (DaMi 2024)
10th International Conference on Data Mining (DaMi 2024)
Β 
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
Β 
WAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATION
WAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATIONWAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATION
WAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATION
Β 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)
Β 
Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...
Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...
Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...
Β 
10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...
Β 
The International Journal of Artificial Intelligence & Applications (IJAIA)
The International Journal of Artificial Intelligence & Applications (IJAIA)The International Journal of Artificial Intelligence & Applications (IJAIA)
The International Journal of Artificial Intelligence & Applications (IJAIA)
Β 
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Β 
2nd International Conference on Computer Science, Engineering and Artificial ...
2nd International Conference on Computer Science, Engineering and Artificial ...2nd International Conference on Computer Science, Engineering and Artificial ...
2nd International Conference on Computer Science, Engineering and Artificial ...
Β 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)
Β 
Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...
Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...
Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...
Β 
13th International Conference on Soft Computing, Artificial Intelligence and ...
13th International Conference on Soft Computing, Artificial Intelligence and ...13th International Conference on Soft Computing, Artificial Intelligence and ...
13th International Conference on Soft Computing, Artificial Intelligence and ...
Β 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
Β 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
Β 
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhisoniya singh
Β 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
Β 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
Β 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
Β 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
Β 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
Β 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
Β 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
Β 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
Β 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
Β 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
Β 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Β 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
Β 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
Β 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
Β 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Β 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Β 
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
Β 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Β 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Β 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
Β 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Β 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
Β 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
Β 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
Β 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Β 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Β 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
Β 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Β 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Β 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
Β 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
Β 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
Β 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
Β 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Β 

AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING

  • 1. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020 DOI: 10.5121/ijaia.2020.11605 47 AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING Hitoshi Kono1 , Yuto Sakamoto2 , Yonghoon Ji3 and Hiromitsu Fujii4 1 Department of Engineering, Tokyo Polytechnic University, Kanagawa, Japan 2 Department of Electronics and Mechatronics, Tokyo Polytechnic University, Kanagawa, Japan 3 Graduate School for Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Ishikawa, Japan 4 Department of Advanced Robotics, Chiba Institute of Technology, Chiba, Japan ABSTRACT This paper proposes a novel parameter for transfer reinforcement learning to avoid over-fitting when an agent uses a transferred policy from a source task. Learning robot systems have recently been studied for many applications, such as home robots, communication robots, and warehouse robots. However, if the agent reuses the knowledge that has been sufficiently learned in the source task, deadlock may occur and appropriate transfer learning may not be realized. In the previous work, a parameter called transfer rate was proposed to adjust the ratio of transfer, and its contribution include avoiding dead lock in the target task. However, adjusting the parameter depends on human intuition and experiences. Furthermore, the method for deciding transfer rate has not discussed. Therefore, an automatic method for adjusting the transfer rate is proposed in this paper using a sigmoid function. Further, computer simulations are used to evaluate the effectiveness of the proposed method to improve the environmental adaptation performance in a target task, which refers to the situation of reusing knowledge. KEYWORDS Reinforcement Learning, Transfer Learning, Transfer rate, Overfitting, Overlearning 1. INTRODUCTION Machine learning systems and intelligent robot systems are increasingly being deployed to solve practical problems, such as house cleaning and conveyance systems in warehouses [1] [2]. The reinforcement learning framework has been widely discussed in applications of machine learning, such as deep Q-networks [3] [4]. Basic reinforcement learning techniques are usually time consuming. Thus, they are disadvantageous for implementation in actual applications, such as robot systems. To address this problem, transfer learning has been proposed for reinforcement learning [5-8]. Transfer learning theory allows the application of prior knowledge to another similar task. In reinforcement learning, a learning agent is used to draw and transfer policies from previous tasks (source task) to current tasks (target task). Advantages of transfer learning in reinforcement learning include: improved learning speed, fast adaptation to environments, and exploration of more effective performances. Agent systems, including transfer learning, have been successful in some cases. However, transfer learning is not successful in all applications, and in most cases, it is necessary to adjust learning conditions and avoid negative transfer situations, such as deadlock. In the previous years, adjusting the ratio of reusing policy had been proposed as a method to increase the environmental adaptation performance in the target task. The method of adjusting the transfer rate is a mechanism that determines the action value of the
  • 2. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020 48 reusing policy, and then uses it in the target task [9] [10]. However, the transfer rate decision depends on human intuition and experience, and the method to determine the transfer rate has not been discussed. In addition, the transfer rate is a fixed value, and it is desirable to adjust it adaptively according to the environment and behavioral conditions. Therefore, an adjusting method of transfer rate using a sigmoid function is proposed in this paper. The proposed method automatically adjusts the transfer rate, and the method adjusting the value has to be discounted when triggered by a bad situation of learning in a target task owing to the reuse of knowledge, such as collision with an obstacle. Computer simulation is used to confirm whether the proposed method for transfer learning in reinforcement learning can adjust the transfer rate. The knowledge obtained is not reused in the case of a negative transfer situation, such as a deadlock; however, it can be reused in other cases. In particular, in recent years, transfer in reinforcement learning in multi-agent systems and human robot teams has been discussed, and it is considered essential to avoid deadlock, that is, negative transfer by transfer learning [11-13]. The remainder of this paper is organized as follows. Section 2 provides an overview of the reinforcement learning and transfer learning method, and discusses the previous method. It adjusts the transferring ratio of reusing knowledge. Section 3 discusses the adjusting method for transferring ratio with the sigmoid function as the proposed method. Section 4 presents the computer simulation experiments and results to evaluate the performance of the proposed method compared with previous methods. Section 5 presents the concluding remarks. 2. PREVIOUS WORKS 2.1. Reinforcement Learning Reinforcement learning is a machine learning algorithm [3]. The reinforcement learning agent explores the optimal solution via trial-and-error and creates its own knowledge as policy. Thus, reinforcement learning does not require training datasets, unlike supervised learning. Many types of reinforcement learning algorithms have been developed in the past decades. In this study, Q- learning was adopted as the learning algorithm [14]. The Q-learning is defined by 𝑄(𝑠𝑑, π‘Žπ‘‘) ← 𝑄(𝑠𝑑, π‘Žπ‘‘) + 𝛼 {π‘Ÿπ‘‘+1 + 𝛾 max π‘Žβˆˆπ΄ 𝑄(𝑠𝑑+1, π‘Ž) βˆ’ 𝑄(𝑠𝑑, π‘Žπ‘‘)}, (1) where 𝑠 ∈ 𝑺 is the state of environments in a state space, π‘Ž ∈ 𝑨 is the action of agent in an action space, 𝛼 is the learning rate (0 < 𝛼 ≀ 1), 𝛾 is the discount rate (0 < 𝛾 ≀ 1), and r denotes the reward. 𝑄(𝑠, π‘Ž), also known as Q-table, contains all states of the environment and each action value pair. To select action by the agent, an action selection method was employed in previous reinforcement learning research. In this study, the Boltzmann distribution model is adopted as an action selection function [3]. The action selection probability adopted from the Boltzmann distribution is defined by P(π‘Žπ‘–|𝑠𝑑) = exp { 𝑄(𝑠𝑑, π‘Žπ‘–) 𝑇 ⁄ } βˆ‘ exp { 𝑄(𝑠𝑑, π‘Ž) 𝑇 ⁄ } π‘Žβˆˆπ΄ , (2) where T is a parameter that determines the randomness of action selection. The following discussion is based on Watkins’s Q-learning and action selection function is derived from the Boltzmann distribution model.
  • 3. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020 49 2.2. Policy Copy in Transfer Learning For traditional transfer learning in reinforcement learning, Tyalor et al. defined a method for transferring the learned action-value function [5] [6]. This method is referred to as policy copy in transfer learning, and is defined as follows: 𝑄𝑑(𝑠, π‘Ž) ← 𝑄𝑑(𝑠, π‘Ž) + 𝑄𝑠(𝑠, π‘Ž). (3) Here, 𝑄𝑑(𝑠, π‘Ž) is the policy that includes the initial value in the target task. It is used for learning and action selection. 𝑄𝑠(𝑠, π‘Ž) is reusing policy from the source task. Reusing policy has state- action values obtained from the source task. In Taylor’s method, the agent in the initial state of the target task sums the initial value of 𝑄𝑑(𝑠, π‘Ž) and transferring policy 𝑄𝑠(𝑠, π‘Ž). The agent is learned using the assimilated policy 𝑄𝑑(𝑠, π‘Ž) in the target task. Simply, the agent is reusing the transferred policy and overwriting the policy through trial-and-error simultaneously. Originally, Taylor’s method includes inter-task mapping, which defines the mapping between the agent’s observable environmental state S and the executable action A. Inter-task mapping is not mentioned in this paper owing to the description of simplicity. In Taylor’s method, if the transferred policy is sufficiently learned in the source task, it will behave according to the policy when used in different environments, and deadlock, such as collision with obstacles, may occur. This phenomenon is called over-fitting and over learning. The agent can overwrite the policy through behavior in the target task, and some learning time is required. 2.3. Adjusting of Transfer Ratio Takano et al. proposed a method to adjust the action value of the transferred policy [9] by using parameters ΞΆ in the policy assimilation equation. One of the methods in Takano’s study is defined by 𝑄𝑐(𝑠, π‘Ž) ← 1 2 (1 βˆ’ 𝜁) 𝑄𝑑(𝑠, π‘Ž) + πœπ‘„π‘ (𝑠, π‘Ž), (4) Where ΞΆ is the adjusting parameter (0 < 𝜁 < 1). 𝑄𝑐(𝑠, π‘Ž) is the assimilation policy, and is used for action selection. Learning in the target task is used as a policy 𝑄𝑑(𝑠, π‘Ž). This method discounts 𝑄𝑠(𝑠, π‘Ž) by ΞΆ , and conversely uses 𝑄𝑑(𝑠, π‘Ž) by the amount of 1 βˆ’ 𝜁. Thus, it is possible to reduce the effect on the target ask even if a policy 𝑄𝑠(𝑠, π‘Ž)with a high action value is reused. However, Takano’s method requires the value of 𝜁 to be determined before operation and depends on human intuition and experience. Furthermore, the policy 𝑄𝑑(𝑠, π‘Ž)that should be trusted is discounted. 2.4. Transfer Rate Kono et al. proposed the transfer rate to adjust the ratio of transferring the obtained policy [10]. The method is based on Takano’s method, and it is similar and simpler than Takano’s method. Transferring method with transfer rate can be defined as 𝑄𝑐(𝑠, π‘Ž) ← 𝑄𝑑(𝑠, π‘Ž) + πœπ‘„π‘ (𝑠, π‘Ž), (5) where 𝜏, (0 < 𝜏 ≀ 1) is transfer rate, and 𝑄𝑐(𝑠, π‘Ž) is an assimilated policy, which is used for action selection in the target task. 𝑄𝑑(𝑠, π‘Ž) is the policy in the target task., and is used for learning in the target task. 𝑄𝑠(𝑠, π‘Ž) is a reusing policy that was learned in the source task. Kono’s method
  • 4. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020 50 is also used to determine the parameters 𝜏 before the learning operation, and the value setting depends on human intuition and experience. In this paper, apart from Taylor’s method, Kono’s and Takano’s methods are collectively called Takano’s method. 3. PROPOSED METHOD In previous study, transferring policy method overwrote transferred policy and fixed discount parameters. An automatic method for adjusting the parameters is proposed in this section. When considering the situation of negative transfer, the action of the agent that can be originally executed cannot be executed by the target task because the policy is reused, and thus, a collision with an obstacle occurs. Therefore, to avoid the above situation, if the agent becomes unable to act, the transfer rate value should be lowered. To lower the value of the transfer rate, the value is adjusted using a sigmoid function according to the number of times that the agent cannot act. The sigmoid function is defined as: 1 1 + π‘’βˆ’πœŽπ‘” , (6) where 𝑔 is the gain value, which is determined by an arbitrary value, and 𝜎 is the input value of the sigmoid function. The transfer method is reformulated using the sigmoid function as defined by 𝑄𝑐(𝑠, π‘Ž) ← 𝑄𝑑(𝑠, π‘Ž) + 1 1 + π‘’βˆ’πœŽπ‘” 𝑄𝑠(𝑠, π‘Ž), (7) 𝜎 is the increment or decrement with state 𝑠𝑑 = 𝑠𝑑+1 and 𝑠𝑑 β‰  𝑠𝑑+1, respectively. The inability of the agent to act means that the current environmental state 𝑠𝑑 and the next environmental state 𝑠𝑑+1 are the same; thus, 𝜎 is subtracted by an arbitrary constant value 𝒹. Conversely, if the action is performed by reusing the policy, only an arbitrary constant 𝒹 is added for each action. This mechanism is defined as follows: Οƒ = { 𝜎 βˆ’ 𝒹 (𝑠𝑑 = 𝑠𝑑+1) 𝜎 + 𝒹 (𝑠𝑑 β‰  𝑠𝑑+1) . (8) The above function is calculated for each action, such as a step of the agent. It is possible to adjust the rate of reuse of the observes by comparing the environmental condition at each time step in a generalized form, regardless of actual situations, such as collisions with obstacles. 4. EVALUATION WITH COMPUTER SIMULATION 4.1. Experimental Setup This experiment aims to confirm the performance of an agent’s environmental adaptation using the proposed method. Furthermore, the shortest path problem is adopted in this study as the basic evaluation with a single agent. The learning environment is shown in Figure 1 in the computer simulation. Figures 1 (a) and (b) show the source task environment and target task environment, respectively. In this environment, if the agent achieves the goal, the agent can obtain a reward π‘Ÿ = 1 from the environment. An agent can observe self-localization. The reinforcement learning parameter is set as Ξ± = 0.1 and Ξ³ = 0.99. The parameter T of the Boltzmann distribution model is set as 𝑇 = 0.01. These parameters are common to all conditions. In each experiment, 300 episodes were conducted for the source and target tasks. The agent can move to 1 grid in each
  • 5. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020 51 step, and there are four directions that can be moved: front, right, back, and left. The time when the agent reaches the goal from the start is referred to as a single 1episode. (a) Source task (b) Target task. Figure 1. Grid world of shortest path problem. Circle mark is agent and initial position in each simulation. Green square is set as goal position. In each sub-figure, red is shortest path between start position and goal position with reinforcement learning. For experimental conditions, the learning results are compared using Taylor, Takano, and the proposed methods. In Taylor’s method, reinforcement learning is executed, as shown in Figure 1 (a), and the acquired policy is transferred to the agent in Figure 1 (b) for transfer learning. In Takano’s method, Kono’s method was adopted for the implementation of this experiment. It is clear that Takano’s method is overfitting, and it assumes a considerable amount of time to learn with the target task. Therefore, the agent is set up to obtain a negative reward π‘Ÿ = βˆ’1 in the event of a collision with a wall. In addition, it was confirmed in advance that the transfer learning was possible, and the transfer rate was set to 𝜏 = 0.1. In the proposed method, the parameter 𝒹 is set to 0.1, and the gain 𝑔 is set to 9.0. 4.2. Results In this section, the experimental results under the three conditions are presented. The obtained learning curves are shown in Figure 2. The learning curve represents the performance related to the learning progress, whereby the horizontal axis is the number of learning episodes and the vertical axis is the number of steps required from the start to the goal. Each learning curve had 10 trial averages. In Figure 2, the learning curve of reinforcement learning that does not reuse the policy in the target task is the standard for performance evaluation. The black line in Figure 2 represents the learning curve of reinforcement learning. 4.2.1. Result of Taylor’s Method Taylor’s method is believed to have a high number of steps in the early stage of learning, and it is considered that an overfitting state emerges. However, because the target task is relearned, the optimal solution, i.e., the convergence to the shortest path, appears from the learning curve of reinforcement learning. 4.2.2. Result of Takano’s Method The learning curve of Tekano has a relatively small number of steps from the early state of learning compared with the learning curve of reinforcement learning, and its convergence is considered to be equivalent to the learning curve of reinforcement learning. High performance in the early stages of the learning curve is called the jump start, and is one of the basic profits of transfer reinforcement learning. It has been confirmed that relearning in the target task is not
  • 6. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020 52 possible when the transfer rate 𝜏 = 1.0 because if the learning in the target task progresses, the value of the update is small and it is time-consuming to cancel the action value of the transferred policy. Figure 2. Learning curves 4.2.3. Result of the Proposed Method The learning curve of the proposed method also has jumpstarted. The speed of convergence is slower than that of Takano’s method and is close to the learning curve of reinforcement learning. From the figure, it can be observed that the proposed method performs better than reinforcement learning. In the initial state, the proposed method is equivalent to the transfer rate 𝜏 = 1.0 of Takano’s method, which normally requires time for learning. However, to adaptively change the transfer rate, the performance of the proposed method is verified by Takano’s method with a transfer rate 𝜏 = 0.1. 4.2.4. Discussion The transfer ratio was used to quantitatively compare the performance. The transfer ratio can be evaluated by improving the learning time compared with the basic learning curve, such as reinforcement learning. Transfer ratio π‘Ÿ can be defined as follows: π‘Ÿ = βˆ‘ 𝐿𝑑(𝑑) βˆ’ βˆ‘ 𝐿𝑀𝑑(𝑑) βˆ‘πΏπ‘€π‘‘(𝑑) , (9) where 𝐿𝑑(𝑑) is the learning curve with transfer and 𝑑 is the number of episodes. Therefore, βˆ‘ 𝐿𝑑(𝑑)is calculated as the learning curve area. In addition, 𝐿𝑀𝑑(𝑑) is the learning curve without transfer, which indicates reinforcement learning in the target task. βˆ‘ 𝐿𝑀𝑑(𝑑) is the calculated learning curve area of without transfer. Table 1 presents the transfer ratios in Taylor, Takano, and the proposed methods after evaluation. Table 1. Comparison of transfer ratio in each conditions
  • 7. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020 53 In Table 1, it can be observed that only Taylor’s method increases the learning time by approximately 1.41 times. Takano’s method and the proposed method decrease learning time. Figure 3 shows the transition of the output value of the sigmoid function at the beginning and end of learning. In the early stage of learning, the value corresponding to the transfer rate is adjusted by interaction with the environment. Sometimes, the transferred policy is used by the agent moving to the goal. At the end of learning, the reusing policy may not be used once. It time- consuming to use the policy at the end of learning than in the early stages of learning because the agent learns a new policy and obtains a high action value; therefore, it becomes possible to reach the goal without being affected by the reusing policy. However, if the agent obtains sufficient policy, there is no need to adjust the ratio of reusing policy. The learning curve of the proposed method does not converge because the ratio is still adjusted, even at the end of learning. Therefore, compared to Takano’s method, the result suggested that the proposed method allows the agent to interact with the environment to automatically adjust the ratio of reuse of the policy without manual adjustment or determining the transfer rate. Moreover, compared with the previous method, it was shown that the agent can improve the environmental adaptability while maintaining the characteristic of not overwriting the reusing policy from the source task. (a) Early stage of learning (1 episode) (b) End of learning (300 episodes) Figure 3. Transition of the value output from the sigmoid function 5. CONCLUSIONS This paper proposed a novel parameter for transfer reinforcement learning to avoid over-fitting in the relearning process of the target task. In the proposed method, the ratio of reusing policy from the source task is adjusted by the sigmoid function and input value, such as collision with obstacle. Basic experiments were performed with the shortest path problem with transfer reinforcement learning based on Taylor’s method without inter-task mapping. The result suggests that the learning agent can be adapted to the environment. Moreover, compared with the previous method, it was shown that the agent can improve the environmental adaptability while maintaining the characteristic of not overwriting the reuse policy from the source task. This result suggests that it can be applied to reusing policy selection in future applications of reinforcement learning agents. Our proposed method does not show convergence in the target task compared with the previous method. For future work, it is necessary to discuss the adjusting method of reusing the ratio of transferring policy at the end of learning. In the proposed method, the ratio of reuse was adjusted adaptively. The effect of the transferring policy cannot be ignored even at the end of learning.
  • 8. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.5/6, November 2020 54 Moreover, the proposed method is evaluated in a simple task and environment in this study. The computer simulation environment is a Markov decision process (MDP), which is different from the real-world environment. To demonstrate the effectiveness of the proposed method, it is necessary to carry out various types of more complex experiments and evaluations. In addition, the proposed method should be evaluated not only in various environments but also in non-MDP environments, such as multi-agent systems. ACKNOWLEDGEMENTS This study was partially supported by JSPS KAKENHI Grant number JP18K18133, and SUZUKI Foundation. This work was founded by The Japan Atomic Energy Agency (JAEA) FY2020 Center of World Intelligence Project for Nuclear Science/Technology and Human Resource Development (Grant no. R02I015). REFERENCES [1] B. Ramalingam, A. K. Lakshmanan, M. Ilyas, A. V. Le, and M. R. Elara (2018). β€œCascaded Machine- Learning Technique for Debris Classification in Floor-Cleaning Robot Application.” Applied Sciences 8(12): 1-19. [2] R. D. Andrea (2012). β€œGuest Editorial: A Revolution in the Warehouse: A Retrospective on Kiva Systems and the Grand Challenges Ahead.”IEEE Transactions on Automation Science and Engineering 9(4): 638-639. [3] R. S. Sutton and A. G. Barto (1998). β€œReinforcement learning: An introduction.” MIT press. [4] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S.Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015). β€œHuman-level control through deep reinforcement learning.” Nature 518: 529-533. [5] M. E. Taylor and P. Stone (2009). β€œTransfer learning for reinforcement learning domains: A survey.” Journal of Machine Learning Research 10(Jul): 1633-1685. [6] M. E. Taylor (2009). β€œTransfer in Reinforcement Learning Domains.”Studies in Computational Intelligence 216: Springer. [7] A. Lazaric (2012). β€œTransfer in Reinforcement Learning: A Framework and a Survey. Reinforcement Learning. Adaptation, Learning, and Optimization.” Berlin, Heidelberg, Springer. 12: 143-173. [8] Q. Yang, Y. Zhang, W. Dai, and S. J. Pan (2020) β€œTransfer learning.”Cambridge University Press. [9] T. Takano, H. Takase, H. Kawanaka, H. Kita, T. Hayashi and S. Tsuruoka (2011). β€œTransfer Learning based on Forbidden Rule Set in Actor-Critic Method.” International Journal of Innovative Computing, Information and Control 7(5(B)). [10] H. Kono, A. Kamimura, K. Tomita, Y. Murata, and T. Suzuki (2014) β€œTransfer Learning Method Using Ontology for Heterogeneous Multi-agent Reinforcement Learning.” International Journal of Advanced Computer Science and Application 5(10): pp.156-164. [11] R. Ramakrishnan, C. Zhang, and J. Shah (2017). β€œPerturbation Training for Human-Robot Teams.” Journal of Artificial Intelligence Research 59: pp.495-541. [12] F. L. Da Silva and A. H. R. Costa (2019). β€œA Survey on Transfer Learning for Multiagent Reinforcement Learning Systems.” Journal of Artificial Intelligence Research 69: pp.645-703. [13] F. L. Da Silva, G. Warnell, A.H.R.Costa, and P. Stone (2020). β€œAgents teaching agents: a survey on inter-agent transfer learning.”Autonomous Agents and Multi-Agent Systems 34 (9). [14] C. J. C. H. Watkins and P. Dayan (1992). β€œQ-Learning.” Machine Learning 8: pp.279-292