Florin Stoica, Dana Simian, Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm, Proceedings of the Recent Advances in Neural Networks, Fuzzy Systems & Evolutionary Computing,13-15 June 2010, Iasi, Romania, ISSN: 1790-2769, ISBN: 978-960-474-194-6, pp. 273-278
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
1. RECENT ADVANCES in NEURAL NETWORKS, FUZZY SYSTEMS & EVOLUTIONARY COMPUTING
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic
algorithm
FLORIN STOICA, DANA SIMIAN
Department of Informatics
“Lucian Blaga” University of Sibiu
Str. Dr. Ion Ratiu 5-7, 550012, Sibiu
ROMANIA
florin.stoica@ulbsibiu.ro, dana.simian@ulbsibiu.ro
Abstract: - Using Stochastic Learning Automata, we can build robust learning systems without the complete
knowledge of their environments. A Stochastic Learning Automaton is a learning entity that learns the optimal action
to use from its set of possible actions. The algorithm that guarantees the desired learning process is called a
reinforcement scheme. A major advantage of reinforcement learning compared to other learning approaches is that it
requires no information about the environment except for the reinforcement signal. The drawback is that a
reinforcement learning system is slower than other approaches for most applications since every action needs to be
tested a number of times for a good performance. In our approach, the learning process must be much faster than the
environment changes, and for accomplish this we need efficient reinforcement schemes. The aim of this paper is to
present a reinforcement scheme which satisfies all necessary and sufficient conditions for absolute expediency for a
stationary environment. Our scheme provides better results, compared with other nonlinear reinforcement schemes.
Furthermore, using a Breeder genetic algorithm, we are providing the optimal learning parameters for our scheme, in
order to reach the best performance.
Key-Words: - Reinforcement Learning, Breeder genetic algorithm, Stochastic Learning Automata.
1 Introduction
in the range (0, 1). In this paper we are using the P-model
An automaton is a machine or control mechanism
for our new reinforcement scheme.
designed to automatically follow a predetermined
sequence of operations or respond to encoded
instructions. The term stochastic emphasizes the
adaptive nature of the automaton we describe here. The
automaton described here does not follow predetermined
rules, but adapts to changes in its environment. This
adaptation is the result of the learning process. Learning
is defined as any permanent change in behaviour as a
result of past experience, and a learning system should
therefore have the ability to improve its behaviour with
time, toward a final goal.
The stochastic automaton attempts a solution of the
problem without any information on the optimal action.
One action is selected at random, the response from the
environment is observed, action probabilities are updated
based on that response, and the procedure is repeated. A
stochastic automaton acting as described to improve its
performance is called a learning automaton. The
algorithm that guarantees the desired learning process is
called a reinforcement scheme [5].
The response values from environment can be
represented in three different models. In the P-model the
response values are either 0 or 1, in the S-model the
response values is continuous in the range (0, 1) and in
the Q-model the values is in a finite set of discrete values
The aim of this paper is to present a new reinforcement
scheme with better performances compared to other
existing schemes ([5]-[7], [15]-[17]) and further to
optimize it using a Breeder genetic algorithm, taking into
account the learning parameters.
The remainder of this paper is organized as follows. In
section 2 we present the mathematical model of a
stochastic learning automaton with variable structure. In
section 3 is presented present the theoretical basis of the
absolutely expedient reinforcement schemes. A new
reinforcement scheme is presented in section 4, and its
optimization process using a Breeder genetic algorithm
is detailed in section 5. Conclusions are presented in
section 6.
2 Mathematical model of Variable
Structure Automaton
Mathematical model of a stochastic automaton with
variable structure is defined by a triple {α , c,β } where
α ={α1 ,α 2 ,...,α r } represents a finite set of actions
being the input to the environment, { 1 , 2} β = β β
represents a binary response set, and c ={c1 , c2 ,..., cr } is
a set of penalty probabilities, where ci is the probability
ISSN: 1790-5109 273 ISBN: 978-960-474-195-3
2. that action i α
RECENT ADVANCES in NEURAL NETWORKS, FUZZY SYSTEMS & EVOLUTIONARY COMPUTING
will result in an unfavourable response.
α (n) is an element from the set {α1 ,α 2 ,...,α r }, and
represents the action selected by the automaton at time
instant n (n = 0,1, 2, ...) . Given that β (n) = 0 is a
favourable outcome and β (n) =1 is an unfavourable
outcome at time instant n (n = 0,1, 2, ...) , the element ci
of c is defined mathematically by:
ci = P(β (n) =1|α (n) =α i ) i =1, 2, ..., r
The environment can further be split up in two types,
stationary and nonstationary. In a stationary environment
the penalty probabilities will never change. In a
nonstationary environment the penalties will change
over time. In the following we will consider only
stationary random environments.
In order to describe the reinforcement schemes, is
defined p(n) , a vector of action probabilities:
pi (n) = P(α (n) =α i ), i =1, r
Updating action probabilities can be represented as
follows:
p(n +1) = T[ p(n),α (n),β (n)]
where T is a mapping. This formula says the next action
probability p(n +1) is updated based on the current
probability p(n) , the input from the environment
β (n) and the resulting action α (n) . If p(n +1) is a
linear function of p(n) , the reinforcement scheme is
said to be linear; otherwise it is a nonlinear scheme.
A learning automaton generates a sequence of actions on
the basis of its interaction with the environment. If the
automaton is “learning” in the process, its performance
must be superior to “intuitive” methods. The evaluation
of performances of a learning automaton requires
definition of a quantitative norm of behavior [7].
We define a quantity M(n) as the average penalty for a
given action probability vector:
M ( n ) = P ( β
( n ) = 1| p ( n
))
=
Σ r
( β ( ) = 1| α ( ) = α ) ∗ ( α ( ) = α
) =
Σ
( )
= =
i
i i
r
i
P n n i P n i c p n
1 1
An automaton is absolutely expedient if the expected
value of the average penalty at one iteration step is less
than it was at the previous step for all steps:
M(n +1) < M(n) for all n [8].
Absolutely expedient learning schemes are presently the
only class of schemes for which necessary and sufficient
conditions of design are available. The algorithm we will
present in this paper is derived from a nonlinear
absolutely expedient reinforcement scheme presented in
[17].
i α
3 Absolutely expedient reinforcement
schemes
The reinforcement scheme is the basis of the learning
process for learning automata. The general solution for
absolutely expedient schemes was found by
Lakshmivarahan and Thathachar [11].
A learning automaton may send its action to multiple
environments at the same time. In that case, the action of
the automaton results in a vector of responses from
environments (or “teachers”). In a stationary N-teacher
P-model environment, if an automaton produced the
action and the environment responses are
j j N
β i =1,..., at time instant n , then the vector of
action probabilities p(n) is updated as follows [7]:
⎤
( 1) ( ) 1 β φ ( ( ))
Σ Σ
≠ =
=
− ∗ ⎥⎦
⎡
⎢⎣
+ = +
r
i j j
j
N
k
k
i i i p n
N
p n p n
1 1
⎤
1 1 β ψ ( ( )) (1)
Σ Σ
≠ =
=
∗ ⎥⎦
⎡
− −
⎢⎣
r
i j j
j
N
k
k
i p n
N 1 1
⎤
⎡
( 1) ( ) 1 ( ( ))
⎤
⎡
+ −
1 1 ( ( ))
1
1
p n
N
p n
N
p n p n
j
N
k
k
i
j
N
k
k
j j i
β ψ
β φ
∗ ⎥⎦
⎢⎣
+ ∗ ⎥⎦
⎢⎣
+ = −
Σ
Σ
=
= (2)
for all i j ≠ where the functions i φ
and i ψ
satisfy the
following conditions:
p n
p n
r λ
1 = = = p n ≤
( ( )) 0
( ( ))
( )
...
( ( ))
1
( )
p n
p n
r
φ φ
(3)
p n
p n
r μ
1 = = = p n ≤
( ( )) 0
( ( ))
( )
...
( ( ))
1
( )
p n
p n
r
ψ ψ
(4)
r
Σ
pi n j p n
( ) + φ ( ( )) >
0 (5)
1
≠ =
j
j i
r
Σ
pi n j p n
( ) − ψ ( ( )) <
1 (6)
1
≠ =
j
j i
p j (n) +ψ j ( p(n)) > 0 (7)
p j (n) −φ j ( p(n)) <1 (8)
for all j∈{1,..., r} {i}
The conditions (5)-(8) ensure that 0 < pk < 1, k = 1, r [16].
Theorem If the functions λ ( p(n)) and μ ( p(n)) satisfy
the following conditions:
λ ( p(n)) ≤ 0
μ ( p(n)) ≤ 0 (9)
λ ( p(n)) + μ ( p(n)) < 0
ISSN: 1790-5109 274 ISBN: 978-960-474-195-3
3. RECENT ADVANCES in NEURAL NETWORKS, FUZZY SYSTEMS & EVOLUTIONARY COMPUTING
then the automaton with the reinforcement scheme given
in (1)-(2) is absolutely expedient in a stationary
environment.
The proof of this theorem can be found in [9].
4 A new nonlinear reinforcement
scheme
Because the above theorem is also valid for a single-teacher
model, we can define a single environment
response that is a function f of many teacher outputs.
Thus, we can update the above algorithm as follows:
p n p n f H n
( + 1) = ( ) + ∗ ( − δ ∗ (1 − θ
) * ( ))
∗
p n f p n
i i
− − − ∗ − ∗ − −
*[1 ( )] (1 ) ( θ ) (1 δ
) *[1 ( )]
i i
p n p n f H n
( 1) ( ) ( (1 )* ( ))
p n f p n
+ = − ∗ − ∗ − ∗
j j
δ θ
( ) (1 ) ( ) (1 )* ( )
∗ + − ∗ − ∗ −
θ δ
j j
(10)
for all j ≠ i , i.e.:
ψ k ( p(n)) = −θ ∗ (1 −δ ) * pk (n)
φ k ( p(n)) = −δ ∗ (1 −θ ) * H(n) ∗ pk (n)
where learning parameters θ and δ are real values
which satisfy:
0 <θ < 1 and 0 <δ < 1.
The function H is defined as:
{ {
⎩ ⎨ ⎧
p n
( )
( ) min 1; max min ε
= −
,
δ θ p n
(1 ) * (1 ( ))
∗ − −
H n
i
i
⎫
⎪⎬
; 0}}
p n
1 −
( )
(1 )* ( )
, 1 ⎪⎭
⎞
⎟ ⎟
⎠
⎛
⎜ ⎜
⎝
−
∗ −
j j r
≠ =
j i
j
p n
ε
δ θ
k Parameter ε φ
is an arbitrarily small positive real number.
Our reinforcement scheme differs from schemes given in
[5]-[7] and [16], [17] by the definition of the functions
H , ψ k , .
We prove in the following that our scheme verifies all
the conditions (3)-(9) and thus we can declare our
scheme as an absolutely expedient reinforcement
scheme.
From (3) and (4) we have:
H n p n
(1 ) * ( ) ( )
φ δ θ
p n
( )
p n
( ( ))
( )
H n p n
(1 ) * ( ) ( ( ))
p n
k
k
k
k
= − ∗ − =
δ θ λ
=
− ∗ − ∗
=
p n
( ( )) p n
ψ θ δ
k θ δ μ
* (1 ) ( ( ))
(1 ) * ( )
( )
( )
p n
p n
p n
k
k
k
= − − =
− ∗ −
=
The rest of the conditions translates to the following:
Condition (5):
+Σ > ⇔
p n p n
( ) φ
( ( )) 0
i j
j
p n H n p n
( ) (1 )* ( ) (1 ( )) 0
− δ ∗ − θ
∗ − > ⇔
i i
(1 )* ( ) (1 ( )) ( )
δ θ
∗ − ∗ − < ⇔
( )
(1 )*(1 ( ))
( )
1
p n
p n
H n
H n p n p n
i
i
i i
r
j i
∗ − −
<
≠ =
δ θ
This condition is satisfied by the definition of the
function H(n) .
Condition (6):
r
−Σ < ⇔
p n p n
( ) ψ
( ( )) 1
i j
i j j
1
≠ =
p n p n
( ) (1 ) * (1 ( )) 1
⇔ + θ ∗ − δ
− <
i i
But pi (n) +θ ∗(1−δ )*(1− pi (n)) < pi (n) +1− pi (n) =1
since 0 <θ <1 and 0 < δ < 1 .
Condition (7):
p j (n) +ψ j ( p(n)) > 0⇔ p j (n) −θ ∗ (1 −δ ) * p j (n) > 0
for all j∈{1,..., r} {i}
But pj (n) −θ ∗ (1−δ ) pj (n) = pj (n) ∗ (1−θ * (1−δ )) > 0
since 0 <θ < 1, 0 <δ < 1 and 0 < p j (n) < 1 for all
j∈{1,..., r}{i}.
Condition (8):
p n p n
j ( ) − φ
j
( ( )) < 1
⇔
p n H n p n
( ) δ (1 θ
)* ( ) ( ) 1
+ ∗ − ∗ < ⇔
j j
p n
1 −
( )
(1 )* ( )
( )
p n
H n
j
j
δ ∗ −θ
<
for all j∈{1,..., r} {i}
This condition is satisfied by the definition of the
function H(n) .
With all conditions of the equations (1)-(2) satisfied, we
conclude that the reinforcement scheme is a candidate
for absolute expediency.
Furthermore, the functions λ and μ for our nonlinear
scheme satisfy the following:
λ ( p(n)) = −δ ∗ (1−θ ) * H(n) ≤ 0
μ ( p(n)) = −θ *(1−δ ) < 0 ≤ 0
λ ( p(n)) +μ ( p(n)) < 0
because 0 <θ <1, 0 <δ < 1 and 0 ≤ H(n) ≤1
In conclusion, we state the algorithm given in equations
(10) is absolutely expedient in a stationary environment.
5 A Breeder genetic algorithm for
reinforcement scheme optimization
In this section is presented our approach in optimization
of the new reinforcement scheme using genetic
ISSN: 1790-5109 275 ISBN: 978-960-474-195-3
4. RECENT ADVANCES in NEURAL NETWORKS, FUZZY SYSTEMS & EVOLUTIONARY COMPUTING
algorithms. The aim is to find optimal values for the
leaning parameters θ and δ .
Because parameters are real values, for this task we are
using a Breeder genetic algorithm, in order to avoid a
weak point of classical GAs, represented by their
discrete representation of solutions, which implies a
limitation of the power of the optimization process.
The Breeder genetic algorithm, proposed by Mühlenbein
and Schlierkamp-Voosen [18] represents solutions
(chromosomes) as vectors of real numbers, much closer
to the reality than normal GAs.
The selection is achieved randomly from the T% best
elements of current population, where T is a constant of
the algorithm (usually, T = 40 provide best results).
Thus, within each generation, from the T% best
chromosomes are selected two elements, and the
crossover operator is applied over them. On the new
child obtained from the mate of the parents is applied the
mutation operator. The process is repeated until are
obtained N-1 new individuals, where N represents the
size of the initial population. The best chromosome
(evaluated through fitness function) is inserted in the
new population (1-elitism). Thus, the new population
will have also N elements.
5.1 The Breeder genetic operators
Let be x {x1, x2 , ..., xn} = and y {y1, y2 , ..., yn} = two
chromosomes, where xi ∈R and yi ∈ R, i = 1, n . The
crossover operator has a result a new chromosome,
whose genes are represented by values
) ( i i i i i x y x z − + = α , n i , 1 = , where i α
is a random
variable uniformly distributed between [−δ , 1+δ ], and
δ depennds on the problem to be solved, typically in the
interval [0,0.5] .
The probability of mutation is typically choosed as 1/ n .
The mutation scheme is given
by xi = xi + si ⋅ ri ⋅ ai , i =1, n where:
si ∈{−1, +1} uniform at random,
ri is the range of variation for xi , defined as
ri r domainxi = ⋅ , where r is a value in the range
between 0.1 and 0.5 (typically 0.1) and domainxi is the
domain of the variable xi and
= −k⋅α
ai 2 where α ∈[0,1] uniform at random and k is
the number of bytes used to represent a number in the
machine within is executed the Breeder algorithm
(mutation precision).
5.2 The Breeder genetic algorithm
With the above definitions of the Breeder genetic
operators, the skeleton of the Breeder genetic algorithm
may be defined as follows:
Procedure Breeder
begin
t = 0
Randomly generate an initial population P(t) of N
individuals
Evaluate P(t) using the fitness function
while (termination criterion not fulfilled) do
for i = 1 to N-1 do
Randomly choose two elements from the T% best
elements of P(t)
Apply the crossover operator
Apply the mutation operator on the child
Insert the result in the new population P’(t)
end for
Choose the best element from P(t) and insert it into
P’(t)
P(t+1) = P’(t)
t = t + 1
end while
end
5.3 Optimization of the new nonlinear
reinforcement scheme
In order to find the best values for learning parameters
δ and θ of our reinforcement scheme, let us consider a
simple example. Figure 1 illustrates a grid world in
which a robot navigates. Shaded cells represent barriers.
Fig. 1 A grid world for robot navigation
The current position of the robot is marked by a circle.
Navigation is done using four actions α ={N, S, E,W} ,
the actions denoting the four possible movements along
the coordinate directions.
The algorithm used in learning process is:
Step 1. Choose an action, α (n) =αi based on the action
probability vector p(n) .
Step 2. Compute the environment response f.
Step 3. Update action probabilities p(n) according to the
new reinforcement scheme.
Step 4. Go to step 1.
ISSN: 1790-5109 276 ISBN: 978-960-474-195-3
5. RECENT ADVANCES in NEURAL NETWORKS, FUZZY SYSTEMS & EVOLUTIONARY COMPUTING
Because in given situation there is a single optimal
action, we stop the execution when the probability of the
optimal action reaches a certain value (0.9999). As a
performance evaluation, we count how many times the
above algorithm is executed until the stop condition is
achieved, and denote this in the following as “number of
steps”.
The data from Table 1 represents results for different
values of θ and δ using two initial conditions where in
first case all probabilities are initially the same and in
second case the optimal action initially has a small
probability value (0.0005). In the learnig process, only
one action receives reward, the optimal action which is a
movement to South.
Average number of steps to reach
popt=0.9999
4 actions with
pi
(0) =
1/ 4,
=
1,4
i
4 actions with
(0) =
0.0005,
=
0.9995/ 3
opt
p
i≠opt
p
θ δ Number of steps Number of steps
0.50 31.56 48.06
0.25 23.87 60.07
0.20 22.92 66.51
0.15 22.27 75.29
0.44
0.10 21.15 96.18
0.50 31.02 48.09
0.25 30.94 61.05
0.20 30.03 67.99
0.15 29.29 77.86
0.33
0.10 28.80 94.28
0.50 33.12 52.40
0.25 39.31 66.90
0.20 40.26 73.26
0.15 41.98 84.28
0.22
0.10 42.26 100.68
Table 1 Convergence rates for a single optimal action of
a 4-action automaton in a stationary environment (200
runs for each parameter set)
Analyzing values from corresponding columns, we
conclude that our algorithm converges to a solution
faster than the one obtained in [17] (using other
reinforcement scheme) when the optimal action has a
small probability value assigned at startup (48.06 versus
de 54.08).
However, is difficult to find intuitively values for
learning parameters in order to beat the best solution
(19.51 steps as average) founded in [17] for the testing
case when all probabilities of automaton actions are
initially the same.
Using the Breeder genetic algorithm, we can provide the
optimal learning parameters for our scheme, in order to
reach the best performance.
Each chromosome contains two genes, representing the
real values δ and θ . The fitness function for
chromosomes evaluation is represented by the number of
steps necessary by the learning process to reach a certain
value (0.999) for the probability of the optimal action.
In our tests, parameters of Breeder algorithm are
assigned with following values: δ = 0 , r = 0.1, k = 8 .
The initial population has 400 chromosomes and
algorithm is stopped after 600 generations.
In Table 2 are showed results provided by the Breeder
genetic algorithm.
Optimal values for learning parameters
provided by the Breeder algorithm
4 actions with
pi
(0) 1/ 4,
=
1,4
=
i
4 actions with
(0) =
0.0005,
=
0.9995/ 3
opt
p
i≠opt
p
δ 0.175798 0.477907
θ 0.874850 0.298292
Number of
steps
9.68 45.63
Table 2 Optimal values for learning parameters provided
by the Breeder genetic algorithm
Comparing solutions from tables 1 and 2, we can
conclude that Breeder genetic algorithm is capable to
provide the best values for learning parameters, and thus
our scheme was optimized for best performance. In both
test cases, results obtained by the new nonlinear
optimized scheme are significant better than those
obtained in [16], [17].
6 Conclusion
The reinforcement scheme presented in this paper satisfy
all necessary and sufficient conditions for absolute
expediency in a stationary environment, and the
nonlinear algorithm based on this scheme is found to
converge to the „optimal” action faster than nonlinear
schemes previously defined in [5]-[7], [16], [17].
The learning parameters δ and θ of the new scheme are
both situated in the interval (0,1) , making their
adjustment more easily.
Using a Breeder genetic algorithm, we can automatically
find the optimal values for the learning parameters for
the reinforcement scheme, in order to reach the best
performance.
This new reinforcement scheme was used within a
simulator for an Intelligent Vehicle Control System, in a
multi-agent approach [17]. The entire system was
implemented in Java, and is based on JADE platform.
In this real-time environment, the learning process must
be much faster than the environment changes, and for
accomplish this we need efficient reinforcement
schemes. After evaluation, we found the new
ISSN: 1790-5109 277 ISBN: 978-960-474-195-3
6. RECENT ADVANCES in NEURAL NETWORKS, FUZZY SYSTEMS & EVOLUTIONARY COMPUTING
reinforcement scheme very suitable for applications with
requirements for fast learning algorithms.
References:
[1] A. Barto, S. Mahadevan, Recent advances in
hierarchical reinforcement learning, Discrete-Event
Systems journal, Special issue on Reinforcement
Learning, 2003.
[2] R. Sutton, A. Barto, Reinforcement learning: An
introduction, MIT-press, Cambridge, MA, 1998.
[3] O. Buffet, A. Dutech, and F. Charpillet. Incremental
reinforcement learning for designing multi-agent
systems, In J. P. Müller, E. Andre, S. Sen, and C.
Frasson, editors, Proceedings of the Fifth International
Conference onAutonomous Agents, pp. 31–32,Montreal,
Canada, 2001. ACM Press.
[4] J. Moody, Y. Liu, M. Saffell, and K. Youn.
Stochastic direct reinforcement: Application to simple
games with recurrence, In Proceedings of Artificial
Multiagent Learning. Papers from the 2004 AAAI Fall
Symposium,Technical Report FS-04-02..
[5] C. Ünsal, P. Kachroo, J. S. Bay, Simulation Study of
Learning Automata Games in Automated Highway
Systems, 1st IEEE Conference on Intelligent
Transportation Systems (ITSC’97), Boston,
Massachusetts, Nov. 9-12, 1997
[6] C. Ünsal, P. Kachroo, J. S. Bay, Simulation Study of
Multiple Intelligent Vehicle Control using Stochastic
Learning Automata, TRANSACTIONS, the Quarterly
Archival Journal of the Society for Computer Simulation
International, volume 14, number 4, December 1997.
[7] C. Ünsal, P. Kachroo, J. S. Bay, Multiple Stochastic
Learning Automata for Vehicle Path Control in an
Automated Highway System, IEEE Transactions on
Systems, Man, and Cybernetics -part A: systems and
humans, vol. 29, no. 1, 1999
[8] K. S. Narendra, M. A. L. Thathachar, Learning
Automata: an introduction, Prentice-Hall, 1989.
[9]N. Baba, New Topics in Learning Automata: Theory
and Applications, Lecture Notes in Control and
Information Sciences Berlin, Germany: Springer-Verlag,
1984.
[10] M. Dorigo, Introduction to the Special Issue on
Learning Autonomous Robots, IEEE Trans. on Systems,
Man and Cybernetics - part B, Vol. 26, No. 3, 1996,
361-364,.
[11] S. Lakshmivarahan, M.A.L. Thathachar, Absolutely
Expedient Learning Algorithms for Stochastic
Automata, IEEE Transactions on Systems, Man and
Cybernetics, vol. SMC-6, 1973, pp. 281-286
[12] K. S. Narendra, M. A. L. Thathachar, Learning
Automata: an introduction, Prentice-Hall, 1989
[13] C. Rivero, Characterization of the absolutely
expedient learning algorithms for stochastic automata in
a non-discrete space of actions, ESANN'2003
proceedings - European Symposium on Artificial Neural
Networks Bruges (Belgium), 2003, pp. 307-312
[14] K.P. Topon, I. Hitoshi, Reinforcement Learning
Estimation of Distribution Algorithm, Proceedings of the
Genetic and Evolutionary Computation Conference 2003
(GECCO2003)
[15] F. Stoica, D. Simian, Automatic control based on
Wasp Behavioral Model and Stochastic Learning
Automata. Mathematics and Computers in Science and
Engineering Series, Proceedings of 10th WSEAS Int.
Conf. On Mathematical Methods, Computational
Techniques and Intelligent Systems (MAMECTIS '08),
Corfu 2008, 2008, WSEAS Press pp. 289-295
[16] F. Stoica, E. M. Popa, An Absolutely Expedient
Learning Algorithm for Stochastic Automata, WSEAS
Transactions on Computers, Issue 2, Volume 6, 2007 ,
pp. 229-235.
[17] F. Stoica, E. M. Popa, I. Pah, A new reinforcement
scheme for stochastic learning automata – Application to
Automatic Control, Proceedings of the International
Conference on e-Business, 2008, Porto, Portugal
[18] H. Mühlenbein, D. Schlierkamp-Voosen, The
science of breeding and its application to the breeder
genetic algorithm, Evolutionary Computation, vol. 1,
1994, pp. 335-360
ISSN: 1790-5109 278 ISBN: 978-960-474-195-3