SlideShare a Scribd company logo
1 of 5
Download to read offline
Proceedings of the 12th WSEAS International Conference on AUTOMATIC CONTROL, MODELLING & SIMULATION 
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata 
DANA SIMIAN, FLORIN STOICA 
Department of Informatics 
“Lucian Blaga” University of Sibiu 
Str. Dr. Ion Ratiu 5-7, 550012, Sibiu 
ROMANIA 
dana.simian@ulbsibiu.ro , florin.stoica@ulbsibiu.ro 
Abstract: - Reinforcement schemes represent the basis of the learning process for stochastic learning automata, 
generating their learning behavior. An automaton using a reinforcement scheme can decide the best action, based on 
past actions and environment responses. The aim of this paper is to introduce a new reinforcement scheme for 
stochastic learning automata. We test our schema and compare with other nonlinear reinforcement schemes. The results 
reveal a faster convergence of the new schema to the „optimal” action. 
Key-Words: - Stochastic Learning Automata, Reinforcement Scheme 
1 Introduction 
An stochastic learning automaton is a control 
mechanism which adapts to changes in its environment, 
based of a learning process. Rule-based systems may 
perform well in many control problems, but they require 
modifications for any change in the problem space. 
Narendra, in [5], emphasizes that the major advantage 
of reinforcement learning compared with other learning 
approaches is that it needs no information about the 
environment except that for the reinforcement signal. A 
stochastic learning automaton attempts a solution of the 
problem without any information on the optimal action 
and learns based on the environment response. Initially 
equal probabilities are attached to all the actions and 
then the probabilities are updated based on environment 
response. There are three models for representation the 
response values. In the P-model, the response values are 
either 0 or 1, in the S-model the response values are 
continuous in the range (0, 1) and in the Q-model the 
values belong to a finite set of discrete values in the 
range (0, 1). The algorithm behind the desired learning 
process is named a reinforcement scheme. 
Stochastic learning automata have many applications, 
one of them being the Intelligent Vehicle Control 
System. 
A detailed presentation about stochastic learning 
automaton and reinforcement schemes as well as 
applications in Intelligent Navigation of Autonomous 
Vehicles in an Automated Highway System can be 
found in [10]. 
The aim of this article is to build a new 
reinforcement scheme with better performances that 
other existing scheme ([7] – [11]). 
The remainder of this paper is organized as follows. 
In section 2 we present the theoretical basis of our main 
results, the new reinforcement scheme is presented in 
section 3. Experimental results are presented in section 
4. In section 5 are presented conclusions and further 
directions of study. 
2 Mathematical model 
Mathematical model of a stochastic automaton is defined 
by a triple{α , c,β }. { , ,..., } 1 2 r α = α α α is a finite set of 
actions and defines the input of the environment. 
{ , } 1 2 β = β β represents a binary response set and 
defines the environment response (the outcomes). We 
will next consider only the P-model. We set β (n) = 0 
for a favorable outcome and β (n) =1 for an unfavorable 
outcome at the moment n (n=1,2,...). . 
{ , ,..., } 1 2 r c = c c c is a set of penalty probabilities, where 
i c is the probability that action i α 
will result in an 
unfavorable response: 
c P n n i r i i = (β ( ) =1|α ( ) =α ) =1, 2, ..., 
Taking into account the changing of penalty probabilities 
over time, we classified the environment in stationary 
and nonstationay. In a stationary environment the 
penalty probabilities will never change while in a 
nonstationary environment the penalties will change 
over time. 
In the following we will consider only the case of 
stationary random environments. 
Each action αi, has associated, at time instant n, a 
probability 
ISSN: 1790-5117 450 ISBN: 978-954-92600-1-4
pi (n) = P(α (n) =αi ), i =1,..., r 
We denote by p(n) the vector of action probabilities: 
p(n)={pi(n), i=1,...,r}. 
The next action probability p(n+1), depends on the 
current probability p(n), the input from the environment, 
α(n) and the resulting action. 
Updating action probabilities is given by 
p(n +1) = T[ p(n),α (n),β (n)] 
The type of mapping T gives the type of reinforcement 
scheme, which can be linear or nonlinear. There exist 
also hybrid schemes (see [5]). 
For a given action probability vector, p(n), we define the 
average penalty, M(n). 
= β 
= = 
( ) ( ( ) 1| ( )) 
M n P n p n 
Σ = = ∗ = = 
Σ 
= = 
r 
β α α α α 
( ( ) 1| ( ) ) ( ( ) ) ( ) 
i 
i i 
r 
i 
i i P n n P n c p n 
1 1 
An important case of automata is the case of 
absolutely expedient automata. For absolutely expedient 
learning schemes are available necessary and sufficient 
conditions of design. An automaton is absolutely 
expedient if (see [3]): 
M(n +1) < M(n), ∀n∈N 
The evaluation of performances of a learning automaton 
requires a quantitative basis for assessing the learning 
behavior. The problem of establishing “norms of 
behavior” is quite complex, even in the simplest P-model 
and stationary random environments. We will consider 
only the notions we need for introducing our new 
reinforcement scheme. Many details about mathematical 
model of stochastic learning automaton and the 
evaluation of its performance can be found in [10]. 
The general solution for absolutely expedient 
schemes was found by Lakshmivarahan in 1973 ([3]). 
Our new reinforcement scheme starts from a nonlinear 
absolutely expedient reinforcement scheme, for a 
stationary N-teacher P-model environment, presented by 
Ünsal ([10] – [11]). The action of the automaton results 
in a vector of responses from environments (or 
“teachers”). If the automaton perform, at the moment n, 
the action αi and the environment responses are j 
i β 
, 
j=1,...,N then the vector of action probabilities, p(n), is 
updated as follows ([1]): 
 
Σ Σ 
≠ = 
= 
− ∗  
 
 
+ = + 
r 
j 
j i 
j 
N 
k 
k 
i i i p n 
N 
p n p n 
1 1 
( ( )) 
1 
( 1) ( ) β φ 
 
Σ Σ 
≠ = 
= 
∗  
 
− − 
 
r 
j 
j i 
j 
N 
k 
k 
i p n 
N 1 1 
( ( )) 
1 
1 β ψ (1) 
Σ 
p n p n p n 
1 
1 
1 
N 
( 1) ( ) ( ( )) 
1 
1 ( ( )), 
k 
j j i j 
k 
N 
k 
i j 
k 
N 
p n j i 
N 
β φ 
β ψ 
= 
= 
  
+ = −   ∗ + 
  
 +  − Σ 
 
 ∗ ∀ ≠ 
  
(2) 
The functions i φ 
and i ψ 
satisfy the following 
conditions∀j∈{1,..., r}{i} : 
p n 
p n 
r λ 
1 p n 
( ( )) 
( ( )) 
( ) 
... 
( ( )) 
( ) 
1 
p n 
p n 
r 
φ φ 
= = = (3) 
p n 
p n 
r μ 
1 p n 
( ( )) 
( ( )) 
( ) 
... 
( ( )) 
( ) 
1 
p n 
p n 
r 
ψ ψ 
= = = (4) 
Σ 
r 
( ) + φ ( ( )) > 
0 (5) 
i j p n p n 
≠ = 
j 
1 
j i 
r 
Σ 
( ) − ψ ( ( )) < 
1 (6) 
i j p n p n 
≠ = 
j 
1 
j i 
p (n) + ( p(n)) > 0 j j ψ (7) 
p (n) − ( p(n)) < 1 j j φ (8) 
The conditions (5)-(8) ensure that 
0 1, 1,..., k < p < k = r . 
If the functions λ(p(n)) and μ(p(n)) satisfy the 
conditions: 
λ ( p(n)) ≤ 0 
μ ( p(n)) ≤ 0 (9) 
λ ( p(n)) + μ ( p(n)) < 0 
then the automaton with the reinforcement scheme given 
in (1) – (2) is absolutely expedient in a stationary 
environment ([1], [10]). 
3 Main results. A new nonlinear 
reinforcement scheme 
Because the theorem from [1], which defines the 
conditions for absolutely expediency of a reinforcement 
scheme in a stationary environment, is also valid for a 
single-teacher model, we can define a single 
environment response that is a function f which 
combines many teacher outputs. Thus, we can update the 
Ünsal’s algorithm. We have the following update rule: 
+ = + ∗ − ∗ − ∗ 
δ θ 
( 1) ( ) ( (1 ) * ( )) 
p n p n f H n 
i i 
− − − ∗ − θ ∗ − δ 
− 
*[1 p ( n )] (1 f ) ( ) (1 ) *[1 p ( n 
)] 
i i 
(10) 
Proceedings of the 12th WSEAS International Conference on AUTOMATIC CONTROL, MODELLING & SIMULATION 
ISSN: 1790-5117 451 ISBN: 978-954-92600-1-4
Proceedings of the 12th WSEAS International Conference on AUTOMATIC CONTROL, MODELLING & SIMULATION 
+ = − ∗ − δ ∗ − θ 
∗ 
( 1) ( ) ( (1 )* ( )) 
( ) (1 ) ( ) (1 )* ( ), 
p n p n f H n 
j j 
p n f p n j i 
∗ + − ∗ − θ ∗ − δ 
∀ ≠ 
j j 
(11) 
Hence, our scheme is derived from de reinforcement 
scheme given in (1) - (2), setting 
( p(n)) (1 ) * p (n) k k ψ = −θ ∗ −δ 
( p(n)) (1 ) * H(n) p (n) k k φ = −δ ∗ −θ ∗ 
where learning parameters θ and δ are real values which 
satisfy the conditions 0< θ<1, 0< δ<1. 
The function H is defined as: 
{ { 
   
( ) 
p n 
= − 
, 
( ) min 1; max min ε 
∗ − − 
δ θ p n 
(1 ) * (1 ( )) 
H n 
i 
i 
 
 
;0}} 
− 
1 p ( n 
) 
(1 ) * ( ) 
, 1  
 
  
 
 
  
 
− 
∗ − 
≠ = 
j j r 
j i 
j 
p n 
ε 
δ θ 
Parameter ε is an arbitrarily small positive real number. 
Our reinforcement scheme differs from schemes given in 
[10] – [11] and [9] by the definition of the functions: H, 
k Ψ and k φ 
. 
We prove next that our scheme verifies all the conditions 
of the reinforcement scheme defined in (1) – (2). 
( p ( n 
)) 
( ( )) 
(1 ) * ( ) 
(1 ) * H ( n ) p ( n 
) 
( ) 
( ) 
p n 
H n 
p n 
p n 
k 
k 
k 
k 
λ 
δ θ 
φ δ θ 
= 
= − ∗ − = 
− ∗ − ∗ 
= 
ψ θ δ 
k θ δ μ 
* (1 ) ( ( )) 
(1 ) * ( ) 
( ) 
( p ( n 
)) 
( ) 
p n 
p n 
p n 
p n 
k 
k 
k 
= − − = 
− ∗ − 
= 
Condition (5) is satisfied by the definition of function 
H(n): 
+Σ φ 
> ⇔ 
( ) ( ( )) 0 
p n p n 
i j 
j 
− ∗ − ∗ − > ⇔ 
δ θ 
( ) (1 ) * ( ) (1 ( )) 0 
p n H n p n 
i i 
∗ − ∗ − < ⇔ 
δ θ 
(1 ) * ( ) (1 ( )) ( ) 
( ) 
(1 ) * (1 ( )) 
( ) 
1 
p n 
p n 
H n 
H n p n p n 
i 
i 
i i 
r 
j i 
∗ − − 
< 
≠ = 
δ θ 
Condition (6) is equivalent with: 
−Σ < ⇔ + ∗ − − < 
i j ψ θ δ 
( ) ( ( )) 1 ( ) (1 ) * (1 ( )) 1 
p n p n p n p n i i 
≠ = 
1 
r 
j 
j i 
It is satisfied because, for 0< θ<1, 0< δ<1, we have 
p (n) + ∗ (1− ) * (1− p (n)) < p (n) +1− p (n) =1 i i i i θ δ 
Condition (7) is equivalent with: 
p (n) + ( p(n)) > 0⇔ p (n) − ∗ (1 − ) * p (n) > 0 j j j j ψ θ δ 
for all j∈{1,..., r}{i} 
It holds because 
p (n) −θ ∗ (1 −δ ) p (n) = p (n) ∗ (1 −θ * (1−δ )) > 0 j j j 
for 0< θ<1, 0< δ<1 and 0< pj(n )<1, for all 
j∈{1,..., r} {i} 
Condition (8) is equivalent with: 
− < ⇔ 
φ 
( ) ( ( )) 1 
p n p n 
j j 
+ δ ∗ − θ 
∗ < ⇔ 
( ) (1 ) * ( ) ( ) 1 
p n H n p n 
j j 
− 
1 ( ) 
< j 
∀ ∈ 
( ) , {1,..., } { } 
∗ − 
(1 )* ( ) 
j 
p n 
H n j r i 
δ θ p n 
This condition is satisfied by the definition of the 
function H(n). 
All conditions of the equations (5) – (8) being satisfied, 
our reinforcement scheme is a candidate for absolute 
expediency. 
Furthermore, the functions λ and μ for our nonlinear 
scheme satisfy the following: 
λ ( p(n)) = −δ ∗ (1−θ ) * H(n) ≤ 0 
μ ( p(n)) = −θ * (1−δ ) < 0 ≤ 0 
λ( p(n)) + μ( p(n)) < 0 
In conclusion, we state the algorithm given in 
equations (10) – (11) is absolutely expedient in a 
stationary environment 
4 Experimental results 
Reinforcement learning is justified if it is easier to 
implement the reinforcement function than the desired 
behavior, or if the behavior generated by an agent 
presents properties which cannot be directly built. 
For this second reason, reinforcement learning is used in 
autonomous robotics ([2], [10]). 
To prove the performances of our new scheme we show 
that our algorithm converges to a solution faster than the 
one given in [11] and our algorithm presented in [9]. In 
order to do this we will use the example from [9]. Figure 
1 illustrates a grid world in which a robot navigates. 
Shaded cells represent barriers. The current position of 
the robot is marked by a circle. Navigation is done using 
four actions α= {N, S, E, W}, the actions denoting the 
four possible movements along the coordinate directions. 
Figure 1 – experimental problem for testing 
The algorithm used is: 
ISSN: 1790-5117 452 ISBN: 978-954-92600-1-4
Proceedings of the 12th WSEAS International Conference on AUTOMATIC CONTROL, MODELLING & SIMULATION 
While the probability of the optimal action is lower than 
a certain value (0.9999) do 
Step 1. Choose an action, α(n)= α i , based on the 
action probability vector p(n). 
i β 
Step 2. Read environment responses j 
, j=1,...,4 
from the sensor modules. 
Step3. Compute the combined environment responses 
f. 
Step 4. Update action probabilities p(n) according to 
the predefined reinforcement scheme. 
End while 
Table1 presents the results for different values of θ 
and δ and for two different initial conditions, where in 
the first case all probabilities are initially the same and in 
second case the optimal action initially has a small 
probability value (0.0005), with only one action 
receiving reward (only one optimal action). 
Average number of steps to reach 
popt=0.9999 
4 actions with 
(0) 1/ 4, 
= 
1,4 
= 
pi 
i 
4 actions with 
= 
(0) 0.0005, 
= 
0.9995 / 3 
opt 
p 
i≠opt 
p 
θ δ New 
alg. 
New 
alg. 
0.33 
0.50 31.02 48.09 
0.25 30.94 61.05 
0.20 30.03 67.99 
0.15 29.29 77.86 
0.10 28.80 94.28 
0.22 
0.50 33.12 52.40 
0.25 39.31 66.90 
0.20 40.26 73.26 
0.15 41.98 84.28 
0.10 42.26 100.68 
0.11 
0.50 53.22 66.54 
0.25 56.49 74.19 
0.20 53.18 84.91 
0.15 66.31 100.82 
0.10 73.73 124.60 
Table 1: Convergence rates for a single optimal action of 
a 4-actions automaton (200 runs for each parameter set) 
Comparing values from corresponding columns, we 
conclude that our algorithm converges to a solution 
faster than the one obtained in [9] when the optimal 
action has a small probability value assigned at startup. 
5 Conclusions 
Reinforcement learning is a great interest topic for 
machine learning and artificial intelligence communities. 
It offers a way of programming agents by reward and 
punishment without needing to specify how the task (i.e., 
behavior) is to be achieved. 
In this article we built a new reinforcement scheme. We 
proved that our new scheme satisfies all necessary and 
sufficient conditions for absolute expediency in a 
stationary environment and the nonlinear algorithm 
based on this scheme is found to converge to the 
„optimal” action faster than nonlinear schemes 
previously defined in ([11], [9]). The values of both 
learning parameters are situated in the interval (0,1), 
giving more stability to our scheme. 
Using this new reinforcement scheme was developed a 
simulator for an Intelligent Vehicle Control System, in a 
multi-agent approach. The entire system was 
implemented in Java, and is based on JADE platform. 
The simulator will be presented and analyzed in other 
article. 
As a further improvement, we will use a genetic 
algorithm for finding the optimal values of learning 
parameters from our reinforcement scheme. The fitness 
function for evaluation of chromosomes will take into 
account the number of steps necessary for learning 
process to accomplish the stop condition. 
References: 
[1]Baba N., New Topics in Learning Automata: Theory 
and Applications, Lecture Notes in Control and 
Information Sciences Berlin, Germany: Springer- 
Verlag, 1984.. 
[2]Dorigo M., Introduction to the Special Issue on 
Learning Autonomous Robots, IEEE Trans. on 
Systems, Man and Cybernetics - part B, Vol. 26, No. 
3, 1996, 361-364.` 
[3]Lakshmivarahan S., Thathachar M.A.L.,. Absolutely 
Expedient Learning Algorithms for Stochastic 
Automata, IEEE Transactions on Systems, Man and 
Cybernetics, vol. SMC-6, 1973, pp. 281-286. 
[4]Moody J., Liu Y., Saffell M., Youn K., Stochastic 
direct reinforcement: Application to simple games 
with recurrence, In Proceedings of Artificial 
Multiagent Learning. Papers from the 2004 AAAI 
Fall Symposium,Technical Report FS-04-02. 
[5]Narendra K. S., Thathachar M. A. L., Learning 
Automata: an introduction, Prentice-Hall, 1989. 
[6]Rivero C., Characterization of the absolutely 
expedient learning algorithms for stochastic automata 
in a non-discrete space of actions, ESANN'2003 
proceedings - European Symposium on Artificial 
Neural Networks, Bruges (Belgium), 2003, pp. 307- 
312 
[7 Stoica F., Simian D., Automatic control based on 
Wasp Behavioral Model and Stochastic Learning 
Automata. Mathematics and Computers in Science 
and Engineering Series, Proceedings of 10th 
WSEAS Int. Conf. On Mathematical Methods, 
Computational Techniques and Intelligent Systems 
(MAMECTIS '08), Corfu 2008, 2008, WSEAS Press 
ISSN: 1790-5117 453 ISBN: 978-954-92600-1-4
Proceedings of the 12th WSEAS International Conference on AUTOMATIC CONTROL, MODELLING & SIMULATION 
pp. 289-295 
[8]Stoica F., Popa E. M., An Absolutely Expedient 
Learning Algorithm for Stochastic Automata, 
WSEAS Transactions on Computers, Issue 2, Volume 
6, 2007 , pp. 229-235. 
[9]Stoica F., Popa E. M., Pah I., A new reinforcement 
scheme for stochastic learning automata – 
Application to Automatic Control, Proceedings of the 
International Conference on e-Business, 2008, Porto, 
Portugal 
[10]Ünsal C, Intelligent Navigation of Autonomous 
Vehicles in an Automated Highway System: Learning 
Methods and Interacting Vehicles Approach, 
dissertation thesis, Pittsburg University, Virginia, 
USA, 1997 
[11]Ünsal C., Kachroo P., Bay J. S., Multiple Stochastic 
Learning Automata for Vehicle Path Control in an 
Automated Highway System, IEEE Transactions on 
Systems, Man, and Cybernetics -part A: systems and 
humans, vol. 29, no. 1, 1999. 
ISSN: 1790-5117 454 ISBN: 978-954-92600-1-4

More Related Content

What's hot

B.Tech-II_Unit-III
B.Tech-II_Unit-IIIB.Tech-II_Unit-III
B.Tech-II_Unit-III
Kundan Kumar
 
HARMONIC ANALYSIS ASSOCIATED WITH A GENERALIZED BESSEL-STRUVE OPERATOR ON THE...
HARMONIC ANALYSIS ASSOCIATED WITH A GENERALIZED BESSEL-STRUVE OPERATOR ON THE...HARMONIC ANALYSIS ASSOCIATED WITH A GENERALIZED BESSEL-STRUVE OPERATOR ON THE...
HARMONIC ANALYSIS ASSOCIATED WITH A GENERALIZED BESSEL-STRUVE OPERATOR ON THE...
irjes
 

What's hot (20)

Slides amsterdam-2013
Slides amsterdam-2013Slides amsterdam-2013
Slides amsterdam-2013
 
Btech_II_ engineering mathematics_unit5
Btech_II_ engineering mathematics_unit5Btech_II_ engineering mathematics_unit5
Btech_II_ engineering mathematics_unit5
 
Integral calculus
Integral calculusIntegral calculus
Integral calculus
 
Slides networks-2017-2
Slides networks-2017-2Slides networks-2017-2
Slides networks-2017-2
 
Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017
 
Slides ensae-2016-11
Slides ensae-2016-11Slides ensae-2016-11
Slides ensae-2016-11
 
B.Tech-II_Unit-III
B.Tech-II_Unit-IIIB.Tech-II_Unit-III
B.Tech-II_Unit-III
 
Btech_II_ engineering mathematics_unit2
Btech_II_ engineering mathematics_unit2Btech_II_ engineering mathematics_unit2
Btech_II_ engineering mathematics_unit2
 
B.Tech-II_Unit-V
B.Tech-II_Unit-VB.Tech-II_Unit-V
B.Tech-II_Unit-V
 
Slides erm-cea-ia
Slides erm-cea-iaSlides erm-cea-ia
Slides erm-cea-ia
 
Active Controller Design for Regulating the Output of the Sprott-P System
Active Controller Design for Regulating the Output of the Sprott-P SystemActive Controller Design for Regulating the Output of the Sprott-P System
Active Controller Design for Regulating the Output of the Sprott-P System
 
B.tech ii unit-5 material vector integration
B.tech ii unit-5 material vector integrationB.tech ii unit-5 material vector integration
B.tech ii unit-5 material vector integration
 
Conference ppt
Conference pptConference ppt
Conference ppt
 
transformations and nonparametric inference
transformations and nonparametric inferencetransformations and nonparametric inference
transformations and nonparametric inference
 
Slides ACTINFO 2016
Slides ACTINFO 2016Slides ACTINFO 2016
Slides ACTINFO 2016
 
Mihai Visinescu "Action-angle variables for geodesic motion on resolved metri...
Mihai Visinescu "Action-angle variables for geodesic motion on resolved metri...Mihai Visinescu "Action-angle variables for geodesic motion on resolved metri...
Mihai Visinescu "Action-angle variables for geodesic motion on resolved metri...
 
HERMITE SERIES
HERMITE SERIESHERMITE SERIES
HERMITE SERIES
 
Stabilization of linear time invariant systems, Factorization Approach
Stabilization of linear time invariant systems, Factorization ApproachStabilization of linear time invariant systems, Factorization Approach
Stabilization of linear time invariant systems, Factorization Approach
 
HARMONIC ANALYSIS ASSOCIATED WITH A GENERALIZED BESSEL-STRUVE OPERATOR ON THE...
HARMONIC ANALYSIS ASSOCIATED WITH A GENERALIZED BESSEL-STRUVE OPERATOR ON THE...HARMONIC ANALYSIS ASSOCIATED WITH A GENERALIZED BESSEL-STRUVE OPERATOR ON THE...
HARMONIC ANALYSIS ASSOCIATED WITH A GENERALIZED BESSEL-STRUVE OPERATOR ON THE...
 
Proba stats-r1-2017
Proba stats-r1-2017Proba stats-r1-2017
Proba stats-r1-2017
 

Viewers also liked

Viewers also liked (13)

AQAA Membership Benefits
AQAA Membership BenefitsAQAA Membership Benefits
AQAA Membership Benefits
 
An evolutionary method for constructing complex SVM kernels
An evolutionary method for constructing complex SVM kernelsAn evolutionary method for constructing complex SVM kernels
An evolutionary method for constructing complex SVM kernels
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 
Interoperability issues in accessing databases through Web Services
Interoperability issues in accessing databases through Web ServicesInteroperability issues in accessing databases through Web Services
Interoperability issues in accessing databases through Web Services
 
CTL Model Update Implementation Using ANTLR Tools
CTL Model Update Implementation Using ANTLR ToolsCTL Model Update Implementation Using ANTLR Tools
CTL Model Update Implementation Using ANTLR Tools
 
KEIO PICNIC CLUB
KEIO PICNIC CLUBKEIO PICNIC CLUB
KEIO PICNIC CLUB
 
Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...
Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...
Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...
 
Using Ontology in Electronic Evaluation for Personalization of eLearning Systems
Using Ontology in Electronic Evaluation for Personalization of eLearning SystemsUsing Ontology in Electronic Evaluation for Personalization of eLearning Systems
Using Ontology in Electronic Evaluation for Personalization of eLearning Systems
 
An executable model for an Intelligent Vehicle Control System
An executable model for an Intelligent Vehicle Control SystemAn executable model for an Intelligent Vehicle Control System
An executable model for an Intelligent Vehicle Control System
 
Nutrition & Supplement :: How long does marijuana stay in your system?
Nutrition & Supplement :: How long does marijuana stay in your system?Nutrition & Supplement :: How long does marijuana stay in your system?
Nutrition & Supplement :: How long does marijuana stay in your system?
 
Raising Bilingual Kids
Raising Bilingual KidsRaising Bilingual Kids
Raising Bilingual Kids
 
Generating JADE agents from SDL specifications
Generating JADE agents from SDL specificationsGenerating JADE agents from SDL specifications
Generating JADE agents from SDL specifications
 
Major fishing nation of the world
Major fishing nation of the worldMajor fishing nation of the world
Major fishing nation of the world
 

Similar to A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata

Adaptive Control Scheme with Parameter Adaptation - From Human Motor Control ...
Adaptive Control Scheme with Parameter Adaptation - From Human Motor Control ...Adaptive Control Scheme with Parameter Adaptation - From Human Motor Control ...
Adaptive Control Scheme with Parameter Adaptation - From Human Motor Control ...
toukaigi
 

Similar to A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata (20)

A new Reinforcement Scheme for Stochastic Learning Automata
A new Reinforcement Scheme for Stochastic Learning AutomataA new Reinforcement Scheme for Stochastic Learning Automata
A new Reinforcement Scheme for Stochastic Learning Automata
 
A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
A new Evolutionary Reinforcement Scheme for Stochastic Learning AutomataA new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
 
ADAPTIVESYNCHRONIZER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC ZH...
ADAPTIVESYNCHRONIZER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC ZH...ADAPTIVESYNCHRONIZER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC ZH...
ADAPTIVESYNCHRONIZER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC ZH...
 
Lecture 11 state observer-2020-typed
Lecture 11 state observer-2020-typedLecture 11 state observer-2020-typed
Lecture 11 state observer-2020-typed
 
Automatic control based on Wasp Behavioral Model and Stochastic Learning Auto...
Automatic control based on Wasp Behavioral Model and Stochastic Learning Auto...Automatic control based on Wasp Behavioral Model and Stochastic Learning Auto...
Automatic control based on Wasp Behavioral Model and Stochastic Learning Auto...
 
Projective and hybrid projective synchronization of 4-D hyperchaotic system v...
Projective and hybrid projective synchronization of 4-D hyperchaotic system v...Projective and hybrid projective synchronization of 4-D hyperchaotic system v...
Projective and hybrid projective synchronization of 4-D hyperchaotic system v...
 
Optimal control problem for processes
Optimal control problem for processesOptimal control problem for processes
Optimal control problem for processes
 
Adaptive Control Scheme with Parameter Adaptation - From Human Motor Control ...
Adaptive Control Scheme with Parameter Adaptation - From Human Motor Control ...Adaptive Control Scheme with Parameter Adaptation - From Human Motor Control ...
Adaptive Control Scheme with Parameter Adaptation - From Human Motor Control ...
 
ACTIVE CONTROLLER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC ZHEN...
ACTIVE CONTROLLER DESIGN FOR THE HYBRID  SYNCHRONIZATION OF HYPERCHAOTIC ZHEN...ACTIVE CONTROLLER DESIGN FOR THE HYBRID  SYNCHRONIZATION OF HYPERCHAOTIC ZHEN...
ACTIVE CONTROLLER DESIGN FOR THE HYBRID SYNCHRONIZATION OF HYPERCHAOTIC ZHEN...
 
ADAPTIVE STABILIZATION AND SYNCHRONIZATION OF LÜ-LIKE ATTRACTOR
ADAPTIVE STABILIZATION AND SYNCHRONIZATION OF LÜ-LIKE ATTRACTORADAPTIVE STABILIZATION AND SYNCHRONIZATION OF LÜ-LIKE ATTRACTOR
ADAPTIVE STABILIZATION AND SYNCHRONIZATION OF LÜ-LIKE ATTRACTOR
 
GLOBAL CHAOS SYNCHRONIZATION OF UNCERTAIN LORENZ-STENFLO AND QI 4-D CHAOTIC S...
GLOBAL CHAOS SYNCHRONIZATION OF UNCERTAIN LORENZ-STENFLO AND QI 4-D CHAOTIC S...GLOBAL CHAOS SYNCHRONIZATION OF UNCERTAIN LORENZ-STENFLO AND QI 4-D CHAOTIC S...
GLOBAL CHAOS SYNCHRONIZATION OF UNCERTAIN LORENZ-STENFLO AND QI 4-D CHAOTIC S...
 
GLOBAL CHAOS SYNCHRONIZATION OF UNCERTAIN LORENZ-STENFLO AND QI 4-D CHAOTIC S...
GLOBAL CHAOS SYNCHRONIZATION OF UNCERTAIN LORENZ-STENFLO AND QI 4-D CHAOTIC S...GLOBAL CHAOS SYNCHRONIZATION OF UNCERTAIN LORENZ-STENFLO AND QI 4-D CHAOTIC S...
GLOBAL CHAOS SYNCHRONIZATION OF UNCERTAIN LORENZ-STENFLO AND QI 4-D CHAOTIC S...
 
PCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdfPCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdf
 
Adaptive Stabilization and Synchronization of Hyperchaotic QI System
Adaptive Stabilization and Synchronization of Hyperchaotic QI SystemAdaptive Stabilization and Synchronization of Hyperchaotic QI System
Adaptive Stabilization and Synchronization of Hyperchaotic QI System
 
International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...
 
ANTI-SYNCHRONIZATION OF HYPERCHAOTIC WANG AND HYPERCHAOTIC LI SYSTEMS WITH UN...
ANTI-SYNCHRONIZATION OF HYPERCHAOTIC WANG AND HYPERCHAOTIC LI SYSTEMS WITH UN...ANTI-SYNCHRONIZATION OF HYPERCHAOTIC WANG AND HYPERCHAOTIC LI SYSTEMS WITH UN...
ANTI-SYNCHRONIZATION OF HYPERCHAOTIC WANG AND HYPERCHAOTIC LI SYSTEMS WITH UN...
 
International Journal of Instrumentation and Control Systems (IJICS)
International Journal of Instrumentation and Control Systems (IJICS)International Journal of Instrumentation and Control Systems (IJICS)
International Journal of Instrumentation and Control Systems (IJICS)
 
ADAPTIVE CONTROLLER DESIGN FOR THE ANTI-SYNCHRONIZATION OF HYPERCHAOTIC YANG ...
ADAPTIVE CONTROLLER DESIGN FOR THE ANTI-SYNCHRONIZATION OF HYPERCHAOTIC YANG ...ADAPTIVE CONTROLLER DESIGN FOR THE ANTI-SYNCHRONIZATION OF HYPERCHAOTIC YANG ...
ADAPTIVE CONTROLLER DESIGN FOR THE ANTI-SYNCHRONIZATION OF HYPERCHAOTIC YANG ...
 
Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...
Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...
Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...
 
ADAPTIVE STABILIZATION AND SYNCHRONIZATION OF HYPERCHAOTIC QI SYSTEM
ADAPTIVE STABILIZATION AND SYNCHRONIZATION OF HYPERCHAOTIC QI SYSTEM ADAPTIVE STABILIZATION AND SYNCHRONIZATION OF HYPERCHAOTIC QI SYSTEM
ADAPTIVE STABILIZATION AND SYNCHRONIZATION OF HYPERCHAOTIC QI SYSTEM
 

More from infopapers

More from infopapers (16)

Implementing an ATL Model Checker tool using Relational Algebra concepts
Implementing an ATL Model Checker tool using Relational Algebra conceptsImplementing an ATL Model Checker tool using Relational Algebra concepts
Implementing an ATL Model Checker tool using Relational Algebra concepts
 
Deliver Dynamic and Interactive Web Content in J2EE Applications
Deliver Dynamic and Interactive Web Content in J2EE ApplicationsDeliver Dynamic and Interactive Web Content in J2EE Applications
Deliver Dynamic and Interactive Web Content in J2EE Applications
 
Building a Web-bridge for JADE agents
Building a Web-bridge for JADE agentsBuilding a Web-bridge for JADE agents
Building a Web-bridge for JADE agents
 
An Executable Actor Model in Abstract State Machine Language
An Executable Actor Model in Abstract State Machine LanguageAn Executable Actor Model in Abstract State Machine Language
An Executable Actor Model in Abstract State Machine Language
 
A New Model Checking Tool
A New Model Checking ToolA New Model Checking Tool
A New Model Checking Tool
 
A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernels
 
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
 
An AsmL model for an Intelligent Vehicle Control System
An AsmL model for an Intelligent Vehicle Control SystemAn AsmL model for an Intelligent Vehicle Control System
An AsmL model for an Intelligent Vehicle Control System
 
Using genetic algorithms and simulation as decision support in marketing stra...
Using genetic algorithms and simulation as decision support in marketing stra...Using genetic algorithms and simulation as decision support in marketing stra...
Using genetic algorithms and simulation as decision support in marketing stra...
 
Intelligent agents in ontology-based applications
Intelligent agents in ontology-based applicationsIntelligent agents in ontology-based applications
Intelligent agents in ontology-based applications
 
A new co-mutation genetic operator
A new co-mutation genetic operatorA new co-mutation genetic operator
A new co-mutation genetic operator
 
Modeling the Broker Behavior Using a BDI Agent
Modeling the Broker Behavior Using a BDI AgentModeling the Broker Behavior Using a BDI Agent
Modeling the Broker Behavior Using a BDI Agent
 
Algebraic Approach to Implementing an ATL Model Checker
Algebraic Approach to Implementing an ATL Model CheckerAlgebraic Approach to Implementing an ATL Model Checker
Algebraic Approach to Implementing an ATL Model Checker
 
Using the Breeder GA to Optimize a Multiple Regression Analysis Model
Using the Breeder GA to Optimize a Multiple Regression Analysis ModelUsing the Breeder GA to Optimize a Multiple Regression Analysis Model
Using the Breeder GA to Optimize a Multiple Regression Analysis Model
 
Building a new CTL model checker using Web Services
Building a new CTL model checker using Web ServicesBuilding a new CTL model checker using Web Services
Building a new CTL model checker using Web Services
 
A Distributed CTL Model Checker
A Distributed CTL Model CheckerA Distributed CTL Model Checker
A Distributed CTL Model Checker
 

Recently uploaded

CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Cherry
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
ONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for vote
RaunakRastogi4
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Cherry
 
Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycle
Cherry
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 

Recently uploaded (20)

FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
ONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for vote
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 
Understanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution MethodsUnderstanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution Methods
 
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsKanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycle
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 

A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata

  • 1. Proceedings of the 12th WSEAS International Conference on AUTOMATIC CONTROL, MODELLING & SIMULATION A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata DANA SIMIAN, FLORIN STOICA Department of Informatics “Lucian Blaga” University of Sibiu Str. Dr. Ion Ratiu 5-7, 550012, Sibiu ROMANIA dana.simian@ulbsibiu.ro , florin.stoica@ulbsibiu.ro Abstract: - Reinforcement schemes represent the basis of the learning process for stochastic learning automata, generating their learning behavior. An automaton using a reinforcement scheme can decide the best action, based on past actions and environment responses. The aim of this paper is to introduce a new reinforcement scheme for stochastic learning automata. We test our schema and compare with other nonlinear reinforcement schemes. The results reveal a faster convergence of the new schema to the „optimal” action. Key-Words: - Stochastic Learning Automata, Reinforcement Scheme 1 Introduction An stochastic learning automaton is a control mechanism which adapts to changes in its environment, based of a learning process. Rule-based systems may perform well in many control problems, but they require modifications for any change in the problem space. Narendra, in [5], emphasizes that the major advantage of reinforcement learning compared with other learning approaches is that it needs no information about the environment except that for the reinforcement signal. A stochastic learning automaton attempts a solution of the problem without any information on the optimal action and learns based on the environment response. Initially equal probabilities are attached to all the actions and then the probabilities are updated based on environment response. There are three models for representation the response values. In the P-model, the response values are either 0 or 1, in the S-model the response values are continuous in the range (0, 1) and in the Q-model the values belong to a finite set of discrete values in the range (0, 1). The algorithm behind the desired learning process is named a reinforcement scheme. Stochastic learning automata have many applications, one of them being the Intelligent Vehicle Control System. A detailed presentation about stochastic learning automaton and reinforcement schemes as well as applications in Intelligent Navigation of Autonomous Vehicles in an Automated Highway System can be found in [10]. The aim of this article is to build a new reinforcement scheme with better performances that other existing scheme ([7] – [11]). The remainder of this paper is organized as follows. In section 2 we present the theoretical basis of our main results, the new reinforcement scheme is presented in section 3. Experimental results are presented in section 4. In section 5 are presented conclusions and further directions of study. 2 Mathematical model Mathematical model of a stochastic automaton is defined by a triple{α , c,β }. { , ,..., } 1 2 r α = α α α is a finite set of actions and defines the input of the environment. { , } 1 2 β = β β represents a binary response set and defines the environment response (the outcomes). We will next consider only the P-model. We set β (n) = 0 for a favorable outcome and β (n) =1 for an unfavorable outcome at the moment n (n=1,2,...). . { , ,..., } 1 2 r c = c c c is a set of penalty probabilities, where i c is the probability that action i α will result in an unfavorable response: c P n n i r i i = (β ( ) =1|α ( ) =α ) =1, 2, ..., Taking into account the changing of penalty probabilities over time, we classified the environment in stationary and nonstationay. In a stationary environment the penalty probabilities will never change while in a nonstationary environment the penalties will change over time. In the following we will consider only the case of stationary random environments. Each action αi, has associated, at time instant n, a probability ISSN: 1790-5117 450 ISBN: 978-954-92600-1-4
  • 2. pi (n) = P(α (n) =αi ), i =1,..., r We denote by p(n) the vector of action probabilities: p(n)={pi(n), i=1,...,r}. The next action probability p(n+1), depends on the current probability p(n), the input from the environment, α(n) and the resulting action. Updating action probabilities is given by p(n +1) = T[ p(n),α (n),β (n)] The type of mapping T gives the type of reinforcement scheme, which can be linear or nonlinear. There exist also hybrid schemes (see [5]). For a given action probability vector, p(n), we define the average penalty, M(n). = β = = ( ) ( ( ) 1| ( )) M n P n p n Σ = = ∗ = = Σ = = r β α α α α ( ( ) 1| ( ) ) ( ( ) ) ( ) i i i r i i i P n n P n c p n 1 1 An important case of automata is the case of absolutely expedient automata. For absolutely expedient learning schemes are available necessary and sufficient conditions of design. An automaton is absolutely expedient if (see [3]): M(n +1) < M(n), ∀n∈N The evaluation of performances of a learning automaton requires a quantitative basis for assessing the learning behavior. The problem of establishing “norms of behavior” is quite complex, even in the simplest P-model and stationary random environments. We will consider only the notions we need for introducing our new reinforcement scheme. Many details about mathematical model of stochastic learning automaton and the evaluation of its performance can be found in [10]. The general solution for absolutely expedient schemes was found by Lakshmivarahan in 1973 ([3]). Our new reinforcement scheme starts from a nonlinear absolutely expedient reinforcement scheme, for a stationary N-teacher P-model environment, presented by Ünsal ([10] – [11]). The action of the automaton results in a vector of responses from environments (or “teachers”). If the automaton perform, at the moment n, the action αi and the environment responses are j i β , j=1,...,N then the vector of action probabilities, p(n), is updated as follows ([1]):  Σ Σ ≠ = = − ∗    + = + r j j i j N k k i i i p n N p n p n 1 1 ( ( )) 1 ( 1) ( ) β φ  Σ Σ ≠ = = ∗   − −  r j j i j N k k i p n N 1 1 ( ( )) 1 1 β ψ (1) Σ p n p n p n 1 1 1 N ( 1) ( ) ( ( )) 1 1 ( ( )), k j j i j k N k i j k N p n j i N β φ β ψ = =   + = −   ∗ +    +  − Σ   ∗ ∀ ≠   (2) The functions i φ and i ψ satisfy the following conditions∀j∈{1,..., r}{i} : p n p n r λ 1 p n ( ( )) ( ( )) ( ) ... ( ( )) ( ) 1 p n p n r φ φ = = = (3) p n p n r μ 1 p n ( ( )) ( ( )) ( ) ... ( ( )) ( ) 1 p n p n r ψ ψ = = = (4) Σ r ( ) + φ ( ( )) > 0 (5) i j p n p n ≠ = j 1 j i r Σ ( ) − ψ ( ( )) < 1 (6) i j p n p n ≠ = j 1 j i p (n) + ( p(n)) > 0 j j ψ (7) p (n) − ( p(n)) < 1 j j φ (8) The conditions (5)-(8) ensure that 0 1, 1,..., k < p < k = r . If the functions λ(p(n)) and μ(p(n)) satisfy the conditions: λ ( p(n)) ≤ 0 μ ( p(n)) ≤ 0 (9) λ ( p(n)) + μ ( p(n)) < 0 then the automaton with the reinforcement scheme given in (1) – (2) is absolutely expedient in a stationary environment ([1], [10]). 3 Main results. A new nonlinear reinforcement scheme Because the theorem from [1], which defines the conditions for absolutely expediency of a reinforcement scheme in a stationary environment, is also valid for a single-teacher model, we can define a single environment response that is a function f which combines many teacher outputs. Thus, we can update the Ünsal’s algorithm. We have the following update rule: + = + ∗ − ∗ − ∗ δ θ ( 1) ( ) ( (1 ) * ( )) p n p n f H n i i − − − ∗ − θ ∗ − δ − *[1 p ( n )] (1 f ) ( ) (1 ) *[1 p ( n )] i i (10) Proceedings of the 12th WSEAS International Conference on AUTOMATIC CONTROL, MODELLING & SIMULATION ISSN: 1790-5117 451 ISBN: 978-954-92600-1-4
  • 3. Proceedings of the 12th WSEAS International Conference on AUTOMATIC CONTROL, MODELLING & SIMULATION + = − ∗ − δ ∗ − θ ∗ ( 1) ( ) ( (1 )* ( )) ( ) (1 ) ( ) (1 )* ( ), p n p n f H n j j p n f p n j i ∗ + − ∗ − θ ∗ − δ ∀ ≠ j j (11) Hence, our scheme is derived from de reinforcement scheme given in (1) - (2), setting ( p(n)) (1 ) * p (n) k k ψ = −θ ∗ −δ ( p(n)) (1 ) * H(n) p (n) k k φ = −δ ∗ −θ ∗ where learning parameters θ and δ are real values which satisfy the conditions 0< θ<1, 0< δ<1. The function H is defined as: { {    ( ) p n = − , ( ) min 1; max min ε ∗ − − δ θ p n (1 ) * (1 ( )) H n i i   ;0}} − 1 p ( n ) (1 ) * ( ) , 1          − ∗ − ≠ = j j r j i j p n ε δ θ Parameter ε is an arbitrarily small positive real number. Our reinforcement scheme differs from schemes given in [10] – [11] and [9] by the definition of the functions: H, k Ψ and k φ . We prove next that our scheme verifies all the conditions of the reinforcement scheme defined in (1) – (2). ( p ( n )) ( ( )) (1 ) * ( ) (1 ) * H ( n ) p ( n ) ( ) ( ) p n H n p n p n k k k k λ δ θ φ δ θ = = − ∗ − = − ∗ − ∗ = ψ θ δ k θ δ μ * (1 ) ( ( )) (1 ) * ( ) ( ) ( p ( n )) ( ) p n p n p n p n k k k = − − = − ∗ − = Condition (5) is satisfied by the definition of function H(n): +Σ φ > ⇔ ( ) ( ( )) 0 p n p n i j j − ∗ − ∗ − > ⇔ δ θ ( ) (1 ) * ( ) (1 ( )) 0 p n H n p n i i ∗ − ∗ − < ⇔ δ θ (1 ) * ( ) (1 ( )) ( ) ( ) (1 ) * (1 ( )) ( ) 1 p n p n H n H n p n p n i i i i r j i ∗ − − < ≠ = δ θ Condition (6) is equivalent with: −Σ < ⇔ + ∗ − − < i j ψ θ δ ( ) ( ( )) 1 ( ) (1 ) * (1 ( )) 1 p n p n p n p n i i ≠ = 1 r j j i It is satisfied because, for 0< θ<1, 0< δ<1, we have p (n) + ∗ (1− ) * (1− p (n)) < p (n) +1− p (n) =1 i i i i θ δ Condition (7) is equivalent with: p (n) + ( p(n)) > 0⇔ p (n) − ∗ (1 − ) * p (n) > 0 j j j j ψ θ δ for all j∈{1,..., r}{i} It holds because p (n) −θ ∗ (1 −δ ) p (n) = p (n) ∗ (1 −θ * (1−δ )) > 0 j j j for 0< θ<1, 0< δ<1 and 0< pj(n )<1, for all j∈{1,..., r} {i} Condition (8) is equivalent with: − < ⇔ φ ( ) ( ( )) 1 p n p n j j + δ ∗ − θ ∗ < ⇔ ( ) (1 ) * ( ) ( ) 1 p n H n p n j j − 1 ( ) < j ∀ ∈ ( ) , {1,..., } { } ∗ − (1 )* ( ) j p n H n j r i δ θ p n This condition is satisfied by the definition of the function H(n). All conditions of the equations (5) – (8) being satisfied, our reinforcement scheme is a candidate for absolute expediency. Furthermore, the functions λ and μ for our nonlinear scheme satisfy the following: λ ( p(n)) = −δ ∗ (1−θ ) * H(n) ≤ 0 μ ( p(n)) = −θ * (1−δ ) < 0 ≤ 0 λ( p(n)) + μ( p(n)) < 0 In conclusion, we state the algorithm given in equations (10) – (11) is absolutely expedient in a stationary environment 4 Experimental results Reinforcement learning is justified if it is easier to implement the reinforcement function than the desired behavior, or if the behavior generated by an agent presents properties which cannot be directly built. For this second reason, reinforcement learning is used in autonomous robotics ([2], [10]). To prove the performances of our new scheme we show that our algorithm converges to a solution faster than the one given in [11] and our algorithm presented in [9]. In order to do this we will use the example from [9]. Figure 1 illustrates a grid world in which a robot navigates. Shaded cells represent barriers. The current position of the robot is marked by a circle. Navigation is done using four actions α= {N, S, E, W}, the actions denoting the four possible movements along the coordinate directions. Figure 1 – experimental problem for testing The algorithm used is: ISSN: 1790-5117 452 ISBN: 978-954-92600-1-4
  • 4. Proceedings of the 12th WSEAS International Conference on AUTOMATIC CONTROL, MODELLING & SIMULATION While the probability of the optimal action is lower than a certain value (0.9999) do Step 1. Choose an action, α(n)= α i , based on the action probability vector p(n). i β Step 2. Read environment responses j , j=1,...,4 from the sensor modules. Step3. Compute the combined environment responses f. Step 4. Update action probabilities p(n) according to the predefined reinforcement scheme. End while Table1 presents the results for different values of θ and δ and for two different initial conditions, where in the first case all probabilities are initially the same and in second case the optimal action initially has a small probability value (0.0005), with only one action receiving reward (only one optimal action). Average number of steps to reach popt=0.9999 4 actions with (0) 1/ 4, = 1,4 = pi i 4 actions with = (0) 0.0005, = 0.9995 / 3 opt p i≠opt p θ δ New alg. New alg. 0.33 0.50 31.02 48.09 0.25 30.94 61.05 0.20 30.03 67.99 0.15 29.29 77.86 0.10 28.80 94.28 0.22 0.50 33.12 52.40 0.25 39.31 66.90 0.20 40.26 73.26 0.15 41.98 84.28 0.10 42.26 100.68 0.11 0.50 53.22 66.54 0.25 56.49 74.19 0.20 53.18 84.91 0.15 66.31 100.82 0.10 73.73 124.60 Table 1: Convergence rates for a single optimal action of a 4-actions automaton (200 runs for each parameter set) Comparing values from corresponding columns, we conclude that our algorithm converges to a solution faster than the one obtained in [9] when the optimal action has a small probability value assigned at startup. 5 Conclusions Reinforcement learning is a great interest topic for machine learning and artificial intelligence communities. It offers a way of programming agents by reward and punishment without needing to specify how the task (i.e., behavior) is to be achieved. In this article we built a new reinforcement scheme. We proved that our new scheme satisfies all necessary and sufficient conditions for absolute expediency in a stationary environment and the nonlinear algorithm based on this scheme is found to converge to the „optimal” action faster than nonlinear schemes previously defined in ([11], [9]). The values of both learning parameters are situated in the interval (0,1), giving more stability to our scheme. Using this new reinforcement scheme was developed a simulator for an Intelligent Vehicle Control System, in a multi-agent approach. The entire system was implemented in Java, and is based on JADE platform. The simulator will be presented and analyzed in other article. As a further improvement, we will use a genetic algorithm for finding the optimal values of learning parameters from our reinforcement scheme. The fitness function for evaluation of chromosomes will take into account the number of steps necessary for learning process to accomplish the stop condition. References: [1]Baba N., New Topics in Learning Automata: Theory and Applications, Lecture Notes in Control and Information Sciences Berlin, Germany: Springer- Verlag, 1984.. [2]Dorigo M., Introduction to the Special Issue on Learning Autonomous Robots, IEEE Trans. on Systems, Man and Cybernetics - part B, Vol. 26, No. 3, 1996, 361-364.` [3]Lakshmivarahan S., Thathachar M.A.L.,. Absolutely Expedient Learning Algorithms for Stochastic Automata, IEEE Transactions on Systems, Man and Cybernetics, vol. SMC-6, 1973, pp. 281-286. [4]Moody J., Liu Y., Saffell M., Youn K., Stochastic direct reinforcement: Application to simple games with recurrence, In Proceedings of Artificial Multiagent Learning. Papers from the 2004 AAAI Fall Symposium,Technical Report FS-04-02. [5]Narendra K. S., Thathachar M. A. L., Learning Automata: an introduction, Prentice-Hall, 1989. [6]Rivero C., Characterization of the absolutely expedient learning algorithms for stochastic automata in a non-discrete space of actions, ESANN'2003 proceedings - European Symposium on Artificial Neural Networks, Bruges (Belgium), 2003, pp. 307- 312 [7 Stoica F., Simian D., Automatic control based on Wasp Behavioral Model and Stochastic Learning Automata. Mathematics and Computers in Science and Engineering Series, Proceedings of 10th WSEAS Int. Conf. On Mathematical Methods, Computational Techniques and Intelligent Systems (MAMECTIS '08), Corfu 2008, 2008, WSEAS Press ISSN: 1790-5117 453 ISBN: 978-954-92600-1-4
  • 5. Proceedings of the 12th WSEAS International Conference on AUTOMATIC CONTROL, MODELLING & SIMULATION pp. 289-295 [8]Stoica F., Popa E. M., An Absolutely Expedient Learning Algorithm for Stochastic Automata, WSEAS Transactions on Computers, Issue 2, Volume 6, 2007 , pp. 229-235. [9]Stoica F., Popa E. M., Pah I., A new reinforcement scheme for stochastic learning automata – Application to Automatic Control, Proceedings of the International Conference on e-Business, 2008, Porto, Portugal [10]Ünsal C, Intelligent Navigation of Autonomous Vehicles in an Automated Highway System: Learning Methods and Interacting Vehicles Approach, dissertation thesis, Pittsburg University, Virginia, USA, 1997 [11]Ünsal C., Kachroo P., Bay J. S., Multiple Stochastic Learning Automata for Vehicle Path Control in an Automated Highway System, IEEE Transactions on Systems, Man, and Cybernetics -part A: systems and humans, vol. 29, no. 1, 1999. ISSN: 1790-5117 454 ISBN: 978-954-92600-1-4