SlideShare a Scribd company logo
Abstract—The effects of transforming the net function
vector in the multilayer perceptron are analyzed. The use
of optimal diagonal transformation matrices on the net
function vector is proved to be equivalent to training the
network using multiple optimal learning factors
(MOLF). A method for linearly compressing large ill-
conditioned MOLF Hessian matrices into smaller well-
conditioned ones is developed. This compression
approach is shown to be equivalent to using several
hidden units per learning factor. The technique is
extended to large networks. In simulations, the proposed
algorithm performs almost as well as the Levenberg
Marquardt algorithm with the computational complexity
of a first order training algorithm.
I. INTRODUCTION
Multilayer perceptrons (MLPs) have found wide
acceptance in a variety of applications ranging from
parameter estimation [1][2], document analysis and
recognition [3], finance, manufacturing [4] and data mining
[5]. The dynamically tuned basis functions of the MLP along
with its approximation capabilities [6] [7] make it an
effective tool for classification and approximation. The
relatively short evaluation time makes it a better choice than
many other classification and approximation methods such
as support vector machines (SVM) [8]. The universal
approximation [9] property of the MLP along with its ability
to mimic Bayes discriminant [10], optimal L2 norm
estimates [11] and maximum a-posteriori (MAP) [12]
estimates give it a strong theoretical foundation.
Despite of the above advantages there are limitations due
to the lack of scalable and effective training algorithms.
Training algorithms for the MLP can be briefly classified as
first and second order algorithms. First order algorithms like
back-propagation (BP) [13] and output weight optimization-
backpropagation (OWO-BP) require fewer multiplies and
data passes and hence take less time per iteration. However,
they are sensitive to input means and gains [14]. Note that
conjugate gradient (CG) [15] has the same sensitivity. CG
can be considered first order as the error function is not
quadratic. Second order methods on the other hand, are not
as sensitive to input means and gains, but they do have their
own inherent limitations. Second order algorithms related to
Manuscript received February 10, 2011.
P. Jesudhas, M. Manry, and R. Rawat are with the University of Texas at
Arlington, TX 76010, USA (e-mail: manry@uta.edu).
S. Malalur is with O-Geo Well LLC in Arlington, Texas (e-mail:
sanjeev.malalur@gmail.com).
Newton’s method often have non-positive definite or
singular Hessian matrices [16][17] which result in unstable
training. Hence the Levenberg-Marquardt (LM) algorithm
[18][19] is used instead. The LM algorithm is very
computationally intensive due to the large size of the
Hessian matrix so its use in training large networks is often
limited.
In [20] a two-stage training algorithm known as multiple
optimal learning factors (MOLF-BP) is found to produce
results comparable to the LM algorithm with computations
similar to first order training algorithms. However, the
MOLF-BP approach still has certain limitations. The MOLF
Hessian can be ill-conditioned. The size of the MOLF
Hessian matrix Hmolf can also become prohibitively large for
larger networks, resulting in scalability problems.
This paper gives a strong theoretical foundation to the
basic MOLF algorithm by relating it to optimally
transforming the net function vector of the MLP. For large
networks, the scalability issues of the basic MOLF algorithm
are dealt with by reducing the MOLF Hessian matrice's size
via a reduction in the number of optimal learning factors.
The organization of this paper is as follows. Section II
starts with an introduction to the MLP notation along with
the basic back-propagation training of the MLP. Section III
includes a review of the MOLF procedure in detail. An
analysis of the existing MOLF procedure with equivalent
networks is provided in section IV. The idea of collapsing
the MOLF Hessian matrix is explained in section V along
with its advantages. Section VI presents experimental
results.
II. THE MULTILAYER PERCEPTRON
A. Notation
In the fully connected MLP of Fig. 1, input weights
w(k,n) connect the nth
input to the kth
hidden unit. Output
weights woh(m,k) connect the kth
hidden unit’s activation
op(k) to the mth
output yp(m), which has a linear activation.
The bypass weight woi(m,n) connects the nth
input to the mth
output. The training data, described by the set {xp,tp}
consists of N-dimensional input vectors xp and M-
dimensional desired output vectors, tp. The pattern number p
varies from 1 to Nv, where Nv denotes the number of training
vectors present in the data set.
In order to handle thresholds in the hidden and output
layers, the input vectors are augmented by an extra element
as xp = [xp(1), xp(2),…., xp(N+1)]T
where xp(N+1) = 1. Let
Nh denote the number of hidden units. The dimensions of the
weight matrices W, Woh and Woi are respectively Nh by
(N+1), M by Nh and M by (N+1). Note that bypass weights
Analysis and Improvement of Multiple Optimal Learning Factors
for Feed-Forward Networks
Praveen Jesudhas, Michael T. Manry, Rohit Rawat, and Sanjeev Malalur
Proceedings of International Joint Conference on Neural Networks, San Jose, California, USA, July 31 – August 5, 2011
978-1-4244-9637-2/11/$26.00 ©2011 IEEE 2593
Fig. 1. A fully connected Multilayer Perceptron
woi relieve the hidden units of the burden of approximating
the linear part of the desired mapping. This reduces the
necessary number of hidden units. The vector of hidden
layer net functions, np and the actual output of the network,
yp can be written as
np = W · xp (1)
yp
= Woi · xp + Woh · op
where the kth
element of the hidden unit activation vector op
is calculated as op(k) = f(np(k)) and f(.) denotes the hidden
layer activation function. Training an MLP typically
involves minimizing the mean squared error between the
desired and the actual network outputs, defined as
E =
1
Nv
[ tp i - yp
i ]
2
M
i=1
Nv
p=1
=
1
Nv
Ep,
Nv
p=1
(2)
Ep = [ tp i - yp
i ]
2
M
i=1
where Ep is the cumulative squared error for pattern p.
B. Discussion of Back-Propagation
BP training [13, 22, 23, 24] is a gradient based learning
algorithm which involves changing system parameters so
that the output error is reduced. For a MLP, the expression
for actual output found from the input is given by equation
(1). The weights of the MLP are changed by BP so that the
error E of equation (2), which computes the sum of the
squared errors between the actual and desired outputs,
decreases. The weights are updated using gradients of the
error with respect to the corresponding weights.
The required negative error gradients are calculated using
the chain rule utilizing delta functions, which are negative
gradients of Ep with respect to net functions. For the pth
pattern, output and hidden layer delta functions [13][21] are
respectively found as,
δpo i = 2 tp i - yp
i 3
δp k = f' np k δpo
M
i=1
(i)woh(i,k)
Now, the negative gradient of E with respect to w(k,n) is,
g k,n =
-∂E
∂w(k,n)
=
1
Nv
δp
Nv
p=1
k xp n (4)
The matrix of negative partial derivatives can therefore be
written as,
G =
1
Nv
δp
Nv
p=1
· xp
T
(5)
where δp = [δp(1) , δ p(2)… , δ p(Nh) ]T
.
If steepest descent is used to modify the hidden weights,
W is updated in a given iteration as,
W ← W + ∆ W, (6)
∆ W = z·G
where z is the learning factor, which can be obtained
optimally from the expression,
z =
-∂E/ ∂z
∂2
E/ ∂z2
(7)
III. REVIEW OF MOLF ALGORITHM
The MOLF algorithm makes use of the BP procedure to
update the input layer weights W, where a different learning
factor zk is used to update weights feeding into the kth
hidden
unit. The output weights Woh and bypass weights Woi are
found using output weight optimization (OWO).
Output weight optimization (OWO) is a technique for
finding weights connected to the outputs of the network.
Since the outputs have linear activations, finding the weights
connected to the outputs is equivalent to solving a system of
linear equations. The expression for the outputs given in (1)
can be re-written as
yp
= Wo· xp (8)
where xp=[ T
, T
]
T
is the augmented input vector of length
Nu where Nu equals N + Nh + 1, Wo is formed as [Woi : Woh]
and has a size of M by Nu . The output weights can be solved
for by setting ∂E ∂Wo⁄ =0 which leads to a set of linear
equations given by,
Ca= Ra·Wo
T
(9)
where,
Ca=
1
Nv
xp
Nv
p=1
· tp
T
, Ra=
1
Nv
xp·
Nv
p=1
xp
T
Equation (9) is most easily solved using orthogonal least
squares (OLS) [22].
The input weight connecting the nth
input to the kth
hidden
unit is updated using,
w k,n ← w k,n + zk·g k,n (10)
where, zk denotes the learning factor corresponding to the kth
hidden unit. The vector z containing the different learning
xp(1)
xp(2)
xp(3)
xp(N+1)
yp(1)
yp(2)
yp(3)
yp(M)
W Woh
np(1) op(1)
op(Nh)
Woi
Input
Layer
Hidden
Layer
Output
Layer
np(Nh)
2594
factors zk can be found using OLS from the following
relation,
Hmolf·z = gmolf
(11)
gmolf
j =
-∂E
∂zj
, hmolf k,j =
∂2
E
∂zk∂zj
IV. ANALYSIS OF THE MOLF ALGORITHM
In this section, the structure of the MLP is analyzed in
detail using equivalent networks. The relationship between
MOLF and linear transformations of the net function vector
is investigated.
A. Discussion of Equivalent Networks
Let MLP 1, have a net function vector np and output vector
yp after back propagation training as discussed previously in
section II-B. In a second network called MLP 2 trained
similarly, the net function vector and the output vector are
respectively denoted by and yp′. The bypass and output
weight matrix along with the hidden unit activation functions
are considered to be equal for both the MLP 1 and MLP 2.
Based on this information the output vectors for MLP 1 and
MLP 2 are respectively,
yp
i = woi
N+1
n=1
i,n xp n + woh
Nh
k=1
i,k f np k (12)
y'p i = woi
N+1
n=1
i,n xp n + woh
Nh
k=1
i,k f n k (13)
In order to make MLP 2 strongly equivalent to MLP 1
their respective output vectors yp and yp′ have to be equal.
Based on equations (12) and (13) this can be achieved only
when their respective net function vectors are equal [23].
MLP 2 can be made strongly equivalent to MLP 1 by
linearly transforming its net function vector before
passing it as an argument to the activation function f(.) given
by,
np = C · (14)
where C is a linear transformation matrix of size Nh by Nh.
On applying the linear transformation to the net function
vector in equation (13) we get,
y'p i = yp
(i) = woi
N+1
n=1
i,n xp n +
woh(i,k)
Nh
k=1
f c(m,k)
Nh
m=1
·n k (15)
After making MLP 2 strongly equivalent to MLP 1, the
net function vector np can be related to its input weights W
and the net function vector as,
np(k) = w(k,m)xp
N+1
n=1
(m) = c(k,m)
Nh
m=1
n (m) (16)
Similarly, the MLP 2 net function vector can be related
to its weights W′ as,
n k = w k,m
N+1
n=1
xp m (17)
On substituting (17) into (16) we get,
np k c k,m
Nh
m 1
N 1
n 1
w m,n xp n (18)
It is observed above that the net function of MLP 1 is
found by linear transformation of the input weight matrix W′
of MLP 2 given by,
W = C · W′ (19)
If the elements of the C matrix are found through an
optimality criterion, then optimal input weights W′ can be
computed with the input weight matrix W of MLP 1.
B. Optimal transformation of net function
The discussion of equivalent networks in section IV-A
suggests that, an optimal set of net functions can be obtained
by linear transformation of the input weights W. In this
section, the input weight update equations are derived for
OWO-BP training, using the linear transformation matrix C
from MLP 2.
MLP 1 and MLP 2 are strongly equivalent as before. Then
output of the two networks after transformation is given as
in equation (15). The elements of the negative gradient
matrix G′ of MLP 2 found from equation (2) and (15) is
defined as,
g u,v =
-∂E
∂w'(u,v)
=
2
Nv
[ tp
M
i=1
Nv
p=1
i - yp
i ]·
∂yp
(i)
∂w'(u,v)
(20)
∂yp
(i)
∂w'(u,v)
= woh
Nh
k=1
(i,k) o'p(k) c(k,u) xp(v)
Rearranging terms in (20) results in,
g u,v =
c k,u
Nh
k=1
2
Nv
[ tp
M
i=1
Nv
p=1
i - yp
i ]woh i,k o k xp v
= c(k,u)
-∂E
∂w(k,v)
Nh
k=1
(21)
which is abbreviated as,
G′ = CT
· G (22)
The input weight update equation for MLP 2 based on its
gradient matrix G′ is given by,
W′ = W′ + z · G′ (23)
On pre-multiplying this equation by C and using equations
(19) and (22) we obtain
2595
W = W + z · G′′ (24)
where,
G′′ = R · G, R = C · CT
Equation (24) is nothing but the weight update equation of
MLP 1, which would result in the network being strongly
equivalent to MLP 2. Thus using equation (24), the input
weights of MLP 1 can be updated so that its net functions
are optimal as in MLP 2.
Lemma 1: For a given R matrix, there are an uncountably
infinite number of C matrices.
The transformed gradient G′′ of MLP 1 is given in terms
of the original gradient G as,
g u,v = r (u,k) g(k,v) (25)
Nh
k=1
g k,v =
-∂E
∂w(k,v)
=
2
Nv
[ tp
M
i=1
Nv
p=1
i - yp
i ]woh i,k o'p k xp v (26)
Equations (24) and (25) suggest that MLP 1 could be
trained with optimal net functions using only the knowledge
of the linear transformation matrix R.
C. Multiple optimal learning factors
Equations (24) and (25) give a method for optimally
transforming the net functions of an MLP by using the
transformation matrix R. In this subsection, the MOLF
approach is derived from equations (24) and (25). Let the R
matrix in equation (24) be diagonal. In this case equation
(24) becomes,
w(k,n) = w(k,n) + z · r(k)g(k,n) (27)
where rk denotes the kth
diagonal element of R. On
comparing equations (10) and (27), the expression for the
optimal learning factors zk could be given as,
zk = z · r(k) (28)
Equation (28) suggests that using the MOLF algorithm for
training a MLP is equivalent to optimally transforming the
net function vector using a diagonal transformation matrix.
V. EFFECTS OF COLLAPSING THE MOLF HESSIAN
This section presents a computational analysis of the
existing MOLF algorithm followed by a proposed method to
collapse the MOLF Hessian matrix to create fewer optimal
learning factors.
A. Computational Analysis of the MOLF algorithm
In the existing MOLF algorithm a significant amount of
the computational burden is attributed to inverting the
Hessian matrix to find the optimal learning factors, usually
through the OLS procedure. The size of the Hessian matrix
which is based on the number of hidden units in the network
directly affects the computational load of the MOLF
algorithm.
The gradient and Hessian equations of the MOLF
algorithm are given in equation (11).The computation of the
multiple optimal learning factor vector z requires that
equation (11) be solved. The number of multiplies required
for computing z through OLS is,
Mols-molf (Nh
+1) Nh Nh+ 2 (29)
Equation (29) shows that the number of multiplications
required increases cubically with the number of hidden units.
For networks having large values of Nh, this can be time
consuming since solving (11) has to be done for every
iteration of the training algorithm. If some hidden units have
linearly dependent outputs [20], the MOLF Hessian matrix
Hmolf can be ill conditioned. This can affect MLP training.
Thus having the number of optimal learning factors equal
to the number of hidden units can result in a high
computational load and also could lead to an ill-conditioned
MOLF Hessian matrix.
B. Training with several hidden units per OLF
In this section, we modify the MOLF approach so that
each OLF is assigned to one or more hidden units. The
Hessian for this new approach is compared to Hmolf.
Let NOLF be the number of optimal learning factors used
for training in the proposed variable optimal learning factors
(VOLF) method. NOLF is selected such that it divides Nh,
with no remainder. Each optimal learning factor zv then
applies to Nh/NOLF hidden units. The MLP output based on
these conditions is given by,
yp
m = woi
N+1
n=1
m,n xp n +
woh m,k f(
Nh
k=1
(w k,n +zvg(k,n)xp n )
N+1
n=1
) 30
v = ceil ( k · (NOLF/Nh) )
where the function ceil() rounds non-integer arguments up to
the next integer.
The negative gradient of the error in equation (2) with
respect to each of the optimal learning factors zv, denoted as
gvolf is computed based on equation (30) as,
gvolf
j =
-∂E
∂zj
=
2
Nv
tp m – woh(m,k)op(zv,k)
Nh
k=1
M
m=1
Nv
p=1
·
woh(m,k(j,c))o p(k(j,c))n p(k(j,c))
Nh NOLF⁄
c=1
(31)
where,
tp m tp m woi
N+1
n=1
m,n xp n ,
k j,c = c + (j-1)·
Nh
NOLF
,
op(zv,k) = f( (w k,n + zv
N+1
n=1
g(k,n))xp(n))
The Hessian matrix Hvolf is derived from (2), (30) and (31)
as,
hvolf i,j =
∂2
E
∂zi∂zj
2596
=
2
Nv
woh(m,k i,d )o p(k i,d )n'p(k i,d ·
Nh NOLF⁄
d=1
M
m=1
Nv
p=1
woh(m,k j,c )o'
p(k j,c )n'p(k j,c )
Nh NOLF⁄
c=1
(32)
The vector z of variable optimal learning factors is found
from the negative gradient vector and Hessian matrix from
the relation,
Hvolf · z = gvolf (33)
where Hvolf is NOLF by NOLF and gvolf is a column vector of
dimension NOLF. Thus by varying the number of optimal
learning factors required for training, the vector z is found
using the method discussed in section III.
The computation required for solving equation (33) can be
adjusted by choosing the number of optimal learning factors.
When NOLF equals one, the current procedure is similar to
using a fixed optimal learning factor as discussed in section
II. In this case, it requires less computation but the algorithm
is also not very effective. When NOLF equals Nh, then the
algorithm reduces to the MOLF algorithm described in
section 3. Thus by varying the number of optimal learning
factors between one and Nh, the algorithm interpolates
between the MOLF and OLF cases.
C. Collapsing the MOLF Hessian
Occasionally the MOLF approach can fail because of
distortion in Hmolf due to linearly dependent inputs [20] or
because it is ill-conditioned due to linearly dependent hidden
units [20]. In these cases we don’t have to redo the current
training iteration. Instead we can collapse Hmolf and gmolf
down to a smaller size and use the approach of the previous
subsection. The elements of the original MOLF Hessian
matrix Hmolf for a MLP with Nh hidden units are denoted as,
Hmolf=
hmolf(1,1) hmolf(1,Nh)
hmolf(2,1)
hmolf(3,1)
hmolf(Nh,1) hmolf(Nh,Nh)
(34)
collapsing the Hessian matrix Hmolf to a matrix Hmolf1 of size
NOLF by NOLF we get,
Hmolf1=
hmolf1(1,1) hmolf1(1,NOLF)
hmolf1(2,1)
hmolf1(3,1)
hmolf1(NOLF,1) hmolf1(NOLF,NOLF)
35
The relationship between the elements of Hmolf1 and Hmolf is
described by,
hmolf1(m,n)= hmolf(
Nh NOLF⁄
d=1
Nh NOLF⁄
c=1
k(m,c),k(n,d)) 36
Equations (11) and (32) show that collapsing the MOLF
Hessian matrix to size NOLF by NOLF results in a VOLF
Hessian matrix of that size. Thus the Hmolf1 Hessian matrix
of equation (35) is the equivalent to the VOLF Hessian
matrix Hvolf of equation (32).
On collapsing the negative gradient vector gmolf to a vector
gmolf1 having NOLF elements we get,
gmolf1
= gmolf1
1 , gmolf1
2 ,………… gmolf1
(NOLF)
T
(37)
The relationship between gmolf1 and gmolf is described by,
gmolf1
(m)= gmolf
(k(m,c) )
Nh NOLF⁄
c=1
(38)
Equations (11) and (31) show that collapsing the MOLF
gradient vector to size NOLF by 1 results in the VOLF
gradient vector.
Therefore collapsing the MOLF Hessian matrix Hmolf and
gradient vector gmolf is equivalent to training the MLP with
the VOLF algorithm.
VI. RESULTS
A. Computational Burden of Different Algorithms
In this section the computational burden is calculated for
different MLP training algorithms in terms of required
multiplies. All the algorithms were implemented in
Microsoft Visual Studio 2005.
Let Nu = N + Nh + 1 denote the number of weights
connected to each output. The total number of weights in the
network is denoted as Nw = M(N + Nh + 1) + Nh(N + 1).The
number of multiplies required to solve for output weights
using Orthogonal Least Squares [22] is given by,
Mols Nu M Nu 2
3
4
Nu Nu 1 39
The numbers of multiplies required per training iteration
using BP, OWO-BP and LM are respectively given by,
Mbp= Nv · MNu + 2Nh N+1 + M(N+6Nh+4) +Nw 40
Mowo-bp= Nv[2Nh N+2 +M Nu+1 +
Nu N+1
2
+ M(N+6Nh+4)]+ Mols+ Nh(N+1) (41)
Mlm= Mbp + Nv MNu Nu+ 3Nh N+1 + 4Nh
2
(N+1)2
+ Nw
2
+ Nw
3
42
Equation (44) gives the multiplies required for computing
the optimal learning factors of the MOLF algorithm.
Similarly the number of multiplies required for the VOLF
algorithm is given as,
Mols-volf (NOLF
+1) NOLF NOLF+ 2 (43)
The total number of multiplies required for the MOLF and
VOLF algorithms are respectively given as,
Mmolf = Mowo-bp+ Nv Nh N+4 - M(7N-Nh+4) + (Nh)3
+ Mols-molf (44)
Mvolf = Mowo-bp+ Nv Nh N+4 - M(7N-Nh+4) + (Nh)3
+ Mols-volf (45)
Equations (44) and (45), show that the number of
multiplies required in the MOLF and VOLF algorithms
consists of the OWO-BP multiplies, multiplies to compute
the Hessian and negative gradient matrix, along with the
number of multiplies to invert the Hessian matrix. The
matrices Hmolf and Hvolf are respectively inverted in the
MOLF and VOLF algorithms.
2597
B. Experimental results
Here the performance of VOLF is compared with those of
MOLF, OWO-BP, LM and CG. In CG and LM, all weights
are varied in every iteration. In MOLF, VOLF and OWO-BP
we first solve linear equations for the output weights and
subsequently update the input weights.
The data sets used for the simulations are listed in Table I.
The training for all the datasets are done on inputs
normalized to zero mean and unit variance.
The optimal number of hidden units to be used in the
MLP is determined by network pruning using the method of
[24]. Then the k-fold validation procedure is used to obtain
the average training and validation errors. In k-fold
validation, the data set is split into k non-overlapping parts
of equal size, and (k − 1) parts are used for training and the
remaining one part is used for validation. The procedure is
Fig. 2. Twod data set: average error vs. number of iterations.
Fig. 3. Twod data set: average error vs. number of multiplies.
repeated k times picking a different part for use as the
validation set every time. For the simulations, k is chosen as
10. The number of iterations used for OWO-BP, CG, MOLF
and VOLF is chosen as 4000. For the LM algorithm 300
iterations are performed. For the VOLF algorithm NOLF =
Nh/2.
The average training error and the number of multiplies is
calculated for every iteration in a particular dataset using the
different training algorithms. These measurements are then
plotted to provide a graphical representation of the
efficiency and quality of the different training algorithms.
These plots for different datasets are shown below.
For the Twod.tra data file [25], the MLP is trained with 30
hidden units. In Fig. 2, the average mean square error (MSE)
for training from 10-fold validation is plotted versus the
number of iterations for each algorithm (shown on a log10
scale). In Fig. 3, the average training MSE from 10-fold
validation is plotted versus the required number of multiplies
(shown on a log10 scale).The LM algorithm shows the
lowest error of all the algorithms used for training, but its
computational burden may make it unsuitable for training
purposes. It is noted that the average error of the VOLF
algorithm lies between that of the OWO-BP and MOLF
algorithms for every iteration.
Fig. 4. Single2 data set: average error vs. number of iterations.
Fig. 5. Single2 data set: average error vs. number of multiplies.
10
0
10
1
10
2
10
3
10
4
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
Number of iterations
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
6
10
7
10
8
10
9
10
10
10
11
10
12
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
Number of multiplies
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
0
10
1
10
2
10
3
10
4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Number of iterations
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
7
10
8
10
9
10
10
10
11
10
12
10
13
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Number of multiplies
MSE
VOLF
MOLF
LM
CG
OWO-BP
TABLE I
Data set description
Data Set Name No. of Inputs No. of Outputs No. of Patterns
Twod 8 7 1768
Single2 16 3 10000
Power12 12 1 1414
Concrete 8 1 1030
2598
Fig. 6. Power12 data set: Average error vs. number of iterations.
Fig. 7. Power12 data set: average error vs. number of multiplies.
For the Single2.tra data file [25], the MLP is trained with
20 hidden units. In Fig. 4, the average mean square error
(MSE) for training from 10-fold validation is plotted versus
the number of iterations for each algorithm (shown on a
log10 scale). In Fig. 5, the average training MSE from 10-
fold validation is plotted versus the required number of
multiplies (shown on a log10 scale). In the initial iterations,
the MOLF algorithm shows the least error. But overall, as
for the previous dataset, the LM algorithm shows the lowest
error of all the algorithms used for training. As before the
average error of the VOLF algorithm lies between that of the
OWO-BP and MOLF algorithms.
For the Power12trn.tra data file [25], the MLP is trained
with 25 hidden units. In Fig. 6, the average mean square
error (MSE) for training from 10-fold validation is plotted
versus the number of iterations for each algorithm (shown on
a log10 scale). In Fig. 7, the average training MSE from 10-
fold validation is plotted versus the required number of
multiplies (shown on a log10 scale). For this dataset the
MOLF algorithm performs better than all the other
Fig. 8. Concrete data set: average error vs. number of iterations.
Fig. 9. Concrete data set: average error vs. number of multiplies.
algorithms. As seen previously, the computational
requirements of the LM algorithm is very high. As before
the average error of the VOLF algorithm lies between that of
the OWO-BP and MOLF algorithms for each iteration.
For the Concrete data file [26], the MLP is trained with 15
hidden units. In Fig. 8, the average mean square error (MSE)
for training from 10-fold validation is plotted versus the
number of iterations for each algorithm (shown on a log10
scale).
In Fig. 9, the average training MSE from 10-fold
validation is plotted versus the required number of multiplies
(shown on a log10 scale). As before, the average error of the
VOLF algorithm lies between that of the OWO-BP and
MOLF algorithms for every iteration.
Table II compares the average training and validation
errors of the MOLF and VOLF algorithms with the other
algorithms for different data files. For each data set, the
average training and validation errors are found after 10-fold
validation.
10
0
10
1
10
2
10
3
10
4
4000
4500
5000
5500
6000
6500
7000
Number of iterations
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
6
10
7
10
8
10
9
10
10
10
11
10
12
4000
4500
5000
5500
6000
6500
7000
Number of multiplies
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
0
10
1
10
2
10
3
10
4
10
20
30
40
50
60
70
Number of iterations
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
5
10
6
10
7
10
8
10
9
10
10
10
11
10
20
30
40
50
60
70
Number of multiplies
MSE
VOLF
MOLF
LM
CG
OWO-BP
2599
From the plots and the table presented, it can be inferred
that the error performance of the VOLF algorithm lies
between those of MOLF and OWO-BP algorithms. The
VOLF and MOLF algorithms are also found to produce
good results approaching those from the LM algorithm, with
computational requirements only in the order of first order
training algorithms.
VII. CONCLUSION
In this paper we have put the MOLF training algorithm on
a firm theoretical footing by showing its derivation from an
optimal linear transformation of the hidden layer's net
function vector. We have also developed the VOLF training
algorithm, which associates several hidden units with each
optimal learning factor. This modification of the MOLF
approach will allow us to train much larger networks
efficiently.
REFERENCES
[1] R. C. Odom, P. Pavlakos, S.S. Diocee, S. M. Bailey, D. M. Zander,
and J. J. Gillespie, “Shaly sand analysis using density-neutron
porosities from a cased-hole pulsed neutron system,” SPE Rocky
Mountain regional meeting proceedings: Society of Petroleum
Engineers, pp. 467-476, 1999.
[2] A. Khotanzad, M. H. Davis, A. Abaye, D. J. Maratukulam, “An
artificial neural network hourly temperature forecaster with
applications in load forecasting,” IEEE Transactions on Power
Systems, vol. 11, No. 2, pp. 870-876, May 1996.
[3] S. Marinai, M. Gori, G. Soda, “Artificial neural networks for
document analysis and recognition,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 27, No. 1, pp. 23-35, 2005.
[4] J. Kamruzzaman, R. A. Sarker, R. Begg, “Artificial Neural Networks:
Applications in Finance and Manufacturing,” Idea Group Inc (IGI),
2006.
[5] L. Wang and X. Fu, Data Mining With Computational Intelligence,
Springer-Verlag, 2005.
[6] K. Hornik, M. Stinchcombe, and H.White, “Multilayer feedforward
networks are universal approximators.” Neural Networks, Vol. 2, No.
5, 1989, pp.359-366.
[7] K. Hornik, M. Stinchcombe, and H. White, “Universal approximation
of an unknown mapping and its derivatives using multilayer
feedforward networks,” Neural Networks,vol. 3,1990, pp.551-560.
[8] C. Cortes and V. Vapnik. Support-vector networks. Machine
Learning, 20:273-297, November 1995.
[9] G. Cybenko, “Approximations by superposition of a sigmoidal
function,” Mathematics of Control, Signals, and Systems (MCSS),
vol. 2, pp. 303-314, 1989.
[10] D. W. Ruck et al., “The multi-layer perceptron as an approximation to
a Bayes optimal discriminant function,” IEEE Transactions on Neural
Networks, vol. 1, No. 4, 1990.
[11] Michael T.Manry, Steven J.Apollo, and Qiang Yu, “Minimum mean
square estimation and neural networks,” Neurocomputing, vol. 13,
September 1996, pp.59-74.
[12] Q. Yu, S.J. Apollo, and M.T. Manry, “MAP estimation and the multi-
layer perceptron,” Proceedings of the 1993 IEEE Workshop on
Neural Networks for Signal Processing, pp. 30-39, Linthicum Heights,
Maryland, Sept. 6-9, 1993.
[13] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, “Learning internal
representations by error propagation,” in D.E. Rumelhart and J.L.
McClelland (Eds.), Parallel Distributed Processing, vol. I, Cambridge,
Massachusetts: The MIT Press, 1986.
[14] Changhua Yu, Michael T. Manry, and Jiang Li, "Effects of
nonsingular pre-processing on feed-forward network training ".
International Journal of Pattern Recognition and Artificial Intelligence
, Vol. 19, No. 2 (2005) pp. 217-247.
[15] J.P. Fitch, S.K. Lehman, F.U. Dowla, S.Y. Lu, E.M. Johansson, and
D.M. Goodman, "Ship Wake-Detection Procedure Using Conjugate
Gradient Trained Artificial Neural Networks," IEEE Trans. on
Geoscience and Remote Sensing, Vol. 29, No. 5, September 1991, pp.
718-726.
[16] S. McLoone and G. Irwin, “A variable memory Quasi-Newton
training algorithm,” Neural Processing Letters, vol. 9, pp. 77-89,
1999.
[17] A. J. Shepherd Second-Order Methods for Neural Networks, Springer-
Verlag New York, Inc., 1997.
[18] K. Levenberg, “A method for the solution of certain problems in least
squares,” Quart. Appl. Math., Vol. 2, pp. 164.168, 1944.
[19] D. Marquardt, “An algorithm for least-squares estimation of nonlinear
parameters,” SIAM J. Appl. Math., Vol. 11, pp. 431.441, 1963.
[20] S. S. Malalur, M. T. Manry, "Multiple optimal learning factors for
feed-forward networks," accepted by The SPIE Defense, Security and
Sensing (DSS) Conference, Orlando, FL, April 2010
[21] T.H. Kim “Development and evaluation of multilayer perceptron
training algorithms”, Phd. Dissertation, The University of Texas at
Arlington, 2001.
[22] W. Kaminski, P. Strumillo, “Kernel orthonormalization in radial basis
function neural networks,” IEEE Transactions on Neural Networks,
vol. 8, Issue 5, pp. 1177 - 1183, 1997
[23] R.P Lippman, “An introduction to computing with Neural Nets,”
IEEE ASSP Magazine,April 1987.
[24] Pramod L. Narasimha, Walter H. Delashmit, Michael T. Manry, Jiang
Li, Francisco Maldonado, “An integrated growing-pruning method for
feedforward network training,” Neurocomputing vol. 71, pp. 2831–
2847, 2008.
[25] Univ. of Texas at Arlington, Training Data Files –
http://wwwee.uta.edu/eeweb/ip/training_data_files.html.
[26] Univ. of California, Irvine, Machine Learning Repository -
http://archive.ics.uci.edu/ml/
TABLE II
Average 10-fold training and validation error
Data Set OWO-BP CG VOLF MOLF LM
Twod
Etrn 0.22 0.20 0.18 0.17 0.16
Eval 0.25 0.22 0.21 0.20 0.17
Single2
Etrn 0.83 0.63 0.51 0.16 0.03
Eval 1.02 0.99 0.64 0.19 0.11
Power12
Etrn 6959.89 6471.98 5475.01 4177.45 4566.34
Eval 7937.21 7132.13 6712.34 5031.23 5423.54
Concrete
Etrn 38.98 19.88 21.81 21.56 16.98
Eval 65.12 56.12 37.23 37.21 33.12
2600

More Related Content

What's hot

A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
AllenWu
 
A comparison of efficient algorithms for scheduling parallel data redistribution
A comparison of efficient algorithms for scheduling parallel data redistributionA comparison of efficient algorithms for scheduling parallel data redistribution
A comparison of efficient algorithms for scheduling parallel data redistribution
IJCNCJournal
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Ryan B Harvey, CSDP, CSM
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)
Mostafa G. M. Mostafa
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
IJERD Editor
 
Incremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clusteringIncremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clustering
Allen Wu
 
PRML 5.5
PRML 5.5PRML 5.5
PRML 5.5
Ryuta Shitomi
 
Restricted Boltzman Machine (RBM) presentation of fundamental theory
Restricted Boltzman Machine (RBM) presentation of fundamental theoryRestricted Boltzman Machine (RBM) presentation of fundamental theory
Restricted Boltzman Machine (RBM) presentation of fundamental theory
Seongwon Hwang
 
A general multiobjective clustering approach based on multiple distance measures
A general multiobjective clustering approach based on multiple distance measuresA general multiobjective clustering approach based on multiple distance measures
A general multiobjective clustering approach based on multiple distance measures
Mehran Mesbahzadeh
 
DL for molecules
DL for moleculesDL for molecules
DL for molecules
Dai-Hai Nguyen
 
Density based clustering
Density based clusteringDensity based clustering
Density based clustering
YaswanthHariKumarVud
 
Adaptive Channel Equalization using Multilayer Perceptron Neural Networks wit...
Adaptive Channel Equalization using Multilayer Perceptron Neural Networks wit...Adaptive Channel Equalization using Multilayer Perceptron Neural Networks wit...
Adaptive Channel Equalization using Multilayer Perceptron Neural Networks wit...
IOSRJVSP
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
Sang Jun Lee
 
Tutorial on Markov Random Fields (MRFs) for Computer Vision Applications
Tutorial on Markov Random Fields (MRFs) for Computer Vision ApplicationsTutorial on Markov Random Fields (MRFs) for Computer Vision Applications
Tutorial on Markov Random Fields (MRFs) for Computer Vision Applications
Anmol Dwivedi
 
Inference & Learning in Linear Chain Conditional Random Fields (CRFs)
Inference & Learning in Linear Chain Conditional Random Fields (CRFs)Inference & Learning in Linear Chain Conditional Random Fields (CRFs)
Inference & Learning in Linear Chain Conditional Random Fields (CRFs)
Anmol Dwivedi
 
reportVPLProject
reportVPLProjectreportVPLProject
reportVPLProject
Sebastien Speierer
 
Classification of handwritten characters by their symmetry features
Classification of handwritten characters by their symmetry featuresClassification of handwritten characters by their symmetry features
Classification of handwritten characters by their symmetry features
AYUSH RAJ
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
SSA KPI
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
Masahiro Suzuki
 
El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711
RCCSRENKEI
 

What's hot (20)

A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 
A comparison of efficient algorithms for scheduling parallel data redistribution
A comparison of efficient algorithms for scheduling parallel data redistributionA comparison of efficient algorithms for scheduling parallel data redistribution
A comparison of efficient algorithms for scheduling parallel data redistribution
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
Incremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clusteringIncremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clustering
 
PRML 5.5
PRML 5.5PRML 5.5
PRML 5.5
 
Restricted Boltzman Machine (RBM) presentation of fundamental theory
Restricted Boltzman Machine (RBM) presentation of fundamental theoryRestricted Boltzman Machine (RBM) presentation of fundamental theory
Restricted Boltzman Machine (RBM) presentation of fundamental theory
 
A general multiobjective clustering approach based on multiple distance measures
A general multiobjective clustering approach based on multiple distance measuresA general multiobjective clustering approach based on multiple distance measures
A general multiobjective clustering approach based on multiple distance measures
 
DL for molecules
DL for moleculesDL for molecules
DL for molecules
 
Density based clustering
Density based clusteringDensity based clustering
Density based clustering
 
Adaptive Channel Equalization using Multilayer Perceptron Neural Networks wit...
Adaptive Channel Equalization using Multilayer Perceptron Neural Networks wit...Adaptive Channel Equalization using Multilayer Perceptron Neural Networks wit...
Adaptive Channel Equalization using Multilayer Perceptron Neural Networks wit...
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
Tutorial on Markov Random Fields (MRFs) for Computer Vision Applications
Tutorial on Markov Random Fields (MRFs) for Computer Vision ApplicationsTutorial on Markov Random Fields (MRFs) for Computer Vision Applications
Tutorial on Markov Random Fields (MRFs) for Computer Vision Applications
 
Inference & Learning in Linear Chain Conditional Random Fields (CRFs)
Inference & Learning in Linear Chain Conditional Random Fields (CRFs)Inference & Learning in Linear Chain Conditional Random Fields (CRFs)
Inference & Learning in Linear Chain Conditional Random Fields (CRFs)
 
reportVPLProject
reportVPLProjectreportVPLProject
reportVPLProject
 
Classification of handwritten characters by their symmetry features
Classification of handwritten characters by their symmetry featuresClassification of handwritten characters by their symmetry features
Classification of handwritten characters by their symmetry features
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
 
El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711
 

Similar to Analysis_molf

1-s2.0-S092523121401087X-main
1-s2.0-S092523121401087X-main1-s2.0-S092523121401087X-main
1-s2.0-S092523121401087X-main
Praveen Jesudhas
 
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTION
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTIONMODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTION
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTION
ijwmn
 
Improving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning AlgorithmImproving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning Algorithm
ijsrd.com
 
parallel
parallelparallel
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSSLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
IJCI JOURNAL
 
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
ArchiLab 7
 
An improved spfa algorithm for single source shortest path problem using forw...
An improved spfa algorithm for single source shortest path problem using forw...An improved spfa algorithm for single source shortest path problem using forw...
An improved spfa algorithm for single source shortest path problem using forw...
IJMIT JOURNAL
 
International Journal of Managing Information Technology (IJMIT)
International Journal of Managing Information Technology (IJMIT)International Journal of Managing Information Technology (IJMIT)
International Journal of Managing Information Technology (IJMIT)
IJMIT JOURNAL
 
An improved spfa algorithm for single source shortest path problem using forw...
An improved spfa algorithm for single source shortest path problem using forw...An improved spfa algorithm for single source shortest path problem using forw...
An improved spfa algorithm for single source shortest path problem using forw...
IJMIT JOURNAL
 
Efficient realization-of-an-adfe-with-a-new-adaptive-algorithm
Efficient realization-of-an-adfe-with-a-new-adaptive-algorithmEfficient realization-of-an-adfe-with-a-new-adaptive-algorithm
Efficient realization-of-an-adfe-with-a-new-adaptive-algorithm
Cemal Ardil
 
Adaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on CooperativeAdaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on Cooperative
ESCOM
 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
IJDKP
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
Wireilla
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ijfls
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
ESCOM
 
directed-research-report
directed-research-reportdirected-research-report
directed-research-report
Ryen Krusinga
 
DSP_FOEHU - MATLAB 01 - Discrete Time Signals and Systems
DSP_FOEHU - MATLAB 01 - Discrete Time Signals and SystemsDSP_FOEHU - MATLAB 01 - Discrete Time Signals and Systems
DSP_FOEHU - MATLAB 01 - Discrete Time Signals and Systems
Amr E. Mohamed
 
Black-box modeling of nonlinear system using evolutionary neural NARX model
Black-box modeling of nonlinear system using evolutionary neural NARX modelBlack-box modeling of nonlinear system using evolutionary neural NARX model
Black-box modeling of nonlinear system using evolutionary neural NARX model
IJECEIAES
 
BNL_Research_Report
BNL_Research_ReportBNL_Research_Report
BNL_Research_Report
Brandon McKinzie
 
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed ArithmeticLow Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
IJERA Editor
 

Similar to Analysis_molf (20)

1-s2.0-S092523121401087X-main
1-s2.0-S092523121401087X-main1-s2.0-S092523121401087X-main
1-s2.0-S092523121401087X-main
 
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTION
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTIONMODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTION
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTION
 
Improving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning AlgorithmImproving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning Algorithm
 
parallel
parallelparallel
parallel
 
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSSLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
 
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
 
An improved spfa algorithm for single source shortest path problem using forw...
An improved spfa algorithm for single source shortest path problem using forw...An improved spfa algorithm for single source shortest path problem using forw...
An improved spfa algorithm for single source shortest path problem using forw...
 
International Journal of Managing Information Technology (IJMIT)
International Journal of Managing Information Technology (IJMIT)International Journal of Managing Information Technology (IJMIT)
International Journal of Managing Information Technology (IJMIT)
 
An improved spfa algorithm for single source shortest path problem using forw...
An improved spfa algorithm for single source shortest path problem using forw...An improved spfa algorithm for single source shortest path problem using forw...
An improved spfa algorithm for single source shortest path problem using forw...
 
Efficient realization-of-an-adfe-with-a-new-adaptive-algorithm
Efficient realization-of-an-adfe-with-a-new-adaptive-algorithmEfficient realization-of-an-adfe-with-a-new-adaptive-algorithm
Efficient realization-of-an-adfe-with-a-new-adaptive-algorithm
 
Adaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on CooperativeAdaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on Cooperative
 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
directed-research-report
directed-research-reportdirected-research-report
directed-research-report
 
DSP_FOEHU - MATLAB 01 - Discrete Time Signals and Systems
DSP_FOEHU - MATLAB 01 - Discrete Time Signals and SystemsDSP_FOEHU - MATLAB 01 - Discrete Time Signals and Systems
DSP_FOEHU - MATLAB 01 - Discrete Time Signals and Systems
 
Black-box modeling of nonlinear system using evolutionary neural NARX model
Black-box modeling of nonlinear system using evolutionary neural NARX modelBlack-box modeling of nonlinear system using evolutionary neural NARX model
Black-box modeling of nonlinear system using evolutionary neural NARX model
 
BNL_Research_Report
BNL_Research_ReportBNL_Research_Report
BNL_Research_Report
 
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed ArithmeticLow Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
 

Analysis_molf

  • 1. Abstract—The effects of transforming the net function vector in the multilayer perceptron are analyzed. The use of optimal diagonal transformation matrices on the net function vector is proved to be equivalent to training the network using multiple optimal learning factors (MOLF). A method for linearly compressing large ill- conditioned MOLF Hessian matrices into smaller well- conditioned ones is developed. This compression approach is shown to be equivalent to using several hidden units per learning factor. The technique is extended to large networks. In simulations, the proposed algorithm performs almost as well as the Levenberg Marquardt algorithm with the computational complexity of a first order training algorithm. I. INTRODUCTION Multilayer perceptrons (MLPs) have found wide acceptance in a variety of applications ranging from parameter estimation [1][2], document analysis and recognition [3], finance, manufacturing [4] and data mining [5]. The dynamically tuned basis functions of the MLP along with its approximation capabilities [6] [7] make it an effective tool for classification and approximation. The relatively short evaluation time makes it a better choice than many other classification and approximation methods such as support vector machines (SVM) [8]. The universal approximation [9] property of the MLP along with its ability to mimic Bayes discriminant [10], optimal L2 norm estimates [11] and maximum a-posteriori (MAP) [12] estimates give it a strong theoretical foundation. Despite of the above advantages there are limitations due to the lack of scalable and effective training algorithms. Training algorithms for the MLP can be briefly classified as first and second order algorithms. First order algorithms like back-propagation (BP) [13] and output weight optimization- backpropagation (OWO-BP) require fewer multiplies and data passes and hence take less time per iteration. However, they are sensitive to input means and gains [14]. Note that conjugate gradient (CG) [15] has the same sensitivity. CG can be considered first order as the error function is not quadratic. Second order methods on the other hand, are not as sensitive to input means and gains, but they do have their own inherent limitations. Second order algorithms related to Manuscript received February 10, 2011. P. Jesudhas, M. Manry, and R. Rawat are with the University of Texas at Arlington, TX 76010, USA (e-mail: manry@uta.edu). S. Malalur is with O-Geo Well LLC in Arlington, Texas (e-mail: sanjeev.malalur@gmail.com). Newton’s method often have non-positive definite or singular Hessian matrices [16][17] which result in unstable training. Hence the Levenberg-Marquardt (LM) algorithm [18][19] is used instead. The LM algorithm is very computationally intensive due to the large size of the Hessian matrix so its use in training large networks is often limited. In [20] a two-stage training algorithm known as multiple optimal learning factors (MOLF-BP) is found to produce results comparable to the LM algorithm with computations similar to first order training algorithms. However, the MOLF-BP approach still has certain limitations. The MOLF Hessian can be ill-conditioned. The size of the MOLF Hessian matrix Hmolf can also become prohibitively large for larger networks, resulting in scalability problems. This paper gives a strong theoretical foundation to the basic MOLF algorithm by relating it to optimally transforming the net function vector of the MLP. For large networks, the scalability issues of the basic MOLF algorithm are dealt with by reducing the MOLF Hessian matrice's size via a reduction in the number of optimal learning factors. The organization of this paper is as follows. Section II starts with an introduction to the MLP notation along with the basic back-propagation training of the MLP. Section III includes a review of the MOLF procedure in detail. An analysis of the existing MOLF procedure with equivalent networks is provided in section IV. The idea of collapsing the MOLF Hessian matrix is explained in section V along with its advantages. Section VI presents experimental results. II. THE MULTILAYER PERCEPTRON A. Notation In the fully connected MLP of Fig. 1, input weights w(k,n) connect the nth input to the kth hidden unit. Output weights woh(m,k) connect the kth hidden unit’s activation op(k) to the mth output yp(m), which has a linear activation. The bypass weight woi(m,n) connects the nth input to the mth output. The training data, described by the set {xp,tp} consists of N-dimensional input vectors xp and M- dimensional desired output vectors, tp. The pattern number p varies from 1 to Nv, where Nv denotes the number of training vectors present in the data set. In order to handle thresholds in the hidden and output layers, the input vectors are augmented by an extra element as xp = [xp(1), xp(2),…., xp(N+1)]T where xp(N+1) = 1. Let Nh denote the number of hidden units. The dimensions of the weight matrices W, Woh and Woi are respectively Nh by (N+1), M by Nh and M by (N+1). Note that bypass weights Analysis and Improvement of Multiple Optimal Learning Factors for Feed-Forward Networks Praveen Jesudhas, Michael T. Manry, Rohit Rawat, and Sanjeev Malalur Proceedings of International Joint Conference on Neural Networks, San Jose, California, USA, July 31 – August 5, 2011 978-1-4244-9637-2/11/$26.00 ©2011 IEEE 2593
  • 2. Fig. 1. A fully connected Multilayer Perceptron woi relieve the hidden units of the burden of approximating the linear part of the desired mapping. This reduces the necessary number of hidden units. The vector of hidden layer net functions, np and the actual output of the network, yp can be written as np = W · xp (1) yp = Woi · xp + Woh · op where the kth element of the hidden unit activation vector op is calculated as op(k) = f(np(k)) and f(.) denotes the hidden layer activation function. Training an MLP typically involves minimizing the mean squared error between the desired and the actual network outputs, defined as E = 1 Nv [ tp i - yp i ] 2 M i=1 Nv p=1 = 1 Nv Ep, Nv p=1 (2) Ep = [ tp i - yp i ] 2 M i=1 where Ep is the cumulative squared error for pattern p. B. Discussion of Back-Propagation BP training [13, 22, 23, 24] is a gradient based learning algorithm which involves changing system parameters so that the output error is reduced. For a MLP, the expression for actual output found from the input is given by equation (1). The weights of the MLP are changed by BP so that the error E of equation (2), which computes the sum of the squared errors between the actual and desired outputs, decreases. The weights are updated using gradients of the error with respect to the corresponding weights. The required negative error gradients are calculated using the chain rule utilizing delta functions, which are negative gradients of Ep with respect to net functions. For the pth pattern, output and hidden layer delta functions [13][21] are respectively found as, δpo i = 2 tp i - yp i 3 δp k = f' np k δpo M i=1 (i)woh(i,k) Now, the negative gradient of E with respect to w(k,n) is, g k,n = -∂E ∂w(k,n) = 1 Nv δp Nv p=1 k xp n (4) The matrix of negative partial derivatives can therefore be written as, G = 1 Nv δp Nv p=1 · xp T (5) where δp = [δp(1) , δ p(2)… , δ p(Nh) ]T . If steepest descent is used to modify the hidden weights, W is updated in a given iteration as, W ← W + ∆ W, (6) ∆ W = z·G where z is the learning factor, which can be obtained optimally from the expression, z = -∂E/ ∂z ∂2 E/ ∂z2 (7) III. REVIEW OF MOLF ALGORITHM The MOLF algorithm makes use of the BP procedure to update the input layer weights W, where a different learning factor zk is used to update weights feeding into the kth hidden unit. The output weights Woh and bypass weights Woi are found using output weight optimization (OWO). Output weight optimization (OWO) is a technique for finding weights connected to the outputs of the network. Since the outputs have linear activations, finding the weights connected to the outputs is equivalent to solving a system of linear equations. The expression for the outputs given in (1) can be re-written as yp = Wo· xp (8) where xp=[ T , T ] T is the augmented input vector of length Nu where Nu equals N + Nh + 1, Wo is formed as [Woi : Woh] and has a size of M by Nu . The output weights can be solved for by setting ∂E ∂Wo⁄ =0 which leads to a set of linear equations given by, Ca= Ra·Wo T (9) where, Ca= 1 Nv xp Nv p=1 · tp T , Ra= 1 Nv xp· Nv p=1 xp T Equation (9) is most easily solved using orthogonal least squares (OLS) [22]. The input weight connecting the nth input to the kth hidden unit is updated using, w k,n ← w k,n + zk·g k,n (10) where, zk denotes the learning factor corresponding to the kth hidden unit. The vector z containing the different learning xp(1) xp(2) xp(3) xp(N+1) yp(1) yp(2) yp(3) yp(M) W Woh np(1) op(1) op(Nh) Woi Input Layer Hidden Layer Output Layer np(Nh) 2594
  • 3. factors zk can be found using OLS from the following relation, Hmolf·z = gmolf (11) gmolf j = -∂E ∂zj , hmolf k,j = ∂2 E ∂zk∂zj IV. ANALYSIS OF THE MOLF ALGORITHM In this section, the structure of the MLP is analyzed in detail using equivalent networks. The relationship between MOLF and linear transformations of the net function vector is investigated. A. Discussion of Equivalent Networks Let MLP 1, have a net function vector np and output vector yp after back propagation training as discussed previously in section II-B. In a second network called MLP 2 trained similarly, the net function vector and the output vector are respectively denoted by and yp′. The bypass and output weight matrix along with the hidden unit activation functions are considered to be equal for both the MLP 1 and MLP 2. Based on this information the output vectors for MLP 1 and MLP 2 are respectively, yp i = woi N+1 n=1 i,n xp n + woh Nh k=1 i,k f np k (12) y'p i = woi N+1 n=1 i,n xp n + woh Nh k=1 i,k f n k (13) In order to make MLP 2 strongly equivalent to MLP 1 their respective output vectors yp and yp′ have to be equal. Based on equations (12) and (13) this can be achieved only when their respective net function vectors are equal [23]. MLP 2 can be made strongly equivalent to MLP 1 by linearly transforming its net function vector before passing it as an argument to the activation function f(.) given by, np = C · (14) where C is a linear transformation matrix of size Nh by Nh. On applying the linear transformation to the net function vector in equation (13) we get, y'p i = yp (i) = woi N+1 n=1 i,n xp n + woh(i,k) Nh k=1 f c(m,k) Nh m=1 ·n k (15) After making MLP 2 strongly equivalent to MLP 1, the net function vector np can be related to its input weights W and the net function vector as, np(k) = w(k,m)xp N+1 n=1 (m) = c(k,m) Nh m=1 n (m) (16) Similarly, the MLP 2 net function vector can be related to its weights W′ as, n k = w k,m N+1 n=1 xp m (17) On substituting (17) into (16) we get, np k c k,m Nh m 1 N 1 n 1 w m,n xp n (18) It is observed above that the net function of MLP 1 is found by linear transformation of the input weight matrix W′ of MLP 2 given by, W = C · W′ (19) If the elements of the C matrix are found through an optimality criterion, then optimal input weights W′ can be computed with the input weight matrix W of MLP 1. B. Optimal transformation of net function The discussion of equivalent networks in section IV-A suggests that, an optimal set of net functions can be obtained by linear transformation of the input weights W. In this section, the input weight update equations are derived for OWO-BP training, using the linear transformation matrix C from MLP 2. MLP 1 and MLP 2 are strongly equivalent as before. Then output of the two networks after transformation is given as in equation (15). The elements of the negative gradient matrix G′ of MLP 2 found from equation (2) and (15) is defined as, g u,v = -∂E ∂w'(u,v) = 2 Nv [ tp M i=1 Nv p=1 i - yp i ]· ∂yp (i) ∂w'(u,v) (20) ∂yp (i) ∂w'(u,v) = woh Nh k=1 (i,k) o'p(k) c(k,u) xp(v) Rearranging terms in (20) results in, g u,v = c k,u Nh k=1 2 Nv [ tp M i=1 Nv p=1 i - yp i ]woh i,k o k xp v = c(k,u) -∂E ∂w(k,v) Nh k=1 (21) which is abbreviated as, G′ = CT · G (22) The input weight update equation for MLP 2 based on its gradient matrix G′ is given by, W′ = W′ + z · G′ (23) On pre-multiplying this equation by C and using equations (19) and (22) we obtain 2595
  • 4. W = W + z · G′′ (24) where, G′′ = R · G, R = C · CT Equation (24) is nothing but the weight update equation of MLP 1, which would result in the network being strongly equivalent to MLP 2. Thus using equation (24), the input weights of MLP 1 can be updated so that its net functions are optimal as in MLP 2. Lemma 1: For a given R matrix, there are an uncountably infinite number of C matrices. The transformed gradient G′′ of MLP 1 is given in terms of the original gradient G as, g u,v = r (u,k) g(k,v) (25) Nh k=1 g k,v = -∂E ∂w(k,v) = 2 Nv [ tp M i=1 Nv p=1 i - yp i ]woh i,k o'p k xp v (26) Equations (24) and (25) suggest that MLP 1 could be trained with optimal net functions using only the knowledge of the linear transformation matrix R. C. Multiple optimal learning factors Equations (24) and (25) give a method for optimally transforming the net functions of an MLP by using the transformation matrix R. In this subsection, the MOLF approach is derived from equations (24) and (25). Let the R matrix in equation (24) be diagonal. In this case equation (24) becomes, w(k,n) = w(k,n) + z · r(k)g(k,n) (27) where rk denotes the kth diagonal element of R. On comparing equations (10) and (27), the expression for the optimal learning factors zk could be given as, zk = z · r(k) (28) Equation (28) suggests that using the MOLF algorithm for training a MLP is equivalent to optimally transforming the net function vector using a diagonal transformation matrix. V. EFFECTS OF COLLAPSING THE MOLF HESSIAN This section presents a computational analysis of the existing MOLF algorithm followed by a proposed method to collapse the MOLF Hessian matrix to create fewer optimal learning factors. A. Computational Analysis of the MOLF algorithm In the existing MOLF algorithm a significant amount of the computational burden is attributed to inverting the Hessian matrix to find the optimal learning factors, usually through the OLS procedure. The size of the Hessian matrix which is based on the number of hidden units in the network directly affects the computational load of the MOLF algorithm. The gradient and Hessian equations of the MOLF algorithm are given in equation (11).The computation of the multiple optimal learning factor vector z requires that equation (11) be solved. The number of multiplies required for computing z through OLS is, Mols-molf (Nh +1) Nh Nh+ 2 (29) Equation (29) shows that the number of multiplications required increases cubically with the number of hidden units. For networks having large values of Nh, this can be time consuming since solving (11) has to be done for every iteration of the training algorithm. If some hidden units have linearly dependent outputs [20], the MOLF Hessian matrix Hmolf can be ill conditioned. This can affect MLP training. Thus having the number of optimal learning factors equal to the number of hidden units can result in a high computational load and also could lead to an ill-conditioned MOLF Hessian matrix. B. Training with several hidden units per OLF In this section, we modify the MOLF approach so that each OLF is assigned to one or more hidden units. The Hessian for this new approach is compared to Hmolf. Let NOLF be the number of optimal learning factors used for training in the proposed variable optimal learning factors (VOLF) method. NOLF is selected such that it divides Nh, with no remainder. Each optimal learning factor zv then applies to Nh/NOLF hidden units. The MLP output based on these conditions is given by, yp m = woi N+1 n=1 m,n xp n + woh m,k f( Nh k=1 (w k,n +zvg(k,n)xp n ) N+1 n=1 ) 30 v = ceil ( k · (NOLF/Nh) ) where the function ceil() rounds non-integer arguments up to the next integer. The negative gradient of the error in equation (2) with respect to each of the optimal learning factors zv, denoted as gvolf is computed based on equation (30) as, gvolf j = -∂E ∂zj = 2 Nv tp m – woh(m,k)op(zv,k) Nh k=1 M m=1 Nv p=1 · woh(m,k(j,c))o p(k(j,c))n p(k(j,c)) Nh NOLF⁄ c=1 (31) where, tp m tp m woi N+1 n=1 m,n xp n , k j,c = c + (j-1)· Nh NOLF , op(zv,k) = f( (w k,n + zv N+1 n=1 g(k,n))xp(n)) The Hessian matrix Hvolf is derived from (2), (30) and (31) as, hvolf i,j = ∂2 E ∂zi∂zj 2596
  • 5. = 2 Nv woh(m,k i,d )o p(k i,d )n'p(k i,d · Nh NOLF⁄ d=1 M m=1 Nv p=1 woh(m,k j,c )o' p(k j,c )n'p(k j,c ) Nh NOLF⁄ c=1 (32) The vector z of variable optimal learning factors is found from the negative gradient vector and Hessian matrix from the relation, Hvolf · z = gvolf (33) where Hvolf is NOLF by NOLF and gvolf is a column vector of dimension NOLF. Thus by varying the number of optimal learning factors required for training, the vector z is found using the method discussed in section III. The computation required for solving equation (33) can be adjusted by choosing the number of optimal learning factors. When NOLF equals one, the current procedure is similar to using a fixed optimal learning factor as discussed in section II. In this case, it requires less computation but the algorithm is also not very effective. When NOLF equals Nh, then the algorithm reduces to the MOLF algorithm described in section 3. Thus by varying the number of optimal learning factors between one and Nh, the algorithm interpolates between the MOLF and OLF cases. C. Collapsing the MOLF Hessian Occasionally the MOLF approach can fail because of distortion in Hmolf due to linearly dependent inputs [20] or because it is ill-conditioned due to linearly dependent hidden units [20]. In these cases we don’t have to redo the current training iteration. Instead we can collapse Hmolf and gmolf down to a smaller size and use the approach of the previous subsection. The elements of the original MOLF Hessian matrix Hmolf for a MLP with Nh hidden units are denoted as, Hmolf= hmolf(1,1) hmolf(1,Nh) hmolf(2,1) hmolf(3,1) hmolf(Nh,1) hmolf(Nh,Nh) (34) collapsing the Hessian matrix Hmolf to a matrix Hmolf1 of size NOLF by NOLF we get, Hmolf1= hmolf1(1,1) hmolf1(1,NOLF) hmolf1(2,1) hmolf1(3,1) hmolf1(NOLF,1) hmolf1(NOLF,NOLF) 35 The relationship between the elements of Hmolf1 and Hmolf is described by, hmolf1(m,n)= hmolf( Nh NOLF⁄ d=1 Nh NOLF⁄ c=1 k(m,c),k(n,d)) 36 Equations (11) and (32) show that collapsing the MOLF Hessian matrix to size NOLF by NOLF results in a VOLF Hessian matrix of that size. Thus the Hmolf1 Hessian matrix of equation (35) is the equivalent to the VOLF Hessian matrix Hvolf of equation (32). On collapsing the negative gradient vector gmolf to a vector gmolf1 having NOLF elements we get, gmolf1 = gmolf1 1 , gmolf1 2 ,………… gmolf1 (NOLF) T (37) The relationship between gmolf1 and gmolf is described by, gmolf1 (m)= gmolf (k(m,c) ) Nh NOLF⁄ c=1 (38) Equations (11) and (31) show that collapsing the MOLF gradient vector to size NOLF by 1 results in the VOLF gradient vector. Therefore collapsing the MOLF Hessian matrix Hmolf and gradient vector gmolf is equivalent to training the MLP with the VOLF algorithm. VI. RESULTS A. Computational Burden of Different Algorithms In this section the computational burden is calculated for different MLP training algorithms in terms of required multiplies. All the algorithms were implemented in Microsoft Visual Studio 2005. Let Nu = N + Nh + 1 denote the number of weights connected to each output. The total number of weights in the network is denoted as Nw = M(N + Nh + 1) + Nh(N + 1).The number of multiplies required to solve for output weights using Orthogonal Least Squares [22] is given by, Mols Nu M Nu 2 3 4 Nu Nu 1 39 The numbers of multiplies required per training iteration using BP, OWO-BP and LM are respectively given by, Mbp= Nv · MNu + 2Nh N+1 + M(N+6Nh+4) +Nw 40 Mowo-bp= Nv[2Nh N+2 +M Nu+1 + Nu N+1 2 + M(N+6Nh+4)]+ Mols+ Nh(N+1) (41) Mlm= Mbp + Nv MNu Nu+ 3Nh N+1 + 4Nh 2 (N+1)2 + Nw 2 + Nw 3 42 Equation (44) gives the multiplies required for computing the optimal learning factors of the MOLF algorithm. Similarly the number of multiplies required for the VOLF algorithm is given as, Mols-volf (NOLF +1) NOLF NOLF+ 2 (43) The total number of multiplies required for the MOLF and VOLF algorithms are respectively given as, Mmolf = Mowo-bp+ Nv Nh N+4 - M(7N-Nh+4) + (Nh)3 + Mols-molf (44) Mvolf = Mowo-bp+ Nv Nh N+4 - M(7N-Nh+4) + (Nh)3 + Mols-volf (45) Equations (44) and (45), show that the number of multiplies required in the MOLF and VOLF algorithms consists of the OWO-BP multiplies, multiplies to compute the Hessian and negative gradient matrix, along with the number of multiplies to invert the Hessian matrix. The matrices Hmolf and Hvolf are respectively inverted in the MOLF and VOLF algorithms. 2597
  • 6. B. Experimental results Here the performance of VOLF is compared with those of MOLF, OWO-BP, LM and CG. In CG and LM, all weights are varied in every iteration. In MOLF, VOLF and OWO-BP we first solve linear equations for the output weights and subsequently update the input weights. The data sets used for the simulations are listed in Table I. The training for all the datasets are done on inputs normalized to zero mean and unit variance. The optimal number of hidden units to be used in the MLP is determined by network pruning using the method of [24]. Then the k-fold validation procedure is used to obtain the average training and validation errors. In k-fold validation, the data set is split into k non-overlapping parts of equal size, and (k − 1) parts are used for training and the remaining one part is used for validation. The procedure is Fig. 2. Twod data set: average error vs. number of iterations. Fig. 3. Twod data set: average error vs. number of multiplies. repeated k times picking a different part for use as the validation set every time. For the simulations, k is chosen as 10. The number of iterations used for OWO-BP, CG, MOLF and VOLF is chosen as 4000. For the LM algorithm 300 iterations are performed. For the VOLF algorithm NOLF = Nh/2. The average training error and the number of multiplies is calculated for every iteration in a particular dataset using the different training algorithms. These measurements are then plotted to provide a graphical representation of the efficiency and quality of the different training algorithms. These plots for different datasets are shown below. For the Twod.tra data file [25], the MLP is trained with 30 hidden units. In Fig. 2, the average mean square error (MSE) for training from 10-fold validation is plotted versus the number of iterations for each algorithm (shown on a log10 scale). In Fig. 3, the average training MSE from 10-fold validation is plotted versus the required number of multiplies (shown on a log10 scale).The LM algorithm shows the lowest error of all the algorithms used for training, but its computational burden may make it unsuitable for training purposes. It is noted that the average error of the VOLF algorithm lies between that of the OWO-BP and MOLF algorithms for every iteration. Fig. 4. Single2 data set: average error vs. number of iterations. Fig. 5. Single2 data set: average error vs. number of multiplies. 10 0 10 1 10 2 10 3 10 4 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 Number of iterations MSE VOLF MOLF LM CG OWO-BP 10 6 10 7 10 8 10 9 10 10 10 11 10 12 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 Number of multiplies MSE VOLF MOLF LM CG OWO-BP 10 0 10 1 10 2 10 3 10 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Number of iterations MSE VOLF MOLF LM CG OWO-BP 10 7 10 8 10 9 10 10 10 11 10 12 10 13 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Number of multiplies MSE VOLF MOLF LM CG OWO-BP TABLE I Data set description Data Set Name No. of Inputs No. of Outputs No. of Patterns Twod 8 7 1768 Single2 16 3 10000 Power12 12 1 1414 Concrete 8 1 1030 2598
  • 7. Fig. 6. Power12 data set: Average error vs. number of iterations. Fig. 7. Power12 data set: average error vs. number of multiplies. For the Single2.tra data file [25], the MLP is trained with 20 hidden units. In Fig. 4, the average mean square error (MSE) for training from 10-fold validation is plotted versus the number of iterations for each algorithm (shown on a log10 scale). In Fig. 5, the average training MSE from 10- fold validation is plotted versus the required number of multiplies (shown on a log10 scale). In the initial iterations, the MOLF algorithm shows the least error. But overall, as for the previous dataset, the LM algorithm shows the lowest error of all the algorithms used for training. As before the average error of the VOLF algorithm lies between that of the OWO-BP and MOLF algorithms. For the Power12trn.tra data file [25], the MLP is trained with 25 hidden units. In Fig. 6, the average mean square error (MSE) for training from 10-fold validation is plotted versus the number of iterations for each algorithm (shown on a log10 scale). In Fig. 7, the average training MSE from 10- fold validation is plotted versus the required number of multiplies (shown on a log10 scale). For this dataset the MOLF algorithm performs better than all the other Fig. 8. Concrete data set: average error vs. number of iterations. Fig. 9. Concrete data set: average error vs. number of multiplies. algorithms. As seen previously, the computational requirements of the LM algorithm is very high. As before the average error of the VOLF algorithm lies between that of the OWO-BP and MOLF algorithms for each iteration. For the Concrete data file [26], the MLP is trained with 15 hidden units. In Fig. 8, the average mean square error (MSE) for training from 10-fold validation is plotted versus the number of iterations for each algorithm (shown on a log10 scale). In Fig. 9, the average training MSE from 10-fold validation is plotted versus the required number of multiplies (shown on a log10 scale). As before, the average error of the VOLF algorithm lies between that of the OWO-BP and MOLF algorithms for every iteration. Table II compares the average training and validation errors of the MOLF and VOLF algorithms with the other algorithms for different data files. For each data set, the average training and validation errors are found after 10-fold validation. 10 0 10 1 10 2 10 3 10 4 4000 4500 5000 5500 6000 6500 7000 Number of iterations MSE VOLF MOLF LM CG OWO-BP 10 6 10 7 10 8 10 9 10 10 10 11 10 12 4000 4500 5000 5500 6000 6500 7000 Number of multiplies MSE VOLF MOLF LM CG OWO-BP 10 0 10 1 10 2 10 3 10 4 10 20 30 40 50 60 70 Number of iterations MSE VOLF MOLF LM CG OWO-BP 10 5 10 6 10 7 10 8 10 9 10 10 10 11 10 20 30 40 50 60 70 Number of multiplies MSE VOLF MOLF LM CG OWO-BP 2599
  • 8. From the plots and the table presented, it can be inferred that the error performance of the VOLF algorithm lies between those of MOLF and OWO-BP algorithms. The VOLF and MOLF algorithms are also found to produce good results approaching those from the LM algorithm, with computational requirements only in the order of first order training algorithms. VII. CONCLUSION In this paper we have put the MOLF training algorithm on a firm theoretical footing by showing its derivation from an optimal linear transformation of the hidden layer's net function vector. We have also developed the VOLF training algorithm, which associates several hidden units with each optimal learning factor. This modification of the MOLF approach will allow us to train much larger networks efficiently. REFERENCES [1] R. C. Odom, P. Pavlakos, S.S. Diocee, S. M. Bailey, D. M. Zander, and J. J. Gillespie, “Shaly sand analysis using density-neutron porosities from a cased-hole pulsed neutron system,” SPE Rocky Mountain regional meeting proceedings: Society of Petroleum Engineers, pp. 467-476, 1999. [2] A. Khotanzad, M. H. Davis, A. Abaye, D. J. Maratukulam, “An artificial neural network hourly temperature forecaster with applications in load forecasting,” IEEE Transactions on Power Systems, vol. 11, No. 2, pp. 870-876, May 1996. [3] S. Marinai, M. Gori, G. Soda, “Artificial neural networks for document analysis and recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, No. 1, pp. 23-35, 2005. [4] J. Kamruzzaman, R. A. Sarker, R. Begg, “Artificial Neural Networks: Applications in Finance and Manufacturing,” Idea Group Inc (IGI), 2006. [5] L. Wang and X. Fu, Data Mining With Computational Intelligence, Springer-Verlag, 2005. [6] K. Hornik, M. Stinchcombe, and H.White, “Multilayer feedforward networks are universal approximators.” Neural Networks, Vol. 2, No. 5, 1989, pp.359-366. [7] K. Hornik, M. Stinchcombe, and H. White, “Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks,” Neural Networks,vol. 3,1990, pp.551-560. [8] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273-297, November 1995. [9] G. Cybenko, “Approximations by superposition of a sigmoidal function,” Mathematics of Control, Signals, and Systems (MCSS), vol. 2, pp. 303-314, 1989. [10] D. W. Ruck et al., “The multi-layer perceptron as an approximation to a Bayes optimal discriminant function,” IEEE Transactions on Neural Networks, vol. 1, No. 4, 1990. [11] Michael T.Manry, Steven J.Apollo, and Qiang Yu, “Minimum mean square estimation and neural networks,” Neurocomputing, vol. 13, September 1996, pp.59-74. [12] Q. Yu, S.J. Apollo, and M.T. Manry, “MAP estimation and the multi- layer perceptron,” Proceedings of the 1993 IEEE Workshop on Neural Networks for Signal Processing, pp. 30-39, Linthicum Heights, Maryland, Sept. 6-9, 1993. [13] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, “Learning internal representations by error propagation,” in D.E. Rumelhart and J.L. McClelland (Eds.), Parallel Distributed Processing, vol. I, Cambridge, Massachusetts: The MIT Press, 1986. [14] Changhua Yu, Michael T. Manry, and Jiang Li, "Effects of nonsingular pre-processing on feed-forward network training ". International Journal of Pattern Recognition and Artificial Intelligence , Vol. 19, No. 2 (2005) pp. 217-247. [15] J.P. Fitch, S.K. Lehman, F.U. Dowla, S.Y. Lu, E.M. Johansson, and D.M. Goodman, "Ship Wake-Detection Procedure Using Conjugate Gradient Trained Artificial Neural Networks," IEEE Trans. on Geoscience and Remote Sensing, Vol. 29, No. 5, September 1991, pp. 718-726. [16] S. McLoone and G. Irwin, “A variable memory Quasi-Newton training algorithm,” Neural Processing Letters, vol. 9, pp. 77-89, 1999. [17] A. J. Shepherd Second-Order Methods for Neural Networks, Springer- Verlag New York, Inc., 1997. [18] K. Levenberg, “A method for the solution of certain problems in least squares,” Quart. Appl. Math., Vol. 2, pp. 164.168, 1944. [19] D. Marquardt, “An algorithm for least-squares estimation of nonlinear parameters,” SIAM J. Appl. Math., Vol. 11, pp. 431.441, 1963. [20] S. S. Malalur, M. T. Manry, "Multiple optimal learning factors for feed-forward networks," accepted by The SPIE Defense, Security and Sensing (DSS) Conference, Orlando, FL, April 2010 [21] T.H. Kim “Development and evaluation of multilayer perceptron training algorithms”, Phd. Dissertation, The University of Texas at Arlington, 2001. [22] W. Kaminski, P. Strumillo, “Kernel orthonormalization in radial basis function neural networks,” IEEE Transactions on Neural Networks, vol. 8, Issue 5, pp. 1177 - 1183, 1997 [23] R.P Lippman, “An introduction to computing with Neural Nets,” IEEE ASSP Magazine,April 1987. [24] Pramod L. Narasimha, Walter H. Delashmit, Michael T. Manry, Jiang Li, Francisco Maldonado, “An integrated growing-pruning method for feedforward network training,” Neurocomputing vol. 71, pp. 2831– 2847, 2008. [25] Univ. of Texas at Arlington, Training Data Files – http://wwwee.uta.edu/eeweb/ip/training_data_files.html. [26] Univ. of California, Irvine, Machine Learning Repository - http://archive.ics.uci.edu/ml/ TABLE II Average 10-fold training and validation error Data Set OWO-BP CG VOLF MOLF LM Twod Etrn 0.22 0.20 0.18 0.17 0.16 Eval 0.25 0.22 0.21 0.20 0.17 Single2 Etrn 0.83 0.63 0.51 0.16 0.03 Eval 1.02 0.99 0.64 0.19 0.11 Power12 Etrn 6959.89 6471.98 5475.01 4177.45 4566.34 Eval 7937.21 7132.13 6712.34 5031.23 5423.54 Concrete Etrn 38.98 19.88 21.81 21.56 16.98 Eval 65.12 56.12 37.23 37.21 33.12 2600